unsure on how to quantize model

brokenlcd · 2 years ago

unsure on how to quantize model

Mechanize · 2 years ago

I’ve never used oobabooga but if you use llama.cpp directly you can specify the number of layers that you want to run on the GPU with the -ngl flag, followed by the number.

So, as an example, a command (on linux) from the directory you have the binary, to run its server would look something like: ./llama-server -m "/path/to/model.gguf" -ngl 10

Another important flag that could interest you is -c for the context size.

This will put 10 layers of the model on the GPU, the rest will be on RAM for the CPU.

I would be surprised if you can’t just connect to the llama.cpp server or just set text-generation-webui to do the same with some setting.

At worst you can consider using ollama, which is a llama.cpp wrapper.

But probably you would want to invest the time to understand how to use llama.cpp directly and put a UI in front of it, Sillytavern is a good one for many usecases, OpenWebUI can be another but - in my experience - it tends to have more half baked features and the development jumps around a lot.

As a more general answer, no, the safetensor format doesn’t directly support quantization, as far as I know

Universal Monk@sh.itjust.works · 2 years ago

Thank you for this!