unsure on how to quantize model

brokenlcd · 8 months ago

unsure on how to quantize model

SmokeyDope@lemmy.world · edit-2 8 months ago

You’ll want to look up how to offload GPU layers in ollama . a lower quant gguf should work great with offloading.

Most people use kobold.cpp now. ollama and llama.cpp kind of fell behind. kobold.cpp is a bleeding edge fork of llama.cpp with all the latest and greatest features. its GPU offloading is so damn easy if you have nvidia card use cblast if you have AMD card use vulcan.

Is there a particular reason youre trying to run a mixture of experts model for an RP/storytelling purposed llm? Usually MoE is better suited at logical reasoning and critical analysis of a complex problem. If you’re a newbie just starting out you may be better with a RP finetune training of a mistralAI LLM like alriamax based of NeMo 12B.

Theres always a tradeoff with finetunes, typically a model thats finetuned for rp/storytelling sacrifices capabilities in other important areas like reasoning, encylcopedic knowledge, and mathematical/coding ability.

Heres an example starting command for offloading, I have a nvidia 1070ti 8gb and can get 25-35 layers offloaded onto it depending on context size

./koboldcpp --model Mistral-Nemo-Instruct-2407-Q4_K_M.gguf --threads 6 --usecublas --gpulayers 28 --contextsize 8092

brokenlcd · 8 months ago

I think i may try this way if kobold uses vulkan instead of rocm, It’s most likely going to be way less of a headache.

As for the model, it’s what came out of a random search for a decent small model on reddit. No reason in particular, thanks for the suggestion.