I was experimenting with oobabooga trying to run this model but due to it’s size it wasn’t going to fit in ram, so i tried to quantize it using llama.cpp, and that worked, but due to the gguf format it was only running on the cpu. searching for ways to quantize the model while keeping it in safetensors returned nothing; so is there any way to do that?
I’m sorry if this is a stupid question, i still know almost nothing of this field
You’ll want to look up how to offload GPU layers in ollama . a lower quant gguf should work great with offloading.
Most people use kobold.cpp now. ollama and llama.cpp kind of fell behind. kobold.cpp is a bleeding edge fork of llama.cpp with all the latest and greatest features. its GPU offloading is so damn easy if you have nvidia card use cblast if you have AMD card use vulcan.
Is there a particular reason youre trying to run a mixture of experts model for an RP/storytelling purposed llm? Usually MoE is better suited at logical reasoning and critical analysis of a complex problem. If you’re a newbie just starting out you may be better with a RP finetune training of a mistralAI LLM like alriamax based of NeMo 12B.
Theres always a tradeoff with finetunes, typically a model thats finetuned for rp/storytelling sacrifices capabilities in other important areas like reasoning, encylcopedic knowledge, and mathematical/coding ability.
Heres an example starting command for offloading, I have a nvidia 1070ti 8gb and can get 25-35 layers offloaded onto it depending on context size
./koboldcpp --model Mistral-Nemo-Instruct-2407-Q4_K_M.gguf --threads 6 --usecublas --gpulayers 28 --contextsize 8092
I think i may try this way if kobold uses vulkan instead of rocm, It’s most likely going to be way less of a headache.
As for the model, it’s what came out of a random search for a decent small model on reddit. No reason in particular, thanks for the suggestion.