Took me some time to figure this one out, and unfortunately requires a significantly larger image (need so much more of nvidia’s toolkit D: couldn’t figure out a way to get around it…)
If people prefer a smaller image, I can start maintaining one for exllama and one without, but for now 1.0 is identical minus exllama support (and I guess also from an older commit) so you can use that one until there’s actual new functionality :)
Also, do note that the model needs to be made with gptq-for-llama, not autogtpq