this post was submitted on 21 Feb 2025
8 points (100.0% liked)
LocalLLaMA
2791 readers
52 users here now
Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.
Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.
As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
In my experience anything similar to qwen-2.5:32B comes closest to gpt-4o. I think it should run on your setup. the 14b model is alright too, but definitely inferior. Mistral Small 3 also seems really good. anything smaller is usually really dumb and I doubt it would work for you.
You could probably run some larger 70b models at a snails pace too.
Try the Deepseek R1 - qwen 32b distill, something like deepseek-r1:32b-qwen-distill-q4_K_M (name on ollama) or some finefune of it. It'll be by far the smartest model you can run.
There are various fine tunes that remove some of the censorship (ablated/abliterated) or are optimized for RP, which might do better for your use case. But personally haven't used them so I can't promise anything.
Thank you so much for the suggestion! I tried Q8 of the model you mentioned, and I am very impressed with the results! The output itself was exactly what I wanted, the speed was a little on the slower side. Loading my previous conversation with a context of over 15k tokens took about 10 minutes to get the first response, but the later messages were much faster. The web ui loses connection almost every time though, and I just manually copy the response from the terminal window in to the web ui to save it for future context. I am currently downloading the Q6 model, and might experiment with going even lower for faster speeds and more stability, if the quality of the output doesn't degrade too much.
Q4 will give you like 98% of quality vs Q8 and like twice the speed + much longer context lengths.
If you don't need the full context length, you can try loading the model at shorter context length, meaning you can load more layers on the GPU, meaning it will be faster.
And you can usually configure your inference engine to keep the model loaded at all times, so you're not loosing so much time when you first start the model up.
Ollama attempts to dynamically load the right context lenght for your request, but in my experience that just results in really inconsistent and long time to first token.
The nice thing about vLLM is that your model is always loaded, so you don't have to worry about that. But then again, it needs much more VRAM.