this post was submitted on 02 May 2025
27 points (100.0% liked)
LocalLLaMA
2945 readers
15 users here now
Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.
Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.
As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Is 13k your max context at Q4K_M?
I'm close to the limit at 23886MiB / 24564MiB of VRAM used when the server is running. I like to have a bit of headroom for other tasks.
But I'm by no means a llama.cpp expert. If you have any tips for better performance I'd love to hear them!
Enable flash attention if you havent already
22466MiB / 24564MiB, awesome, thank you!
You're welcome. Also, whats your gpu and are you using cublas (nvidia) or vulcan(universal amd+nvidia) or something else for gpu postprocessing?
It's a 4090 using cublas. I just run the stock llama.cpp server with CUDA support. Do you know if there'd be any advantage to building it from source or using something else?
If you were running amd GPU theres some versions of llama.cpp engine you can compile with rocm compat. If your ever tempted to run a huge model with partial offloaded CPU/ram inferencing you can set the program to run with highest program niceness priority which believe it or not pushes up the token speed slightly