A lot of the speed depends on hardware. Generally, in my experience so far the most accurate models are the largest you can run in 4 bit. I can barely run a Llama2 70B GGML in 4 bit with a 16 layers on a 3080Ti and everything else on 64GB of DDR5. A solid block of 200-300 words takes about 3 minutes to complete, but the quality of the results is well worth it. I also use a WizardLM 30B with 4 bit GGML. It takes around 2 minutes for an equivalent output. Anything in the 7-20B range is like asking the average teenager for technical help. It is possible to have a functional smalltalk conversation with one, but don't hand them your investment portfolio, ask them to toss a new clutch in the car, or secure a corporate server rack even if they claim expertise. Maybe with some embeddings and a bunch of tuning better results are possible. I have only tried 2 13B's a dozen 7B's half a dozen ~30B's and 2 70B's.
LocalLLaMA
Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.
Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.
As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.
I'm still on 3060Ti, but then speed isn't my biggest concern. I'm primarily focused on reasonably accurate "understanding" of the source material. I got pretty good results with GPT 4, but I feel like focusing my training data could help avoid irrelevant responses.
Can you give us a link to the software you use? Or is the ttrpg assistant something you develop?
Edit: Sry. I didn't get it. You write you're feeding rulebooks into privateGPT. I somehow thought you had something roleplay specific.