LocalLLaMA

3111 readers

26 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

founded 2 years ago

MODERATORS

[email protected]

What is a good model that runs on 6GB Vram? (discuss.online)

submitted 4 months ago by [email protected] to c/[email protected]

10 comments fedilink hide all child comments

Should be good at conversations and creative, it'll be for worldbuilding

Best if uncensored as I prefer that over it kicking in when I least want it

I'm fine with those roleplaying models as long as they can actually give me ideas and talk to be logically

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 6 points 4 months ago* (last edited 4 months ago) (1 children)

Try the IQ4_XS quant of mistral nemo

If you want a more roleplay based model with more creativity at the cost of other things you can try the arliai finetune of nemo.

If you want the model to remember long term you need to bump its context size up. You can trade GPU layers for context size or go down a quant or go to a smaller model like llama 8b.

[–] [email protected] 1 points 4 months ago (2 children)

Can't you just increase context length at the cost of paging and slowdown?

[–] [email protected] 2 points 4 months ago (1 children)

At some point you'll run out of vram memory on the GPU. You make it slower by offloading some memory layers to make room for more context.

[–] [email protected] 1 points 4 months ago

Yes, but if he's world building, a larger, slower model might just be an acceptable compromise.

I was getting oom errors doing speech to text on my 4070ti. I know (now) that I should have for for the 3090ti. Such is life.

[–] [email protected] 1 points 4 months ago

At a certain point, layers will be pushed to RAM leading to incredibly slow inference. You don't want to wait hours for the model to generate a single response.