this post was submitted on 04 Nov 2023
21 points (100.0% liked)

LocalLLaMA

2819 readers
62 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

founded 2 years ago
MODERATORS
 

"This feature utilizes KV cache shifting to automatically remove old tokens from context and add new ones without requiring any reprocessing."

This means a major speed increase for people like me who rely on (slow) CPU inference (or big models). Consider a chatbot scenario and a long chat where old lines of dialogue need to be evicted from the context to stay within the (4096 token) context size. Previously the context had to be re-computed starting with the first changed/now missing token. This feature detects that, deletes the affected tokens from the KV cache and shifts the subsequent tokens in the KV cache so it can be re-used. Avoiding a computationally expensive re-calculation.

This is probably also more or less related to recent advancements like Streaming-LLM

This won't help once text gets inserted "in the middle" or the prompt gets changed in another way. But I managed to connect KoboldCPP as a backend for SillyTavern/Oobabooga and now I'm able to have unlimited length conversations without waiting excessively, once the chat history hits max tokens and the frontend starts dropping text.

It's just a clever way to re-use the KV cache in one specific case. But I've wished for this for quite some time.

top 2 comments
sorted by: hot top controversial new old
[โ€“] [email protected] 3 points 1 year ago (1 children)

Cool stuff! Smarter than smart contexts.

[โ€“] [email protected] 2 points 1 year ago* (last edited 1 year ago)

I wasn't able to get good use out if the old 'Smartcontext' anyways and seems other people had the same problem. To me, this is a huge improvement. And it doesn't even need extra memory or anything.

I really like how the KoboldCPP dev(s(?)) and the llama.cpp community constantly implement all the crazy stuff.