LocalLLaMA

2841 readers

1 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

founded 2 years ago

MODERATORS

[email protected]

[Paper] The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits (huggingface.co)

submitted 1 year ago* (last edited 1 year ago) by [email protected] to c/[email protected]

18 comments fedilink hide all child comments

From the abstract: "Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}."

Would allow larger models with limited resources. However, this isn't a quantization method you can convert models to after the fact, Seems models need to be trained from scratch this way, and to this point they only went as far as 3B parameters. The paper isn't that long and seems they didn't release the models. It builds on the BitNet paper from October 2023.

"the matrix multiplication of BitNet only involves integer addition, which saves orders of energy cost for LLMs." (no floating point matrix multiplication necessary)

"1-bit LLMs have a much lower memory footprint from both a capacity and bandwidth standpoint"

Edit: Update: additional FAQ published

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 2 points 1 year ago* (last edited 1 year ago) (1 children)

Reading up on the speculation on the internet: There must be a caveat... There is probably a reason why they only trained up to 3B parameter models... I mean the team has the name Microsoft underneath and they should have access to enough GPUs. Maybe the training is super (computationally) expensive.

[–] [email protected] 3 points 1 year ago (1 children)

They say that the models would have to be trained from scratch, and so far that has always been super expensive.

[–] [email protected] 1 points 1 year ago* (last edited 1 year ago) (1 children)

Sure, I meant considerably more expensive than current methods... It's not really a downside if it's as expensive as other methods, because of the huge benefits it has after training is finished (on inference.)

If it's just that, the next base/foundation models would be surely conceptualized with this. And companies would soon pick up on it, since the initial investment in training would pay back quickly. And then you have like an 8x competetive advantage.

[–] [email protected] 2 points 1 year ago (1 children)

Ah, I thought you meant why the researchers themselves hadn't produced any larger models. AFAIK neither MS or OAI has released even a 7b model, they might have larger BitNet models which they only use internally.

[–] [email protected] 1 points 1 year ago* (last edited 1 year ago)

Hmm. I meant kind of both. I think them not releasing a model isn't a good sign to begin with. That wouldn't matter if somebody picked it up. (What I read from the paper is that they did some training up to 3B(?!) and then scaled that up in some way to get some more measurements without actually training larger models. So also internally they don't seem to have any real larger models. But even the small models don't seem to have been published. I mean I also don't have any insight on what amount of GPUs the researchers/companies have sitting around or what they're currently working on and using them for. It's a considerable amount, though.)

It's only been a few weeks. I couldn't find a comprehensive test / follow-up of their approach yet. However last week they released some more information: https://github.com/microsoft/unilm/blob/master/bitnet/The-Era-of-1-bit-LLMs__Training_Tips_Code_FAQ.pdf

And I found this post from 2 days ago where someone did a small training run and published the loss curve.

And some people have started doing some implementations on Github. I'm not sure though where this is supposed to be going without availability of actual models.