LocalLLaMA

3306 readers

1 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

Rules:

Rule 1 - No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.

Rule 2 - No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.

Rule 3 - No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they're still using the same algorithms since <over 10 years ago>.

Rule 4 - No implying that models are devoid of purpose or potential for enriching peoples lives.

founded 2 years ago

MODERATORS

[email protected]

Qwen3-32b: Windows95 starfield screensaver web app with warp drive on click (lemmynsfw.com)

submitted 2 months ago* (last edited 1 month ago) by [email protected] to c/[email protected]

11 comments fedilink hide all child comments

It's amazing how far open source LLMs have come.

Qwen3-32b recreated the Windows95 Starfield screensaver as a web app with the bonus feature to enable "warp drive" on click. This was generated with reasoning disabled (/no_think) using a 4-bit quant running locally on a 4090.

Here's the result: https://codepen.io/mekelef486/pen/xbbWGpX

Model: Qwen3-32B-Q4_K_M.gguf (Unsloth quant)

Llama.cpp Server Docker Config:

docker run \
-p 8080:8080 \
-v /path/to/models:/models \
--name llama-cpp-qwen3-32b \
--gpus all \
ghcr.io/ggerganov/llama.cpp:server-cuda \
-m /models/qwen3-32b-q4_k_m.gguf \
--host 0.0.0.0 --port 8080 \
--n-gpu-layers 65 \
--ctx-size 13000 \
--temp 0.7 \
--top-p 0.8 \
--top-k 20 \
--min-p 0

System Prompt:

You are a helpful expert and aid. Communicate clearly and succinctly. Avoid emojis.

User Prompt:

Create a simple web app that uses javascript to visualize a simple starfield, where the user is racing forward through the stars from a first person point of view like in the old Microsoft screensaver. Stars must be uniformly distributed. Clicking inside the window enables "warp speed" mode, where the visualization speeds up and star trails are added. The app must be fully contained in a single HTML file. /no_think

top 11 comments

sorted by: hot top controversial new old

[–] [email protected] 5 points 2 months ago (2 children)

Is 13k your max context at Q4K_M?

[–] [email protected] 6 points 2 months ago* (last edited 2 months ago) (1 children)

I'm close to the limit at 23886MiB / 24564MiB of VRAM used when the server is running. I like to have a bit of headroom for other tasks.

But I'm by no means a llama.cpp expert. If you have any tips for better performance I'd love to hear them!

[–] [email protected] 5 points 2 months ago (1 children)

Enable flash attention if you havent already

[–] [email protected] 3 points 2 months ago (1 children)

22466MiB / 24564MiB, awesome, thank you!

[–] [email protected] 2 points 2 months ago (1 children)

You're welcome. Also, whats your gpu and are you using cublas (nvidia) or vulcan(universal amd+nvidia) or something else for gpu postprocessing?

[–] [email protected] 2 points 1 month ago (1 children)

It's a 4090 using cublas. I just run the stock llama.cpp server with CUDA support. Do you know if there'd be any advantage to building it from source or using something else?

[–] [email protected] 2 points 1 month ago

If you were running amd GPU theres some versions of llama.cpp engine you can compile with rocm compat. If your ever tempted to run a huge model with partial offloaded CPU/ram inferencing you can set the program to run with highest program niceness priority which believe it or not pushes up the token speed slightly

[–] [email protected] 4 points 1 month ago

Exllamav3 is still in development so it's not fully optimized and could have bugs, but I get 16k context with 4bpw (which has very similar perplexity as Q4_K_M, according to developer's own measurements) using only 22GB VRAM, since I also run my desktop env on the same computer.

[–] [email protected] 2 points 1 month ago (2 children)

People implement this on their calculator during class. This is the kind of thing you would write to learn programming, the definition of entry-level. You're using a device that can execute billions of trigonometric calculations per millisecond to produce code that calculates X and Y coordinates for few dozens of points on a radial trajectory.

What the fuck...

[–] [email protected] 6 points 1 month ago

There is, seriously, no pleasing some people. You appear to be a vocal member of this highly undignified and odious demographic.

[–] [email protected] 3 points 1 month ago* (last edited 1 month ago)

Fair point. My original prompt asked for more, but the model wasn't capable enough. Not sure if the "warp drive" part would be part of any standard algo.

Any ideas on challenges that are new and more fun than the "balls rolling in a hexa-,hepta-,octagon" or "simulate a solar system" prompts everyone's using these days?