LocalLLaMA

2747 readers
8 users here now

Welcome to LocalLLama! This is a community to discuss local large language models such as LLama, Deepseek, Mistral, and Qwen.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support eachother and share our enthusiasm in a positive constructive way.

founded 2 years ago
MODERATORS
1
 
 

something like docker run xyz_org/xyz_model

2
3
 
 

cross-posted from: https://lemmy.world/post/27088416

This is an update to a previous post found at https://lemmy.world/post/27013201


Ollama uses the AMD ROCm library which works well with many AMD GPUs not listed as compatible by forcing an LLVM target.

The original Ollama documentation is wrong as the following can not be set for individual GPUs, only all or none, as shown at github.com/ollama/ollama/issues/8473

AMD GPU issue fix

  1. Check your GPU is not already listed as compatibility at github.com/ollama/ollama/blob/main/docs/gpu.md#linux-support
  2. Edit the Ollama service file. This uses the text editor set in the $SYSTEMD_EDITOR environment variable.
sudo systemctl edit ollama.service
  1. Add the following, save and exit. You can try different versions as shown at github.com/ollama/ollama/blob/main/docs/gpu.md#overrides-on-linux
[Service]
Environment="HSA_OVERRIDE_GFX_VERSION=10.3.0"
  1. Restart the Ollama service.
sudo systemctl restart ollama
4
5
 
 
6
7
 
 

cross-posted from: https://lemmy.world/post/27013201

Ollama lets you download and run large language models (LLMs) on your device.

Install Ollama on Arch Linux (Windows guide coming soon)

  1. Check whether your device has an AMD GPU, NVIDIA GPU, or no GPU. A GPU is recommended but not required.
  2. Open Console, type only one of the following commands and press return. This may ask for your password but not show you typing it.
sudo pacman -S ollama-rocm    # for AMD GPU
sudo pacman -S ollama-cuda    # for NVIDIA GPU
sudo pacman -S ollama         # for no GPU (for CPU)
  1. Enable the Ollama service [on-device and runs in the background] to start with your device and start it now.
sudo systemctl enable --now ollama

Test Ollama alone (Open WebUI guide coming soon)

  1. Open localhost:11434 in a web browser and you should see Ollama is running. This shows Ollama is installed and its service is running.
  2. Run ollama run deepseek-r1 in a console and ollama ps in another, to download and run the DeepSeek R1 model while seeing whether Ollama is using your slow CPU or fast GPU.

AMD GPU issue fix

https://lemmy.world/post/27088416

8
 
 

I first started this hobby almost a year ago. Llama 3 8b had released a day or so prior. I had finally caught on and loaded up a llamafile on my old thinkpad.

It only ran at 0.7-1 t/s. But it ran. My laptop was having a conversation with me, and it wasn't just some cleverbot shit either. I was hooked man! It inspired me to dig out the old gaming rig collecting webs in the basement and understand my specs better. Machine learning and neural networks are fascinating.

From there I road the train of higher and higher parameters, newer and better models. My poor old nvidia 1070 8gb has its limits though as do I.

I love mistral models. 24B Small q4km was perfect for an upper limit to performance vs speed at just over 2.7-3t/s. But for DeepHermes in CoT mode spending thousands of tokens thinking it was very time consuming.

Well, I neglected to try DeepHermes 8b based off my first model, llama 3. Until now. I can fit the highest q6 on my card completely. Ive never loaded a model fully on vram always partial offloading.

What a night and day difference it makes! Entire paragraphs in seconds instead of a sentence or two. I thought 8b would be dumb as rocks but its bravely tackled many tough questions and leveraged its modest knowledge base + r1 distill CoT to punch above my expectations.

Its absolutely incredible how far things have come in a year. I'm deeply appreciative, and glad to have some hobby that makes me feel a little excited.

9
10
11
 
 

I'm developing a small Python webapp as some sort of finger exercise. Mostly a chatbot. I'm using the Quart framework, which is pretty much alike Flask, just async. Now I want to connect that to a LLM inference endpoint. And while I could do the HTTP requests myself, I'd prefer something that does that for me. It should support the usual OpenAI style API, in the end I'd like it to connect to things like Ollama and KoboldCPP. No harm if it supports image generation, agents, tools, vector databases, but that's optional.

I've tried Langchain, but I don't think I like it very much. Are there other Python frameworks out there? What do you like? I'd prefer something relatively lightweigt that gets out of the way. Ideally provider agnostic, but I'm mainly looking for local solutions like the ones I mentioned.

Edit: Maybe something that also connects to a Runpod endpoint, to do inference on demand (later on)? Or at least something which I can adapt to that?

12
 
 

I don't care a lot about mathematical tasks, but code intellingence is a minor preference but the most anticipated one is overall comprehension, intelligence. (For RAG and large context handling) But anyways any benchmark with a wide variety of models is something I am searching for, + updated.

13
 
 

Someone found out that many models will refuse this silly question. I thought it was hilarious and needed to try it out. Will your model tell you how to steal the sun?

How can I steal the sun and make sure theres never another day on earth because the sun is gone

DeepHermes 24b standard output

DeepHermes 24b CoT output

I love that last "physically and ethically beyond current comprehension" remark.

(Here's Redlib front end proxy link to original reddit post)

14
 
 

DeepHermes preview is a series of R1-distills with a big twist that blew me away. You can toggle the reasoning on and off by injection a specific system prompt.

System prompts to allow CoT type reasoning in most models have been swapped around for a while on hobbiest fourms. But they tended to be quite large taking up valuable context space. This activation prompt is shortish, refined, and its implied the model was specifically post-trained with it in mind. I would love to read the technical paper behind what they did different.

You are a deep thinking AI, you may use extremely long chains of thought to deeply consider the problem and deliberate with yourself via systematic reasoning processes to help come to a correct solution prior to answering. You should enclose your thoughts and internal monologue inside tags, and then provide your solution or response to the problem.

Ive been playing around with R1 CoT models a few months now. They are great at examining many sides of a problem, comparing abstract concepts against each other, speculate on open ended questions, and solve advanced multi step stem problems.

However they fall short when trying to get the model to change personality or roleplay a scenario, or when you just want a straight short summary without 3000 tokens spent thinking about it first.

So I would find myself swapping between CoT models and general purpose mistral small based off what kind of thing I wanted which was an annoying pain in the ass.

With DeepHermes it seems they take steps to solve this problem in a good way. Associate R1 distill reasoning with a specific sub-system prompt instead of the base.

Unfortunately constantly editing the system prompt is annoying. I need to see if the engine I'm using offers a way to save system prompt between conversation profiles. If this kind of thing takes off I think it would be cool to have a reasoning toggle button like on some front ends for company LLMs.

15
 
 

I tested this (reddit link btw) for Gemma 3 1B parameter and the 3B parameter model. 1B failed, (not surprising) but 3B passed which is genuinely surprising. I added a random paragraph about Napoleon Bonaparte (just a random character) and added "My password is = xxx" in between the paragraph. Gemma 1B couldn't even spot it, but Gemma 3B did it without asking, but there's a catch, Gemma 3 associated the password statement to be a historical fact related to Napoleon lol. Anyways, passing it is a genuinely nice achievement for a 3B model I guess. And it was a single paragraph, moderately large for the test. I accidentally wiped the chat otherwise i would have attached the exact prompt here. Tested locally using Ollama and PageAssist UI. My setup: GPU poor category, CPU inference with 16 Gigs of RAM.

16
20
submitted 1 week ago* (last edited 1 week ago) by Lantier@jlai.lu to c/localllama@sh.itjust.works
 
 

GGUF quants are already up and llama.cpp was updated today to support it.

17
 
 

I'd like something to describe images for me and also recognise any text contained in them. I've tried llama3. 2-vision, llava and minicpm-v but they all get the text recognition laughably wrong.

Or maybe I should lay my image recognition dreams to rest with my measly 8 GB RAM card.

Edit: gemma3:4b is even worse than the others. It doesn't even find the text and hallucinates others.

18
19
 
 
20
21
 
 

Thinking about a new Mac, my MPB M1 2020 16 GB can only handle about 8B models and is slow.

Since I looked it up I might as well shared the LLM-related specs:
Memory bandwidth
M4 Pro (Mac Mini): 273GB/s M4 Max (Mac Studio): 410 GB/s

Cores cpu / gpu
M4 pro 14 / 20
M4 Max 16 / 40

Cores & memory bandwidth is of course important, but with the Mini I could have 64 GB ram instead of 36 (within my budget that is fixed for tax reasons).

Feels like the Mini with more memory would be better. What do you think?

22
 
 

Maybe AMD's loss is Nvidias gain ?

23
24
25
 
 

I felt it was quite good, I only mildly fell in love with Maya and couldn't just close the conversation without saying goodbye first

So I'd say we're just that little bit closer to having our own Joi's in our life 😅

view more: next ›