by @Aukejw
+1
bartowski/Qwen_Qwen3.5-2B-GGUF Q4_K_M
llama-server

Good tool-calling model for limited compute settings

chat tool-use

Setup details

I'm using a Raspberry pi 5 for home automation. I'm storing some local data on it, and I'd like to search and edit (think address books, private documents) with custom grep-based tools.

This model works surprisingly well for its size. It's not fast enough to read documents, and the pi does get warm, but it runs and is useful at around 7.5 tokens/s generation!

I tried MTP as well but this reduced speed to +/- 4.0 tokens/s, looks like the benefit of the draft model is limited at this level of compute.

Setup:

llama-server \
 --ctx-size 4096
 --threads 4 \
 --temp 0.6 \
 --cache-type-k q8_0 \
 --cache-type-v q8_0
Cortex-A76
barely
likely
1x