Recipe #3 — Should I run this model?

I'm using a Raspberry pi 5 for home automation. I'm storing some local data on it, and I'd like to search and edit (think address books, private documents) with custom grep-based tools.

This model works surprisingly well for its size. It's not fast enough to read documents, and the pi does get warm, but it runs and is useful at around 7.5 tokens/s generation!

I tried MTP as well but this reduced speed to +/- 4.0 tokens/s, looks like the benefit of the draft model is limited at this level of compute.

Setup:

llama-server \
 --ctx-size 4096
 --threads 4 \
 --temp 0.6 \
 --cache-type-k q8_0 \
 --cache-type-v q8_0

Setup details