Live · 7am IST · DailyFeatured
Reel

The ShiftMaker

AI Intelligence Daily
AI

Google's Gemini got distilled down to 26M parameters for wearables

Cactus-Compute's Needle model handles function calls on-device at 6,000 tokens/sec prefill — no cloud round-trip. Finetuning runs locally via a one-command web UI.

Published 13 May 2026 · ID 2026-05-13-google-s-gemini-got-distilled-down-to-26m-parameters-for-wearables
Google's Gemini got distilled down to 26M parameters for wearables

Cactus-Compute distilled Google's Gemini into a 26M-parameter model called Needle, built on a Simple Attention Network architecture designed for phones, watches, and glasses. The weights are fully open on the Cactus-Compute/needle repo, along with the dataset generation code.

On production hardware, Needle hits 6,000 tokens/sec prefill and 1,200 tokens/sec decode — fast enough that the cloud round-trip stops being the default assumption for function-call tasks. On single-shot function calling, it beats FunctionGemma-270m, Qwen-0.6B, Granite-350m, and LFM2.5-350m, all of which are 10x or more larger.

The catch the repo is honest about: those models carry more capacity and hold up better in conversational settings. Needle is narrow. Small models are also finicky, and single-shot benchmarks don't tell the whole story.

Finetuning runs locally via a one-command web UI at 127.0.0.1:7860, with weights auto-downloaded. If you're building anything that needs on-device function calls, clone it and test your own tools today.

Sources

Share on X Share on LinkedIn