OpenClaw + Gemma 4

Local agentic AI on a 2017 iMac

Building an autonomous AI agent that texts me through Telegram, searches the web, and runs entirely on consumer hardware. No cloud. No API keys. Just a 9-year-old desktop with a 4 GB GPU and some stubbornness.

The Stack

Model

Gemma 4 E4B

7.5B dense · Q4_K_M · 4.5 GB

Inference

llama.cpp v38

Vulkan/MoltenVK · 7.46 tok/s

Orchestration

OpenClaw

Gateway + tool-calling + compaction

Interface

Telegram Bot

ClawdyDawdy · web search · always on

The Machine

2017 iMac 27"

CPU

i7-7700K

RAM

40 GB DDR4

GPU

Radeon Pro 575

VRAM

4 GB GDDR5

The GPU runs via Vulkan/MoltenVK— not Metal (crashes on discrete AMD), not ROCm (Linux-only), not Ollama (can't see it). Three hours of empirical testing turned "worthless for LLMs" into "conversational speed for 7.5B models."

The Journey

Apr 6–10Done

GPU Resurrection

Everyone said the Radeon Pro 575 was dead for LLMs. Metal crashes on discrete AMD GPUs. ROCm is Linux-only. Ollama can't see it. So I compiled llama.cpp with the LunarG Vulkan SDK — MoltenVK, the layer nobody tests — and the GPU appeared. E4B went from 7.3 tok/s (CPU) to 37.6 tok/s on Vulkan. 5.1× speedup from hardware everyone wrote off.

Apr 10–11Done

403 Tests, 4 Models, Zero Hand-Holding

Built an automated benchmark pipeline that ran overnight. 13 hours, no human intervention. The 31B model scored 94% despite everyone saying it couldn't run on this hardware. Published the full source code, every prompt, every grading function, every raw JSON result. MIT licensed.

Apr 12Done

From Benchmarks to Bot

Wired up OpenClaw to the local Vulkan-accelerated llama-server and built a Telegram bot. The E4B model handles tool-calling flawlessly. We stopped testing models and started deploying agents. Named the bot ClawdyDawdy because why not.

Apr 14Done

Full Pipeline Validated

Asked the agent to search the web for Google I/O 2026 dates. Seven minutes later, Telegram delivered the answer: May 19–20, Shoreline Amphitheatre, Mountain View. The full chain: Telegram → OpenClaw → Gemma 4 E4B on Vulkan GPU → DuckDuckGo search → formatted response → Telegram. All on a 2017 iMac.

Apr 14Blocked

The MoE Bug Hunt

Tried to upgrade to the 26B-A4B MoE model for better quality. Both unsloth and bartowski quants produced infinite <unused50> tokens. Tested 7 configurations exhaustively. Discovered it's a Vulkan compute shader bug in MoE expert routing — the dense E4B works perfectly on the same binary. Filed reproduction data on llama.cpp #21516. Waiting for upstream fix.

NextActive

E4B Optimization & System Prompt Tuning

Making the daily-driver pipeline as good as possible. System prompt tuning for personality and brevity, multi-step agentic tasks (search → fetch → summarize), and reliability hardening for the Telegram integration.

NextNext

Tiered Model Routing

Running a fast E2B as the conversational router in GPU VRAM, delegating heavy research tasks to the E4B. Like having a receptionist who's quick with answers and a researcher who's thorough with analysis.

What Actually Works

✅ Validated Capabilities

●Web search via DuckDuckGo — no API keys needed
●Tool-calling chains (search → parse → respond)
●Context compaction — prevents overflow crashes on long conversations
●Telegram bot — always-on, texts responses to phone
●Auto-restart via launchd — survives reboots

❌ What Doesn't Work (Yet)

●26B MoE models on Vulkan — llama.cpp shader bug (#21516)
●Flash attention on AMD — Metal/CUDA only
●Models over 4 GB VRAM — won't fit on Radeon Pro 575

Myths Debunked

"You need Apple Silicon for local LLMs"

A 2017 Intel iMac with Vulkan runs Gemma 4 at 7.5 tok/s — fast enough for an autonomous agent.

"20+ tok/s on 26B with --cpu-moe"

We measured 0.5–10 tok/s. And the output was garbage tokens. Performance claims without hardware specs are worthless.

"Just use the bartowski quants, they fix the bug"

Tested bartowski Q3_K_M. Same <unused50> tokens. The bug is in Vulkan's MoE shaders, not the quantization.

"--flash-attn works on AMD"

Flash attention requires Metal (Apple Silicon) or CUDA (NVIDIA). Not available on Vulkan.

This page is a living document. I update it as the pipeline evolves. Follow the journey in the builder's log, or check the Gemma 4 benchmarks for raw performance data.