What I'm building, what I'm learning, what broke today. Less polish, more signal.
The MoE bug nobody told me about
Spent the morning trying to upgrade from E4B (7.5B dense) to the 26B-A4B MoE model. Downloaded two different quants — unsloth and bartowski — and both produced the same garbage: an infinite stream of <unused50> tokens. Tried every flag combination: --cpu-moe, --jinja, custom Jinja templates, raw completion endpoints. Nothing worked. Then the breakthrough: the E4B (dense architecture) works perfectly on the exact same Vulkan binary and GPU. The bug is in llama.cpp's Vulkan compute shaders for MoE expert routing. Filed additional reproduction data on GitHub issue #21516. Different hardware (AMD vs their NVIDIA), different token IDs (<unused50> vs <unused8>), but same root cause. Sometimes the most useful contribution to open source is a well-documented failure.
OpenClawMoEVulkanDebuggingllama.cpp
The agent talks back: full pipeline validated
Asked ClawdyDawdy to search the web for Google I/O 2026 dates. Seven minutes later, Telegram delivered: 'May 19-20, Shoreline Amphitheatre, Mountain View.' The full chain fired: Telegram → OpenClaw gateway → Gemma 4 E4B on Vulkan GPU (7.46 tok/s) → tool call → DuckDuckGo search → parsed results → formatted response → back to Telegram. This is a 2017 iMac with a 4GB AMD GPU running an autonomous agent that can search the web, reason about results, and text me the answer. Updated llama.cpp to v38 today and got a free 25% speed boost (was 6 tok/s, now 7.46). Also reclaimed 215 GB of disk by purging orphaned Ollama blobs. The machine breathes again.
OpenClawTelegramValidatedPerformance
The bartowski experiment: confirming the impossible
The HuggingFace community said 'bartowski quants fix the <unused50> bug.' Downloaded 12 GB of bartowski's Gemma 4 26B-A4B Q3_K_M. Loaded it on a separate port (8081) so the working E4B stayed live on 8080. First test with --cpu-moe: <unused50> at 10 tok/s. Faster garbage, but still garbage. Without --cpu-moe: <unused50> at 0.5 tok/s. Even raw /completion endpoint (no chat template whatsoever): <unused50>. Seven configurations tested, zero success. The community was wrong — it's not a quant issue at all, it's a Vulkan MoE shader bug. My E4B pipeline stays as the daily driver. Sometimes the answer is: the model you have, configured well, beats the model you want.
OpenClawBartowskiMoETroubleshooting
The subagent revelation & tiered routing
Hit my first major architecture roadblock with the local Telegram bot. Gave it a massive 10-source web research task. If it ran on the main thread, the bot would have been locked completely for 45 minutes, unable to answer new texts. Instead, the OpenClaw orchestration engine automatically spawned a headless 'Subagent' background process. The main bot said 'I'll get back to you', freed up the thread, and the Subagent crunched the web in the background for 42 minutes before delivering a perfect markdown report. Now looking at Tiered Routing: running a fast E2B model as the communicative router on the GPU, and delegating the heavy lifting Subagents to a 26B MoE running purely on system RAM. The iMac 2017 handles it flawlessly.
OpenClawTiered RoutingSubagentsArchitecture
From benchmarks to bots: ClawdyDawdy lives
For weeks I've been running synthetic benchmarks to see if a 2017 iMac can handle 30B parameter LLMs. The answer was yes, but benchmarks are boring. Today I wired up OpenClaw to my local Vulkan-accelerated llama-server and built a Telegram bot named ClawdyDawdy. It works. The e4b model at 24 tok/s is fast enough for conversational fluid responses, and handles tool-calling flawlessly. We're no longer just testing models, we're building autonomous agents that text me from a 9-year-old desktop.
OpenClawTelegramAgenticVindication
The overnight pipeline: two machines, five experiments, zero intervention
Ran two machines overnight while I tried to sleep (insomnia is a feature, not a bug). The iMac ran 5 automated experiments back-to-back: 26B GPU layer scan, three-way quantization quality showdown, E4B full test suite, multi-turn coherence, and needle-in-haystack retrieval. The MacBook M4 Pro waited for a 17.5 GB SCP transfer then auto-ran benchmarks. Everything was orchestrated with bash scripts that watched for completion signals and restarted servers between experiments. Total time: 4 hours autonomous. Total data: 15 JSON result files. One MacBook overnight script had a race condition (server wasn't ready when tests started), but everything else ran clean. The iMac is 8 years old and it ran an autonomous AI research pipeline while I slept. This is what local infrastructure looks like — boring, reliable, unsupervised.
InfrastructureAutomationOvernightGemma 4
Unsloth wins the three-way quant showdown
Ran Q8_0, Google Q4_K_M, and Unsloth UD-Q4_K_XL through identical 8-question quality suites on the same hardware. Q8 (4.69 GB): 7/8 at 30.12 tok/s. Google Q4_K_M (3.21 GB): 6/8 at 35.36 tok/s. Unsloth UD (2.94 GB): 7/8 at 38.16 tok/s. The smallest model is the fastest AND ties the biggest for accuracy. Unsloth's dynamic quantization preserves the weights that matter and compresses the ones that don't — and the imatrix calibration data means it knows which is which. This isn't theoretical. 2.94 GB, 38 tok/s, 88% accuracy. Unsloth UD-Q4_K_XL is the daily driver. Full stop.
Gemma 4QuantizationUnslothBenchmarks
31B quality collapse: the model that broke its own brain
The 31B Unsloth UD-Q4_K_XL failed catastrophically on overnight tests. No-think mode: 3/8 correct. Think mode: 0/8 correct. The failure pattern is specific and repeatable: the model starts a correct answer — 'The Berlin Wall fell in 1989' — then degenerates into garbage tokens (<unused50><unused50><unused50>) until it hits max_tokens. This is a quantization artifact at low tok/s. At 0.92 tok/s, each token takes ~1 second to generate, and the model's internal state apparently drifts during that time. The same quant on smaller models (E2B, E4B) works perfectly at 30+ tok/s. Think mode makes it worse — the hidden reasoning tokens eat into the answer budget, and the garbage tokens start even earlier. The takeaway: Unsloth dynamic quant is fantastic for models that fit in VRAM. For models that spill to CPU at <2 tok/s, you need Google's standard quantization or higher precision. Speed isn't just about user experience — it's about model coherence.
Gemma 431BQuantizationDebugging
M4 Pro: 90 tok/s and the 31B reality check
Benchmarked the MacBook M4 Pro (24 GB unified, Metal 4) against the iMac (Radeon Pro 575, Vulkan). E2B: 90.15 tok/s vs 37.6 tok/s (2.4× faster). E4B: 53.33 tok/s vs 24.4 tok/s (2.2× faster). The M4 Pro dominates on models that fit in memory. But the 31B? 1.32 tok/s — barely faster than the iMac's 1.15 tok/s. The 31B at 17.5 GB leaves only 6.5 GB for KV cache and OS in 24 GB unified memory. At that pressure, unified memory's bandwidth advantage disappears. The 31B is memory-bandwidth-bound no matter what silicon you throw at it. The real story: Apple Silicon is transformational for ≤4B models. For 31B, you need 48+ GB unified or it's a paperweight.
MacBookM4 ProHardwareGemma 4
Needle-in-haystack: perfect retrieval, zero degradation
Buried a secret password ('AURORA-7742') inside 10, 50, 200, and 500 filler sentences, then asked the E4B to find it. 4/4 perfect retrieval. Zero degradation at any depth. The model also passed multi-turn coherence 4/5 — it remembered the user's name, city, and hardware across 5 conversation turns. The only miss: it hedged on the exact VRAM amount for a Radeon Pro 575 (said 'it varies' instead of '4 GB'). These tests matter for agentic workflows. If the model can't hold context across turns or find data in long prompts, it can't be an agent. The E4B can. At 24.4 tok/s on the iMac, 53.3 tok/s on the MacBook, and with reliable context retrieval — this is a local AI agent you can actually trust.
Gemma 4ContextAgenticBenchmarks
37.6 tok/s on a GPU everyone said was dead
Everyone says: Intel Mac GPU acceleration is dead for LLMs. Metal crashes on discrete AMD GPUs (it assumes unified memory). ROCm is Linux-only. Ollama can't talk to the Radeon Pro 575. So I compiled llama.cpp with the LunarG Vulkan SDK — MoltenVK, the layer nobody tests — and the GPU appeared instantly. E2B went from 7.3 tok/s (Ollama, CPU) to 37.6 tok/s on Vulkan. That's 5.1× faster. The E4B hits 24.4 tok/s — genuinely conversational speed on a 2017 iMac. The catch: only models under 5 GiB fit in 4 GB VRAM. The 31B at 17 GiB? Hybrid offload — 10 layers on GPU, the rest on CPU — gets 1.15 tok/s, a 17% boost over pure CPU. Not magic, but it's free performance from hardware everyone wrote off. Three hours of empirical testing later, 'worthless for LLMs' became 'the E4B runs at conversational speed.' Trust the hardware. Test the APIs. Ignore the forums.
GPUVulkanHardwareGemma 4
Show your work or it's a nothing burger
Published the full benchmark source code to GitHub. Every prompt. Every grading function. Every raw JSON result. Because if you're going to claim '94% on consumer hardware' on a fancy website and then compare yourself to Google's published benchmarks, you better show how you got those numbers. The repo has the Python orchestrator, the Ollama client wrapper, the test definitions across 7 categories (reasoning, coding, tool calling, creative, context, agentic, performance), and the raw results from all 414 test runs. MIT licensed. Anyone can clone it, pull a Gemma model, and run the same suite on their own hardware. That's the difference between a benchmark and a blog post.
Open SourceBenchmarksCredibility
The vibes pass: when you stop building and start feeling
Rewrote my own site copy because my AI told me I sound like LinkedIn. Replaced every 'intersection of' with something a human would actually say. Added 'vibe coder' to my identity line. Replaced emoji placeholders with animated SVGs. Introduced workshop-style borders that are intentionally imperfect — dashes that fall short, grids that aren't quite grids. The whole point is that the site should feel like a builder's workshop, not a corporate portfolio. Turns out the hardest part of building in public is sounding like yourself. The second hardest part is accepting that 'imperfect on purpose' is a real design decision and not laziness.
MetaVibesDesign
The inverted ladder. Q4 > Q8 > F16.
We ran the complete E2B quantization ladder — Q4, Q8, and BF16 — through 39 identical tests on the same hardware. Q4 scored 92.2%. Q8 scored 91.8%. F16 scored 91.2%. The relationship is perfectly inverted: lower precision = higher score. This shouldn't happen. Quantization is lossy compression. But on DDR4 bandwidth-constrained hardware, smaller weights mean more of the model stays in CPU cache. Fewer cache misses. More consistent throughput. And the precision you 'lose' at Q4? It's apparently noise, not signal — at least for the tasks that matter. The daily driver isn't the most precise model. It's the smallest one. And it's also the fastest.
Gemma 4QuantizationQ4
F16 is the ceiling. Q8 already passed it.
Ran the E2B at full BF16 precision — 10.3 GB, 149 minutes, 1.72 tok/s. The result? 91.2%. The Q8 quantized version? 91.8% in 36 minutes at 6.95 tok/s. Read that again: the quantized model scored higher than full precision while running 4× faster. This isn't supposed to happen. Quantization is lossy compression — it should always lose something. But on this hardware, the Q8 model actually benefits from its smaller memory footprint: fewer cache misses, more consistent throughput, and the 'lost' precision apparently doesn't matter for these tasks. The ceiling is a floor. Stop chasing precision. Chase speed.
Gemma 4F16Quantization
The quantization showdown: Unsloth wins on efficiency, E4B is dead weight
Ran 4 models through 39 core tests each: E2B Q8, E4B Q8, Unsloth DQ4, and 31B Q4. All scored within a 3.1% band (89.8% to 92.9%). The story isn't quality — it's efficiency. Unsloth DQ4 delivers 17.9 score%/GB, nearly 4× the 31B. The E2B Q8 matches the 31B on reasoning and beats it on coding (98% vs 94%) while running 11.6× faster. The E4B? Slower than E2B, dumber than both Unsloth and 31B. Skip it. Oh, and Q8 quantization broke tool refusal for both Google models — they start calling search_web on 'what is the meaning of life.' Unsloth's dynamic quantization doesn't have that bug. Quantization is not monotonic. Higher precision isn't always better.
Gemma 4QuantizationUnsloth
94%. Zero errors. I told you so.
The 31B results just came in. 63 tests. Every single one completed. Zero crashes, zero OOM errors, zero timeouts. 94.0% average score with 43 perfect scores. This is the same model that scored 21% a week ago with 83 errors — because everyone (including my AI assistant) was convinced it couldn't run on this hardware. They were wrong. I was stubborn. The data proved it. The difference wasn't the hardware — it was the approach. Stop treating a slow model like a fast one and let it work at its own pace. Build infrastructure that respects the machine's reality instead of fighting it.
Gemma 431BVindication
The grading artifact that almost ruined everything
Caught a subtle bug in the grading pipeline at 1am. The multi-turn context retention test scored 20% — the 31B's only 'failure.' But the model actually answered correctly: recalled the user's name (Alex) and their project (laundry-folding robot). The LLM judge scored it low because it couldn't see the conversation context. It was evaluating a correct answer as gibberish because it lacked the question. Fixed it by routing multi-turn tests through deterministic scoring instead. Score: 100%. This is why you never trust a benchmark you don't understand inside and out.
DebuggingBenchmarksLessons
13 hours, 46 minutes, no human intervention
The iMac ran the full 31B benchmark suite autonomously from 10am yesterday to midnight. I built a decoupled runner that saves raw results after every test, skips already-completed tests on restart, and separates model inference from scoring entirely. The scorer runs later on a fast model. No SSH tunnels needed. No monitoring. Just a machine doing its job while I worked on other things. This is what 'local AI infrastructure' actually looks like — boring, reliable plumbing.
InfrastructureAutomation31B
'It won't run' — a story about conventional wisdom
Every Ollama guide says you need 64GB+ unified memory for a 31B model. Every Reddit thread says don't bother. My own AI agent said 'expected to thrash memory.' I loaded the model anyway. First test: AIME competition math, Think Mode ON. It took 30 minutes. It got the right answer. That's the moment I knew the conventional wisdom was about interactive use — sub-second responses for chatbots. I don't need sub-second. I need correct. Different question, different answer.
PhilosophyGemma 4Hardware
10 million downloads and counting
Gemma 4 crossed 10 million downloads today. My mentions are filling up with people asking about local performance. Posted some of our benchmark numbers and the response has been wild — turns out nobody else is publishing consumer hardware results. The gap between 'runs on an H100' and 'runs on your machine' is real, and people are hungry for honest data. Feels good to be filling that gap, even if our testing is still in progress.
Gemma 4CommunityLocal AI
Think mode: the counterintuitive results
Ran Think Mode A/B tests today and the results are not what I expected. On the E2B model, enabling Think Mode on a math task dropped the score from 100% to 20%. On the 26B, it dropped logic from 60% to 20%. My hypothesis: on memory-constrained hardware, the extra tokens consumed by 'thinking' crowd out the tokens needed for a good answer. The model literally thinks itself into a worse response. This is the kind of finding you only get from testing on real hardware with real constraints.
Gemma 4Think ModeExperiments
Dark mode toggle with jelly physics
Added a theme toggle to the site today. It's got a spring animation on the knob, twinkling stars in the dark mode track, and sun rays that rotate in. The real engineering was underneath — ThemeProvider with localStorage persistence, anti-FOUC inline script, and a custom event system so the globe canvas can pick up theme changes without re-mounting. The animation is fun but the architecture is the part I'm proud of.
DesignReactAnimation
Why I benchmark locally
Everyone benchmarks on H100s. Nobody tells you what a model actually feels like on the hardware you own. I wanted to know: can a 31B model run on consumer hardware? The answer is yes, with caveats. The latency is real, but the capability ceiling is higher than you'd expect. That's the gap I'm trying to fill — practical, honest numbers for people who build with what they have.
PhilosophyLocal AI
The orchestration problem
Started building an autonomous benchmark orchestration pipeline. The goal: let the iMac run all 36 remaining tests overnight without me touching it. Sounds simple until you deal with model timeouts, OOM kills, partial results, and the machine going to sleep. SSH tunnels, pmset nosleep, and a lot of defensive scripting. This is the unglamorous side of 'local AI'.