Gemma 4 Benchmark Suite
414+ tests across 7 models on two machines. iMac 2017 via Vulkan, MacBook M4 Pro via Metal. Autonomous overnight pipeline. 15 result files. Zero cloud. Zero API keys.
🐘 The headline: The 31B Dense model scored 94% across 63 tests with zero errors. One week earlier, the same model scored 21% with 83 errors — because the testing infrastructure wasn't built for a model this slow. The data says otherwise.
Test Lab
The Machines
Every number on this page came from one of these two machines. No cloud instances, no rented GPUs — just hardware we own.
iMac 27" 5K Retina
Mid 2017 · iMac 18,3Intel Core i7-7700K @ 4.20 GHz
4 cores · 8 threads
40 GB DDR4
38.4 GB/s bandwidth
Radeon Pro 575 · 4 GB GDDR5
217 GB/s bandwidth
macOS 13.7.8 Ventura
27" 5120×2880 Retina
Inference Stack
MacBook Pro 16"
Late 2024 · Mac16,7Apple M4 Pro
14 cores · 10P + 4E
24 GB Unified LPDDR5
273 GB/s bandwidth
Integrated · 20-core
Shared unified memory
macOS 26.5 Tahoe
16.2" 3456×2234 XDR
Inference Stack
🔑 Why this matters: The iMac is 8 years old with a discrete AMD GPU that most frameworks ignore. The MacBook is current-gen Apple Silicon. Benchmarking both tells you the floor and ceiling of what Gemma 4 can do on hardware you can actually buy on eBay or at the Apple Store.
Model Overview
Avg Score
71%Tests
78
Avg Duration
1.4m
Errors
13
Avg Score
52%Tests
78
Avg Duration
2.2m
Avg Score
49%Tests
78
Avg Duration
2.0m
Avg Score
94%Tests
63
Avg Duration
13.2m
Reality Check
iMac vs Google's published benchmarks
Google runs their benchmarks on datacenter hardware. We ran ours on a 2017 iMac. Different tests, different conditions — but the question is the same: does the model actually work?
| Benchmark | GooglePublished | iMacOur tests | Notes |
|---|---|---|---|
| Math (AIME-style) | 89.2% | 92% | Our 31B scored 100% on all 3 AIME tasks. Different problems, same difficulty band. |
| Code Generation | 80.0% | 95% | LiveCodeBench vs our 10-test coding suite. Our tests are less exhaustive but more practical. |
| Science (GPQA-style) | 84.3% | 68% | Chemistry pulled us down. Our 31B got 75% on chem vs 100% on physics. |
| Tool Calling | — | 99% | Google doesn't publish tool calling scores. The 31B aced it; smaller models struggled. |
| Creative Writing | — | 91% | Subjective scoring. All models scored 85%+ on creative tasks. |
⚠️ Apples and oranges: Google's published scores use standardized academic benchmarks on datacenter GPUs. Our tests are custom-designed, run locally via GPU-accelerated llama.cpp on two consumer machines. The point isn't to match their numbers — it's to answer: can you actually use this model on hardware you own?
📂 Show your work: Every prompt, grading function, and raw result is open source. View the full test suite on GitHub →
🔬 April 11 update: Added cross-hardware comparison (M4 Pro vs iMac), three-way quantization showdown, multi-turn coherence, needle-in-haystack, and 31B Think A/B testing. All collected via autonomous overnight pipeline.
Scores by Category
| Category | 2B | 4B | 26B | 31B |
|---|---|---|---|---|
| Performance | n=888% | n=786% | n=786% | n=8100% |
| Reasoning | n=1758% | n=1759% | n=1759% | n=1192% |
| Coding | n=1470% | n=1467% | n=1465% | n=1095% |
| Tool Calling | n=1457% | n=1457% | n=1471% | n=1099% |
| Creative | n=1091% | n=1094% | n=1093% | n=1091% |
| Multimodal | n=687% | n=694% | n=650% | n=688% |
| Agentic | n=973% | n=1077% | n=1057% | n=892% |
New
Quantization Showdown
7 models. 39 core tests each. The inverted ladder: Q4 > Q8 > F16. Lower precision = higher score on consumer hardware. All scores within a 3.1% band — so the real story is speed, memory, and edge-case behavior.
Speed
6.95 tok/s
Disk
7.2 GB
Perfect
33/39
⚠️ Tool Refusal (0%) · 12.8 score%/GB
Speed
6.95 tok/s
Disk
8.1 GB
Perfect
31/39
⚠️ Tool Refusal (0%) · 11.3 score%/GB
Speed
1.72 tok/s
Disk
10.3 GB
Perfect
31/39
⚠️ Speed (1.7 tok/s) for 0% gain · 8.9 score%/GB
Speed
3.34 tok/s
Disk
11.6 GB
Perfect
30/39
⚠️ Everything (worst value) · 7.7 score%/GB
Speed
0.87 tok/s
Disk
16.0 GB
Perfect
31/39
⚠️ 5 hrs for +0.6% over Q8 · 5.7 score%/GB
Speed
3.41 tok/s
Disk
5.1 GB
Perfect
32/39
⚠️ Think Mode (33%) · 17.9 score%/GB
Speed
0.6 tok/s
Disk
19.9 GB
Perfect
43/63
⚠️ Speed (0.6 tok/s) · 4.7 score%/GB
Visual Analysis
The Data, Visualized
Seven models tested on the same iMac. Same tests, same grading. The story is in the shapes.
0
test results
across 7 models
0
universally perfect
out of 39 tests
0
tok/s peak
E2B Q4
0
min longest test
Context Retention
Daily Driver
E2B Q4
Fastest + highest score
Best Value
Unsloth DQ4
17.9% per GB
Quality King
31B Q4
93% accuracy
Model Fingerprints
Each model has a shape. Click to compare. Hover to isolate.
Speed vs Accuracy
Bubble size = disk footprint. Top-right is the sweet spot.
The Runtime Race
Same 39 tests. The fastest model finishes before the slowest has loaded.
Wall-clock time for 39 identical tests · score at right
Every Test. Every Model.
39 tests scored across 7 models. Bigger dots = higher scores. The pattern tells you where each model shines — and where it breaks.
The 31B Endurance
The longest single tests — each one a marathon at 0.60 tok/s.
31B Q4 at 0.60 tok/s · each test is a marathon
The Inverted Ladder
Less precision. Higher score. The counterintuitive finding.
Same 5.1B architecture. Same 39 tests. Same iMac.
Lower precision → higher score.
New · April 10
GPU Acceleration Discovery
Everyone said GPU acceleration on Intel Mac was dead. Metal crashes on discrete AMD GPUs. ROCm is Linux-only. Ollama can't talk to the Radeon Pro 575.
Nobody tested Vulkan via MoltenVK. We compiled llama.cpp with the LunarG Vulkan SDK, and the GPU appeared instantly. The results changed everything.
| Model | OllamaCPU only | iMac VulkanRadeon 575 | MacBook MetalM4 Pro | Best | Notes |
|---|---|---|---|---|---|
| E2B2.9 GB | 7.3 | 37.6 | 90.2 | 12.4× | Full GPU on both |
| E4B4.8 GB | 3.7 | 24.4 | 53.3 | 14.4× | Sweet spot model |
| 26B MoE16 GB | 2.0 | 3.7 | — | 1.9× | CPU-only on iMac (GPU offload fails) |
| 31B17.5 GB | 0.69 | 1.15 | 1.32 | 1.9× | Memory-bandwidth bound on both |
All speeds in tok/s (text generation, tg128). Best = fastest ÷ Ollama baseline.
The VRAM cliff
Models under 5 GiB see massive GPU acceleration (5–14×). Models over 15 GiB are barely helped — the memory bus becomes the bottleneck on both iMac (PCIe 3.0) and MacBook (unified LPDDR5).
M4 Pro = 2.4× faster
Apple Silicon's unified memory and Metal 4 backend deliver consistent 2.2–2.4× speedups over the iMac's Vulkan path. 90 tok/s on E2B is genuinely instant-feeling.
31B is memory-bound
The 31B runs at ~1.2 tok/s on both machines. At 17.5 GB, it saturates the memory bus regardless of GPU architecture. You need 48+ GB unified for this model to breathe.
🔧 How to reproduce: iMac (Vulkan): Compile llama.cpp with LunarG Vulkan SDK (MoltenVK). MacBook (Metal): Use prebuilt llama.cpp releases — Metal works out of the box. Use Unsloth UD-Q4_K_XL GGUFs for models ≤4B. Set -ngl 99.
Spotlight
Reasoning Tasks — The Heavy Lifts
The longest and hardest reasoning tasks across all models. Think mode on, math hard.
Deep Dive
Experiments beyond the scores
Beyond the standard benchmarks, we ran targeted experiments to answer specific questions about how these models behave in practice. Expand each to dig in.
We ran the same tasks with Think Mode ON and OFF across E2B and 26B models. The results were surprising — thinking doesn't always help, and sometimes it actively hurts.
| Task | E2B Off | E2B Think | 26B Off | 26B Think |
|---|---|---|---|---|
| Math | 100% | 20% | 100% | 100% |
| Logic | 20% | 20% | 60% | 20% |
| Code | 70% | 70% | 70% | 70% |
| Creative | 100% | 100% | 100% | 85% |
Think mode hurt E2B on math (100% → 20%) and 26B on logic (60% → 20%). On this hardware, the extra tokens spent "thinking" can actually degrade quality.
We buried a passphrase in progressively larger documents to find the real context ceiling on this hardware. The advertised limits don't match reality.
1K tokens
E2B
✓26B
✓4K tokens
E2B
✓26B
✗8K tokens
E2B
✗26B
✗16K tokens
E2B
✗26B
✗On 40GB RAM, E2B tops out around 4K tokens reliably. The 26B can only handle ~1K before OOM pressure causes timeouts. Forget about the advertised 128K context window on consumer hardware.
Can the models maintain coherence across a 5-turn conversation? Both E2B and 26B scored 95% — dropping only one point on turn 4 (a follow-up question that required referencing context from turn 1).
Both models: turn scores [100, 100, 100, 75, 100]. Impressive coherence.
Full Test Suite
Every test, every model, every score
All 39 unique tests across 7 model configurations. Click column headers to sort. Filter by category to focus on what matters to you.
| Test | Cat | E2B Q4 | E2B Q8 | E2B F16 | E4B Q8 | E4B F16 | Unsloth DQ4 | 31B Q4 | Avg ↓ |
|---|---|---|---|---|---|---|---|---|---|
| Cold Start Latency | Perf | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| Short Generation (50 tokens) | Perf | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| Medium Generation (200 tokens) | Perf | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| Long Generation (500 tokens) | Perf | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| Prompt Processing (100 tokens) | Perf | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| Prompt Processing (500 tokens) | Perf | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| TTFT: Warm Start (streaming) | Perf | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| TTFT: Medium Prompt (streaming) | Perf | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| AIME Math: Number Theory | Reas | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| AIME Math: Combinatorics | Reas | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| AIME Math: Algebra | Reas | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| Logic: Knights and Knaves | Reas | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| Logic: Constraint Satisfaction | Reas | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| Science: Physics | Reas | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| Common Sense: Physical World | Reas | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| Common Sense: Temporal | Reas | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| Function Gen: Python Fibonacci | Codi | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| Function Gen: JS Array Flatten | Codi | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| Function Gen: SQL Query | Codi | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| Bug Detection: Off-by-One | Codi | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| Bug Detection: Memory Leak | Codi | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| Algorithm: Two Sum | Codi | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| Algorithm: Graph BFS | Codi | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| Single Tool: Weather Query | Tool | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| Single Tool: Calculator | Tool | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| Single Tool: Web Search | Tool | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| Parameter Extraction: Implicit Units | Tool | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| Tool Refusal: Conversational Query | Tool | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| Structured Output: JSON Response | Tool | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| Refactoring: Cleanup Messy Code | Codi | 100 | 98 | 98 | 98 | 98 | 100 | 100 | 99 |
| Error Recovery: Failed Tool Call | Tool | 100 | 90 | 100 | 90 | 100 | 100 | 90 | 96 |
| Multi-Tool: Research + Email | Tool | 100 | 100 | 100 | 100 | 100 | 50 | 100 | 93 |
| Real-World: API Data Pipeline Review | Codi | 88 | 85 | 83 | 83 | 97 | 93 | 80 | 87 |
| Science: Chemistry | Reas | 75 | 75 | 75 | 50 | 50 | 75 | 75 | 68 |
| Thinking Mode: OFF | Reas | 67 | 67 | 67 | 67 | 67 | 67 | 67 | 67 |
| Code Explanation: Complex Regex | Codi | 100 | 100 | 67 | 50 | 50 | 33 | 67 | 67 |
| Thinking Mode: ON | Reas | 67 | 67 | 67 | 67 | 67 | 33 | 67 | 62 |
| Tool Refusal: No Tool Needed | Tool | 0 | 0 | 0 | 0 | 0 | 100 | 100 | 29 |
| Parameter Extraction: Calendar Event | Tool | 0 | 0 | 0 | 0 | 0 | 0 | 100 | 14 |
39 tests shown · 273 total results
All benchmarks were run locally using Ollama on a 2017 iMac (i7-7700K, 40GB RAM, macOS). Nothing fancy — this is a machine I actually use. Each test was executed sequentially with cold-start measurements, no prompt caching, and real wall-clock timing.
Models tested span Google's Gemma 4 family from the smallest E2B (2 billion parameters) up to the full 31B. Tests cover 10 categories: performance profiling, mathematical reasoning (AIME-style problems), code generation, tool calling, creative writing, multimodal analysis, agentic task completion, parameter optimization, context window stress testing, and score calibration.
Scoring uses a 0-1 scale. Performance tests are scored on metrics only (latency, throughput). Reasoning and coding tests use automated verification against expected answers.
I'm not a research lab. I'm one person who wanted to know if these models are actually useful on hardware I already own. I publish the failures alongside the wins because that's what's actually helpful.
— The whole point
● 7 models testedSome tests are still running. Some results will change as I learn more. The messy build log is more useful than the polished launch post.
New · April 11
Three-Way Quantization Showdown
We ran Q8_0, Google Q4_K_M, and Unsloth UD-Q4_K_XL through identical 8-question quality suites on the same iMac hardware. Same prompts, same seed, same temperature.
| Quantization | Size | Speed | Quality | Score%/GB |
|---|---|---|---|---|
| Q8_0 (full precision) | 4.69 GB | 30.12 tok/s | 7/8 | 18.8 |
| Q4_K_M (Google) | 3.21 GB | 35.36 tok/s | 6/8 | 23.4 |
| UD-Q4_K_XL (Unsloth)🏆 winner | 2.94 GB | 38.16 tok/s | 7/8 | 29.8 |
Why Unsloth wins
Unsloth's dynamic quantization uses imatrix calibration data to identify which weights matter most. High-impact weights get Q5/Q6 precision while low-impact weights get Q4. The result: smaller file, faster inference, same quality as Q8.
⚠️ The 31B exception
Unsloth UD-Q4_K_XL fails on the 31B at <2 tok/s. The model starts correct answers then degenerates into garbage tokens. At slow inference speeds, the quantization's approximations compound across tokens. For 31B, use Google's standard Q4_K_M or higher precision.
New · April 11
Agentic Readiness
Can these models actually work as local agents? We tested context retrieval, multi-turn memory, and Think Mode overhead.
Needle-in-Haystack
Buried a secret password in 10, 50, 200, and 500 filler sentences. E4B found it every time. Zero degradation at any depth.
Multi-Turn Memory
5-turn conversation testing fact retention. The E4B remembered name, city, and hardware across all turns.
31B Think Mode
Think Mode on the 31B made everything worse. Without thinking: 3/8. With thinking: 0/8. The hidden reasoning tokens eat into the answer budget at low tok/s.
🤖 The verdict: The E4B is agent-ready. Perfect context retrieval, strong multi-turn memory, and 24–53 tok/s depending on hardware. Think Mode is only useful on models fast enough that the thinking overhead doesn't crowd out the answer. For agentic workflows, disable Think Mode on anything below 10 tok/s.
New · April 12
From Benchmarks to Bots: The Clawdy Pipeline
Benchmarking is just theory until you build something real. We took all the data from our Vulkan optimization tests and wired the unsloth-e4b model directly into an autonomous OpenClaw agent bridged to Telegram. The result? A fully functional AI assistant texting us from a 2017 iMac at 24 tok/s.
The Subagent Revelation
When told to autonomously research 10 web sources simultaneously, the OpenClaw orchestration engine realized the task would lock up the single-threaded Telegram chat. Instead of freezing, it elegantly spawned an invisible background "Subagent" that crunched internet data for 42 minutes, leaving the main bot thread perfectly free for casual conversation.
Optimizing for "Tiered Routing"
The next evolution of this local infrastructure is mapping these subagents to the iMac's physical constraints. We are building a "Router + Heavy Lifter" architecture: a tiny 2B model lives on the GPU responding to texts instantly, while complex agent tasks are silently passed to a 26B MoE running purely on the 40GB system CPU memory. Zero resource contention.

💡 The bottom line: Benchmarks proved the iMac could run it. The Telegram bot proved it is running it. We are no longer testing models; we are deploying autonomous AI infrastructure on 8-year-old consumer hardware.
Takeaways
What we learned
Unsloth is the daily driver
UD-Q4_K_XL: 2.94 GB, 38.16 tok/s, 7/8 accuracy. Smallest file, fastest inference, tied for best quality. On the MacBook M4 Pro, 90.15 tok/s— that's instant-feeling inference on a laptop. Dynamic quantization preserves what matters and compresses what doesn't.
The E4B is the sweet spot
At 24.4 tok/s (iMac) or 53.3 tok/s (MacBook), the E4B delivers 7.5B-parameter quality at conversational speed. Perfect needle-in-haystack retrieval. 4/5 multi-turn memory. Agent-ready.
Speed affects quality, not just UX
The 31B Unsloth quant works perfectly at 30+ tok/s (on smaller models) but collapses at <2 tok/s with garbage token generation. Quantization quality is speed-dependent — a finding you only get from testing on real hardware with real constraints.
Two machines tell different stories
The iMac (Vulkan, 4 GB VRAM) proved GPU acceleration works on legacy hardware. The MacBook (Metal, 24 GB unified) showed the speed ceiling for Apple Silicon. Both agree: models under 5 GB are transformative. Models over 15 GB are paperweights on consumer hardware.
The 31B redemption (and limits)
The original benchmark suite showed 94% accuracy (0 errors) via Ollama on the 31B. GPU-accelerated Unsloth quant dropped to 3/8 due to the garbage token issue. The 31B is a phenomenal model — but only with the right quantization at the right speed.
Autonomous pipelines work
15 JSON result files from an overnight pipeline that ran across two machines with zero human intervention. Bash orchestrators watching for completion signals, auto-restarting servers, saving data after every test. Local AI research doesn't need a cluster — it needs good plumbing.
Project Logs
Deep Dives & Technical Reports
Fine-Tuning Gemma 4 for the Edge
How we used Kaggle T4s and Unsloth to train an r=32 LoRA, creating our custom concierge with 88.6% accuracy against hallucinations.
Read the report →Multi-Model Orchestration on Legacy Hardware
Running a dual-node OpenClaw pipeline on a 2017 iMac. Overcoming context exhaustion, tokenizer bugs, and VRAM limits.
Read the report →