Gemma 4 Benchmark Suite
403 tests across 4 models running locally on a 2017 iMac i7-7700K via Ollama. No cloud, no API — just consumer hardware doing real work.
Why this matters: Gemma 4 has crossed 10 million downloads and the community is benchmarking on datacenter GPUs. These numbers show what the models actually do on the hardware most people own. The 2B model is shockingly capable. The 31B model is... ambitious.
● activeI'm just a person hacking on this hardware to see what works. Traditional benchmarks don't tell me much about real-life use, so I built my own. Testing is ongoing — the 31B model is still running as I type this — and these numbers will evolve. This is a snapshot, not a verdict.
Model Overview
Avg Score
71%Tests
78
Avg Duration
1.4m
Errors
13
Avg Score
52%Tests
108
Avg Duration
2.2m
Errors
47
Avg Score
49%Tests
108
Avg Duration
2.0m
Errors
47
Avg Score
21%Tests
109
Avg Duration
8.1m
Errors
83
Scores by Category
| Category | 2B | 4B | 26B | 31B |
|---|---|---|---|---|
| Performance | n=888% | n=786% | n=786% | n=875% |
| Reasoning | n=1758% | n=1759% | n=1759% | n=1712% |
| Coding | n=1470% | n=1467% | n=1465% | n=140% |
| Tool Calling | n=1457% | n=1457% | n=1471% | n=1468% |
| Creative | n=1091% | n=1094% | n=1093% | n=1025% |
| Multimodal | n=687% | n=694% | n=650% | n=68% |
| Agentic | n=973% | n=1077% | n=1057% | n=1025% |
| Optimization | — | n=160% | n=160% | n=160% |
| Context | — | n=60% | n=60% | n=60% |
| Calibration | — | n=80% | n=80% | n=80% |
Spotlight
Reasoning Tasks — The Heavy Lifts
The longest and hardest reasoning tasks across all models. Think mode on, math hard.
Methodology
All benchmarks were run locally using Ollama on a 2017 iMac (i7-7700K, 40GB RAM, macOS). Nothing fancy — this is a machine I actually use. Each test was executed sequentially with cold-start measurements, no prompt caching, and real wall-clock timing.
Models tested span Google's Gemma 4 family from the smallest E2B (2 billion parameters) up to the full 31B. Tests cover 10 categories: performance profiling, mathematical reasoning (AIME-style problems), code generation, tool calling, creative writing, multimodal analysis, agentic task completion, parameter optimization, context window stress testing, and score calibration.
Scoring uses a 0-1 scale. Performance tests are scored on metrics only (latency, throughput). Reasoning and coding tests use automated verification against expected answers.
A note on honesty:I'm not a research lab. I'm one person who wanted to know if these models are actually useful on hardware I already own. Some tests are still running. Some results will change as I learn more. I publish the failures alongside the wins because that's what's actually helpful.
Deep Dive
Experiments beyond the scores
Beyond the standard benchmarks, we ran targeted experiments to answer specific questions about how these models behave in practice.
Think Mode: Does it actually help?
We ran the same tasks with Think Mode ON and OFF across E2B and 26B models. The results were surprising — thinking doesn't always help, and sometimes it actively hurts.
| Task | E2B Off | E2B Think | 26B Off | 26B Think |
|---|---|---|---|---|
| Math | 100% | 20% | 100% | 100% |
| Logic | 20% | 20% | 60% | 20% |
| Code | 70% | 70% | 70% | 70% |
| Creative | 100% | 100% | 100% | 85% |
Hot take: Think mode hurt E2B on math (100% → 20%) and 26B on logic (60% → 20%). On this hardware, the extra tokens spent "thinking" can actually degrade quality.
Needle in a Haystack: Context window limits
We buried a passphrase in progressively larger documents to find the real context ceiling on this hardware. The advertised limits don't match reality.
1K tokens
E2B
✓26B
✓4K tokens
E2B
✓26B
✗8K tokens
E2B
✗26B
✗16K tokens
E2B
✗26B
✗Hot take: On 40GB RAM, E2B tops out around 4K tokens reliably. The 26B can only handle ~1K before OOM pressure causes timeouts. Forget about the advertised 128K context window on consumer hardware.
Multi-turn Consistency
Can the models maintain coherence across a 5-turn conversation? Both E2B and 26B scored 95% — dropping only one point on turn 4 (a follow-up question that required referencing context from turn 1).
Both models: turn scores [100, 100, 100, 75, 100]. Impressive coherence.
Takeaways
What we learned
The 2B model punches up
At 71% average across all categories, Gemma 4 E2B is the best performer relative to its size. 91% on creative tasks, 88% on performance benchmarks. Sub-25s average response times make it genuinely usable for real workflows.
Bigger ≠ better (on this hardware)
The 31B model scored 21% average with 83 errors. Memory pressure on 40GB RAM causes timeouts on complex reasoning tasks. The model is capable — it just can't breathe on this hardware.
Tool calling is solid across the board
Even the smallest model hits 57% on tool calling. The 26B reaches 71%. For agentic workflows on local hardware, structured tool use is already production-viable.
Creative writing is the sleeper
All models score 90%+ on creative tasks (E4B hits 94%). If you're running local AI for writing assistance, even the smallest Gemma 4 model is excellent.