Research · Live

Gemma 4 Benchmark Suite

403 tests across 4 models running locally on a 2017 iMac i7-7700K via Ollama. No cloud, no API — just consumer hardware doing real work.

403 total tests·10 categories·iMac 2017 · i7-7700K · 40GB RAM

Why this matters: Gemma 4 has crossed 10 million downloads and the community is benchmarking on datacenter GPUs. These numbers show what the models actually do on the hardware most people own. The 2B model is shockingly capable. The 31B model is... ambitious.

● activeI'm just a person hacking on this hardware to see what works. Traditional benchmarks don't tell me much about real-life use, so I built my own. Testing is ongoing — the 31B model is still running as I type this — and these numbers will evolve. This is a snapshot, not a verdict.

Model Overview

2BGemma 4 E2B

Avg Score

71%

Tests

78

Avg Duration

1.4m

Errors

13

4BGemma 4 E4B

Avg Score

52%

Tests

108

Avg Duration

2.2m

Errors

47

26BGemma 4 26B

Avg Score

49%

Tests

108

Avg Duration

2.0m

Errors

47

31BGemma 4 31B

Avg Score

21%

Tests

109

Avg Duration

8.1m

Errors

83

Scores by Category

Category2B4B26B31B
Performance
n=888%
n=786%
n=786%
n=875%
Reasoning
n=1758%
n=1759%
n=1759%
n=1712%
Coding
n=1470%
n=1467%
n=1465%
n=140%
Tool Calling
n=1457%
n=1457%
n=1471%
n=1468%
Creative
n=1091%
n=1094%
n=1093%
n=1025%
Multimodal
n=687%
n=694%
n=650%
n=68%
Agentic
n=973%
n=1077%
n=1057%
n=1025%
Optimization
n=160%
n=160%
n=160%
Context
n=60%
n=60%
n=60%
Calibration
n=80%
n=80%
n=80%

Spotlight

Reasoning Tasks — The Heavy Lifts

The longest and hardest reasoning tasks across all models. Think mode on, math hard.

31BScience: Physics [Think ON]15.2m0%
31BScience: Physics [Think OFF]15.2m0%
31BScience: Physics15.2m0%
31BScience: Chemistry15.2m0%
31BThinking Mode: ON15.2m0%
31BThinking Mode: OFF15.2m0%
31BLogic: Constraint Satisfaction [Think ON]15.2m0%
31BAIME Math: Number Theory [Think ON]15.2m0%

Methodology

All benchmarks were run locally using Ollama on a 2017 iMac (i7-7700K, 40GB RAM, macOS). Nothing fancy — this is a machine I actually use. Each test was executed sequentially with cold-start measurements, no prompt caching, and real wall-clock timing.

Models tested span Google's Gemma 4 family from the smallest E2B (2 billion parameters) up to the full 31B. Tests cover 10 categories: performance profiling, mathematical reasoning (AIME-style problems), code generation, tool calling, creative writing, multimodal analysis, agentic task completion, parameter optimization, context window stress testing, and score calibration.

Scoring uses a 0-1 scale. Performance tests are scored on metrics only (latency, throughput). Reasoning and coding tests use automated verification against expected answers.

A note on honesty:I'm not a research lab. I'm one person who wanted to know if these models are actually useful on hardware I already own. Some tests are still running. Some results will change as I learn more. I publish the failures alongside the wins because that's what's actually helpful.

Deep Dive

Experiments beyond the scores

Beyond the standard benchmarks, we ran targeted experiments to answer specific questions about how these models behave in practice.

Experiment 1

Think Mode: Does it actually help?

We ran the same tasks with Think Mode ON and OFF across E2B and 26B models. The results were surprising — thinking doesn't always help, and sometimes it actively hurts.

TaskE2B OffE2B Think26B Off26B Think
Math100%20%100%100%
Logic20%20%60%20%
Code70%70%70%70%
Creative100%100%100%85%

Hot take: Think mode hurt E2B on math (100% → 20%) and 26B on logic (60% → 20%). On this hardware, the extra tokens spent "thinking" can actually degrade quality.

Experiment 2

Needle in a Haystack: Context window limits

We buried a passphrase in progressively larger documents to find the real context ceiling on this hardware. The advertised limits don't match reality.

1K tokens

E2B

26B

4K tokens

E2B

26B

8K tokens

E2B

26B

16K tokens

E2B

26B

Hot take: On 40GB RAM, E2B tops out around 4K tokens reliably. The 26B can only handle ~1K before OOM pressure causes timeouts. Forget about the advertised 128K context window on consumer hardware.

Experiment 3

Multi-turn Consistency

Can the models maintain coherence across a 5-turn conversation? Both E2B and 26B scored 95% — dropping only one point on turn 4 (a follow-up question that required referencing context from turn 1).

E2B
95%
26B
95%

Both models: turn scores [100, 100, 100, 75, 100]. Impressive coherence.

Takeaways

What we learned

The 2B model punches up

At 71% average across all categories, Gemma 4 E2B is the best performer relative to its size. 91% on creative tasks, 88% on performance benchmarks. Sub-25s average response times make it genuinely usable for real workflows.

Bigger ≠ better (on this hardware)

The 31B model scored 21% average with 83 errors. Memory pressure on 40GB RAM causes timeouts on complex reasoning tasks. The model is capable — it just can't breathe on this hardware.

Tool calling is solid across the board

Even the smallest model hits 57% on tool calling. The 26B reaches 71%. For agentic workflows on local hardware, structured tool use is already production-viable.

Creative writing is the sleeper

All models score 90%+ on creative tasks (E4B hits 94%). If you're running local AI for writing assistance, even the smallest Gemma 4 model is excellent.