Research · Updated April 11

Gemma 4 Benchmark Suite

414+ tests across 7 models on two machines. iMac 2017 via Vulkan, MacBook M4 Pro via Metal. Autonomous overnight pipeline. 15 result files. Zero cloud. Zero API keys.

414+ total tests·7 categories·3 quantizations compared·2 machines · 2 GPU backends

🐘 The headline: The 31B Dense model scored 94% across 63 tests with zero errors. One week earlier, the same model scored 21% with 83 errors — because the testing infrastructure wasn't built for a model this slow. The data says otherwise.


Test Lab

The Machines

Every number on this page came from one of these two machines. No cloud instances, no rented GPUs — just hardware we own.

🖥

iMac 27" 5K Retina

Mid 2017 · iMac 18,3
CPU

Intel Core i7-7700K @ 4.20 GHz

4 cores · 8 threads

RAM

40 GB DDR4

38.4 GB/s bandwidth

GPU

Radeon Pro 575 · 4 GB GDDR5

217 GB/s bandwidth

OS

macOS 13.7.8 Ventura

Display

27" 5120×2880 Retina


Inference Stack

Ollama 0.20.0CPU-only · 8 threads · 4096 ctx
llama.cpp (ff5ef82)Vulkan via MoltenVK · LunarG SDK 1.4.341
💻

MacBook Pro 16"

Late 2024 · Mac16,7
Chip

Apple M4 Pro

14 cores · 10P + 4E

RAM

24 GB Unified LPDDR5

273 GB/s bandwidth

GPU

Integrated · 20-core

Shared unified memory

OS

macOS 26.5 Tahoe

Display

16.2" 3456×2234 XDR


Inference Stack

llama.cpp (prebuilt)Metal · native Apple Silicon

🔑 Why this matters: The iMac is 8 years old with a discrete AMD GPU that most frameworks ignore. The MacBook is current-gen Apple Silicon. Benchmarking both tells you the floor and ceiling of what Gemma 4 can do on hardware you can actually buy on eBay or at the Apple Store.


Model Overview

2BGemma 4 E2B

Avg Score

71%

Tests

78

Avg Duration

1.4m

Errors

13

4BGemma 4 E4B

Avg Score

52%

Tests

78

Avg Duration

2.2m

26BGemma 4 26B

Avg Score

49%

Tests

78

Avg Duration

2.0m

31BGemma 4 31B

Avg Score

94%

Tests

63

Avg Duration

13.2m


Reality Check

iMac vs Google's published benchmarks

Google runs their benchmarks on datacenter hardware. We ran ours on a 2017 iMac. Different tests, different conditions — but the question is the same: does the model actually work?

BenchmarkGooglePublishediMacOur testsNotes
Math (AIME-style)89.2%92%Our 31B scored 100% on all 3 AIME tasks. Different problems, same difficulty band.
Code Generation80.0%95%LiveCodeBench vs our 10-test coding suite. Our tests are less exhaustive but more practical.
Science (GPQA-style)84.3%68%Chemistry pulled us down. Our 31B got 75% on chem vs 100% on physics.
Tool Calling99%Google doesn't publish tool calling scores. The 31B aced it; smaller models struggled.
Creative Writing91%Subjective scoring. All models scored 85%+ on creative tasks.

⚠️ Apples and oranges: Google's published scores use standardized academic benchmarks on datacenter GPUs. Our tests are custom-designed, run locally via GPU-accelerated llama.cpp on two consumer machines. The point isn't to match their numbers — it's to answer: can you actually use this model on hardware you own?

📂 Show your work: Every prompt, grading function, and raw result is open source. View the full test suite on GitHub →

🔬 April 11 update: Added cross-hardware comparison (M4 Pro vs iMac), three-way quantization showdown, multi-turn coherence, needle-in-haystack, and 31B Think A/B testing. All collected via autonomous overnight pipeline.


Scores by Category

Category2B4B26B31B
Performance
n=888%
n=786%
n=786%
n=8100%
Reasoning
n=1758%
n=1759%
n=1759%
n=1192%
Coding
n=1470%
n=1467%
n=1465%
n=1095%
Tool Calling
n=1457%
n=1457%
n=1471%
n=1099%
Creative
n=1091%
n=1094%
n=1093%
n=1091%
Multimodal
n=687%
n=694%
n=650%
n=688%
Agentic
n=973%
n=1077%
n=1057%
n=892%

New

Quantization Showdown

7 models. 39 core tests each. The inverted ladder: Q4 > Q8 > F16. Lower precision = higher score on consumer hardware. All scores within a 3.1% band — so the real story is speed, memory, and edge-case behavior.

5.1BE2B Q4
Score92%

Speed

6.95 tok/s

Disk

7.2 GB

Perfect

33/39

Performance100%
Reasoning92%
Coding99%
Tool Calling80%

⚠️ Tool Refusal (0%) · 12.8 score%/GB

5.1BE2B Q8
Score92%

Speed

6.95 tok/s

Disk

8.1 GB

Perfect

31/39

Performance100%
Reasoning92%
Coding98%
Tool Calling79%

⚠️ Tool Refusal (0%) · 11.3 score%/GB

5.1BE2B F16
Score91%

Speed

1.72 tok/s

Disk

10.3 GB

Perfect

31/39

Performance100%
Reasoning92%
Coding95%
Tool Calling80%

⚠️ Speed (1.7 tok/s) for 0% gain · 8.9 score%/GB

8.0BE4B Q8
Score90%

Speed

3.34 tok/s

Disk

11.6 GB

Perfect

30/39

Performance100%
Reasoning89%
Coding93%
Tool Calling79%

⚠️ Everything (worst value) · 7.7 score%/GB

8.0BE4B F16
Score90%

Speed

0.87 tok/s

Disk

16.0 GB

Perfect

31/39

Performance100%
Reasoning89%
Coding94%
Tool Calling80%

⚠️ 5 hrs for +0.6% over Q8 · 5.7 score%/GB

4B (DQ4)Unsloth DQ4
🏆 efficiency
Score91%

Speed

3.41 tok/s

Disk

5.1 GB

Perfect

32/39

Performance100%
Reasoning89%
Coding93%
Tool Calling85%

⚠️ Think Mode (33%) · 17.9 score%/GB

31.3B31B Q4
🏆 quality
Score93%

Speed

0.6 tok/s

Disk

19.9 GB

Perfect

43/63

Performance100%
Reasoning92%
Coding94%
Tool Calling100%

⚠️ Speed (0.6 tok/s) · 4.7 score%/GB

Visual Analysis

The Data, Visualized

Seven models tested on the same iMac. Same tests, same grading. The story is in the shapes.

0

test results

across 7 models

0

universally perfect

out of 39 tests

0

tok/s peak

E2B Q4

0

min longest test

Context Retention

Daily Driver

E2B Q4

Fastest + highest score

92%6.95 tok/s · 7.2 GB
💎

Best Value

Unsloth DQ4

17.9% per GB

91%3.41 tok/s · 5.1 GB
👑

Quality King

31B Q4

93% accuracy

93%0.6 tok/s · 19.9 GB

Model Fingerprints

Each model has a shape. Click to compare. Hover to isolate.

PerformanceReasoningCodingTool Calling

Speed vs Accuracy

Bubble size = disk footprint. Top-right is the sweet spot.

89%91%93%1357tokens/sec →accuracy →↗ fast + smartE2B Q4E2B Q8E2B F16E4B Q8E4B F16DQ431B Q4

The Runtime Race

Same 39 tests. The fastest model finishes before the slowest has loaded.

E2B Q8
36 min
92%
E2B Q4
38 min
92%
Unsloth DQ4
76 min
91%
E4B Q8
85 min
90%
E2B F16
149 min
91%
E4B F16
295 min
90%
31B Q4
13h 46m
93%

Wall-clock time for 39 identical tests · score at right

Every Test. Every Model.

39 tests scored across 7 models. Bigger dots = higher scores. The pattern tells you where each model shines — and where it breaks.

E2B Q4
E2B Q8
E2B F16
E4B Q8
E4B F16
DQ4
31B Q4
🔧Calendar Event
🔧No Tool Needed
🧠ON
🧠OFF
💻Complex Regex
🔧Web Search
🔧Implicit Units
🔧Conversational Query
🔧JSON Response
100%
70–99%
1–69%
0%

The 31B Endurance

The longest single tests — each one a marathon at 0.60 tok/s.

Context Retention
35.4m
100%
Automation Planning
34.8m
82%
PDF Data Extraction Plan
33.1m
73%
Constraint Satisfaction
31m
100%
ON
30.6m
67%
API Data Pipeline Review
28.2m
80%
Architecture Understanding
27.2m
85%
Project Breakdown
25.9m
84%

31B Q4 at 0.60 tok/s · each test is a marathon

The Inverted Ladder

Less precision. Higher score. The counterintuitive finding.

Q4
92.2%
6.95 tok/s
7.2 GB · 38 min
Q8
91.8%
6.95 tok/s
8.1 GB · 36 min
F16
91.2%
1.72 tok/s
10.3 GB · 149 min

Same 5.1B architecture. Same 39 tests. Same iMac.
Lower precision → higher score.


New · April 10

GPU Acceleration Discovery

Everyone said GPU acceleration on Intel Mac was dead. Metal crashes on discrete AMD GPUs. ROCm is Linux-only. Ollama can't talk to the Radeon Pro 575.

Nobody tested Vulkan via MoltenVK. We compiled llama.cpp with the LunarG Vulkan SDK, and the GPU appeared instantly. The results changed everything.

ModelOllamaCPU onlyiMac VulkanRadeon 575MacBook MetalM4 ProBestNotes
E2B2.9 GB7.337.690.212.4×Full GPU on both
E4B4.8 GB3.724.453.314.4×Sweet spot model
26B MoE16 GB2.03.71.9×CPU-only on iMac (GPU offload fails)
31B17.5 GB0.691.151.321.9×Memory-bandwidth bound on both

All speeds in tok/s (text generation, tg128). Best = fastest ÷ Ollama baseline.

The VRAM cliff

Models under 5 GiB see massive GPU acceleration (5–14×). Models over 15 GiB are barely helped — the memory bus becomes the bottleneck on both iMac (PCIe 3.0) and MacBook (unified LPDDR5).

M4 Pro = 2.4× faster

Apple Silicon's unified memory and Metal 4 backend deliver consistent 2.2–2.4× speedups over the iMac's Vulkan path. 90 tok/s on E2B is genuinely instant-feeling.

31B is memory-bound

The 31B runs at ~1.2 tok/s on both machines. At 17.5 GB, it saturates the memory bus regardless of GPU architecture. You need 48+ GB unified for this model to breathe.

🔧 How to reproduce: iMac (Vulkan): Compile llama.cpp with LunarG Vulkan SDK (MoltenVK). MacBook (Metal): Use prebuilt llama.cpp releases — Metal works out of the box. Use Unsloth UD-Q4_K_XL GGUFs for models ≤4B. Set -ngl 99.


Spotlight

Reasoning Tasks — The Heavy Lifts

The longest and hardest reasoning tasks across all models. Think mode on, math hard.

31BAIME Math: Number Theory16.3m100%
31BLogic: Constraint Satisfaction31.0m100%
31BMulti-Tool: Research + Email15.0m100%
31BContext Stress: Needle in Haystack11.4m100%
31BeBay Buyer Negotiation15.6m100%
31BChart Reading: Bar Chart15.1m70%

Deep Dive

Experiments beyond the scores

Beyond the standard benchmarks, we ran targeted experiments to answer specific questions about how these models behave in practice. Expand each to dig in.

We ran the same tasks with Think Mode ON and OFF across E2B and 26B models. The results were surprising — thinking doesn't always help, and sometimes it actively hurts.

TaskE2B OffE2B Think26B Off26B Think
Math100%20%100%100%
Logic20%20%60%20%
Code70%70%70%70%
Creative100%100%100%85%

Think mode hurt E2B on math (100% → 20%) and 26B on logic (60% → 20%). On this hardware, the extra tokens spent "thinking" can actually degrade quality.

We buried a passphrase in progressively larger documents to find the real context ceiling on this hardware. The advertised limits don't match reality.

1K tokens

E2B

26B

4K tokens

E2B

26B

8K tokens

E2B

26B

16K tokens

E2B

26B

On 40GB RAM, E2B tops out around 4K tokens reliably. The 26B can only handle ~1K before OOM pressure causes timeouts. Forget about the advertised 128K context window on consumer hardware.

Can the models maintain coherence across a 5-turn conversation? Both E2B and 26B scored 95% — dropping only one point on turn 4 (a follow-up question that required referencing context from turn 1).

E2B
95%
26B
95%

Both models: turn scores [100, 100, 100, 75, 100]. Impressive coherence.


Full Test Suite

Every test, every model, every score

All 39 unique tests across 7 model configurations. Click column headers to sort. Filter by category to focus on what matters to you.

TestCatE2B Q4E2B Q8E2B F16E4B Q8E4B F16Unsloth DQ431B Q4Avg
Cold Start LatencyPerf100100100100100100100100
Short Generation (50 tokens)Perf100100100100100100100100
Medium Generation (200 tokens)Perf100100100100100100100100
Long Generation (500 tokens)Perf100100100100100100100100
Prompt Processing (100 tokens)Perf100100100100100100100100
Prompt Processing (500 tokens)Perf100100100100100100100100
TTFT: Warm Start (streaming)Perf100100100100100100100100
TTFT: Medium Prompt (streaming)Perf100100100100100100100100
AIME Math: Number TheoryReas100100100100100100100100
AIME Math: CombinatoricsReas100100100100100100100100
AIME Math: AlgebraReas100100100100100100100100
Logic: Knights and KnavesReas100100100100100100100100
Logic: Constraint SatisfactionReas100100100100100100100100
Science: PhysicsReas100100100100100100100100
Common Sense: Physical WorldReas100100100100100100100100
Common Sense: TemporalReas100100100100100100100100
Function Gen: Python FibonacciCodi100100100100100100100100
Function Gen: JS Array FlattenCodi100100100100100100100100
Function Gen: SQL QueryCodi100100100100100100100100
Bug Detection: Off-by-OneCodi100100100100100100100100
Bug Detection: Memory LeakCodi100100100100100100100100
Algorithm: Two SumCodi100100100100100100100100
Algorithm: Graph BFSCodi100100100100100100100100
Single Tool: Weather QueryTool100100100100100100100100
Single Tool: CalculatorTool100100100100100100100100
Single Tool: Web SearchTool100100100100100100100100
Parameter Extraction: Implicit UnitsTool100100100100100100100100
Tool Refusal: Conversational QueryTool100100100100100100100100
Structured Output: JSON ResponseTool100100100100100100100100
Refactoring: Cleanup Messy CodeCodi1009898989810010099
Error Recovery: Failed Tool CallTool10090100901001009096
Multi-Tool: Research + EmailTool1001001001001005010093
Real-World: API Data Pipeline ReviewCodi8885838397938087
Science: ChemistryReas7575755050757568
Thinking Mode: OFFReas6767676767676767
Code Explanation: Complex RegexCodi100100675050336767
Thinking Mode: ONReas6767676767336762
Tool Refusal: No Tool NeededTool0000010010029
Parameter Extraction: Calendar EventTool00000010014

39 tests shown · 273 total results


All benchmarks were run locally using Ollama on a 2017 iMac (i7-7700K, 40GB RAM, macOS). Nothing fancy — this is a machine I actually use. Each test was executed sequentially with cold-start measurements, no prompt caching, and real wall-clock timing.

Models tested span Google's Gemma 4 family from the smallest E2B (2 billion parameters) up to the full 31B. Tests cover 10 categories: performance profiling, mathematical reasoning (AIME-style problems), code generation, tool calling, creative writing, multimodal analysis, agentic task completion, parameter optimization, context window stress testing, and score calibration.

Scoring uses a 0-1 scale. Performance tests are scored on metrics only (latency, throughput). Reasoning and coding tests use automated verification against expected answers.

I'm not a research lab. I'm one person who wanted to know if these models are actually useful on hardware I already own. I publish the failures alongside the wins because that's what's actually helpful.

The whole point

● 7 models testedSome tests are still running. Some results will change as I learn more. The messy build log is more useful than the polished launch post.


New · April 11

Three-Way Quantization Showdown

We ran Q8_0, Google Q4_K_M, and Unsloth UD-Q4_K_XL through identical 8-question quality suites on the same iMac hardware. Same prompts, same seed, same temperature.

QuantizationSizeSpeedQualityScore%/GB
Q8_0 (full precision)4.69 GB30.12 tok/s7/818.8
Q4_K_M (Google)3.21 GB35.36 tok/s6/823.4
UD-Q4_K_XL (Unsloth)🏆 winner2.94 GB38.16 tok/s7/829.8

Why Unsloth wins

Unsloth's dynamic quantization uses imatrix calibration data to identify which weights matter most. High-impact weights get Q5/Q6 precision while low-impact weights get Q4. The result: smaller file, faster inference, same quality as Q8.

⚠️ The 31B exception

Unsloth UD-Q4_K_XL fails on the 31B at <2 tok/s. The model starts correct answers then degenerates into garbage tokens. At slow inference speeds, the quantization's approximations compound across tokens. For 31B, use Google's standard Q4_K_M or higher precision.


New · April 11

Agentic Readiness

Can these models actually work as local agents? We tested context retrieval, multi-turn memory, and Think Mode overhead.

4/4

Needle-in-Haystack

Buried a secret password in 10, 50, 200, and 500 filler sentences. E4B found it every time. Zero degradation at any depth.

depth=103.9s
depth=507.6s
depth=20019.4s
depth=50037.4s
4/5

Multi-Turn Memory

5-turn conversation testing fact retention. The E4B remembered name, city, and hardware across all turns.

Inferred city from riddle
Recalled user name
Hedged on VRAM (miss)
Recalled hardware model
Full summary correct
0/8

31B Think Mode

Think Mode on the 31B made everything worse. Without thinking: 3/8. With thinking: 0/8. The hidden reasoning tokens eat into the answer budget at low tok/s.

No-think3/8 · avg 206s
Think (1024 budget)0/8 · avg 292s
Overhead1.4× wall time

🤖 The verdict: The E4B is agent-ready. Perfect context retrieval, strong multi-turn memory, and 24–53 tok/s depending on hardware. Think Mode is only useful on models fast enough that the thinking overhead doesn't crowd out the answer. For agentic workflows, disable Think Mode on anything below 10 tok/s.


New · April 12

From Benchmarks to Bots: The Clawdy Pipeline

Benchmarking is just theory until you build something real. We took all the data from our Vulkan optimization tests and wired the unsloth-e4b model directly into an autonomous OpenClaw agent bridged to Telegram. The result? A fully functional AI assistant texting us from a 2017 iMac at 24 tok/s.

The Subagent Revelation

When told to autonomously research 10 web sources simultaneously, the OpenClaw orchestration engine realized the task would lock up the single-threaded Telegram chat. Instead of freezing, it elegantly spawned an invisible background "Subagent" that crunched internet data for 42 minutes, leaving the main bot thread perfectly free for casual conversation.

Optimizing for "Tiered Routing"

The next evolution of this local infrastructure is mapping these subagents to the iMac's physical constraints. We are building a "Router + Heavy Lifter" architecture: a tiny 2B model lives on the GPU responding to texts instantly, while complex agent tasks are silently passed to a 26B MoE running purely on the 40GB system CPU memory. Zero resource contention.

Telegram bot explaining it is spawning a one-shot subagent

💡 The bottom line: Benchmarks proved the iMac could run it. The Telegram bot proved it is running it. We are no longer testing models; we are deploying autonomous AI infrastructure on 8-year-old consumer hardware.


Takeaways

What we learned

Unsloth is the daily driver

UD-Q4_K_XL: 2.94 GB, 38.16 tok/s, 7/8 accuracy. Smallest file, fastest inference, tied for best quality. On the MacBook M4 Pro, 90.15 tok/s— that's instant-feeling inference on a laptop. Dynamic quantization preserves what matters and compresses what doesn't.

The E4B is the sweet spot

At 24.4 tok/s (iMac) or 53.3 tok/s (MacBook), the E4B delivers 7.5B-parameter quality at conversational speed. Perfect needle-in-haystack retrieval. 4/5 multi-turn memory. Agent-ready.

Speed affects quality, not just UX

The 31B Unsloth quant works perfectly at 30+ tok/s (on smaller models) but collapses at <2 tok/s with garbage token generation. Quantization quality is speed-dependent — a finding you only get from testing on real hardware with real constraints.

Two machines tell different stories

The iMac (Vulkan, 4 GB VRAM) proved GPU acceleration works on legacy hardware. The MacBook (Metal, 24 GB unified) showed the speed ceiling for Apple Silicon. Both agree: models under 5 GB are transformative. Models over 15 GB are paperweights on consumer hardware.

The 31B redemption (and limits)

The original benchmark suite showed 94% accuracy (0 errors) via Ollama on the 31B. GPU-accelerated Unsloth quant dropped to 3/8 due to the garbage token issue. The 31B is a phenomenal model — but only with the right quantization at the right speed.

Autonomous pipelines work

15 JSON result files from an overnight pipeline that ran across two machines with zero human intervention. Bash orchestrators watching for completion signals, auto-restarting servers, saving data after every test. Local AI research doesn't need a cluster — it needs good plumbing.