Research · Updated April 11

Gemma 4 Benchmark Suite

414+ tests across 7 models on two machines. iMac 2017 via Vulkan, MacBook M4 Pro via Metal. Autonomous overnight pipeline. 15 result files. Zero cloud. Zero API keys.

414+ total tests·7 categories·3 quantizations compared·2 machines · 2 GPU backends

🐘 The headline: The 31B Dense model scored 94% across 63 tests with zero errors. One week earlier, the same model scored 21% with 83 errors — because the testing infrastructure wasn't built for a model this slow. The data says otherwise.

Test Lab

The Machines

Every number on this page came from one of these two machines. No cloud instances, no rented GPUs — just hardware we own.

🖥

iMac 27" 5K Retina

Mid 2017 · iMac 18,3

CPU

Intel Core i7-7700K @ 4.20 GHz

4 cores · 8 threads

RAM

40 GB DDR4

38.4 GB/s bandwidth

GPU

Radeon Pro 575 · 4 GB GDDR5

217 GB/s bandwidth

macOS 13.7.8 Ventura

Display

27" 5120×2880 Retina

Inference Stack

Ollama 0.20.0CPU-only · 8 threads · 4096 ctx

llama.cpp (ff5ef82)Vulkan via MoltenVK · LunarG SDK 1.4.341

💻

MacBook Pro 16"

Late 2024 · Mac16,7

Chip

Apple M4 Pro

14 cores · 10P + 4E

RAM

24 GB Unified LPDDR5

273 GB/s bandwidth

GPU

Integrated · 20-core

Shared unified memory

macOS 26.5 Tahoe

Display

16.2" 3456×2234 XDR

Inference Stack

llama.cpp (prebuilt)Metal · native Apple Silicon

🔑 Why this matters: The iMac is 8 years old with a discrete AMD GPU that most frameworks ignore. The MacBook is current-gen Apple Silicon. Benchmarking both tells you the floor and ceiling of what Gemma 4 can do on hardware you can actually buy on eBay or at the Apple Store.

Model Overview

2BGemma 4 E2B

Avg Score

71%

Tests

Avg Duration

1.4m

Errors

4BGemma 4 E4B

Avg Score

52%

Tests

Avg Duration

2.2m

26BGemma 4 26B

Avg Score

49%

Tests

Avg Duration

2.0m

31BGemma 4 31B

Avg Score

94%

Tests

Avg Duration

13.2m

Reality Check

iMac vs Google's published benchmarks

Google runs their benchmarks on datacenter hardware. We ran ours on a 2017 iMac. Different tests, different conditions — but the question is the same: does the model actually work?

Benchmark	GooglePublished	iMacOur tests	Notes
Math (AIME-style)	89.2%	92%	Our 31B scored 100% on all 3 AIME tasks. Different problems, same difficulty band.
Code Generation	80.0%	95%	LiveCodeBench vs our 10-test coding suite. Our tests are less exhaustive but more practical.
Science (GPQA-style)	84.3%	68%	Chemistry pulled us down. Our 31B got 75% on chem vs 100% on physics.
Tool Calling	—	99%	Google doesn't publish tool calling scores. The 31B aced it; smaller models struggled.
Creative Writing	—	91%	Subjective scoring. All models scored 85%+ on creative tasks.

⚠️ Apples and oranges: Google's published scores use standardized academic benchmarks on datacenter GPUs. Our tests are custom-designed, run locally via GPU-accelerated llama.cpp on two consumer machines. The point isn't to match their numbers — it's to answer: can you actually use this model on hardware you own?

📂 Show your work: Every prompt, grading function, and raw result is open source. View the full test suite on GitHub →

🔬 April 11 update: Added cross-hardware comparison (M4 Pro vs iMac), three-way quantization showdown, multi-turn coherence, needle-in-haystack, and 31B Think A/B testing. All collected via autonomous overnight pipeline.

Scores by Category

Category	2B	4B	26B	31B
Performance	n=888%	n=786%	n=786%	n=8100%
Reasoning	n=1758%	n=1759%	n=1759%	n=1192%
Coding	n=1470%	n=1467%	n=1465%	n=1095%
Tool Calling	n=1457%	n=1457%	n=1471%	n=1099%
Creative	n=1091%	n=1094%	n=1093%	n=1091%
Multimodal	n=687%	n=694%	n=650%	n=688%
Agentic	n=973%	n=1077%	n=1057%	n=892%

New

Quantization Showdown

7 models. 39 core tests each. The inverted ladder: Q4 > Q8 > F16. Lower precision = higher score on consumer hardware. All scores within a 3.1% band — so the real story is speed, memory, and edge-case behavior.

5.1BE2B Q4

Score92%

Speed

6.95 tok/s

Disk

7.2 GB

Perfect

33/39

Performance100%

Reasoning92%

Coding99%

Tool Calling80%

⚠️ Tool Refusal (0%) · 12.8 score%/GB

5.1BE2B Q8

Score92%

Speed

6.95 tok/s

Disk

8.1 GB

Perfect

31/39

Performance100%

Reasoning92%

Coding98%

Tool Calling79%

⚠️ Tool Refusal (0%) · 11.3 score%/GB

5.1BE2B F16

Score91%

Speed

1.72 tok/s

Disk

10.3 GB

Perfect

31/39

Performance100%

Reasoning92%

Coding95%

Tool Calling80%

⚠️ Speed (1.7 tok/s) for 0% gain · 8.9 score%/GB

8.0BE4B Q8

Score90%

Speed

3.34 tok/s

Disk

11.6 GB

Perfect

30/39

Performance100%

Reasoning89%

Coding93%

Tool Calling79%

⚠️ Everything (worst value) · 7.7 score%/GB

8.0BE4B F16

Score90%

Speed

0.87 tok/s

Disk

16.0 GB

Perfect

31/39

Performance100%

Reasoning89%

Coding94%

Tool Calling80%

⚠️ 5 hrs for +0.6% over Q8 · 5.7 score%/GB

4B (DQ4)Unsloth DQ4

🏆 efficiency

Score91%

Speed

3.41 tok/s

Disk

5.1 GB

Perfect

32/39

Performance100%

Reasoning89%

Coding93%

Tool Calling85%

⚠️ Think Mode (33%) · 17.9 score%/GB

31.3B31B Q4

🏆 quality

Score93%

Speed

0.6 tok/s

Disk

19.9 GB

Perfect

43/63

Performance100%

Reasoning92%

Coding94%

Tool Calling100%

⚠️ Speed (0.6 tok/s) · 4.7 score%/GB

Visual Analysis

The Data, Visualized

Seven models tested on the same iMac. Same tests, same grading. The story is in the shapes.

test results

across 7 models

universally perfect

out of 39 tests

tok/s peak

E2B Q4

min longest test

Context Retention

⚡

Daily Driver

E2B Q4

Fastest + highest score

92%6.95 tok/s · 7.2 GB

💎

Best Value

Unsloth DQ4

17.9% per GB

91%3.41 tok/s · 5.1 GB

👑

Quality King

31B Q4

93% accuracy

93%0.6 tok/s · 19.9 GB

Model Fingerprints

Each model has a shape. Click to compare. Hover to isolate.

Speed vs Accuracy

Bubble size = disk footprint. Top-right is the sweet spot.

The Runtime Race

Same 39 tests. The fastest model finishes before the slowest has loaded.

E2B Q8

36 min

92%

E2B Q4

38 min

92%

Unsloth DQ4

76 min

91%

E4B Q8

85 min

90%

E2B F16

149 min

91%

E4B F16

295 min

90%

31B Q4

13h 46m

93%

Wall-clock time for 39 identical tests · score at right

Every Test. Every Model.

39 tests scored across 7 models. Bigger dots = higher scores. The pattern tells you where each model shines — and where it breaks.

E2B Q4

E2B Q8

E2B F16

E4B Q8

E4B F16

DQ4

31B Q4

🔧Calendar Event

🔧No Tool Needed

🧠ON

🧠OFF

💻Complex Regex

🔧Web Search

🔧Implicit Units

🔧Conversational Query

🔧JSON Response

100%

70–99%

1–69%

The 31B Endurance

The longest single tests — each one a marathon at 0.60 tok/s.

Context Retention

35.4m

100%

Automation Planning

34.8m

82%

PDF Data Extraction Plan

33.1m

73%

Constraint Satisfaction

31m

100%

30.6m

67%

API Data Pipeline Review

28.2m

80%

Architecture Understanding

27.2m

85%

Project Breakdown

25.9m

84%

31B Q4 at 0.60 tok/s · each test is a marathon

The Inverted Ladder

Less precision. Higher score. The counterintuitive finding.

92.2%

6.95 tok/s↑

7.2 GB · 38 min

91.8%

6.95 tok/s

8.1 GB · 36 min

F16

91.2%

1.72 tok/s

10.3 GB · 149 min

Same 5.1B architecture. Same 39 tests. Same iMac.
Lower precision → higher score.

New · April 10

GPU Acceleration Discovery

Everyone said GPU acceleration on Intel Mac was dead. Metal crashes on discrete AMD GPUs. ROCm is Linux-only. Ollama can't talk to the Radeon Pro 575.

Nobody tested Vulkan via MoltenVK. We compiled llama.cpp with the LunarG Vulkan SDK, and the GPU appeared instantly. The results changed everything.

Model	OllamaCPU only	iMac VulkanRadeon 575	MacBook MetalM4 Pro	Best	Notes
E2B2.9 GB	7.3	37.6	90.2	12.4×	Full GPU on both
E4B4.8 GB	3.7	24.4	53.3	14.4×	Sweet spot model
26B MoE16 GB	2.0	3.7	—	1.9×	CPU-only on iMac (GPU offload fails)
31B17.5 GB	0.69	1.15	1.32	1.9×	Memory-bandwidth bound on both

All speeds in tok/s (text generation, tg128). Best = fastest ÷ Ollama baseline.

The VRAM cliff

Models under 5 GiB see massive GPU acceleration (5–14×). Models over 15 GiB are barely helped — the memory bus becomes the bottleneck on both iMac (PCIe 3.0) and MacBook (unified LPDDR5).

M4 Pro = 2.4× faster

Apple Silicon's unified memory and Metal 4 backend deliver consistent 2.2–2.4× speedups over the iMac's Vulkan path. 90 tok/s on E2B is genuinely instant-feeling.

31B is memory-bound

The 31B runs at ~1.2 tok/s on both machines. At 17.5 GB, it saturates the memory bus regardless of GPU architecture. You need 48+ GB unified for this model to breathe.

🔧 How to reproduce: iMac (Vulkan): Compile llama.cpp with LunarG Vulkan SDK (MoltenVK). MacBook (Metal): Use prebuilt llama.cpp releases — Metal works out of the box. Use Unsloth UD-Q4_K_XL GGUFs for models ≤4B. Set -ngl 99.

Spotlight

Reasoning Tasks — The Heavy Lifts

The longest and hardest reasoning tasks across all models. Think mode on, math hard.

31BAIME Math: Number Theory16.3m100%

31BLogic: Constraint Satisfaction31.0m100%

31BMulti-Tool: Research + Email15.0m100%

31BContext Stress: Needle in Haystack11.4m100%

31BeBay Buyer Negotiation15.6m100%

31BChart Reading: Bar Chart15.1m70%

Deep Dive

Experiments beyond the scores

Beyond the standard benchmarks, we ran targeted experiments to answer specific questions about how these models behave in practice. Expand each to dig in.

We ran the same tasks with Think Mode ON and OFF across E2B and 26B models. The results were surprising — thinking doesn't always help, and sometimes it actively hurts.

Task	E2B Off	E2B Think	26B Off	26B Think
Math	100%	20%	100%	100%
Logic	20%	20%	60%	20%
Code	70%	70%	70%	70%
Creative	100%	100%	100%	85%

Think mode hurt E2B on math (100% → 20%) and 26B on logic (60% → 20%). On this hardware, the extra tokens spent "thinking" can actually degrade quality.

We buried a passphrase in progressively larger documents to find the real context ceiling on this hardware. The advertised limits don't match reality.

1K tokens

E2B

✓

26B

✓

4K tokens

E2B

✓

26B

✗

8K tokens

E2B

✗

26B

✗

16K tokens

E2B

✗

26B

✗

On 40GB RAM, E2B tops out around 4K tokens reliably. The 26B can only handle ~1K before OOM pressure causes timeouts. Forget about the advertised 128K context window on consumer hardware.

Can the models maintain coherence across a 5-turn conversation? Both E2B and 26B scored 95% — dropping only one point on turn 4 (a follow-up question that required referencing context from turn 1).

E2B

95%

26B

95%

Both models: turn scores [100, 100, 100, 75, 100]. Impressive coherence.

Full Test Suite

Every test, every model, every score

All 39 unique tests across 7 model configurations. Click column headers to sort. Filter by category to focus on what matters to you.

Test	Cat	E2B Q4	E2B Q8	E2B F16	E4B Q8	E4B F16	Unsloth DQ4	31B Q4	Avg ↓
Cold Start Latency	Perf	100	100	100	100	100	100	100	100
Short Generation (50 tokens)	Perf	100	100	100	100	100	100	100	100
Medium Generation (200 tokens)	Perf	100	100	100	100	100	100	100	100
Long Generation (500 tokens)	Perf	100	100	100	100	100	100	100	100
Prompt Processing (100 tokens)	Perf	100	100	100	100	100	100	100	100
Prompt Processing (500 tokens)	Perf	100	100	100	100	100	100	100	100
TTFT: Warm Start (streaming)	Perf	100	100	100	100	100	100	100	100
TTFT: Medium Prompt (streaming)	Perf	100	100	100	100	100	100	100	100
AIME Math: Number Theory	Reas	100	100	100	100	100	100	100	100
AIME Math: Combinatorics	Reas	100	100	100	100	100	100	100	100
AIME Math: Algebra	Reas	100	100	100	100	100	100	100	100
Logic: Knights and Knaves	Reas	100	100	100	100	100	100	100	100
Logic: Constraint Satisfaction	Reas	100	100	100	100	100	100	100	100
Science: Physics	Reas	100	100	100	100	100	100	100	100
Common Sense: Physical World	Reas	100	100	100	100	100	100	100	100
Common Sense: Temporal	Reas	100	100	100	100	100	100	100	100
Function Gen: Python Fibonacci	Codi	100	100	100	100	100	100	100	100
Function Gen: JS Array Flatten	Codi	100	100	100	100	100	100	100	100
Function Gen: SQL Query	Codi	100	100	100	100	100	100	100	100
Bug Detection: Off-by-One	Codi	100	100	100	100	100	100	100	100
Bug Detection: Memory Leak	Codi	100	100	100	100	100	100	100	100
Algorithm: Two Sum	Codi	100	100	100	100	100	100	100	100
Algorithm: Graph BFS	Codi	100	100	100	100	100	100	100	100
Single Tool: Weather Query	Tool	100	100	100	100	100	100	100	100
Single Tool: Calculator	Tool	100	100	100	100	100	100	100	100
Single Tool: Web Search	Tool	100	100	100	100	100	100	100	100
Parameter Extraction: Implicit Units	Tool	100	100	100	100	100	100	100	100
Tool Refusal: Conversational Query	Tool	100	100	100	100	100	100	100	100
Structured Output: JSON Response	Tool	100	100	100	100	100	100	100	100
Refactoring: Cleanup Messy Code	Codi	100	98	98	98	98	100	100	99
Error Recovery: Failed Tool Call	Tool	100	90	100	90	100	100	90	96
Multi-Tool: Research + Email	Tool	100	100	100	100	100	50	100	93
Real-World: API Data Pipeline Review	Codi	88	85	83	83	97	93	80	87
Science: Chemistry	Reas	75	75	75	50	50	75	75	68
Thinking Mode: OFF	Reas	67	67	67	67	67	67	67	67
Code Explanation: Complex Regex	Codi	100	100	67	50	50	33	67	67
Thinking Mode: ON	Reas	67	67	67	67	67	33	67	62
Tool Refusal: No Tool Needed	Tool	0	0	0	0	0	100	100	29
Parameter Extraction: Calendar Event	Tool	0	0	0	0	0	0	100	14

39 tests shown · 273 total results

All benchmarks were run locally using Ollama on a 2017 iMac (i7-7700K, 40GB RAM, macOS). Nothing fancy — this is a machine I actually use. Each test was executed sequentially with cold-start measurements, no prompt caching, and real wall-clock timing.

Models tested span Google's Gemma 4 family from the smallest E2B (2 billion parameters) up to the full 31B. Tests cover 10 categories: performance profiling, mathematical reasoning (AIME-style problems), code generation, tool calling, creative writing, multimodal analysis, agentic task completion, parameter optimization, context window stress testing, and score calibration.

Scoring uses a 0-1 scale. Performance tests are scored on metrics only (latency, throughput). Reasoning and coding tests use automated verification against expected answers.

I'm not a research lab. I'm one person who wanted to know if these models are actually useful on hardware I already own. I publish the failures alongside the wins because that's what's actually helpful.
— The whole point

● 7 models testedSome tests are still running. Some results will change as I learn more. The messy build log is more useful than the polished launch post.

Full source: prompts, grading, raw results→

New · April 11

Three-Way Quantization Showdown

We ran Q8_0, Google Q4_K_M, and Unsloth UD-Q4_K_XL through identical 8-question quality suites on the same iMac hardware. Same prompts, same seed, same temperature.

Quantization	Size	Speed	Quality	Score%/GB
Q8_0 (full precision)	4.69 GB	30.12 tok/s	7/8	18.8
Q4_K_M (Google)	3.21 GB	35.36 tok/s	6/8	23.4
UD-Q4_K_XL (Unsloth)🏆 winner	2.94 GB	38.16 tok/s	7/8	29.8

Why Unsloth wins

Unsloth's dynamic quantization uses imatrix calibration data to identify which weights matter most. High-impact weights get Q5/Q6 precision while low-impact weights get Q4. The result: smaller file, faster inference, same quality as Q8.

⚠️ The 31B exception

Unsloth UD-Q4_K_XL fails on the 31B at <2 tok/s. The model starts correct answers then degenerates into garbage tokens. At slow inference speeds, the quantization's approximations compound across tokens. For 31B, use Google's standard Q4_K_M or higher precision.

New · April 11

Agentic Readiness

Can these models actually work as local agents? We tested context retrieval, multi-turn memory, and Think Mode overhead.

4/4

Needle-in-Haystack

Buried a secret password in 10, 50, 200, and 500 filler sentences. E4B found it every time. Zero degradation at any depth.

depth=10✅ 3.9s

depth=50✅ 7.6s

depth=200✅ 19.4s

depth=500✅ 37.4s

4/5

Multi-Turn Memory

5-turn conversation testing fact retention. The E4B remembered name, city, and hardware across all turns.

✅Inferred city from riddle

✅Recalled user name

❌Hedged on VRAM (miss)

✅Recalled hardware model

✅Full summary correct

0/8

31B Think Mode

Think Mode on the 31B made everything worse. Without thinking: 3/8. With thinking: 0/8. The hidden reasoning tokens eat into the answer budget at low tok/s.

No-think3/8 · avg 206s

Think (1024 budget)0/8 · avg 292s

Overhead1.4× wall time

🤖 The verdict: The E4B is agent-ready. Perfect context retrieval, strong multi-turn memory, and 24–53 tok/s depending on hardware. Think Mode is only useful on models fast enough that the thinking overhead doesn't crowd out the answer. For agentic workflows, disable Think Mode on anything below 10 tok/s.

New · April 12

From Benchmarks to Bots: The Clawdy Pipeline

Benchmarking is just theory until you build something real. We took all the data from our Vulkan optimization tests and wired the unsloth-e4b model directly into an autonomous OpenClaw agent bridged to Telegram. The result? A fully functional AI assistant texting us from a 2017 iMac at 24 tok/s.

The Subagent Revelation

When told to autonomously research 10 web sources simultaneously, the OpenClaw orchestration engine realized the task would lock up the single-threaded Telegram chat. Instead of freezing, it elegantly spawned an invisible background "Subagent" that crunched internet data for 42 minutes, leaving the main bot thread perfectly free for casual conversation.

Optimizing for "Tiered Routing"

The next evolution of this local infrastructure is mapping these subagents to the iMac's physical constraints. We are building a "Router + Heavy Lifter" architecture: a tiny 2B model lives on the GPU responding to texts instantly, while complex agent tasks are silently passed to a 26B MoE running purely on the 40GB system CPU memory. Zero resource contention.

Telegram bot explaining it is spawning a one-shot subagent

💡 The bottom line: Benchmarks proved the iMac could run it. The Telegram bot proved it is running it. We are no longer testing models; we are deploying autonomous AI infrastructure on 8-year-old consumer hardware.

Takeaways

What we learned

Unsloth is the daily driver

UD-Q4_K_XL: 2.94 GB, 38.16 tok/s, 7/8 accuracy. Smallest file, fastest inference, tied for best quality. On the MacBook M4 Pro, 90.15 tok/s— that's instant-feeling inference on a laptop. Dynamic quantization preserves what matters and compresses what doesn't.

The E4B is the sweet spot

At 24.4 tok/s (iMac) or 53.3 tok/s (MacBook), the E4B delivers 7.5B-parameter quality at conversational speed. Perfect needle-in-haystack retrieval. 4/5 multi-turn memory. Agent-ready.

Speed affects quality, not just UX

The 31B Unsloth quant works perfectly at 30+ tok/s (on smaller models) but collapses at <2 tok/s with garbage token generation. Quantization quality is speed-dependent — a finding you only get from testing on real hardware with real constraints.

Two machines tell different stories

The iMac (Vulkan, 4 GB VRAM) proved GPU acceleration works on legacy hardware. The MacBook (Metal, 24 GB unified) showed the speed ceiling for Apple Silicon. Both agree: models under 5 GB are transformative. Models over 15 GB are paperweights on consumer hardware.

The 31B redemption (and limits)

The original benchmark suite showed 94% accuracy (0 errors) via Ollama on the 31B. GPU-accelerated Unsloth quant dropped to 3/8 due to the garbage token issue. The 31B is a phenomenal model — but only with the right quantization at the right speed.

Autonomous pipelines work

15 JSON result files from an overnight pipeline that ran across two machines with zero human intervention. Bash orchestrators watching for completion signals, auto-restarting servers, saving data after every test. Local AI research doesn't need a cluster — it needs good plumbing.

Project Logs

Deep Dives & Technical Reports

🧠

Fine-Tuning Gemma 4 for the Edge

How we used Kaggle T4s and Unsloth to train an r=32 LoRA, creating our custom concierge with 88.6% accuracy against hallucinations.

Read the report →

⚡

Multi-Model Orchestration on Legacy Hardware

Running a dual-node OpenClaw pipeline on a 2017 iMac. Overcoming context exhaustion, tokenizer bugs, and VRAM limits.

Read the report →