Fine-Tuning Gemma 4 for the Edge

We wanted to turn the base Gemma 4 E4B model into a hyper-specific expert on Google I/O 2026. While Retrieval-Augmented Generation (RAG) is the industry standard for this, we wanted to see if we could bake the knowledge directly into the model's weights so it could run entirely on-device without needing a separate vector database running alongside it.

The Training Setup

We ran the fine-tuning pipelines on Kaggle using T4 and P100 GPUs. The base model of choice was a 4B parameter instruction-tuned variant. We used Unsloth to accelerate the training and handle the quantization directly to a compressed format.

The Capacity Problem: Moving from r=8 to r=32

Our initial runs used a standard LoRA rank of r=8. The model nailed the persona and could get high-level facts right (like keynote times), but it started hallucinating specific details. It simply didn't have enough parameter capacity to store the hundreds of facts about 56 sessions and 84 speakers.

// LoRA Configuration Shift

Old: r=8 (~18M trainable parameters)
New: r=32, alpha=64 (~73M trainable parameters)

We bumped the LoRA rank to r=32 (maintaining lora_alpha=64). This gave us ~73 million trainable parameters (about 4x the capacity of our r=8 runs, representing ~0.9% of the total model). This allowed the model to successfully internalize the dense event data.

The training took around 2-3 hours on a Kaggle T4, outputting directly to Unsloth's dynamic format, ready to be dropped into llama.cpp.

Hardening and Evaluation

You can't deploy an event concierge without knowing exactly how it fails. We built a custom evaluation gauntlet (eval_v7) to test the model against 70 brutal, edge-case scenarios.

88.6%

Final Hardened Score

Final Fine-Tuned Model · 62 passed, 8 failed

We tested for specific failure modes:

Temporal Conflicts:Q: "Can I see 'What's new in Android' and 'What's new in Chrome'?"✓ It correctly identified they were scheduled at the exact same time: 3:30 PM on Day 1.
Negative Person Checks (Anti-Hallucination):Q: "Is Sam Altman at IO?" or "Is Jensen Huang presenting?"✓ It correctly responded that it was a Google event and pointed to speakers like Jeff Dean and Demis Hassabis instead.
Year Guarding:Q: "What happened at last year's IO?"✓ It successfully restricted its persona to being the 2026 guide only, refusing to answer about 2024 or 2025.

By baking the facts directly into a 4B parameter model and aggressively evaluating its hallucination boundaries, we built an agent that fits comfortably into consumer GPU VRAM while maintaining domain-expert accuracy.

Fine-Tuning Gemma 4 for the Edge: Building the I/O 2026 Concierge

The Training Setup

The Capacity Problem: Moving from r=8 to r=32

Hardening and Evaluation