Summary: Gemini in One Sentence
Gemini is Google’s unified multimodal model that tokenises text, images, audio, and video identically, feeds them through a single Transformer with efficient attention, and achieves 90% on MMLU — proving that native multimodality beats bolted-on vision adapters.
Core Idea Recap
| Aspect | Details |
|---|---|
| Problem | Text-first models bolt on vision inefficiently; true multimodal reasoning requires joint training |
| Solution | Train one model from scratch on text + images + audio + video simultaneously |
| Architecture | Unified tokenisation → shared embedding space → Transformer with efficient attention |
| Key Achievement | Gemini Ultra: 90.04% on MMLU (first non-OpenAI model to exceed human expert baseline) |
| Sizes | Nano (on-device), Pro (balanced), Ultra (most capable) |
| Impact | Restored Google’s credibility in AI, sparked competition, led to Gemini 1.5 (1M context), influenced industry toward multimodal-first design |
Indian Analogy Recap
Gemini is like a student who learned language, visual reasoning, and audio understanding together from the start — not a student who memorized English vocabulary first, then tried to understand maps later. This simultaneous learning creates fluency that bolted-on approaches can’t match.
The Math, Simply
- Images → split into 14×14 patches (256 patches for 224×224 image)
- All modalities → embedded to same dimension (d_model = 2048 for Ultra)
- All tokens → positional encoding added
- Attention → each token attends to all others (image patches attend to text, text to images)
- Output → next token predicted (text, image patch, or other modality)
What Changed in AI
Before Gemini (2023):
- OpenAI had momentum with GPT-4
- Multimodal was “bolt-on vision to language models”
- Context lengths were limited (4K–128K tokens)
After Gemini (2024+):
- Google proved it could compete
- Multimodal-from-scratch became the industry standard
- Context lengths exploded (Gemini 1.5: 1M tokens)
- Open-source multimodal models (Gemma) became accessible
Three Key Numbers
- 90.04% — MMLU score, exceeding human experts (89.8%)
- 32K → 1M tokens — Gemini 1.0 to 1.5 context leap
- 0.0005 per 1K — Price point, 100x cheaper than GPT-4
If You Remember Nothing Else
- Gemini processes text, images, audio, video as one unified language (not separate streams)
- This works because all modalities are tokenised identically and fed to the same Transformer
- The result: multimodal reasoning that’s more efficient and capable than “vision encoder + language model” approaches
What Came Next
- Gemini 1.5 (May 2024): 1M token context; better understanding of long documents and code
- Gemma (July 2024): Open-source Gemini derivatives; 2B to 13B parameters
- Claude 3, GPT-4V iterates: Industry-wide push for better multimodal models
- Mamba (2024): Linear-time alternative to Transformers (next paper)
How to Deepen Your Understanding
- Read Gemini’s competitors: Paper 18 (Mistral) for efficient attention ideas, Paper 19 (Ring Attention) for long-context techniques
- Understand Vision: If multimodality interests you, read about Vision Transformers (ViT)
- Follow-up: Read about Gemini 1.5 and Gemma (released as open papers)
Navigation
← Paper 19: Ring Attention
You are here: Paper 20 — Gemini
Paper 21: Mamba →
Discussion Questions
-
Why did Google choose “native multimodality” over the faster approach of bolting vision onto an existing language model?
- Because native multimodality leads to better understanding of cross-modal relationships (word “cat” with cat image align naturally in the embedding space)
-
If Gemini has 32K tokens and GPT-4 Turbo has 128K, why is Gemini considered better?
- Gemini 1.5 (released later) has 1M tokens. But even early Gemini competed on quality (MMLU score) rather than context length. Different trade-offs for different tasks.
-
Why is the price (0.0005 per 1K tokens) so much lower than GPT-4 (0.03)?
- Google has massive compute infrastructure (TPUs). Also strategic pricing to gain market share. Prices typically drop as models become more efficient.
-
What’s the difference between “native multimodality” and “bolted-on vision”?
- Native: All modalities trained together from day one; the model learns cross-modal alignments naturally (e.g., “cat” embedding aligns with cat-image embedding)
- Bolted-on: Text model trained first, then vision encoder added and fine-tuned; the two parts don’t truly understand each other
-
Why does Gemini’s 1M-token context (in 1.5) matter?
- You can now summarise a full novel, analyse a large codebase, or process 12 hours of meeting transcripts in one prompt. No previous model could do this.
Final thought: Gemini’s story is the story of frontier AI in 2023–2024: rapid iteration, competitive pressure driving innovation, and the shift from “one company dominates” (OpenAI) to “multiple capable models exist” (Gemini, Claude, LLaMA). This competition is good for everyone — prices fall, quality rises, and capabilities expand. The best time to learn AI is now.