Gemini: A Family of Highly Capable Multimodal Models
Gemini: A Family of Highly Capable Multimodal Models
Paper by: Gemini Team, Google DeepMind
Published: December 2023
Venue: arXiv (Technical Report)
URL: https://arxiv.org/abs/2312.11805
What This Paper Did
Gemini is Google’s answer to GPT-4 — but built differently from the ground up. Instead of starting with a text-only model and bolting on vision later, Google trained a single model that understood text, images, audio, and video natively and simultaneously. Think of the difference between a student who learns English vocabulary in isolation and then later studies diagrams separately, versus a student whose textbooks, lectures, diagrams, and videos all arrive together as one unified learning experience.
The paper presents three model sizes: Gemini Ultra (most capable), Gemini Pro (balanced performance and speed), and Gemini Nano (efficient, runs on a Pixel phone). Gemini Ultra became the first model to exceed human expert performance on MMLU — a benchmark of 57 diverse academic subjects — scoring 90.04% vs 89.8% (human expert baseline).
Key Numbers
| Benchmark | Gemini Ultra | GPT-4 | Task |
|---|---|---|---|
| MMLU | 90.04% | 86.4% | World knowledge (57 subjects) |
| HumanEval | 74.4% | 88.4% | Code generation |
| GSM8K | 94.4% | 92.0% | Grade-school math |
| Needle-in-haystack | Pass (32K context) | Limited to 128K | Long-context retrieval |
Core Innovation: Native Multimodality
Instead of this (traditional approach):
Text input → Text encoder → Shared representation → Output
↑
Image input → Image encoder ────────┘
Gemini does this:
Text + Image + Audio + Video → Joint tokeniser → Unified Transformer → Output
All modalities are tokenised into the same representation space from the start. An image is split into 14×14 patches (196 tokens for a 224×224 image), each projected to the same embedding dimension as text tokens. The model never “sees” them as separate problems — it’s all just a sequence of tokens.
Training Scale
- Model sizes: Gemini Ultra (~1.3T params, estimated), Pro (~50B params, estimated), Nano (~2-7B params)
- Hardware: TPU v4 and v5 clusters with Pathways (Google’s framework for training multiple tasks simultaneously)
- Training data: Large-scale multimodal corpus (details not fully disclosed in the technical report)
- Context: 32K tokens initially (later extended to 1 million in Gemini 1.5)
The Indian Analogy
For native multimodality: Imagine a student in a small-town school who has two ways to learn:
-
Old way (traditional models): The teacher writes everything on the blackboard, and the student learns to read and understand text. Then — much later — the teacher shows a map, and the student has to “translate” the map back into words to understand it. (“This blue line is a river. A river is water flowing from hills to ocean.”)
-
New way (Gemini): The student learns from the very start with a textbook that has words and maps and diagrams all together. When learning about rivers, they read the word “river,” see the shape on the map, and understand both simultaneously. The river is not a “picture to decode” — it’s part of the same language as the word.
Gemini is the second student. It doesn’t learn language first and then retrofit vision. It learns them together.
For three model sizes: Like a government office with three roles:
- Gemini Ultra (IAS Officer): Handles the most complex, nuanced decisions. Slow, expensive, needs full resources. Best for the hardest problems.
- Gemini Pro (District Officer): Handles 80% of everyday work. Balanced speed and intelligence. Works well for most users.
- Gemini Nano (Peon): Handles simple, routine tasks. Walks around to every office doing fast basic work. Can even run on your phone for offline use.
Read This Paper in This Order
| Section | What You Will Learn | Difficulty | Time |
|---|---|---|---|
| 01 — Context | Why Google needed Gemini; GPT-4’s success; multimodal race | 🟢 Beginner | 8 min |
| 02 — The Problem | Weaknesses in text-first approaches; why bolted-on vision isn’t enough | 🟡 Intermediate | 6 min |
| 03 — The Idea | Native multimodality; unified tokenisation; the three sizes | 🟡 Intermediate | 10 min |
| 04 — The Math | Token encoding; patch embeddings; Transformer modifications | 🟡 Intermediate | 12 min |
| 05 — Worked Example | Image tokenisation walkthrough; token concatenation | 🟡 Intermediate | 8 min |
| 06 — The Code | Call Gemini API with text + image | 🟢 Beginner | 5 min |
| 07 — Limitations | Delayed release; benchmark questions; context vs GPT-4 Turbo | 🟡 Intermediate | 7 min |
| 08 — Impact | Gemini 1.5 (million-token context); Gemma; Google’s AI acceleration | 🟢 Beginner | 6 min |
| 09 — Summary | One-paragraph recap; what came next | 🟢 Beginner | 2 min |
Total reading time: ~45 minutes
Before You Read: Math Tutorials You Need
You should already know these concepts. If not, read them first:
- Softmax and Cross-Entropy — How attention weights and logits work
- Transformers: Self-Attention — Core of Gemini’s architecture
- Matrix Multiplication and Projections — How embeddings are transformed
Architecture Diagram
Input Modality (Text / Image / Audio / Video)
|
↓
Unified Tokeniser
(SentencePiece + Patch Embeddings)
|
↓
Token Embedding Layer
(All tokens → same d_model dimension)
|
↓
Positional Encoding + Token Type Encoding
|
↓
Transformer Stack
(Efficient Attention + Dense Layers)
|
↓
Output Projections
(Language head / Image head / Task-specific head)
|
↓
Predictions (Next tokens / Image regions / Classification)
Discussion
Questions about this paper? Spotted something unclear? Start a discussion below — powered by GitHub, no separate account needed.