Gemini: A Family of Highly Capable Multimodal Models

Paper by: Gemini Team, Google DeepMind
Published: December 2023
Venue: arXiv (Technical Report)
URL: https://arxiv.org/abs/2312.11805

What This Paper Did

Gemini is Google’s answer to GPT-4 — but built differently from the ground up. Instead of starting with a text-only model and bolting on vision later, Google trained a single model that understood text, images, audio, and video natively and simultaneously. Think of the difference between a student who learns English vocabulary in isolation and then later studies diagrams separately, versus a student whose textbooks, lectures, diagrams, and videos all arrive together as one unified learning experience.

The paper presents three model sizes: Gemini Ultra (most capable), Gemini Pro (balanced performance and speed), and Gemini Nano (efficient, runs on a Pixel phone). Gemini Ultra became the first model to exceed human expert performance on MMLU — a benchmark of 57 diverse academic subjects — scoring 90.04% vs 89.8% (human expert baseline).

Key Numbers

Benchmark	Gemini Ultra	GPT-4	Task
MMLU	90.04%	86.4%	World knowledge (57 subjects)
HumanEval	74.4%	88.4%	Code generation
GSM8K	94.4%	92.0%	Grade-school math
Needle-in-haystack	Pass (32K context)	Limited to 128K	Long-context retrieval

Core Innovation: Native Multimodality

Instead of this (traditional approach):

Text input → Text encoder → Shared representation → Output
                                     ↑
Image input → Image encoder ────────┘

Gemini does this:

Text + Image + Audio + Video → Joint tokeniser → Unified Transformer → Output

All modalities are tokenised into the same representation space from the start. An image is split into 14×14 patches (196 tokens for a 224×224 image), each projected to the same embedding dimension as text tokens. The model never “sees” them as separate problems — it’s all just a sequence of tokens.

Training Scale

Model sizes: Gemini Ultra (~1.3T params, estimated), Pro (~50B params, estimated), Nano (~2-7B params)
Hardware: TPU v4 and v5 clusters with Pathways (Google’s framework for training multiple tasks simultaneously)
Training data: Large-scale multimodal corpus (details not fully disclosed in the technical report)
Context: 32K tokens initially (later extended to 1 million in Gemini 1.5)

The Indian Analogy

For native multimodality: Imagine a student in a small-town school who has two ways to learn:

Old way (traditional models): The teacher writes everything on the blackboard, and the student learns to read and understand text. Then — much later — the teacher shows a map, and the student has to “translate” the map back into words to understand it. (“This blue line is a river. A river is water flowing from hills to ocean.”)
New way (Gemini): The student learns from the very start with a textbook that has words and maps and diagrams all together. When learning about rivers, they read the word “river,” see the shape on the map, and understand both simultaneously. The river is not a “picture to decode” — it’s part of the same language as the word.

Gemini is the second student. It doesn’t learn language first and then retrofit vision. It learns them together.

For three model sizes: Like a government office with three roles:

Gemini Ultra (IAS Officer): Handles the most complex, nuanced decisions. Slow, expensive, needs full resources. Best for the hardest problems.
Gemini Pro (District Officer): Handles 80% of everyday work. Balanced speed and intelligence. Works well for most users.
Gemini Nano (Peon): Handles simple, routine tasks. Walks around to every office doing fast basic work. Can even run on your phone for offline use.

Read This Paper in This Order

Section	What You Will Learn	Difficulty	Time
01 — Context	Why Google needed Gemini; GPT-4’s success; multimodal race	🟢 Beginner	8 min
02 — The Problem	Weaknesses in text-first approaches; why bolted-on vision isn’t enough	🟡 Intermediate	6 min
03 — The Idea	Native multimodality; unified tokenisation; the three sizes	🟡 Intermediate	10 min
04 — The Math	Token encoding; patch embeddings; Transformer modifications	🟡 Intermediate	12 min
05 — Worked Example	Image tokenisation walkthrough; token concatenation	🟡 Intermediate	8 min
06 — The Code	Call Gemini API with text + image	🟢 Beginner	5 min
07 — Limitations	Delayed release; benchmark questions; context vs GPT-4 Turbo	🟡 Intermediate	7 min
08 — Impact	Gemini 1.5 (million-token context); Gemma; Google’s AI acceleration	🟢 Beginner	6 min
09 — Summary	One-paragraph recap; what came next	🟢 Beginner	2 min

Total reading time: ~45 minutes

Before You Read: Math Tutorials You Need

You should already know these concepts. If not, read them first:

Softmax and Cross-Entropy — How attention weights and logits work
Transformers: Self-Attention — Core of Gemini’s architecture
Matrix Multiplication and Projections — How embeddings are transformed

Architecture Diagram

Input Modality (Text / Image / Audio / Video)
         |
         ↓
Unified Tokeniser
(SentencePiece + Patch Embeddings)
         |
         ↓
Token Embedding Layer
(All tokens → same d_model dimension)
         |
         ↓
Positional Encoding + Token Type Encoding
         |
         ↓
Transformer Stack
(Efficient Attention + Dense Layers)
         |
         ↓
Output Projections
(Language head / Image head / Task-specific head)
         |
         ↓
Predictions (Next tokens / Image regions / Classification)

← Paper 19: Ring Attention | Paper 21: Mamba →

Gemini: A Family of Highly Capable Multimodal Models

Gemini: A Family of Highly Capable Multimodal Models

What This Paper Did

Key Numbers

Core Innovation: Native Multimodality

Training Scale

The Indian Analogy

Read This Paper in This Order

Before You Read: Math Tutorials You Need

Architecture Diagram

Navigation

Discussion