Paper 20
Intermediate

Gemini: A Family of Highly Capable Multimodal Models

Gemini: A Family of Highly Capable Multimodal Models

Paper by: Gemini Team, Google DeepMind
Published: December 2023
Venue: arXiv (Technical Report)
URL: https://arxiv.org/abs/2312.11805


What This Paper Did

Gemini is Google’s answer to GPT-4 — but built differently from the ground up. Instead of starting with a text-only model and bolting on vision later, Google trained a single model that understood text, images, audio, and video natively and simultaneously. Think of the difference between a student who learns English vocabulary in isolation and then later studies diagrams separately, versus a student whose textbooks, lectures, diagrams, and videos all arrive together as one unified learning experience.

The paper presents three model sizes: Gemini Ultra (most capable), Gemini Pro (balanced performance and speed), and Gemini Nano (efficient, runs on a Pixel phone). Gemini Ultra became the first model to exceed human expert performance on MMLU — a benchmark of 57 diverse academic subjects — scoring 90.04% vs 89.8% (human expert baseline).

Key Numbers

BenchmarkGemini UltraGPT-4Task
MMLU90.04%86.4%World knowledge (57 subjects)
HumanEval74.4%88.4%Code generation
GSM8K94.4%92.0%Grade-school math
Needle-in-haystackPass (32K context)Limited to 128KLong-context retrieval

Core Innovation: Native Multimodality

Instead of this (traditional approach):

Text input → Text encoder → Shared representation → Output

Image input → Image encoder ────────┘

Gemini does this:

Text + Image + Audio + Video → Joint tokeniser → Unified Transformer → Output

All modalities are tokenised into the same representation space from the start. An image is split into 14×14 patches (196 tokens for a 224×224 image), each projected to the same embedding dimension as text tokens. The model never “sees” them as separate problems — it’s all just a sequence of tokens.

Training Scale

  • Model sizes: Gemini Ultra (~1.3T params, estimated), Pro (~50B params, estimated), Nano (~2-7B params)
  • Hardware: TPU v4 and v5 clusters with Pathways (Google’s framework for training multiple tasks simultaneously)
  • Training data: Large-scale multimodal corpus (details not fully disclosed in the technical report)
  • Context: 32K tokens initially (later extended to 1 million in Gemini 1.5)

The Indian Analogy

For native multimodality: Imagine a student in a small-town school who has two ways to learn:

  1. Old way (traditional models): The teacher writes everything on the blackboard, and the student learns to read and understand text. Then — much later — the teacher shows a map, and the student has to “translate” the map back into words to understand it. (“This blue line is a river. A river is water flowing from hills to ocean.”)

  2. New way (Gemini): The student learns from the very start with a textbook that has words and maps and diagrams all together. When learning about rivers, they read the word “river,” see the shape on the map, and understand both simultaneously. The river is not a “picture to decode” — it’s part of the same language as the word.

Gemini is the second student. It doesn’t learn language first and then retrofit vision. It learns them together.

For three model sizes: Like a government office with three roles:

  • Gemini Ultra (IAS Officer): Handles the most complex, nuanced decisions. Slow, expensive, needs full resources. Best for the hardest problems.
  • Gemini Pro (District Officer): Handles 80% of everyday work. Balanced speed and intelligence. Works well for most users.
  • Gemini Nano (Peon): Handles simple, routine tasks. Walks around to every office doing fast basic work. Can even run on your phone for offline use.

Read This Paper in This Order

SectionWhat You Will LearnDifficultyTime
01 — ContextWhy Google needed Gemini; GPT-4’s success; multimodal race🟢 Beginner8 min
02 — The ProblemWeaknesses in text-first approaches; why bolted-on vision isn’t enough🟡 Intermediate6 min
03 — The IdeaNative multimodality; unified tokenisation; the three sizes🟡 Intermediate10 min
04 — The MathToken encoding; patch embeddings; Transformer modifications🟡 Intermediate12 min
05 — Worked ExampleImage tokenisation walkthrough; token concatenation🟡 Intermediate8 min
06 — The CodeCall Gemini API with text + image🟢 Beginner5 min
07 — LimitationsDelayed release; benchmark questions; context vs GPT-4 Turbo🟡 Intermediate7 min
08 — ImpactGemini 1.5 (million-token context); Gemma; Google’s AI acceleration🟢 Beginner6 min
09 — SummaryOne-paragraph recap; what came next🟢 Beginner2 min

Total reading time: ~45 minutes


Before You Read: Math Tutorials You Need

You should already know these concepts. If not, read them first:

  1. Softmax and Cross-Entropy — How attention weights and logits work
  2. Transformers: Self-Attention — Core of Gemini’s architecture
  3. Matrix Multiplication and Projections — How embeddings are transformed

Architecture Diagram

Input Modality (Text / Image / Audio / Video)
         |

Unified Tokeniser
(SentencePiece + Patch Embeddings)
         |

Token Embedding Layer
(All tokens → same d_model dimension)
         |

Positional Encoding + Token Type Encoding
         |

Transformer Stack
(Efficient Attention + Dense Layers)
         |

Output Projections
(Language head / Image head / Task-specific head)
         |

Predictions (Next tokens / Image regions / Classification)

Paper 19: Ring Attention | Paper 21: Mamba

Discussion

Questions about this paper? Spotted something unclear? Start a discussion below — powered by GitHub, no separate account needed.