Worked Example: Tokenising an Image and Text Together

Let’s trace through the full process of converting a simple image and caption into tokens that Gemini can process.

Scenario

We have:

Image: A 28×28 pixel photo (small, for easy calculation)
Caption: “A cat”
Patch size: 7×7 (to keep numbers manageable)
d_model: 8 (simplified; real: 2048)

Step 1: Divide Image into Patches

A 28×28 image with 7×7 patches:

Number of patches = (28 / 7) × (28 / 7) = 4 × 4 = 16 patches

Imagine the image divided into a 4×4 grid:

┌─────────┬─────────┬─────────┬─────────┐
│ Patch 0 │ Patch 1 │ Patch 2 │ Patch 3 │
├─────────┼─────────┼─────────┼─────────┤
│ Patch 4 │ Patch 5 │ Patch 6 │ Patch 7 │
├─────────┼─────────┼─────────┼─────────┤
│ Patch 8 │ Patch 9 │ Patch10 │ Patch11 │
├─────────┼─────────┼─────────┼─────────┤
│Patch 12 │Patch 13 │Patch 14 │Patch 15 │
└─────────┴─────────┴─────────┴─────────┘

Each patch is 7×7×3 = 147 values.

Step 2: Embed Each Patch

Each patch is a 147-dimensional vector. We project it to d_model = 8:

W_patch ∈ ℝ^(8 × 147)

For Patch 0 (top-left, contains mostly white sky):
  Raw patch values: [255, 255, 255, ... 147 times] (white pixels)
  
  e_patch[0] = W_patch @ [255, 255, ..., 255] + b_patch
             ≈ [0.5, -0.2, 0.8, 0.1, -0.3, 0.6, 0.2, 0.4]  (example embedding)
             ∈ ℝ^8

For Patch 1 (top-middle, contains edge of cat's head):
  Raw patch: [200, 150, 100, ... varied pixel values ...]
  
  e_patch[1] ≈ [0.2, 0.9, -0.1, 0.7, 0.3, -0.2, 0.5, 0.1]
             ∈ ℝ^8

We get 16 embeddings, each 8-dimensional:

e_patch = [[0.5, -0.2, 0.8, 0.1, -0.3, 0.6, 0.2, 0.4],   # Patch 0
           [0.2, 0.9, -0.1, 0.7, 0.3, -0.2, 0.5, 0.1],   # Patch 1
           ...
           [0.1, 0.3, 0.6, -0.4, 0.2, 0.7, -0.1, 0.5]]   # Patch 15

Shape: (16, 8)  [16 patches × 8-D embeddings]

Step 3: Tokenise Text

“A cat” is tokenised using SentencePiece:

Text: "A cat"
Tokens: ["A", "cat"]  (2 tokens)
Token IDs: [15, 234]  (hypothetical IDs from 256K vocabulary)

Word embeddings (from W_text ∈ ℝ^(256000 × 8)):
  e_text[0] = W_text[15] ≈ [0.3, 0.1, -0.2, 0.5, 0.6, -0.1, 0.2, 0.8]
  e_text[1] = W_text[234] ≈ [0.7, -0.3, 0.4, 0.2, -0.5, 0.6, 0.1, 0.3]

Shape: (2, 8) [2 tokens × 8-D embeddings]

Step 4: Concatenate All Tokens

Combine image patches and text tokens:

X = [
  [0.5, -0.2, 0.8, 0.1, -0.3, 0.6, 0.2, 0.4],     # Patch 0
  [0.2, 0.9, -0.1, 0.7, 0.3, -0.2, 0.5, 0.1],     # Patch 1
  ... (patches 2-15) ...
  [0.3, 0.1, -0.2, 0.5, 0.6, -0.1, 0.2, 0.8],     # Text token "A"
  [0.7, -0.3, 0.4, 0.2, -0.5, 0.6, 0.1, 0.3]      # Text token "cat"
]

Shape: (18, 8)  [18 total tokens × 8-D embeddings]

Key point: Tokens 0–15 came from an image. Tokens 16–17 came from text. The model treats them identically.

Step 5: Add Positional Encodings

Compute positional encoding for each position using the formula:

pos_enc[i] = [sin(i/10000^0/d_model), cos(i/10000^2/d_model), 
              sin(i/10000^4/d_model), cos(i/10000^6/d_model), ...]

For d_model = 8, compute for positions 0, 1, 2, and 16:

Position 0 (Patch 0)

pos_enc[0, 0] = sin(0 / 10000^0/8) = sin(0) = 0
pos_enc[0, 1] = cos(0 / 10000^2/8) = cos(0) = 1
pos_enc[0, 2] = sin(0 / 10000^4/8) = sin(0) = 0
pos_enc[0, 3] = cos(0 / 10000^6/8) = cos(0) = 1
pos_enc[0, 4] = sin(0 / 10000^8/8) = sin(0) = 0
pos_enc[0, 5] = cos(0 / 10000^10/8) = cos(0) = 1
pos_enc[0, 6] = sin(0 / 10000^12/8) = sin(0) = 0
pos_enc[0, 7] = cos(0 / 10000^14/8) = cos(0) = 1

pos_enc[0] = [0, 1, 0, 1, 0, 1, 0, 1]

Position 1 (Patch 1)

pos_enc[1, 0] = sin(1 / 10000^0/8) = sin(1) ≈ 0.841
pos_enc[1, 1] = cos(1 / 10000^2/8) = cos(1) ≈ 0.540
pos_enc[1, 2] = sin(1 / 10000^4/8) = sin(1/10000^0.5) = sin(1/100) ≈ 0.010
pos_enc[1, 3] = cos(1 / 10000^6/8) = cos(1/10000^0.75) ≈ 1.0
pos_enc[1, 4] = sin(1 / 10000^8/8) = sin(1/10000) ≈ 0.0001
pos_enc[1, 5] = cos(1 / 10000^10/8) ≈ 1.0
pos_enc[1, 6] = sin(1 / 10000^12/8) ≈ 0
pos_enc[1, 7] = cos(1 / 10000^14/8) ≈ 1.0

pos_enc[1] ≈ [0.841, 0.540, 0.010, 1.0, 0.0001, 1.0, 0, 1.0]

Position 16 (Text Token “A”, after all image patches)

pos_enc[16, 0] = sin(16 / 1) = sin(16) ≈ -0.288
pos_enc[16, 1] = cos(16 / 1) ≈ -0.958
pos_enc[16, 2] = sin(16 / 100) ≈ 0.159
pos_enc[16, 3] = cos(16 / 100) ≈ 0.987
pos_enc[16, 4] = sin(16 / 10000) ≈ 0.0016
pos_enc[16, 5] = cos(16 / 10000) ≈ 1.0
pos_enc[16, 6] = sin(16 / 100000000) ≈ 0
pos_enc[16, 7] = cos(16 / 100000000) ≈ 1.0

pos_enc[16] ≈ [-0.288, -0.958, 0.159, 0.987, 0.0016, 1.0, 0, 1.0]

Key insight: Positions far apart (0 vs 16) have very different positional encodings, so the Transformer knows the text token comes after all the image patches.

Step 6: Add Positional Encodings to Token Embeddings

X_pos = X + pos_enc

X_pos[0] = [0.5, -0.2, 0.8, 0.1, -0.3, 0.6, 0.2, 0.4] 
         + [0, 1, 0, 1, 0, 1, 0, 1]
         = [0.5, 0.8, 0.8, 1.1, -0.3, 1.6, 0.2, 1.4]

X_pos[1] = [0.2, 0.9, -0.1, 0.7, 0.3, -0.2, 0.5, 0.1]
         + [0.841, 0.540, 0.010, 1.0, 0.0001, 1.0, 0, 1.0]
         = [1.041, 1.440, -0.090, 1.7, 0.3001, 0.8, 0.5, 1.1]

X_pos[16] = [0.3, 0.1, -0.2, 0.5, 0.6, -0.1, 0.2, 0.8]
          + [-0.288, -0.958, 0.159, 0.987, 0.0016, 1.0, 0, 1.0]
          = [0.012, -0.858, -0.041, 1.487, 0.6016, 0.9, 0.2, 1.8]

Final shape: (18, 8)

Step 7: Feed Into Transformer

Now X_pos (18 tokens, 8 dimensions each) is fed into the Transformer stack:

X_pos (18 × 8)
  ↓
Multi-head attention (18 tokens attend to each other)
  ├─ Patches 0-15 attend to each other (spatial relationships in image)
  ├─ Patches 0-15 attend to tokens 16-17 (image understanding text context)
  └─ Tokens 16-17 attend to patches 0-15 (language grounded in image)
  ↓
Feed-forward network (applied per token)
  ↓
(Repeat N times)
  ↓
Output: (18, 8) embeddings ready for prediction

What Did the Model Learn?

After training, the model’s attention weights reveal:

When processing Patch 5 (likely contains part of the cat):

High attention to patches 1, 2, 4, 6, 9, 10 (nearby patches — understanding spatial structure)
High attention to token 16 “cat” (grounding the visual feature in language)
Low attention to patches in background

When processing token 16 “cat”:

High attention to patches 1, 5, 9 (where the cat appears in the image)
Low attention to patches with just background (sky, grass)
Moderate attention to token 17 (understanding grammar)

This cross-modal reasoning emerges automatically from the unified architecture.

Summary

Step	Input	Output	Dimension
Patch division	28×28×3 image	16 patches of 7×7×3	(16, 147)
Patch embedding	16 patches (147-D)	16 embeddings	(16, 8)
Text tokenisation	”A cat”	2 tokens	(2,)
Text embedding	2 tokens	2 embeddings	(2, 8)
Concatenation	Patches + text	Combined sequence	(18, 8)
Positional encoding	(18, 8) + positions	Position-aware embeddings	(18, 8)
Transformer	(18, 8)	Processed representations	(18, 8)

The beauty of native multimodality: All steps are identical regardless of modality. No special cases, no bolted-on components. Just tokens and attention.

Next: The Code: Using Gemini’s API