The Problem: Why Text-First Models Struggle with Vision

The Limitation of Bolted-On Vision

GPT-4V (the vision-capable version of GPT-4) works like this:

Take a Transformer trained primarily on text
Attach a vision encoder (likely a ViT — Vision Transformer) on the side
Connect the vision encoder’s outputs to the main language model via a projection layer
Fine-tune slightly

This is efficient for speed (you don’t retrain the entire model), but inefficient for learning. The vision encoder and language model were born in different worlds:

The language model learned to predict “the next word” in English
The vision encoder learned to classify ImageNet categories or detect objects in COCO

When you mash them together, neither part deeply understands the other. The language model sees the vision embeddings as just another modality to predict from, not as truly integrated information.

Real Problems This Creates

1. Inefficient Token Allocation

A naive vision-language model might represent an image as 4,000+ tokens to preserve detail. But these tokens are all treated equally by the attention mechanism. A text-only model that reads 4,000 words can focus on the important ones. A vision-language model that encodes an image as 4,000 patches can’t easily learn to ignore the background noise.

Analogy: Imagine a student (the language model) listening to a 4,000-word lecture delivered in English. That’s manageable — the student’s brain learned English. But now you pipe in another 4,000 words in an entirely different modality (music notation). The student has to learn to “hear” music notation while listening to English. Inefficient and hard.

When a model is trained text-first, its internal representations are optimized for text. If it later learns vision through fine-tuning, the vision embeddings have to translate into text-space.

For example, if you show GPT-4V a chart:

GPT-4V's process:
Image → Vision Encoder → (X-axis: numbers, Y-axis: "sales") 
        → Projects into text-space → Generates: "This chart shows..."

A natively multimodal model can do:

Gemini's process:
Chart image → Tokenize directly → "This chart shows..." emerges
(no translation step)

The second approach learns richer alignments. The model learns that the visual shape of an upward curve and the word “growth” belong together, without explicit translation.

3. Slower Inference for Long Context

GPT-4V has to:

Run the vision encoder (separate forward pass)
Projects outputs
Then runs the language model attention

For a user sending a 100-page PDF with images, this means:

Encoding every image
Waiting for the full pipeline

A unified architecture can batch image tokens and text tokens together, using the same efficient attention mechanisms for both.

4. Training Data Misalignment

Text-first models were trained on internet-scale text data:

Wikipedia, books, web pages, code repositories

Vision encoders were trained on:

ImageNet (classification)
COCO (detection)
Crawled web images with alt-text

These are fundamentally different distributions. When you bolt them together, there’s a “seam” where the training objectives don’t align. A natively trained multimodal model can learn from data where text and images naturally co-occur (captions, documents, videos with audio) from the start.

Concrete Failure Cases

Example 1: Reading a Handwritten Exam

GPT-4V sees:
- Image: handwritten answer
- Vision encoder extracts: "shapes and strokes"
- Projects to text-space: "these shapes could be letters"
- Finally: "answer is likely 'E'"

Gemini sees:
- Image + patches as tokens + character recognition learned jointly
- "E" recognized in one unified step

Example 2: Understanding a Scientific Diagram

A diagram of the water cycle (text labels + arrows + cloud shapes) requires:

Understanding the visual layout (clouds → rain → river)
Understanding the text labels (evaporation, condensation)
Understanding the relationship (how the words and shapes align)

Bolted-on vision learns these separately. Unified multimodality learns them together.

Example 3: Processing a Video

A video is not just “25 frames per second.” It’s:

Visual: the changing pixels (movement)
Temporal: the sequence of changes
Audio: speech and sound effects
Text: captions, subtitles, overlays

A text-first model has to handle video by:

Extracting key frames
Encoding each frame with a vision model
Running an audio encoder separately
Merging all three

A natively multimodal model processes all of this as one unified sequence of tokens.

Why This Matters

The fundamental problem: Bolted-on multimodality limits intelligence.

If your model was born understanding only language, bolting on vision is like a person who spent 30 years reading only books suddenly learning to interpret photographs through translated descriptions. It’s possible, but it’s always a translation step away from true fluency.

Google’s bet with Gemini: Train a model from scratch on language, vision, audio, and video simultaneously. Not as separate problems, but as one unified language.

Next: The Idea: How Gemini Solves This