Vision Transformer (ViT)

Appears in 1 paper

A Transformer applied to images by dividing them into patches and treating patches as tokens.

As used in Paper 20 — Gemini: A Family of Highly Capable Multimodal Models →

A Transformer applied to images by dividing them into patches and treating patches as tokens. Gemini uses a similar approach for images (though full architecture details aren't disclosed).

Paper 20 — Gemini: A Family of Highly Capable Multimodal Models →

Appears in papers