Vision Transformer (ViT)
A Transformer applied to images by dividing them into patches and treating patches as tokens.
A Transformer applied to images by dividing them into patches and treating patches as tokens. Gemini uses a similar approach for images (though full architecture details aren't disclosed).