Context Length

Appears in 1 paper

The maximum number of tokens a model can process in a single input.

The maximum number of tokens a model can process in a single input. Standard Transformers are limited by memory (KV cache grows with sequence length). Mistral supports up to 32,768 tokens in context, though practical quality is best up to ~8K tokens due to SWA and training-inference mismatch.