Inference Latency

Appears in 1 paper

The time to generate a single token during inference.

The time to generate a single token during inference. Lower latency means faster response times and better user experience. Mistral 7B's efficiency (GQA + SWA) reduces inference latency by 2–4× compared to larger models, making it practical for real-time applications.