Causal Masking
In autoregressive language modelling, preventing the model from attending to future tokens (tokens that come after the current position).
In autoregressive language modelling, preventing the model from attending to future tokens (tokens that come after the current position). Implemented by setting attention scores to -∞ for future positions before softmax. Essential for ensuring models don't cheat by reading ahead. Mistral combines causal masking with sliding window masking.
In autoregressive language generation, preventing attention to future tokens. Token t cannot attend to tokens beyond position t. Ring Attention requires careful masking to enforce causality as KV chunks circulate across GPUs.