Batch Size
The number of training examples (or sequences) processed in a single gradient update.
The number of training examples (or sequences) processed in a single gradient update. Batch size affects training dynamics and speed but not the final scaling laws (the laws are relatively batch-size independent).