Learning Rate
A small positive number (e.g.
A small positive number (e.g. 0.1) that controls how big each weight update step is. Too high: learning overshoots and oscillates. Too low: learning is painfully slow. Choosing the right learning rate remains an important practical consideration in all neural network training today.
The η (eta) in the weight update rule w ← w - η×∂L/∂w. Controls the step size in gradient descent. Too high: overshoots, diverges. Too low: learns very slowly. Finding the right learning rate is one of the most important practical tasks in training neural networks.
A hyperparameter controlling the step size in gradient descent. The scaling laws assume learning rate is optimized for each model size. Poor learning rate scheduling breaks the scaling laws.