Mean, Variance, and Standard Deviation
Mean, Variance, and Standard Deviation
Understanding the spread and central tendency of data is fundamental to machine learning. This tutorial covers three essential concepts: mean (average), variance (spread), and standard deviation (standardized spread).
1. Mean (Average)
The mean is the sum of all values divided by the count. It’s the “center” of a distribution.
Formula
Mean = (1/n) * Σ(x_i)
where:
n = number of values
x_i = each individual value
Σ = sum over all values
Worked Example
Data: Loss values from 5 training runs of a language model
Run 1: 2.5 bits per token
Run 2: 2.4 bits per token
Run 3: 2.6 bits per token
Run 4: 2.3 bits per token
Run 5: 2.4 bits per token
Calculate the mean:
Mean = (1/5) * (2.5 + 2.4 + 2.6 + 2.3 + 2.4)
= (1/5) * (12.2)
= 2.44 bits per token
The average loss across 5 runs is 2.44.
Interpretation
The mean tells you the “typical” value. If you had to pick a single number to represent all losses, the mean is a good choice. But it doesn’t tell you if the losses are all close together or spread out.
2. Variance
Variance measures how spread out the values are from the mean. High variance means values are scattered; low variance means they’re clustered.
Formula
Variance = (1/n) * Σ(x_i - mean)^2
where:
x_i = each value
mean = the average (calculated above)
(x_i - mean) = deviation from the mean
(x_i - mean)^2 = squared deviation
Worked Example (Continuing from Above)
We have:
- Values: 2.5, 2.4, 2.6, 2.3, 2.4
- Mean: 2.44
Step 1: Compute deviations from the mean
Run 1: 2.5 - 2.44 = 0.06
Run 2: 2.4 - 2.44 = -0.04
Run 3: 2.6 - 2.44 = 0.16
Run 4: 2.3 - 2.44 = -0.14
Run 5: 2.4 - 2.44 = -0.04
Step 2: Square each deviation
Run 1: (0.06)^2 = 0.0036
Run 2: (-0.04)^2 = 0.0016
Run 3: (0.16)^2 = 0.0256
Run 4: (-0.14)^2 = 0.0196
Run 5: (-0.04)^2 = 0.0016
Step 3: Average the squared deviations
Variance = (1/5) * (0.0036 + 0.0016 + 0.0256 + 0.0196 + 0.0016)
= (1/5) * (0.052)
= 0.0104
The variance is 0.0104 (bits per token)^2.
Interpretation
Variance is in squared units, which is hard to interpret intuitively. For this reason, we use standard deviation (below), which is the square root of variance.
3. Standard Deviation
Standard deviation is the square root of variance. It measures spread in the same units as the original data.
Formula
Std Dev = √Variance
or
Std Dev = √( (1/n) * Σ(x_i - mean)^2 )
Worked Example (Continuing)
From above, Variance = 0.0104.
Std Dev = √(0.0104)
= 0.102 bits per token
The standard deviation is 0.102.
Interpretation
Standard deviation tells you the typical spread. In this example, most loss values are within 0.102 of the mean (2.44):
- Range: [2.44 - 0.102, 2.44 + 0.102] = [2.338, 2.542]
- Actual values: 2.5, 2.4, 2.6, 2.3, 2.4
- Most values fall within ±1 std dev of the mean: ✓
4. Relationship Between Mean, Variance, and Std Dev
Start with data: [2.5, 2.4, 2.6, 2.3, 2.4]
↓
Calculate mean: 2.44
↓
Calculate deviations: [0.06, -0.04, 0.16, -0.14, -0.04]
↓
Square deviations: [0.0036, 0.0016, 0.0256, 0.0196, 0.0016]
↓
Average squares: 0.0104 (this is variance)
↓
Take square root: 0.102 (this is std dev)
5. The 68-95-99.7 Rule (Normal Distribution)
For data that follows a normal distribution, the 68-95-99.7 rule tells you how much data falls within certain ranges:
68% of data falls within 1 std dev of the mean: [mean - 1σ, mean + 1σ]
95% of data falls within 2 std devs of the mean: [mean - 2σ, mean + 2σ]
99.7% of data falls within 3 std devs of the mean: [mean - 3σ, mean + 3σ]
Example
If model loss has:
- Mean: 1.5 bits per token
- Std Dev: 0.1 bits per token
Then:
- 68% of training runs have loss between 1.4 and 1.6
- 95% of training runs have loss between 1.3 and 1.7
- 99.7% of training runs have loss between 1.2 and 1.8
6. Why This Matters in Machine Learning
Scaling Laws (Why This Tutorial Exists)
When studying scaling laws for language models, you run hundreds of experiments at different model sizes. Each size has variability due to:
- Different random seeds
- Different batch orderings
- Hardware noise
You report:
- Mean loss: The average loss at that scale
- Variance/Std Dev: The uncertainty or spread in loss
This allows you to fit power laws while accounting for noise:
Experiment: 10B parameters
Mean loss: 1.2
Std Dev: 0.05
Experiment: 100B parameters
Mean loss: 1.0
Std Dev: 0.04
Fit a line on log-log axes: Loss follows L = a * N^(-0.076)
The std dev around this line shows uncertainty.
Confidence Intervals
With mean and std dev, you can compute confidence intervals:
95% Confidence Interval = [Mean - 2*Std Dev, Mean + 2*Std Dev]
= [1.2 - 0.1, 1.2 + 0.1]
= [1.1, 1.3]
Interpretation: We're 95% confident the true loss is between 1.1 and 1.3.
Comparing Models
Model A: Mean loss = 2.0, Std Dev = 0.5 Model B: Mean loss = 1.9, Std Dev = 0.1
Model B is better (lower mean loss) AND more reliable (lower variance).
7. Complete Worked Example: Comparing Training Runs
Scenario: You train a model three times with different random seeds and measure validation loss.
Run 1 Losses: [2.1, 2.2, 2.0, 2.3, 2.1]
Run 2 Losses: [2.5, 2.4, 2.6, 2.5, 2.4]
Run 3 Losses: [1.8, 2.0, 1.9, 1.7, 1.9]
Step 1: Calculate mean for each run
Run 1: (2.1 + 2.2 + 2.0 + 2.3 + 2.1) / 5 = 10.7 / 5 = 2.14
Run 2: (2.5 + 2.4 + 2.6 + 2.5 + 2.4) / 5 = 12.4 / 5 = 2.48
Run 3: (1.8 + 2.0 + 1.9 + 1.7 + 1.9) / 5 = 9.3 / 5 = 1.86
Step 2: Calculate std dev for each run
Run 1:
Deviations: [−0.04, 0.06, −0.14, 0.16, −0.04]
Squared: [0.0016, 0.0036, 0.0196, 0.0256, 0.0016]
Variance: 0.052 / 5 = 0.0104
Std Dev: √0.0104 = 0.102
Run 2:
Deviations: [0.02, −0.08, 0.12, 0.02, −0.08]
Squared: [0.0004, 0.0064, 0.0144, 0.0004, 0.0064]
Variance: 0.028 / 5 = 0.0056
Std Dev: √0.0056 = 0.075
Run 3:
Deviations: [−0.06, 0.14, 0.04, −0.16, 0.04]
Squared: [0.0036, 0.0196, 0.0016, 0.0256, 0.0016]
Variance: 0.052 / 5 = 0.0104
Std Dev: √0.0104 = 0.102
Step 3: Summary Table
| Run | Mean Loss | Std Dev | 95% CI |
|---|---|---|---|
| 1 | 2.14 | 0.102 | [1.94, 2.34] |
| 2 | 2.48 | 0.075 | [2.33, 2.63] |
| 3 | 1.86 | 0.102 | [1.66, 2.06] |
Interpretation:
- Run 3 is best: Lowest mean loss (1.86) and reasonable variance (0.102).
- Run 2 is worst: Highest mean loss (2.48), even though it has low variance.
- Run 1 is middle: Average mean loss (2.14) and moderate variance.
Choose Run 3 for deployment.
8. Python Code Example
import numpy as np
# Data from Run 1 (losses)
losses = np.array([2.1, 2.2, 2.0, 2.3, 2.1])
# Calculate mean
mean = np.mean(losses)
print(f"Mean: {mean:.2f}") # Output: 2.14
# Calculate variance
variance = np.var(losses)
print(f"Variance: {variance:.4f}") # Output: 0.0104
# Calculate standard deviation
std_dev = np.std(losses)
print(f"Std Dev: {std_dev:.3f}") # Output: 0.102
# Calculate 95% confidence interval
ci_lower = mean - 2 * std_dev
ci_upper = mean + 2 * std_dev
print(f"95% CI: [{ci_lower:.2f}, {ci_upper:.2f}]")
# Output: 95% CI: [1.94, 2.34]
9. Key Takeaways
-
Mean: The center of the data. Single number summary.
-
Variance: Squared spread. Hard to interpret (units are squared).
-
Standard Deviation: Square root of variance. Same units as data. Measures typical spread.
-
Why both? Variance has mathematical properties (easier to work with algebra). Std dev has intuitive interpretation.
-
In ML: Report mean and std dev of model performance across runs to show both average quality and reliability.
-
Normal distribution rule: ~68% of data within 1 std dev, ~95% within 2, ~99.7% within 3.
10. Practice Problems
Problem 1: Five students’ exam scores: [75, 82, 78, 85, 80] a) Calculate the mean b) Calculate the variance c) Calculate the std dev
Problem 2: A model is trained 3 times. Accuracy scores: [92%, 91%, 93%] a) What’s the mean accuracy? b) What’s the std dev? c) What’s the 95% confidence interval?
Problem 3: Why might a model with lower variance be preferable even if its mean is slightly higher? (Think: deployment, reliability.)
Answers to Practice Problems
Problem 1: a) Mean = (75 + 82 + 78 + 85 + 80) / 5 = 400 / 5 = 80
b) Deviations: [-5, 2, -2, 5, 0]
Squared: [25, 4, 4, 25, 0]
Variance = (25 + 4 + 4 + 25 + 0) / 5 = 58 / 5 = 11.6
c) Std Dev = √11.6 = 3.40
Problem 2: a) Mean = (92 + 91 + 93) / 3 = 276 / 3 = 92%
b) Deviations: [0, -1, 1]
Squared: [0, 1, 1]
Variance = 2 / 3 = 0.667
Std Dev = √0.667 = 0.816%
c) 95% CI = [92 - 2(0.816), 92 + 2(0.816)] = [90.37%, 93.63%]
Problem 3: Higher reliability (lower variance) means more consistent performance. In production, consistency matters—you want the model to perform the same way every time, not fluctuate wildly. A model with slightly lower mean but much lower variance is more trustworthy.
Related Concepts
- Normal Distribution: The bell curve. Many natural phenomena (and model losses) follow this.
- Z-Score: How many standard deviations a value is from the mean. Useful for standardization.
- Confidence Intervals: Range around the mean where the true value likely falls.
- Hypothesis Testing: Using mean and std dev to compare whether two models are statistically different.
Further Reading
- “Statistics for Machine Learning” — Chapter on Descriptive Statistics
- Probability Distributions — How to model data
- Cross-Entropy Loss — Why loss matters in ML
Good luck with your studies!