Mean, Variance, and Standard Deviation

Understanding the spread and central tendency of data is fundamental to machine learning. This tutorial covers three essential concepts: mean (average), variance (spread), and standard deviation (standardized spread).

1. Mean (Average)

The mean is the sum of all values divided by the count. It’s the “center” of a distribution.

Formula

Mean = (1/n) * Σ(x_i)

where:
  n = number of values
  x_i = each individual value
  Σ = sum over all values

Worked Example

Data: Loss values from 5 training runs of a language model

Run 1: 2.5 bits per token
Run 2: 2.4 bits per token
Run 3: 2.6 bits per token
Run 4: 2.3 bits per token
Run 5: 2.4 bits per token

Calculate the mean:

Mean = (1/5) * (2.5 + 2.4 + 2.6 + 2.3 + 2.4)
     = (1/5) * (12.2)
     = 2.44 bits per token

The average loss across 5 runs is 2.44.

Interpretation

The mean tells you the “typical” value. If you had to pick a single number to represent all losses, the mean is a good choice. But it doesn’t tell you if the losses are all close together or spread out.

2. Variance

Variance measures how spread out the values are from the mean. High variance means values are scattered; low variance means they’re clustered.

Formula

Variance = (1/n) * Σ(x_i - mean)^2

where:
  x_i = each value
  mean = the average (calculated above)
  (x_i - mean) = deviation from the mean
  (x_i - mean)^2 = squared deviation

Worked Example (Continuing from Above)

We have:

Values: 2.5, 2.4, 2.6, 2.3, 2.4
Mean: 2.44

Step 1: Compute deviations from the mean

Run 1: 2.5 - 2.44 = 0.06
Run 2: 2.4 - 2.44 = -0.04
Run 3: 2.6 - 2.44 = 0.16
Run 4: 2.3 - 2.44 = -0.14
Run 5: 2.4 - 2.44 = -0.04

Step 2: Square each deviation

Run 1: (0.06)^2 = 0.0036
Run 2: (-0.04)^2 = 0.0016
Run 3: (0.16)^2 = 0.0256
Run 4: (-0.14)^2 = 0.0196
Run 5: (-0.04)^2 = 0.0016

Step 3: Average the squared deviations

Variance = (1/5) * (0.0036 + 0.0016 + 0.0256 + 0.0196 + 0.0016)
         = (1/5) * (0.052)
         = 0.0104

The variance is 0.0104 (bits per token)^2.

Interpretation

Variance is in squared units, which is hard to interpret intuitively. For this reason, we use standard deviation (below), which is the square root of variance.

3. Standard Deviation

Standard deviation is the square root of variance. It measures spread in the same units as the original data.

Formula

Std Dev = √Variance

or

Std Dev = √( (1/n) * Σ(x_i - mean)^2 )

Worked Example (Continuing)

From above, Variance = 0.0104.

Std Dev = √(0.0104)
        = 0.102 bits per token

The standard deviation is 0.102.

Interpretation

Standard deviation tells you the typical spread. In this example, most loss values are within 0.102 of the mean (2.44):

Range: [2.44 - 0.102, 2.44 + 0.102] = [2.338, 2.542]
Actual values: 2.5, 2.4, 2.6, 2.3, 2.4
Most values fall within ±1 std dev of the mean: ✓

4. Relationship Between Mean, Variance, and Std Dev

Start with data:         [2.5, 2.4, 2.6, 2.3, 2.4]
                              ↓
Calculate mean:         2.44
                              ↓
Calculate deviations:   [0.06, -0.04, 0.16, -0.14, -0.04]
                              ↓
Square deviations:      [0.0036, 0.0016, 0.0256, 0.0196, 0.0016]
                              ↓
Average squares:        0.0104  (this is variance)
                              ↓
Take square root:       0.102   (this is std dev)

5. The 68-95-99.7 Rule (Normal Distribution)

For data that follows a normal distribution, the 68-95-99.7 rule tells you how much data falls within certain ranges:

68% of data falls within 1 std dev of the mean:   [mean - 1σ, mean + 1σ]
95% of data falls within 2 std devs of the mean:  [mean - 2σ, mean + 2σ]
99.7% of data falls within 3 std devs of the mean: [mean - 3σ, mean + 3σ]

Example

If model loss has:

Mean: 1.5 bits per token
Std Dev: 0.1 bits per token

Then:

68% of training runs have loss between 1.4 and 1.6
95% of training runs have loss between 1.3 and 1.7
99.7% of training runs have loss between 1.2 and 1.8

6. Why This Matters in Machine Learning

Scaling Laws (Why This Tutorial Exists)

When studying scaling laws for language models, you run hundreds of experiments at different model sizes. Each size has variability due to:

Different random seeds
Different batch orderings
Hardware noise

You report:

Mean loss: The average loss at that scale
Variance/Std Dev: The uncertainty or spread in loss

This allows you to fit power laws while accounting for noise:

Experiment: 10B parameters
  Mean loss: 1.2
  Std Dev: 0.05
  
Experiment: 100B parameters
  Mean loss: 1.0
  Std Dev: 0.04
  
Fit a line on log-log axes: Loss follows L = a * N^(-0.076)
The std dev around this line shows uncertainty.

Confidence Intervals

With mean and std dev, you can compute confidence intervals:

95% Confidence Interval = [Mean - 2*Std Dev, Mean + 2*Std Dev]
                        = [1.2 - 0.1, 1.2 + 0.1]
                        = [1.1, 1.3]

Interpretation: We're 95% confident the true loss is between 1.1 and 1.3.

Comparing Models

Model A: Mean loss = 2.0, Std Dev = 0.5 Model B: Mean loss = 1.9, Std Dev = 0.1

Model B is better (lower mean loss) AND more reliable (lower variance).

7. Complete Worked Example: Comparing Training Runs

Scenario: You train a model three times with different random seeds and measure validation loss.

Run 1 Losses: [2.1, 2.2, 2.0, 2.3, 2.1]
Run 2 Losses: [2.5, 2.4, 2.6, 2.5, 2.4]
Run 3 Losses: [1.8, 2.0, 1.9, 1.7, 1.9]

Step 1: Calculate mean for each run

Run 1: (2.1 + 2.2 + 2.0 + 2.3 + 2.1) / 5 = 10.7 / 5 = 2.14
Run 2: (2.5 + 2.4 + 2.6 + 2.5 + 2.4) / 5 = 12.4 / 5 = 2.48
Run 3: (1.8 + 2.0 + 1.9 + 1.7 + 1.9) / 5 = 9.3 / 5 = 1.86

Step 2: Calculate std dev for each run

Run 1:
  Deviations: [−0.04, 0.06, −0.14, 0.16, −0.04]
  Squared: [0.0016, 0.0036, 0.0196, 0.0256, 0.0016]
  Variance: 0.052 / 5 = 0.0104
  Std Dev: √0.0104 = 0.102

Run 2:
  Deviations: [0.02, −0.08, 0.12, 0.02, −0.08]
  Squared: [0.0004, 0.0064, 0.0144, 0.0004, 0.0064]
  Variance: 0.028 / 5 = 0.0056
  Std Dev: √0.0056 = 0.075

Run 3:
  Deviations: [−0.06, 0.14, 0.04, −0.16, 0.04]
  Squared: [0.0036, 0.0196, 0.0016, 0.0256, 0.0016]
  Variance: 0.052 / 5 = 0.0104
  Std Dev: √0.0104 = 0.102

Step 3: Summary Table

Run	Mean Loss	Std Dev	95% CI
1	2.14	0.102	[1.94, 2.34]
2	2.48	0.075	[2.33, 2.63]
3	1.86	0.102	[1.66, 2.06]

Interpretation:

Run 3 is best: Lowest mean loss (1.86) and reasonable variance (0.102).
Run 2 is worst: Highest mean loss (2.48), even though it has low variance.
Run 1 is middle: Average mean loss (2.14) and moderate variance.

Choose Run 3 for deployment.

8. Python Code Example

import numpy as np

# Data from Run 1 (losses)
losses = np.array([2.1, 2.2, 2.0, 2.3, 2.1])

# Calculate mean
mean = np.mean(losses)
print(f"Mean: {mean:.2f}")  # Output: 2.14

# Calculate variance
variance = np.var(losses)
print(f"Variance: {variance:.4f}")  # Output: 0.0104

# Calculate standard deviation
std_dev = np.std(losses)
print(f"Std Dev: {std_dev:.3f}")  # Output: 0.102

# Calculate 95% confidence interval
ci_lower = mean - 2 * std_dev
ci_upper = mean + 2 * std_dev
print(f"95% CI: [{ci_lower:.2f}, {ci_upper:.2f}]")
# Output: 95% CI: [1.94, 2.34]

9. Key Takeaways

Mean: The center of the data. Single number summary.
Variance: Squared spread. Hard to interpret (units are squared).
Standard Deviation: Square root of variance. Same units as data. Measures typical spread.
Why both? Variance has mathematical properties (easier to work with algebra). Std dev has intuitive interpretation.
In ML: Report mean and std dev of model performance across runs to show both average quality and reliability.
Normal distribution rule: ~68% of data within 1 std dev, ~95% within 2, ~99.7% within 3.

10. Practice Problems

Problem 1: Five students’ exam scores: [75, 82, 78, 85, 80] a) Calculate the mean b) Calculate the variance c) Calculate the std dev

Problem 2: A model is trained 3 times. Accuracy scores: [92%, 91%, 93%] a) What’s the mean accuracy? b) What’s the std dev? c) What’s the 95% confidence interval?

Problem 3: Why might a model with lower variance be preferable even if its mean is slightly higher? (Think: deployment, reliability.)

Answers to Practice Problems

Problem 1: a) Mean = (75 + 82 + 78 + 85 + 80) / 5 = 400 / 5 = 80

b) Deviations: [-5, 2, -2, 5, 0]
Squared: [25, 4, 4, 25, 0]
Variance = (25 + 4 + 4 + 25 + 0) / 5 = 58 / 5 = 11.6

c) Std Dev = √11.6 = 3.40

Problem 2: a) Mean = (92 + 91 + 93) / 3 = 276 / 3 = 92%

b) Deviations: [0, -1, 1]
Squared: [0, 1, 1]
Variance = 2 / 3 = 0.667
Std Dev = √0.667 = 0.816%

c) 95% CI = [92 - 2(0.816), 92 + 2(0.816)] = [90.37%, 93.63%]

Problem 3: Higher reliability (lower variance) means more consistent performance. In production, consistency matters—you want the model to perform the same way every time, not fluctuate wildly. A model with slightly lower mean but much lower variance is more trustworthy.

Normal Distribution: The bell curve. Many natural phenomena (and model losses) follow this.
Z-Score: How many standard deviations a value is from the mean. Useful for standardization.
Confidence Intervals: Range around the mean where the true value likely falls.
Hypothesis Testing: Using mean and std dev to compare whether two models are statistically different.

Mean, Variance, and Standard Deviation

1. Mean (Average)

Formula

Worked Example

Interpretation

2. Variance

Formula

Worked Example (Continuing from Above)

Interpretation

3. Standard Deviation

Formula

Worked Example (Continuing)

Interpretation

4. Relationship Between Mean, Variance, and Std Dev

5. The 68-95-99.7 Rule (Normal Distribution)

Example

6. Why This Matters in Machine Learning

Scaling Laws (Why This Tutorial Exists)

Confidence Intervals

Comparing Models

7. Complete Worked Example: Comparing Training Runs

8. Python Code Example

9. Key Takeaways

10. Practice Problems

Answers to Practice Problems

Related Concepts

Further Reading