Statistics From Scratch: The Complete No-BS Guide¶
Statistics is the science of collecting, organizing, analyzing, interpreting, and presenting data. In less formal terms: it’s how we make sense of the world through numbers, detect patterns, test ideas, and make informed decisions when we’re not 100% sure about things (which is always).
Whether you’re a data scientist, a student, a trader, or someone who just wants to call BS on misleading charts in the news, this guide is for you. We’re going from zero to confident, with formulas, worked examples, and the occasional reminder that “68% of statistics are made up on the spot.”
1. Descriptive Statistics¶
Descriptive statistics summarize and describe the main features of a dataset. Before we can model or infer anything, we must first understand what we’re looking at. Think of it as getting to know your data before asking it to do tricks.
1.1 Population vs Sample¶
A population is the entire group of individuals or observations we are interested in studying. A sample is a subset drawn from the population to make the study feasible.
| Concept | Population | Sample |
|---|---|---|
| Definition | Every member of the group of interest | A subset selected from the population |
| Size notation | N | n |
| Mean notation | mu (population mean) | x-bar (sample mean) |
| Std dev notation | sigma | s |
| Purpose | Complete truth (often unknown) | Estimate population parameters |
Key insight: We almost never have access to the entire population. We rely on samples and use inferential statistics to draw conclusions about the population from which the sample was drawn. It’s like tasting a spoonful of soup to decide if the whole pot needs more salt, you don’t need to drink the whole thing.
Example: If we want to know the average height of all adults in France (population ~50 million adults), we cannot measure everyone. Instead, we take a sample of, say, 2,000 adults and compute the sample mean to estimate the population mean.
1.2 Measures of Central Tendency¶
Central tendency describes the “center” or “typical value” of a dataset. There are three main ways to find it, and they each tell you something different.
Mean (Arithmetic Average)¶
The mean is the sum of all values divided by the number of values. It’s the one everyone knows.
- Population mean:
mu = (1/N) * sum(x_i)fori = 1toN - Sample mean:
x-bar = (1/n) * sum(x_i)fori = 1ton
Worked Example:
Dataset: {4, 8, 6, 5, 3, 8, 9, 7}
Step 1: Sum all values: 4 + 8 + 6 + 5 + 3 + 8 + 9 + 7 = 50
Step 2: Count the values: n = 8
Step 3: Divide: x-bar = 50 / 8 = 6.25
The mean is 6.25.
Median¶
The median is the middle value when the data is sorted in ascending order. If there is an even number of observations, the median is the average of the two middle values.
Worked Example (odd count):
Dataset: {12, 3, 7, 9, 15}
Sorted: {3, 7, 9, 12, 15}
The middle value (3rd of 5) is 9. The median is 9.
Worked Example (even count):
Dataset: {4, 8, 6, 5, 3, 8, 9, 7}
Sorted: {3, 4, 5, 6, 7, 8, 8, 9}
The two middle values (4th and 5th of 8) are 6 and 7.
Median = (6 + 7) / 2 = 6.5
The median is 6.5.
When to use what: The mean is sensitive to outliers. If one person in a room earns $10 million while nine others earn $40,000, the mean income is over $1 million, which misrepresents the group. The median ($40,000) would be far more representative. Use the median for skewed distributions and the mean for symmetric ones.
This is why news articles say “median household income” instead of “mean household income.” If they used the mean, Jeff Bezos would personally drag the number up by a few thousand dollars for everyone in his zip code.
Mode¶
The mode is the value that appears most frequently in the dataset.
Worked Example:
Dataset: {4, 8, 6, 5, 3, 8, 9, 7}
Frequency count: 8 appears twice, all others appear once.
The mode is 8.
A dataset can be unimodal (one mode), bimodal (two modes), multimodal (more than two), or have no mode (all values appear equally often).
Bimodal Example:
Dataset: {2, 3, 3, 5, 7, 7, 9}, modes are 3 and 7.
1.3 Measures of Dispersion¶
Dispersion (or spread) measures describe how spread out the values in a dataset are. The mean tells you where the center is; dispersion tells you how far the data wanders from that center.
Range¶
Range = Maximum value - Minimum value
Example: Dataset {3, 4, 5, 6, 7, 8, 8, 9} has Range = 9 - 3 = 6.
The range is simple but heavily influenced by outliers. One crazy data point, and your range is meaningless.
Variance¶
Variance measures the average squared deviation from the mean. Why squared? Because if you just averaged the deviations, positives and negatives would cancel out and you’d always get zero. Squaring fixes that (and also punishes big deviations more than small ones).
- Population variance:
sigma^2 = (1/N) * sum((x_i - mu)^2) - Sample variance:
s^2 = (1/(n-1)) * sum((x_i - x-bar)^2)
Why n-1 for sample variance? This is called Bessel’s correction. Using
n-1instead ofncorrects for the bias that arises because a sample tends to underestimate the population variance. Dividing byn-1gives an unbiased estimator. It’s a small mathematical kindness that makes your estimate more honest.
Worked Example (step-by-step):
Dataset: {4, 8, 6, 5, 3}
Step 1: Compute the mean.x-bar = (4 + 8 + 6 + 5 + 3) / 5 = 26 / 5 = 5.2
Step 2: Compute each deviation from the mean and square it.
x_i | x_i - x-bar | (x_i - x-bar)^2 |
|---|---|---|
| 4 | 4 - 5.2 = -1.2 | 1.44 |
| 8 | 8 - 5.2 = 2.8 | 7.84 |
| 6 | 6 - 5.2 = 0.8 | 0.64 |
| 5 | 5 - 5.2 = -0.2 | 0.04 |
| 3 | 3 - 5.2 = -2.2 | 4.84 |
Step 3: Sum the squared deviations.1.44 + 7.84 + 0.64 + 0.04 + 4.84 = 14.80
Step 4: Divide by n - 1 = 4 (sample variance).s^2 = 14.80 / 4 = 3.70
The sample variance is 3.70.
Standard Deviation¶
The standard deviation is the square root of the variance. It brings the measure of spread back to the original units of the data. If your data is in dollars, the variance is in “dollars squared” (which makes no sense), but the standard deviation is back in dollars.
s = sqrt(s^2) = sqrt(3.70) = 1.924
The sample standard deviation is approximately 1.92.
1.4 Percentiles, Quartiles, and IQR¶
A percentile is a value below which a given percentage of data falls. The k-th percentile P_k is the value such that k% of the data points are at or below it.
If you scored in the 90th percentile on a test, 90% of test-takers scored lower than you. It doesn’t mean you got 90% of the answers right, a common and slightly embarrassing misconception.
Quartiles divide sorted data into four equal parts:
- Q1 (25th percentile): 25% of the data falls below this value.
- Q2 (50th percentile): This is the median.
- Q3 (75th percentile): 75% of the data falls below this value.
Interquartile Range (IQR):
IQR = Q3 - Q1
The IQR measures the spread of the middle 50% of the data. It is robust to outliers.
Worked Example:
Dataset (sorted): {2, 4, 5, 7, 8, 9, 11, 13, 15} (n = 9)
- Q2 (median) = 8 (the 5th value)
- Q1 = median of the lower half
{2, 4, 5, 7}=(4 + 5) / 2 = 4.5 - Q3 = median of the upper half
{9, 11, 13, 15}=(11 + 13) / 2 = 12 IQR = 12 - 4.5 = 7.5
Outlier detection rule: A data point is considered a potential outlier if it falls below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR.
- Lower fence:
4.5 - 1.5 * 7.5 = 4.5 - 11.25 = -6.75 - Upper fence:
12 + 1.5 * 7.5 = 12 + 11.25 = 23.25
Any data point below -6.75 or above 23.25 would be flagged as an outlier. In our dataset, no outliers are present.
1.5 Skewness and Kurtosis¶
Skewness¶
Skewness measures the asymmetry of a distribution.
- Skewness = 0: The distribution is symmetric (e.g., normal distribution).
- Positive skew (right-skewed): The tail extends to the right. Most values are clustered on the left. Example: income distribution, most people earn “normal” amounts, but a few billionaires pull the tail way out to the right.
- Negative skew (left-skewed): The tail extends to the left. Most values are clustered on the right. Example: age at retirement, most people retire around 60-65, but some retire very early.
Intuition: If the mean is greater than the median, the distribution is likely right-skewed. If the mean is less than the median, it is likely left-skewed. The mean gets pulled toward the tail; the median doesn’t care.
Kurtosis¶
Kurtosis measures the tailedness of a distribution, how heavy or light the tails are relative to a normal distribution.
- Mesokurtic (kurtosis ~ 3): Similar tails to the normal distribution.
- Leptokurtic (kurtosis > 3): Heavier tails, more extreme outliers likely. “Tall and thin” peak. Financial returns often look like this, more crashes and booms than a normal distribution would predict.
- Platykurtic (kurtosis < 3): Lighter tails, fewer outliers. “Short and flat” peak.
Excess kurtosis is defined as
kurtosis - 3, so a normal distribution has excess kurtosis of 0.
Leptokurtic, mesokurtic, platykurtic. If you can say these at a party without laughing, you might be a statistician.
2. Data Visualization Concepts¶
Data visualization is the graphical representation of data. Choosing the right chart is essential for communicating insights clearly, and for not accidentally lying with your data.
Histograms¶
A histogram displays the frequency distribution of a continuous variable by dividing the data into bins (intervals) and showing how many observations fall into each bin.
When to use: To understand the shape of a distribution (symmetric, skewed, bimodal), to detect outliers, and to choose appropriate statistical models.
Key properties: The x-axis represents the variable, the y-axis represents frequency (or relative frequency). The bars are adjacent (no gaps) because the variable is continuous.
Box Plots (Box-and-Whisker Plots)¶
A box plot summarizes data using five values: minimum, Q1, median (Q2), Q3, and maximum.
- The box spans from Q1 to Q3 (the IQR).
- The line inside the box is the median.
- The whiskers extend to the smallest and largest values within 1.5 * IQR from the box.
- Points beyond the whiskers are plotted individually as outliers.
When to use: To compare distributions across groups, to quickly spot outliers, and to see the spread and center of data side-by-side. Box plots are the Swiss Army knife of exploratory data analysis.
Scatter Plots¶
A scatter plot displays the relationship between two continuous variables by plotting each observation as a point in a 2D plane.
When to use: To explore the relationship (linear, curved, none) between two variables, to detect clusters, and to identify outliers.
3. Probability Distributions¶
A probability distribution describes how the probabilities are distributed over the possible values of a random variable. Think of it as a recipe that tells you “what values can this thing take, and how likely is each one?”
3.1 Discrete Distributions¶
Bernoulli Distribution¶
A single trial with two outcomes: success (probability p) or failure (probability 1 - p). This is the atom of probability, the simplest possible distribution.
P(X = 1) = p and P(X = 0) = 1 - p
- Mean:
E(X) = p - Variance:
Var(X) = p(1 - p)
Example: Flipping a fair coin. Let success = heads, p = 0.5. Then P(X = 1) = 0.5, P(X = 0) = 0.5.
Binomial Distribution¶
The number of successes in n independent Bernoulli trials, each with success probability p. This is the Bernoulli distribution’s bigger sibling.
P(X = k) = C(n, k) * p^k * (1 - p)^(n - k)
where C(n, k) = n! / (k! * (n - k)!)
- Mean:
E(X) = np - Variance:
Var(X) = np(1 - p)
Worked Example: A fair coin is flipped 10 times. What is the probability of getting exactly 6 heads?
n = 10, k = 6, p = 0.5
C(10, 6) = 10! / (6! * 4!) = (10 * 9 * 8 * 7) / (4 * 3 * 2 * 1) = 5040 / 24 = 210
P(X = 6) = 210 * (0.5)^6 * (0.5)^4 = 210 * (0.5)^10 = 210 / 1024 = 0.2051
There is approximately a 20.51% chance of getting exactly 6 heads. Not bad, but also not as likely as you might intuitively think.
Poisson Distribution¶
Models the number of events occurring in a fixed interval of time or space, when events occur independently at a constant average rate lambda.
P(X = k) = (e^(-lambda) * lambda^k) / k!
- Mean:
E(X) = lambda - Variance:
Var(X) = lambda
Worked Example: A call center receives an average of 4 calls per hour (lambda = 4). What is the probability of receiving exactly 2 calls in a given hour?
P(X = 2) = (e^(-4) * 4^2) / 2! = (0.01832 * 16) / 2 = 0.2932 / 2 = 0.1465
There is approximately a 14.65% chance of receiving exactly 2 calls.
The Poisson distribution: for when you want to count how many times something happens in a fixed window. Calls per hour, typos per page, goals per soccer game, meteorites hitting your car per lifetime (lambda is very small for that one).
3.2 Continuous Distributions¶
Uniform Distribution¶
Every value in the interval [a, b] is equally likely. It’s the “I have no idea, so everything is equally possible” distribution.
- PDF:
f(x) = 1 / (b - a)fora <= x <= b - Mean:
E(X) = (a + b) / 2 - Variance:
Var(X) = (b - a)^2 / 12
Example: A random number is generated uniformly between 0 and 10. The probability that it falls between 3 and 7 is (7 - 3) / (10 - 0) = 4/10 = 0.40 or 40%.
Normal (Gaussian) Distribution¶
The most important distribution in statistics. It’s the celebrity of the distribution world. It appears everywhere (heights, test scores, measurement errors) and the Central Limit Theorem explains why.
f(x) = (1 / (sigma * sqrt(2*pi))) * e^(-(x - mu)^2 / (2 * sigma^2))
The 68-95-99.7 Rule (Empirical Rule):
- Approximately 68% of data falls within
mu +/- 1*sigma - Approximately 95% of data falls within
mu +/- 2*sigma - Approximately 99.7% of data falls within
mu +/- 3*sigma
Example: If exam scores are normally distributed with mu = 70 and sigma = 10:
- 68% of scores fall between 60 and 80
- 95% of scores fall between 50 and 90
- 99.7% of scores fall between 40 and 100
So if you scored below 40, you’re a 0.15%-er, and not in the good way.
Exponential Distribution¶
Models the time between events in a Poisson process with rate lambda.
- PDF:
f(x) = lambda * e^(-lambda * x)forx >= 0 - Mean:
E(X) = 1 / lambda - Variance:
Var(X) = 1 / lambda^2
Key property: The exponential distribution is memoryless, the probability of waiting another t minutes is the same regardless of how long you have already waited. The bus doesn’t care that you’ve been standing there for 20 minutes.
3.3 Standard Normal Distribution and Z-Scores¶
The standard normal distribution has mu = 0 and sigma = 1. Any normal distribution can be converted to it using the Z-score:
Z = (X - mu) / sigma
The Z-score tells you how many standard deviations a value is from the mean. A Z-score of 2 means “this value is 2 standard deviations above average.” A Z-score of -1.5 means “this value is 1.5 standard deviations below average.”
Worked Example:
Student scores on a test are normally distributed with mu = 75 and sigma = 8. A student scored 91. What percentage of students scored lower?
Step 1: Compute the Z-score.Z = (91 - 75) / 8 = 16 / 8 = 2.0
Step 2: Look up Z = 2.0 in the standard normal table.P(Z < 2.0) = 0.9772
97.72% of students scored lower than 91. This student is in the top 2.28%.
Common Z-table values:
| Z-score | P(Z < z) |
|---|---|
| -2.0 | 0.0228 |
| -1.0 | 0.1587 |
| 0.0 | 0.5000 |
| 1.0 | 0.8413 |
| 1.645 | 0.9500 |
| 1.96 | 0.9750 |
| 2.0 | 0.9772 |
| 2.576 | 0.9950 |
Memorize 1.96 and 2.576. They show up everywhere in confidence intervals and hypothesis testing. They’re the 42 of statistics, the answer to (almost) everything.
4. Sampling and Estimation¶
4.1 Sampling Methods¶
- Simple Random Sampling: Every member of the population has an equal chance of being selected. Example: drawing names from a hat.
- Stratified Sampling: The population is divided into subgroups (strata) based on a characteristic (e.g., age, gender), and random samples are taken from each stratum proportionally. This ensures representation.
- Systematic Sampling: Every k-th element is selected from a list. For example, selecting every 10th person from an alphabetical list.
4.2 Central Limit Theorem (CLT)¶
Central Limit Theorem: Regardless of the shape of the population distribution, the distribution of sample means approaches a normal distribution as the sample size
nincreases (typicallyn >= 30). The mean of the sampling distribution equals the population meanmu, and the standard deviation of the sampling distribution (called the standard error) issigma / sqrt(n).
Why is the CLT so powerful? It allows us to use the normal distribution to make inferences about population parameters even when the population itself is not normally distributed. This is the theoretical foundation for confidence intervals and hypothesis tests. It’s arguably the most important theorem in all of statistics.
The CLT is like a magic spell: take any weird, ugly, skewed distribution, average enough samples from it, and (poof) you get a beautiful normal distribution. Math is wild.
Example: Suppose the time a customer spends in a store is exponentially distributed (highly right-skewed) with mu = 15 minutes and sigma = 15 minutes. If we take random samples of n = 36 customers:
- The sampling distribution of the sample mean is approximately normal.
- Mean of sampling distribution:
mu_x-bar = 15 minutes - Standard error:
SE = 15 / sqrt(36) = 15 / 6 = 2.5 minutes
So even though individual customer times are skewed, the average time for groups of 36 customers will follow a bell curve centered at 15 minutes with a standard error of 2.5 minutes.
4.3 Confidence Intervals¶
A confidence interval provides a range of plausible values for a population parameter. Instead of saying “the mean is 42,” you say “the mean is somewhere between 39 and 45, and I’m 95% confident about that.”
Formula for a confidence interval for the mean (known sigma or large n):
CI = x-bar +/- z* * (sigma / sqrt(n))
where z* is the critical value from the standard normal distribution for the desired confidence level.
| Confidence Level | z* |
|---|---|
| 90% | 1.645 |
| 95% | 1.96 |
| 99% | 2.576 |
Worked Example: A sample of n = 100 light bulbs has a mean lifetime of x-bar = 1200 hours with a known population standard deviation of sigma = 100 hours. Construct a 95% confidence interval.
Step 1: Find the critical value. For 95% confidence, z* = 1.96.
Step 2: Compute the standard error. SE = 100 / sqrt(100) = 100 / 10 = 10
Step 3: Compute the margin of error. ME = 1.96 * 10 = 19.6
Step 4: Construct the interval. CI = 1200 +/- 19.6 = [1180.4, 1219.6]
Interpretation: We are 95% confident that the true population mean lifetime lies between 1180.4 and 1219.6 hours. This does NOT mean there is a 95% probability the true mean is in this interval. The true mean is fixed; it either is or is not in the interval. The “95%” refers to the long-run success rate of the method.
This distinction drives students crazy, but it matters. The true mean isn’t a random variable, it’s a fixed number we don’t know. The interval is what’s random. If we repeated this process 100 times, about 95 of our intervals would contain the true mean.
4.4 Margin of Error¶
The margin of error is the half-width of the confidence interval:
ME = z* * (sigma / sqrt(n))
A smaller margin of error means more precision. You can reduce the margin of error by:
- Increasing the sample size
n(most common approach). - Decreasing the confidence level (trade-off between confidence and precision).
- Having a smaller population standard deviation (not usually under your control).
Want a tighter confidence interval? Get more data. Always more data. Data is the answer to almost every statistics problem.
5. Hypothesis Testing¶
Hypothesis testing is a formal procedure for deciding whether sample data provides enough evidence to reject a claim about a population. It’s the scientific method, but with math.
5.1 Null and Alternative Hypotheses¶
- Null hypothesis
H_0: The default claim, typically “no effect” or “no difference.” It represents the status quo. Think of it as “innocent until proven guilty.” - Alternative hypothesis
H_1(orH_a): The claim we are testing for, typically “there is an effect” or “there is a difference.”
Example: A pharmaceutical company claims its new drug lowers blood pressure by an average of 10 mmHg.
H_0: mu = 10(the drug works as claimed)H_1: mu != 10(the drug does not work as claimed, two-tailed test)
5.2 Type I and Type II Errors¶
H_0 is true | H_0 is false | |
|---|---|---|
Reject H_0 | Type I error (false positive) | Correct decision (power) |
Fail to reject H_0 | Correct decision | Type II error (false negative) |
- Type I error (alpha): Rejecting
H_0when it is actually true. Like convicting an innocent person. The probability of a Type I error is the significance levelalpha(commonly 0.05). - Type II error (beta): Failing to reject
H_0when it is actually false. Like acquitting a guilty person. - Power = 1 - beta: The probability of correctly rejecting a false
H_0.
Type I error: your COVID test says positive, but you’re healthy. Type II error: your COVID test says negative, but you’re actually sick. Both are bad, but in different ways.
5.3 The p-value¶
The p-value is the probability of observing a test statistic at least as extreme as the one computed from the data, assuming the null hypothesis is true.
- If
p-value <= alpha: RejectH_0. The result is “statistically significant.” - If
p-value > alpha: Fail to rejectH_0. There is insufficient evidence againstH_0.
Intuition: A small p-value means the observed data would be very unlikely under the null hypothesis, so we doubt the null hypothesis. Think of it as: “If nothing special is happening (H_0 is true), the chance of seeing data this extreme is only p%. That’s suspicious.”
5.4 Steps of a Hypothesis Test¶
- State
H_0andH_1. - Choose a significance level
alpha(e.g., 0.05). - Select the appropriate test and compute the test statistic.
- Determine the p-value (or compare the test statistic to the critical value).
- Make a decision: reject or fail to reject
H_0. - State the conclusion in context.
Worked Example (One-sample Z-test):
A factory claims its bolts have a mean length of mu_0 = 5.00 cm. A quality inspector measures a random sample of n = 50 bolts and finds x-bar = 5.08 cm. The population standard deviation is known to be sigma = 0.20 cm. Test at alpha = 0.05.
Step 1: H_0: mu = 5.00, H_1: mu != 5.00 (two-tailed)
Step 2: alpha = 0.05
Step 3: Compute the test statistic.Z = (x-bar - mu_0) / (sigma / sqrt(n)) = (5.08 - 5.00) / (0.20 / sqrt(50)) = 0.08 / 0.02828 = 2.83
Step 4: Find the p-value. For a two-tailed test:p-value = 2 * P(Z > 2.83) = 2 * (1 - 0.9977) = 2 * 0.0023 = 0.0046
Step 5: Since p-value = 0.0046 < 0.05 = alpha, we reject H_0.
Step 6: There is statistically significant evidence that the mean bolt length differs from 5.00 cm. Those bolts are off-spec, and someone on the factory floor needs to recalibrate something.
5.5 t-test¶
When the population standard deviation is unknown (which is almost always in practice), we use the t-test instead of the Z-test. The test statistic follows a t-distribution with n - 1 degrees of freedom.
t = (x-bar - mu_0) / (s / sqrt(n))
The t-distribution is like the normal distribution’s scrappier cousin, a bit wider, a bit heavier in the tails. As your sample size grows, the t-distribution approaches the normal distribution. With n = 30+, they’re almost identical.
One-sample t-test¶
Tests whether a sample mean differs from a hypothesized value.
Example: A teacher believes the average score on a test is 72. A sample of 15 students gives x-bar = 76 and s = 8.
t = (76 - 72) / (8 / sqrt(15)) = 4 / 2.066 = 1.936
With df = 14 and alpha = 0.05 (two-tailed), the critical value is approximately t_crit = 2.145.
Since |1.936| < 2.145, we fail to reject H_0. There is not enough evidence to say the true mean differs from 72. The difference looks suggestive, but we can’t be sure it isn’t just noise.
Two-sample t-test (independent samples)¶
Tests whether two groups have different means.
t = (x-bar_1 - x-bar_2) / sqrt(s_1^2/n_1 + s_2^2/n_2)
Example: Compare test scores for two teaching methods.
- Method A:
n_1 = 25,x-bar_1 = 78,s_1 = 10 - Method B:
n_2 = 30,x-bar_2 = 73,s_2 = 12
t = (78 - 73) / sqrt(100/25 + 144/30) = 5 / sqrt(4 + 4.8) = 5 / sqrt(8.8) = 5 / 2.966 = 1.686
Using approximate degrees of freedom and alpha = 0.05 (two-tailed), the critical value for large df is about 2.01. Since 1.686 < 2.01, we fail to reject H_0. The difference is not statistically significant. Method A looks better, but we can’t rule out random chance.
5.6 Chi-squared Test¶
The chi-squared test assesses whether there is a significant association between two categorical variables using a contingency table.
chi^2 = sum((O_i - E_i)^2 / E_i)
where O_i = observed frequency and E_i = expected frequency under the null hypothesis of independence.
E_i = (row total * column total) / grand total
Worked Example:
A survey of 200 people asks about gender and preference for tea or coffee.
Observed frequencies:
| Tea | Coffee | Row Total | |
|---|---|---|---|
| Male | 30 | 70 | 100 |
| Female | 50 | 50 | 100 |
| Column Total | 80 | 120 | 200 |
H_0: Gender and drink preference are independent.H_1: Gender and drink preference are associated.
Expected frequencies (under independence):
- E(Male, Tea) =
(100 * 80) / 200 = 40 - E(Male, Coffee) =
(100 * 120) / 200 = 60 - E(Female, Tea) =
(100 * 80) / 200 = 40 - E(Female, Coffee) =
(100 * 120) / 200 = 60
Compute chi-squared:
chi^2 = (30-40)^2/40 + (70-60)^2/60 + (50-40)^2/40 + (50-60)^2/60
chi^2 = 100/40 + 100/60 + 100/40 + 100/60
chi^2 = 2.5 + 1.667 + 2.5 + 1.667 = 8.334
Degrees of freedom: (rows - 1) * (cols - 1) = (2-1)(2-1) = 1
Critical value for chi^2 at df = 1 and alpha = 0.05 is 3.841.
Since 8.334 > 3.841, we reject H_0. There is a statistically significant association between gender and drink preference. Apparently, men and women in this sample have genuinely different hot beverage preferences. Stop the presses.
6. Correlation and Regression¶
6.1 Pearson Correlation Coefficient¶
The Pearson correlation coefficient r measures the strength and direction of the linear relationship between two continuous variables. It ranges from -1 to +1.
r = [n * sum(x_i * y_i) - sum(x_i) * sum(y_i)] / sqrt([n * sum(x_i^2) - (sum(x_i))^2] * [n * sum(y_i^2) - (sum(y_i))^2])
| r value | Interpretation |
|---|---|
| r = +1 | Perfect positive linear relationship |
| 0.7 < r < 1 | Strong positive correlation |
| 0.3 < r < 0.7 | Moderate positive correlation |
| 0 < r < 0.3 | Weak positive correlation |
| r = 0 | No linear correlation |
| -1 < r < 0 | Negative correlation (analogous scale) |
Worked Example:
| Student | Hours studied (x) | Exam score (y) |
|---|---|---|
| 1 | 2 | 55 |
| 2 | 3 | 60 |
| 3 | 5 | 72 |
| 4 | 7 | 80 |
| 5 | 8 | 90 |
Compute the necessary sums (n = 5):
sum(x) = 2 + 3 + 5 + 7 + 8 = 25sum(y) = 55 + 60 + 72 + 80 + 90 = 357sum(xy) = 110 + 180 + 360 + 560 + 720 = 1930sum(x^2) = 4 + 9 + 25 + 49 + 64 = 151sum(y^2) = 3025 + 3600 + 5184 + 6400 + 8100 = 26309
Numerator: 5 * 1930 - 25 * 357 = 9650 - 8925 = 725
Denominator:
- 5 * 151 - 25^2 = 755 - 625 = 130
- 5 * 26309 - 357^2 = 131545 - 127449 = 4096
- sqrt(130 * 4096) = sqrt(532480) = 729.71
r = 725 / 729.71 = 0.9935
The correlation coefficient is approximately 0.99, indicating a very strong positive linear relationship between hours studied and exam scores. Study more, score better. Your parents were right. Sorry.
6.2 Correlation vs Causation¶
Correlation does not imply causation. Just because two variables are correlated does not mean one causes the other. There may be a confounding variable, or the relationship could be coincidental.
Classic example: Ice cream sales and drowning deaths are positively correlated. But ice cream does not cause drowning. The confounding variable is summer/warm weather, which increases both ice cream consumption and swimming activity.
There’s an entire website (tylervigen.com) dedicated to spurious correlations. My favorite: US spending on science correlates with suicides by hanging. Does science cause despair? No. Correlation is not causation. Tattoo it on your brain.
6.3 Simple Linear Regression¶
Simple linear regression models the relationship between a dependent variable y and an independent variable x with a straight line:
y = a + bx
where:
- b (slope) = [n * sum(xy) - sum(x) * sum(y)] / [n * sum(x^2) - (sum(x))^2]
- a (intercept) = y-bar - b * x-bar
6.4 Coefficient of Determination (R-squared)¶
R^2 = r^2
R-squared represents the proportion of the variance in y that is explained by x. An R^2 of 0.85 means 85% of the variability in y is explained by the linear relationship with x. The remaining 15% is unexplained, due to other factors, noise, or the universe being messy.
6.5 Worked Example: Predicting House Prices from Size¶
| House | Size (sq ft) x | Price ($1000s) y |
|---|---|---|
| 1 | 850 | 150 |
| 2 | 1200 | 200 |
| 3 | 1500 | 260 |
| 4 | 1800 | 310 |
| 5 | 2200 | 380 |
| 6 | 2600 | 430 |
n = 6
Step 1: Compute the necessary sums.
sum(x) = 850 + 1200 + 1500 + 1800 + 2200 + 2600 = 10150sum(y) = 150 + 200 + 260 + 310 + 380 + 430 = 1730sum(xy) = 127500 + 240000 + 390000 + 558000 + 836000 + 1118000 = 3269500sum(x^2) = 722500 + 1440000 + 2250000 + 3240000 + 4840000 + 6760000 = 19252500
Step 2: Compute the slope b.
b = (6 * 3269500 - 10150 * 1730) / (6 * 19252500 - 10150^2)
b = (19617000 - 17559500) / (115515000 - 103022500)
b = 2057500 / 12492500 = 0.16470
Step 3: Compute the intercept a.
x-bar = 10150 / 6 = 1691.67y-bar = 1730 / 6 = 288.33
a = 288.33 - 0.16470 * 1691.67 = 288.33 - 278.58 = 9.75
Regression equation: y = 9.75 + 0.1647x
Interpretation:
- For each additional square foot, the predicted price increases by about $164.70.
- The intercept (9.75, or $9,750) is a mathematical artifact; it does not make practical sense for a house of 0 sq ft. (A house with no square footage is just… land. Or an existential crisis.)
Step 4: Predict the price of a 2000 sq ft house.
y = 9.75 + 0.1647 * 2000 = 9.75 + 329.4 = 339.15
The predicted price is approximately $339,150.
Step 5: Compute R-squared.
First compute the Pearson correlation r using the same sums:
sum(y^2) = 22500 + 40000 + 67600 + 96100 + 144400 + 184900 = 555500
Numerator: 6 * 3269500 - 10150 * 1730 = 2057500
Denominator: sqrt((6 * 19252500 - 10150^2) * (6 * 555500 - 1730^2))= sqrt(12492500 * (3333000 - 2992900))= sqrt(12492500 * 340100)= sqrt(4248699250000)= 2061238.0
r = 2057500 / 2061238.0 = 0.99819
R^2 = (0.99819)^2 = 0.9964
99.64% of the variance in house price is explained by house size in this dataset. This is an excellent fit. Almost suspiciously good, actually, real-world data is rarely this clean.
7. Summary of Key Formulas¶
| Concept | Formula |
|---|---|
| Sample Mean | x-bar = (1/n) * sum(x_i) |
| Sample Variance | s^2 = (1/(n-1)) * sum((x_i - x-bar)^2) |
| Standard Deviation | s = sqrt(s^2) |
| IQR | IQR = Q3 - Q1 |
| Z-score | Z = (X - mu) / sigma |
| Standard Error | SE = sigma / sqrt(n) |
| Confidence Interval | x-bar +/- z* * (sigma / sqrt(n)) |
| Z-test statistic | Z = (x-bar - mu_0) / (sigma / sqrt(n)) |
| t-test statistic | t = (x-bar - mu_0) / (s / sqrt(n)) |
| Chi-squared | chi^2 = sum((O - E)^2 / E) |
| Pearson r | r = [n*sum(xy) - sum(x)*sum(y)] / sqrt([n*sum(x^2) - (sum(x))^2]*[n*sum(y^2) - (sum(y))^2]) |
| Regression slope | b = [n*sum(xy) - sum(x)*sum(y)] / [n*sum(x^2) - (sum(x))^2] |
| Regression intercept | a = y-bar - b * x-bar |
| R-squared | R^2 = r^2 |
| Binomial | P(X=k) = C(n,k) * p^k * (1-p)^(n-k) |
| Poisson | P(X=k) = (e^(-lambda) * lambda^k) / k! |
| Normal PDF | f(x) = (1/(sigma*sqrt(2*pi))) * e^(-(x-mu)^2 / (2*sigma^2)) |
Final note: Statistics is not just about computing numbers. It’s about asking the right questions, choosing the right methods, understanding assumptions, and interpreting results with intellectual honesty. A statistically significant result is not necessarily practically important, and the absence of significance does not mean there is no effect. Always think critically about what the data is telling you, and what it’s not.
As the saying goes: “There are three kinds of lies: lies, damned lies, and statistics.” The best defense against misleading stats? Actually understanding statistics. Which, if you’ve made it this far, you now do.