Probability Theory: From Coin Flips to Bayes’ Theorem¶
Probability theory is the mathematical framework for reasoning about uncertainty. It gives us the language and rules for quantifying how likely events are, making predictions, and making rational decisions when we don’t know what’s going to happen next (which is basically always).
It’s the foundation of statistics, machine learning, finance, gambling, weather forecasting, medical testing, and pretty much every field where the answer isn’t “definitely yes” or “definitely no.”
And the best part? It all starts with a coin flip.
Why do probabilists never get invited to parties? Because they always say “it depends.” Q: “Want a beer?” A: “Well, what’s the prior probability that I’ll enjoy it given my historical beer consumption data…”
1. Foundations of Probability¶
1.1 What Is Probability?¶
Probability assigns a number between 0 and 1 to an event, representing how likely that event is to occur.
P(event) = 0means the event is impossible.P(event) = 1means the event is certain.P(event) = 0.5means the event is equally likely to happen or not.
Key terminology:
- Experiment: A process that produces an outcome (e.g., rolling a die, flipping a coin).
- Sample space
S(orOmega): The set of all possible outcomes of an experiment. - Event: A subset of the sample space. An event “occurs” if the outcome of the experiment belongs to that subset.
Example: Rolling a six-sided die.
- Sample space: S = {1, 2, 3, 4, 5, 6}
- Event A = “rolling an even number” = {2, 4, 6}
- P(A) = |A| / |S| = 3 / 6 = 0.5
You have a 50% chance of rolling even. You also have a 50% chance of rolling odd. And you have a 100% chance of arguing about this at game night.
1.2 Three Interpretations of Probability¶
Classical (Laplace) Interpretation¶
Probability is the ratio of favorable outcomes to total equally likely outcomes.
P(A) = number of favorable outcomes / total number of equally likely outcomes
This works well for fair dice, coins, and cards, but breaks down when outcomes are not equally likely. (The probability of rain tomorrow is not “1 out of 2 because it either rains or it doesn’t.”)
Frequentist Interpretation¶
Probability is the long-run relative frequency of an event occurring in repeated experiments.
P(A) = lim (n -> infinity) [count of A occurring / n]
If you flip a fair coin 10,000 times, the proportion of heads will converge to 0.5. This interpretation is empirical and requires repeatable experiments.
Subjective (Bayesian) Interpretation¶
Probability is a degree of belief about how likely an event is, based on available evidence. Different people can assign different probabilities to the same event based on their information.
Example: “There is a 70% chance it will rain tomorrow” expresses a personal assessment of uncertainty, not the result of a repeatable experiment.
Weather forecasters are Bayesians whether they know it or not. When they say “30% chance of rain,” they’re expressing uncertainty, not saying that exactly 3 out of 10 parallel universes are getting wet.
1.3 Probability Axioms (Kolmogorov)¶
All of probability theory is built on three axioms proposed by Andrey Kolmogorov in 1933. Just three. The entire field. From these three simple rules, everything else follows.
Axiom 1 (Non-negativity): For any event
A,P(A) >= 0.Axiom 2 (Normalization):
P(S) = 1, whereSis the entire sample space.Axiom 3 (Additivity): For any two mutually exclusive events
AandB(i.e.,A intersection B = empty set),P(A union B) = P(A) + P(B).
From these three axioms, all other rules of probability can be derived.
Derived results:
P(empty set) = 0(the impossible event has probability 0)P(A) <= 1for any eventAP(A union B) = P(A) + P(B) - P(A intersection B)(inclusion-exclusion for any two events)
1.4 Complementary Events¶
The complement of event A, written A' or A^c, is the event “A does not occur.”
P(A') = 1 - P(A)
Worked Example:
A bag contains 3 red balls and 7 blue balls. What is the probability of NOT drawing a red ball?
P(red) = 3/10 = 0.30P(not red) = 1 - 0.30 = 0.70
There is a 70% chance of not drawing a red ball.
Another Example: The probability that a student passes an exam is 0.85. The probability they fail is 1 - 0.85 = 0.15.
Tip: The complement rule is especially useful when computing
P(A)directly is hard but computingP(A')is easy. For instance, “the probability of getting at least one head in 5 coin flips” is easier to compute as1 - P(no heads at all) = 1 - (0.5)^5 = 1 - 0.03125 = 0.96875.
The complement trick is like answering “how likely is it that SOMETHING interesting happens?” by computing “how likely is it that NOTHING interesting happens?” and subtracting from 1. Lazy? Maybe. Effective? Absolutely.
2. Combinatorics Essentials¶
Combinatorics is the mathematics of counting. It’s essential for computing probabilities in finite sample spaces because probability often boils down to: “how many ways can event A happen, out of the total number of possible outcomes?”
2.1 Factorials¶
The factorial of a non-negative integer n, denoted n!, is the product of all positive integers from 1 to n.
n! = n * (n-1) * (n-2) * ... * 2 * 1
By convention, 0! = 1. (Yes, really. It makes the math work.)
Examples:
5! = 5 * 4 * 3 * 2 * 1 = 1203! = 3 * 2 * 1 = 61! = 10! = 1
Factorials grow absurdly fast. 10! = 3,628,800. 20! = 2,432,902,008,176,640,000. By the time you get to 70!, the number is larger than the estimated number of atoms in the observable universe. Counting is serious business.
2.2 Permutations (Ordered Arrangements)¶
A permutation is an ordered arrangement of objects. The order matters.
All objects: The number of ways to arrange n distinct objects is n!.
Example: How many ways can 5 books be arranged on a shelf?
5! = 120
There are 120 different arrangements.
Subset of objects: The number of ways to choose and arrange r objects from n distinct objects is:
P(n, r) = n! / (n - r)!
Example: In a race with 10 runners, how many different ways can the gold, silver, and bronze medals be awarded?
P(10, 3) = 10! / 7! = 10 * 9 * 8 = 720
There are 720 possible medal arrangements.
Example: How many 4-letter codes can be formed from the 26 letters of the alphabet if no letter is repeated?
P(26, 4) = 26! / 22! = 26 * 25 * 24 * 23 = 358,800
2.3 Combinations (Unordered Selections)¶
A combination is a selection of objects where the order does NOT matter. This is the “I don’t care who’s first” version of counting.
C(n, r) = n! / (r! * (n - r)!)
Also written as “n choose r” or (n r) in binomial coefficient notation.
Example: How many ways can you choose 3 students from a class of 10 for a committee?
C(10, 3) = 10! / (3! * 7!) = (10 * 9 * 8) / (3 * 2 * 1) = 720 / 6 = 120
There are 120 possible committees.
Lottery Example: A lottery requires choosing 6 numbers from 1 to 49. How many possible combinations are there?
C(49, 6) = 49! / (6! * 43!) = (49 * 48 * 47 * 46 * 45 * 44) / (6 * 5 * 4 * 3 * 2 * 1)
= 10,068,347,520 / 720 = 13,983,816
The chance of winning with one ticket is 1 / 13,983,816, which is approximately 0.00000715%.
To put that in perspective: you’re about 45 times more likely to be struck by lightning in your lifetime than to win the lottery with a single ticket. But hey, “you can’t win if you don’t play!”, technically true, but also technically true for Russian roulette.
Poker Hands Example: How many 5-card hands can be dealt from a standard 52-card deck?
C(52, 5) = 52! / (5! * 47!) = (52 * 51 * 50 * 49 * 48) / (5 * 4 * 3 * 2 * 1) = 311,875,200 / 120 = 2,598,960
There are 2,598,960 possible poker hands.
How many of these are a flush (all 5 cards of the same suit)?
- Choose 1 suit from 4:
C(4, 1) = 4 - Choose 5 cards from that suit’s 13:
C(13, 5) = 1287 - Total flushes (including straight flushes):
4 * 1287 = 5148
P(flush) = 5148 / 2,598,960 = 0.00198 or about 0.2%.
2.4 With vs Without Replacement¶
- Without replacement: Once an item is selected, it is removed from the pool. The total number of available items decreases with each selection. This is the default in most combinatorics problems.
- With replacement: After an item is selected, it is put back. Each selection is from the full pool.
Example: Drawing 2 cards from a deck of 52.
Without replacement:
- P(1st card is ace) = 4/52
- P(2nd card is ace | 1st was ace) = 3/51
- P(both aces) = (4/52) * (3/51) = 12/2652 = 1/221 = 0.00452
With replacement:
- P(1st card is ace) = 4/52
- P(2nd card is ace) = 4/52 (card was returned)
- P(both aces) = (4/52) * (4/52) = 16/2704 = 1/169 = 0.00592
Without replacement makes the second draw depend on the first. With replacement, it’s like the deck has Alzheimer’s, it forgets what just happened.
3. Conditional Probability¶
3.1 Definition¶
The conditional probability of event A given that event B has occurred is:
P(A|B) = P(A intersection B) / P(B), provided P(B) > 0
This formula restricts the sample space to only the outcomes where B has occurred, then asks what fraction of those outcomes also satisfy A.
Worked Example (Drawing Cards):
A card is drawn from a standard 52-card deck. What is the probability that it is a King given that it is a face card?
- Event A: the card is a King.
- Event B: the card is a face card (Jack, Queen, King).
P(A intersection B) = P(King) = 4/52(all Kings are face cards, soA intersection B = A)P(B) = 12/52(there are 12 face cards)
P(A|B) = (4/52) / (12/52) = 4/12 = 1/3 = 0.3333
Given the card is a face card, there is a 33.33% chance it is a King. Makes sense: there are 3 types of face cards (J, Q, K), and King is one of them.
Worked Example (Medical Test):
A disease affects 1% of the population. A test for the disease has a 95% true positive rate (sensitivity) and a 90% true negative rate (specificity).
P(Disease) = 0.01P(Positive | Disease) = 0.95P(Negative | No Disease) = 0.90, soP(Positive | No Disease) = 0.10(false positive rate)
If a person tests positive, what is the probability they actually have the disease? (We will solve this fully in the Bayes’ Theorem section, and the answer will surprise you.)
3.2 Independence¶
Two events A and B are independent if the occurrence of one does not affect the probability of the other.
P(A intersection B) = P(A) * P(B)
Equivalently, P(A|B) = P(A) and P(B|A) = P(B).
Example: Flipping a coin and rolling a die are independent events. The coin doesn’t care what the die does.
P(Heads) = 1/2P(Roll 6) = 1/6P(Heads AND Roll 6) = (1/2) * (1/6) = 1/12
Non-independence example: Drawing two cards from a deck without replacement. The probability of the second card depends on what the first card was.
3.3 Multiplication Rule¶
General multiplication rule:
P(A intersection B) = P(A) * P(B|A) = P(B) * P(A|B)
For independent events:
P(A intersection B) = P(A) * P(B)
Extended to multiple events:
P(A intersection B intersection C) = P(A) * P(B|A) * P(C|A intersection B)
Example: A bag contains 5 red and 3 blue marbles. Two marbles are drawn without replacement. What is the probability both are red?
P(1st red) = 5/8P(2nd red | 1st red) = 4/7P(both red) = (5/8) * (4/7) = 20/56 = 5/14 = 0.3571
There is approximately a 35.71% chance both marbles are red.
4. Bayes’ Theorem¶
4.1 Formula and Intuition¶
Bayes’ Theorem:
P(A|B) = [P(B|A) * P(A)] / P(B)
In words: the probability of A given B equals the probability of B given A, times the prior probability of A, divided by the total probability of B.
The denominator can be expanded using the law of total probability:
P(B) = P(B|A) * P(A) + P(B|A') * P(A')
So the full formula becomes:
P(A|B) = [P(B|A) * P(A)] / [P(B|A) * P(A) + P(B|A') * P(A')]
Intuition: Bayes’ theorem lets us update our beliefs. We start with a prior belief P(A), observe evidence B, and compute the posterior belief P(A|B). The evidence either strengthens or weakens our initial belief. It’s like being a detective: you start with a hunch, gather evidence, and revise your suspicions.
4.2 Classic Example: Disease Testing¶
Let us return to the medical test example from Section 3.1 and solve it completely. This is one of the most important examples in all of probability.
Given:
- Prevalence: P(Disease) = 0.01 (1% of the population has the disease)
- Sensitivity: P(Positive | Disease) = 0.95
- Specificity: P(Negative | No Disease) = 0.90, so P(Positive | No Disease) = 0.10
Question: If a person tests positive, what is the probability they actually have the disease?
Step-by-step calculation:
Step 1: Identify what we need.
We need P(Disease | Positive).
Step 2: Apply Bayes’ theorem.
P(Disease | Positive) = [P(Positive | Disease) * P(Disease)] / P(Positive)
Step 3: Compute P(Positive) using the law of total probability.
P(Positive) = P(Positive | Disease) * P(Disease) + P(Positive | No Disease) * P(No Disease)
P(Positive) = 0.95 * 0.01 + 0.10 * 0.99
P(Positive) = 0.0095 + 0.099 = 0.1085
Step 4: Plug into Bayes’ formula.
P(Disease | Positive) = (0.95 * 0.01) / 0.1085 = 0.0095 / 0.1085 = 0.0876
Result: Even after testing positive, there is only an 8.76% chance the person actually has the disease.
Wait, what?! A test that’s 95% accurate says you’re positive, and there’s only an 8.76% chance you’re actually sick? Yes. Welcome to Bayes’ theorem, where your intuition goes to die.
Why so low? Because the disease is rare (1%). The 10% false positive rate generates many more false positives among the 99% of healthy people than the 95% true positive rate generates true positives among the 1% of sick people. The base rate matters enormously.
Let us verify with a concrete population of 10,000 people:
| Has Disease | No Disease | Total | |
|---|---|---|---|
| Test Positive | 0.95 * 100 = 95 | 0.10 * 9900 = 990 | 1085 |
| Test Negative | 0.05 * 100 = 5 | 0.90 * 9900 = 8910 | 8915 |
| Total | 100 | 9900 | 10000 |
Of the 1085 positive tests, only 95 are true positives. So P(Disease | Positive) = 95 / 1085 = 0.0876. This confirms our calculation.
See the problem? 990 healthy people got false positives vs. only 95 sick people got true positives. The rare disease drowns in a sea of false alarms. This is why doctors order a second test when the first comes back positive.
4.3 Why Bayes Matters in Real Life¶
Spam filters: Email spam filters use Bayes’ theorem to update the probability that a message is spam based on the words it contains. If the word “lottery” appears,
P(spam | "lottery")is computed using the prior probability of spam and the likelihood of “lottery” appearing in spam vs legitimate emails.Medical diagnosis: Doctors use Bayesian reasoning (implicitly or explicitly) when interpreting test results. A positive mammogram does not mean a patient definitely has cancer; the base rate of cancer in the population matters enormously.
Legal reasoning: The probability of guilt given evidence must account for the prior probability and the likelihood of the evidence under both guilt and innocence.
Machine learning: Naive Bayes classifiers, Bayesian networks, and Bayesian inference methods all rest on Bayes’ theorem.
Bayes’ theorem is like a superpower for thinking clearly about uncertain information. Once you understand it, you start seeing it everywhere, and you start questioning every “95% accurate” claim you encounter.
5. Random Variables¶
5.1 Discrete vs Continuous¶
A random variable is a function that assigns a numerical value to each outcome in the sample space. It’s a way of turning “stuff that happens” into “numbers we can do math with.”
- Discrete random variable: Takes on a countable number of values (often integers). Examples: number of heads in 10 coin flips, number of defective items in a batch.
- Continuous random variable: Takes on any value within an interval. Examples: height, temperature, time.
5.2 PMF, PDF, and CDF¶
Probability Mass Function (PMF): for discrete variables¶
p(x) = P(X = x)
The PMF gives the probability that the random variable takes each specific value. The sum of all probabilities equals 1: sum(p(x_i)) = 1.
Example: For a fair six-sided die:
| x | 1 | 2 | 3 | 4 | 5 | 6 |
|---|---|---|---|---|---|---|
| P(X = x) | 1/6 | 1/6 | 1/6 | 1/6 | 1/6 | 1/6 |
Probability Density Function (PDF): for continuous variables¶
f(x) such that P(a <= X <= b) = integral from a to b of f(x) dx
The PDF does not give probabilities directly. For continuous variables, P(X = x) = 0 for any specific value. Probabilities are computed as areas under the curve over an interval.
The probability of being EXACTLY 180.0000000… cm tall is zero. But the probability of being between 179 and 181 cm? That’s a real number. Continuous distributions are all about intervals, not points.
Properties: f(x) >= 0 for all x, and integral from -infinity to +infinity of f(x) dx = 1.
Cumulative Distribution Function (CDF)¶
F(x) = P(X <= x)
The CDF gives the probability that the random variable takes a value less than or equal to x. It works for both discrete and continuous random variables.
Properties:
- F(x) is non-decreasing.
- F(-infinity) = 0 and F(+infinity) = 1.
- For continuous variables: F(x) = integral from -infinity to x of f(t) dt.
5.3 Expected Value¶
The expected value (or expectation) of a random variable is its long-run average value. If you repeated the experiment millions of times and averaged the results, that’s what you’d get.
For a discrete random variable:E(X) = sum(x_i * P(X = x_i))
For a continuous random variable:E(X) = integral from -infinity to +infinity of x * f(x) dx
Worked Example (Fair Die):
E(X) = 1*(1/6) + 2*(1/6) + 3*(1/6) + 4*(1/6) + 5*(1/6) + 6*(1/6)
E(X) = (1 + 2 + 3 + 4 + 5 + 6) / 6 = 21 / 6 = 3.5
The expected value is 3.5. Note that 3.5 is not a value the die can actually land on. The expected value is the long-run average: if you roll a die thousands of times and average the results, you will get very close to 3.5.
It’s like saying the average family has 2.3 children. Nobody actually has 0.3 of a child, but the math works.
Worked Example (Is This Game Fair?):
Suppose a game costs $5 to play. You flip a coin: heads wins $10, tails wins $0.
P(Win $10) = 0.5,P(Win $0) = 0.5- Expected winnings:
E(X) = 10 * 0.5 + 0 * 0.5 = $5 - Expected profit:
$5 - $5 = $0
This is a fair game because the expected profit is zero.
Now consider a modified game: heads wins $12, tails wins $0, cost is $5.
E(winnings) = 12 * 0.5 + 0 * 0.5 = $6E(profit) = $6 - $5 = $1
On average, you gain $1 per game. In the long run, you should play this game. In the short run, you might lose, but probability has your back eventually.
5.4 Variance and Standard Deviation¶
The variance measures how spread out the values of a random variable are around the expected value.
Var(X) = E[(X - E(X))^2] = E(X^2) - [E(X)]^2
The standard deviation is: SD(X) = sqrt(Var(X))
Worked Example (Fair Die):
We already know E(X) = 3.5.
E(X^2) = 1^2*(1/6) + 2^2*(1/6) + 3^2*(1/6) + 4^2*(1/6) + 5^2*(1/6) + 6^2*(1/6)
E(X^2) = (1 + 4 + 9 + 16 + 25 + 36) / 6 = 91 / 6 = 15.1667
Var(X) = 15.1667 - (3.5)^2 = 15.1667 - 12.25 = 2.9167
SD(X) = sqrt(2.9167) = 1.7078
5.5 Properties of Expected Value and Variance¶
Linearity of expectation:
E(aX + b) = a * E(X) + b
This holds always, whether or not X has any special distribution. Expectation is linear, and this is one of the most useful properties in all of probability.
Variance scaling:
Var(aX + b) = a^2 * Var(X)
Adding a constant b shifts the distribution but does not change the spread. Multiplying by a scales the spread by a^2.
Example: If E(X) = 10 and Var(X) = 4:
- E(3X + 5) = 3 * 10 + 5 = 35
- Var(3X + 5) = 3^2 * 4 = 9 * 4 = 36
- SD(3X + 5) = sqrt(36) = 6
Sum of independent variables: If
XandYare independent:
-E(X + Y) = E(X) + E(Y)(this holds even without independence)
-Var(X + Y) = Var(X) + Var(Y)(this requires independence)
6. Common Distributions (Detailed)¶
Here’s the lineup, the all-star team of probability distributions.
6.1 Bernoulli Distribution¶
The simplest distribution: a single trial with two outcomes. The hydrogen atom of probability.
| Parameter | Value |
|---|---|
| Notation | X ~ Bernoulli(p) |
| Support | {0, 1} |
| PMF | P(X = 1) = p, P(X = 0) = 1 - p |
| Mean | E(X) = p |
| Variance | Var(X) = p(1 - p) |
Example: A basketball player has a 70% free throw rate. Each shot is a Bernoulli trial with p = 0.7.
P(make) = 0.7,P(miss) = 0.3E(X) = 0.7,Var(X) = 0.7 * 0.3 = 0.21
6.2 Binomial Distribution¶
The number of successes in n independent Bernoulli trials. It answers: “If I try this n times, how many times will it work?”
| Parameter | Value |
|---|---|
| Notation | X ~ Binomial(n, p) |
| Support | {0, 1, 2, ..., n} |
| PMF | P(X = k) = C(n,k) * p^k * (1-p)^(n-k) |
| Mean | E(X) = np |
| Variance | Var(X) = np(1-p) |
Worked Example (Quality Control):
A factory produces items with a 5% defect rate (p = 0.05). In a batch of 20 items (n = 20), what is the probability of finding exactly 2 defective items?
P(X = 2) = C(20, 2) * (0.05)^2 * (0.95)^18
C(20, 2) = 20! / (2! * 18!) = (20 * 19) / (2 * 1) = 190
(0.05)^2 = 0.0025
(0.95)^18 = 0.3972 (computed by successive multiplication)
P(X = 2) = 190 * 0.0025 * 0.3972 = 190 * 0.000993 = 0.1887
There is approximately an 18.87% chance of finding exactly 2 defective items.
What is the probability of finding at most 1 defective item?
P(X <= 1) = P(X = 0) + P(X = 1)
P(X = 0) = C(20,0) * (0.05)^0 * (0.95)^20 = 1 * 1 * 0.3585 = 0.3585
P(X = 1) = C(20,1) * (0.05)^1 * (0.95)^19 = 20 * 0.05 * 0.3774 = 0.3774
P(X <= 1) = 0.3585 + 0.3774 = 0.7359
There is approximately a 73.59% chance of finding at most 1 defective item.
6.3 Geometric Distribution¶
The number of trials needed to get the first success. It answers: “How long until this finally works?”
| Parameter | Value |
|---|---|
| Notation | X ~ Geometric(p) |
| Support | {1, 2, 3, ...} |
| PMF | P(X = k) = (1-p)^(k-1) * p |
| Mean | E(X) = 1/p |
| Variance | Var(X) = (1-p) / p^2 |
Example: A telemarketer has a 10% success rate per call (p = 0.10). What is the probability they make their first sale on the 5th call?
P(X = 5) = (1 - 0.10)^(5-1) * 0.10 = (0.90)^4 * 0.10 = 0.6561 * 0.10 = 0.0656
There is approximately a 6.56% chance the first sale is on the 5th call.
On average, how many calls until the first sale?
E(X) = 1 / 0.10 = 10 calls.
If your success rate is 10%, expect to fail 9 times for every success. That’s not discouraging, that’s statistics giving you realistic expectations. You’re welcome.
6.4 Poisson Distribution¶
Models the count of events in a fixed interval when events occur independently at a constant average rate. Great for rare events.
| Parameter | Value |
|---|---|
| Notation | X ~ Poisson(lambda) |
| Support | {0, 1, 2, ...} |
| PMF | P(X = k) = (e^(-lambda) * lambda^k) / k! |
| Mean | E(X) = lambda |
| Variance | Var(X) = lambda |
Key property: For the Poisson distribution, the mean equals the variance. If you see this in data, think Poisson.
Worked Example (Defects per Unit):
A manufacturer knows that fabric has an average of 3 defects per 100 meters (lambda = 3). What is the probability of finding exactly 0 defects in a randomly chosen 100-meter roll?
P(X = 0) = (e^(-3) * 3^0) / 0! = e^(-3) * 1 / 1 = 0.0498
There is approximately a 4.98% chance of zero defects. Perfect fabric is pretty rare.
What about finding 5 or more defects?
P(X >= 5) = 1 - P(X <= 4)
P(X = 0) = 0.0498P(X = 1) = e^(-3) * 3 / 1 = 0.0498 * 3 = 0.1494P(X = 2) = e^(-3) * 9 / 2 = 0.0498 * 4.5 = 0.2240P(X = 3) = e^(-3) * 27 / 6 = 0.0498 * 4.5 = 0.2240P(X = 4) = e^(-3) * 81 / 24 = 0.0498 * 3.375 = 0.1680
P(X <= 4) = 0.0498 + 0.1494 + 0.2240 + 0.2240 + 0.1680 = 0.8152
P(X >= 5) = 1 - 0.8152 = 0.1848
There is approximately an 18.48% chance of 5 or more defects.
Worked Example (Calls per Hour):
A help desk receives an average of 8 calls per hour (lambda = 8). What is the probability of receiving exactly 10 calls in a given hour?
P(X = 10) = (e^(-8) * 8^10) / 10!
e^(-8) = 0.0003358^10 = 1,073,741,82410! = 3,628,800
P(X = 10) = (0.000335 * 1,073,741,824) / 3,628,800 = 359,703 / 3,628,800 = 0.0993
There is approximately a 9.93% chance of exactly 10 calls.
6.5 Uniform Distribution (Continuous)¶
| Parameter | Value |
|---|---|
| Notation | X ~ Uniform(a, b) |
| Support | [a, b] |
f(x) = 1/(b-a) for a <= x <= b | |
| CDF | F(x) = (x-a)/(b-a) for a <= x <= b |
| Mean | E(X) = (a+b)/2 |
| Variance | Var(X) = (b-a)^2 / 12 |
Example: A bus arrives at a stop uniformly at random between 0 and 30 minutes. What is the probability you wait less than 10 minutes?
P(X < 10) = (10 - 0) / (30 - 0) = 10/30 = 1/3 = 0.3333
There is a 33.33% chance of waiting less than 10 minutes.
And a 66.67% chance of standing there wondering why you didn’t check the bus schedule. Probability won’t save you from poor planning.
6.6 Normal (Gaussian) Distribution¶
The most important continuous distribution. It appears everywhere due to the Central Limit Theorem. It’s the Beyonce of distributions.
| Parameter | Value |
|---|---|
| Notation | X ~ N(mu, sigma^2) |
| Support | (-infinity, +infinity) |
f(x) = (1/(sigma*sqrt(2*pi))) * e^(-(x-mu)^2 / (2*sigma^2)) | |
| Mean | E(X) = mu |
| Variance | Var(X) = sigma^2 |
Properties of the bell curve:
- Symmetric about
mu. - The mean, median, and mode are all equal to
mu. - The curve extends infinitely in both directions but approaches zero rapidly.
- The inflection points are at
mu - sigmaandmu + sigma.
68-95-99.7 Rule:
P(mu - sigma < X < mu + sigma) = 0.6827(approximately 68%)P(mu - 2*sigma < X < mu + 2*sigma) = 0.9545(approximately 95%)P(mu - 3*sigma < X < mu + 3*sigma) = 0.9973(approximately 99.7%)
Standardization: Any normal variable can be converted to the standard normal Z ~ N(0, 1):
Z = (X - mu) / sigma
Worked Example: Heights of adult men are normally distributed with mu = 175 cm and sigma = 7 cm. What is the probability that a randomly selected man is taller than 185 cm?
Z = (185 - 175) / 7 = 10/7 = 1.4286
P(X > 185) = P(Z > 1.43) = 1 - P(Z < 1.43) = 1 - 0.9236 = 0.0764
There is approximately a 7.64% chance of being taller than 185 cm.
6.7 Exponential Distribution¶
Models the waiting time between events in a Poisson process. If Poisson counts how many events happen, exponential measures how long you wait between them.
| Parameter | Value |
|---|---|
| Notation | X ~ Exponential(lambda) |
| Support | [0, +infinity) |
f(x) = lambda * e^(-lambda*x) | |
| CDF | F(x) = 1 - e^(-lambda*x) |
| Mean | E(X) = 1/lambda |
| Variance | Var(X) = 1/lambda^2 |
Memoryless property: P(X > s + t | X > s) = P(X > t)
The probability of waiting at least t more time units is the same regardless of how long you have already waited. This is a unique property of the exponential distribution among continuous distributions.
The exponential distribution has no memory. It doesn’t care that you’ve been waiting for 20 minutes, the probability of waiting 10 more minutes is the same as it was when you started. Light bulbs and radioactive atoms work this way. Humans do not, which is why we get impatient.
Worked Example: Customers arrive at a store at an average rate of 12 per hour (lambda = 12). What is the probability that the time between two consecutive customers is more than 10 minutes?
First, convert units. lambda = 12 per hour = 0.2 per minute.
P(X > 10) = 1 - F(10) = e^(-0.2 * 10) = e^(-2) = 0.1353
There is approximately a 13.53% chance of waiting more than 10 minutes between customers.
7. Summary Table of Distributions¶
| Distribution | Type | Parameters | Mean | Variance | Key Use |
|---|---|---|---|---|---|
| Bernoulli | Discrete | p | p | p(1-p) | Single yes/no trial |
| Binomial | Discrete | n, p | np | np(1-p) | Count of successes in n trials |
| Geometric | Discrete | p | 1/p | (1-p)/p^2 | Trials until first success |
| Poisson | Discrete | lambda | lambda | lambda | Count of rare events |
| Uniform | Continuous | a, b | (a+b)/2 | (b-a)^2/12 | Equal likelihood over interval |
| Normal | Continuous | mu, sigma | mu | sigma^2 | Natural phenomena, CLT |
| Exponential | Continuous | lambda | 1/lambda | 1/lambda^2 | Time between events |
Print this table. Laminate it. Put it on your wall. These seven distributions cover about 90% of real-world probability problems.
8. Law of Large Numbers¶
8.1 Statement and Intuition¶
Law of Large Numbers (LLN): As the number of independent, identically distributed trials increases, the sample mean converges to the expected value (population mean).
In notation: as n -> infinity, x-bar_n -> E(X) (in probability or almost surely).
Casino Example: A casino game has a house edge of 2%. On any single bet, the casino might win or lose. But the law of large numbers guarantees that over millions of bets, the casino’s average profit per bet will converge to 2%. This is why casinos are profitable in the long run despite the randomness of individual games.
The LLN is why casinos don’t sweat individual gamblers winning. They WANT you to win sometimes, it keeps you playing. Over millions of bets, the house always wins. Vegas wasn’t built on luck.
Coin Flip Example: If you flip a fair coin 10 times, you might get 7 heads (70%). Flip it 100 times and you will likely get between 40-60 heads (closer to 50%). Flip it 10,000 times and the proportion of heads will be extremely close to 50%. The larger the sample, the closer the average is to the true probability.
8.2 Weak vs Strong¶
Weak LLN: The sample mean converges in probability to the expected value. For any
epsilon > 0,P(|x-bar_n - mu| > epsilon) -> 0asn -> infinity.Strong LLN: The sample mean converges almost surely to the expected value.
P(lim x-bar_n = mu as n -> infinity) = 1.
The strong form is more powerful (almost sure convergence implies convergence in probability), but the weak form is sufficient for most practical applications.
9. Central Limit Theorem¶
9.1 Statement¶
Central Limit Theorem (CLT): Let
X_1, X_2, ..., X_nbe independent and identically distributed random variables with meanmuand finite variancesigma^2. Asn -> infinity, the distribution of the sample meanx-bar_napproaches a normal distribution:
x-bar_n ~ N(mu, sigma^2/n)approximately, for largen.Equivalently, the standardized sum converges to a standard normal:
Z_n = (x-bar_n - mu) / (sigma / sqrt(n)) -> N(0, 1)in distribution.
9.2 Why It Is the Most Important Theorem in Statistics¶
The CLT is profoundly powerful for several reasons:
Universality: It does not matter what the original distribution of the data looks like, skewed, bimodal, discrete, continuous. As long as the sample size is large enough, the sampling distribution of the mean will be approximately normal.
Foundation for inference: Confidence intervals and hypothesis tests for means are built on the normality of the sampling distribution, which is justified by the CLT.
Practical rule of thumb: A sample size of
n >= 30is usually sufficient for the CLT approximation to be good, though this depends on how skewed the original distribution is.
The CLT is essentially magic. Take any distribution (ugly, weird, skewed, bizarre) and average enough samples from it. The result? A beautiful bell curve. Every time. It’s the “just add water” of mathematics.
9.3 Practical Implications¶
Example: Suppose customer waiting times at a bank have a mean of mu = 8 minutes and a standard deviation of sigma = 5 minutes. The distribution is right-skewed (not normal). If we take a random sample of 40 customers, what is the probability that the average waiting time exceeds 9 minutes?
By the CLT, even though individual waiting times are not normally distributed, the sample mean is approximately normal:
x-bar ~ N(8, 5^2/40) = N(8, 0.625)
Standard error: SE = 5 / sqrt(40) = 5 / 6.3246 = 0.7906
Z = (9 - 8) / 0.7906 = 1 / 0.7906 = 1.265
P(x-bar > 9) = P(Z > 1.265) = 1 - P(Z < 1.265) = 1 - 0.8971 = 0.1029
There is approximately a 10.29% chance that the average waiting time for 40 customers exceeds 9 minutes.
If we increased the sample to n = 100:
SE = 5 / sqrt(100) = 0.5
Z = (9 - 8) / 0.5 = 2.0
P(x-bar > 9) = P(Z > 2.0) = 1 - 0.9772 = 0.0228
With a larger sample, the probability drops to only 2.28%. This illustrates how increasing the sample size reduces variability in the sample mean. More data = more precision. Always.
10. Joint Probability¶
10.1 Joint PMF and PDF¶
When we have two random variables X and Y, their joint distribution describes the probability of any combination of values.
Joint PMF (discrete):
p(x, y) = P(X = x AND Y = y)
The sum over all x and y must equal 1.
Joint PDF (continuous):
f(x, y) such that P((X, Y) in A) = double integral over A of f(x, y) dx dy
Example (Joint PMF):
Two dice are rolled. Let X be the value of die 1 and Y be the value of die 2.
Since the dice are independent, P(X = x, Y = y) = (1/6) * (1/6) = 1/36 for each combination. The joint PMF is uniform over the 6 * 6 = 36 possible outcomes.
P(X = 3, Y = 5) = 1/36
P(X + Y = 7) = P(1,6) + P(2,5) + P(3,4) + P(4,3) + P(5,2) + P(6,1) = 6/36 = 1/6
Seven is the most likely sum when rolling two dice. This is why 7 is special in craps. The casino didn’t pick that number at random, probability did.
10.2 Marginal Probability¶
The marginal probability of one variable is obtained by summing (or integrating) the joint distribution over all values of the other variable.
For discrete variables: P(X = x) = sum over all y of P(X = x, Y = y)
Example: Consider the following joint PMF table.
| Y = 0 | Y = 1 | Marginal P(X) | |
|---|---|---|---|
| X = 0 | 0.20 | 0.10 | 0.30 |
| X = 1 | 0.30 | 0.40 | 0.70 |
| Marginal P(Y) | 0.50 | 0.50 | 1.00 |
P(X = 0) = 0.20 + 0.10 = 0.30P(X = 1) = 0.30 + 0.40 = 0.70P(Y = 0) = 0.20 + 0.30 = 0.50P(Y = 1) = 0.10 + 0.40 = 0.50
10.3 Covariance and Correlation¶
Covariance measures the direction of the linear relationship between two random variables.
Cov(X, Y) = E[(X - E(X))(Y - E(Y))] = E(XY) - E(X) * E(Y)
Cov(X, Y) > 0: X and Y tend to increase together.Cov(X, Y) < 0: When X increases, Y tends to decrease.Cov(X, Y) = 0: No linear relationship (but could still have a non-linear relationship).
Correlation standardizes the covariance to always lie between -1 and +1.
rho(X, Y) = Cov(X, Y) / (SD(X) * SD(Y))
Worked Example: Using the joint PMF table above.
E(X) = 0 * 0.30 + 1 * 0.70 = 0.70
E(Y) = 0 * 0.50 + 1 * 0.50 = 0.50
E(XY) = 0*0*0.20 + 0*1*0.10 + 1*0*0.30 + 1*1*0.40 = 0.40
Cov(X, Y) = 0.40 - 0.70 * 0.50 = 0.40 - 0.35 = 0.05
The covariance is positive (0.05), indicating a slight positive relationship.
Var(X) = E(X^2) - [E(X)]^2 = (0^2*0.30 + 1^2*0.70) - 0.70^2 = 0.70 - 0.49 = 0.21
Var(Y) = E(Y^2) - [E(Y)]^2 = (0^2*0.50 + 1^2*0.50) - 0.50^2 = 0.50 - 0.25 = 0.25
rho(X, Y) = 0.05 / (sqrt(0.21) * sqrt(0.25)) = 0.05 / (0.4583 * 0.5) = 0.05 / 0.2291 = 0.2182
The correlation is approximately 0.22, indicating a weak positive linear relationship.
10.4 Independent Random Variables¶
Two random variables X and Y are independent if and only if:
P(X = x, Y = y) = P(X = x) * P(Y = y) for all x, y
Or equivalently, their joint PMF/PDF factors into the product of their marginals.
Check independence in our table:
P(X=0, Y=0) = 0.20 but P(X=0) * P(Y=0) = 0.30 * 0.50 = 0.15
Since 0.20 != 0.15, X and Y are not independent.
Important consequences of independence:
- If
XandYare independent:E(XY) = E(X) * E(Y) - If
XandYare independent:Cov(X, Y) = 0andrho(X, Y) = 0 - If
XandYare independent:Var(X + Y) = Var(X) + Var(Y)
Warning: Zero covariance does NOT imply independence. Two variables can have
Cov(X, Y) = 0but still be dependent (e.g.,Y = X^2whereX ~ N(0,1)givesCov(X, Y) = 0butYis completely determined byX).
This is one of the most common mistakes in probability: assuming uncorrelated = independent. It doesn’t. Correlation only measures LINEAR relationships. Y = X^2 is perfectly dependent but has zero correlation with X when X is symmetric around 0. Sneaky, right?
Summary of Key Formulas¶
| Concept | Formula |
|---|---|
| Complement | P(A') = 1 - P(A) |
| Addition Rule | P(A union B) = P(A) + P(B) - P(A intersection B) |
| Conditional Probability | P(A\|B) = P(A intersection B) / P(B) |
| Multiplication Rule | P(A intersection B) = P(A) * P(B\|A) |
| Independence | P(A intersection B) = P(A) * P(B) |
| Bayes’ Theorem | P(A\|B) = P(B\|A) * P(A) / P(B) |
| Permutations | P(n, r) = n! / (n-r)! |
| Combinations | C(n, r) = n! / (r! * (n-r)!) |
| Expected Value | E(X) = sum(x_i * P(x_i)) |
| Variance | Var(X) = E(X^2) - [E(X)]^2 |
| Linearity of E | E(aX + b) = aE(X) + b |
| Variance Scaling | Var(aX + b) = a^2 * Var(X) |
| Covariance | Cov(X,Y) = E(XY) - E(X)*E(Y) |
| Correlation | rho = Cov(X,Y) / (SD(X) * SD(Y)) |
| Standard Error | SE = sigma / sqrt(n) |
Final reflection: Probability is not just an abstract mathematical exercise. It is the language of uncertainty, and uncertainty is a fundamental feature of reality. From predicting the weather to diagnosing diseases, from designing experiments to training neural networks, probability gives us the rigorous tools to reason about what we do not know.
The concepts in this course (from counting principles through Bayes’ theorem, from random variables through the Central Limit Theorem) form the bedrock upon which all of modern data science and inferential statistics are built. Master them, and you’ve got the keys to understanding an enormous range of quantitative disciplines.
And the next time someone says “there’s a 50% chance of rain, either it rains or it doesn’t,” you’ll know exactly why that reasoning is hilariously, beautifully wrong.