Web Scraping 101: Extracting Financial Data with Python¶
Every finance professional has done it. You open a website, see a table full of numbers you need, and start copying them into Excel. One cell at a time. Ctrl+C, click, Ctrl+V, repeat. After 20 minutes you realize you are only halfway through, you have already made three paste errors, and your will to live is declining faster than a meme stock after earnings.
There is a better way. It is called web scraping, and it is the art of writing code that extracts data from websites automatically. What takes you 20 minutes of painful copy-pasting takes Python about 3 seconds. And unlike you at 4 PM on a Friday, Python does not make paste errors.
Web scraping is one of the most practically useful skills you can learn as someone who works with data. It sits at the intersection of programming, data engineering, and “I refuse to do this manually one more time.” In finance specifically, it opens the door to pulling market data, economic indicators, central bank publications, earnings reports, and anything else that lives on a web page.
The first rule of web scraping: if you are copying data from a website by hand, you are doing it wrong.
What Is Web Scraping?¶
Web scraping is the automated extraction of data from websites. At its core, it works like this:
- Request a web page (just like your browser does)
- Parse the HTML content to find the data you want
- Extract the data into a structured format (like a DataFrame or CSV)
- Store or analyze the data
When you visit a website, your browser sends an HTTP request to a server, receives HTML back, and renders it into the pretty page you see. Web scraping does the same thing, except instead of rendering the page visually, it reads the raw HTML and pulls out the numbers and text you care about.
Key Insight: Web scraping is not hacking. You are making the same HTTP requests a browser makes. The data is public (the website is showing it to anyone who visits). You are just automating the “reading” part. That said, always check a website’s
robots.txtfile and terms of service before scraping. Some sites explicitly prohibit it, and hitting a server with thousands of requests per second is not polite.
Think of it this way: web scraping is reading a newspaper, but with a robot arm that clips the articles for you. You are not breaking in. You are just reading faster.
The Python Toolkit¶
You need three libraries. That is it.
1. requests: The Fetcher¶
requests sends HTTP requests and returns the response. It is how you “visit” a web page from Python.
pip install requests
2. BeautifulSoup: The Parser¶
BeautifulSoup (from the bs4 package) parses HTML and lets you navigate the document tree to find specific elements. It is how you say “give me the third table on this page.”
pip install beautifulsoup4
3. pandas: The Organizer¶
pandas takes the extracted data and turns it into clean DataFrames. It also has a built-in read_html() function that can scrape tables directly.
pip install pandas
Or install all three at once:
pip install requests beautifulsoup4 pandas
Three libraries. Three purposes. No more, no less. If someone tells you that you need Selenium to scrape a simple table, they are overcomplicating it.
HTTP Requests: How the Web Works (In 2 Minutes)¶
Before we scrape anything, a quick refresher on how the web actually works.
The Request-Response Cycle¶
Your Script Web Server
| |
|--- GET /page.html ---------->|
| |
|<-- 200 OK + HTML content ----|
| |
- Your script sends a GET request to a URL
- The server responds with a status code (200 = success) and the HTML content
- You parse the HTML to extract data
Common Status Codes¶
| Code | Meaning | What to Do |
|---|---|---|
| 200 | OK, here is the page | Parse it |
| 301/302 | Redirected | requests handles this automatically |
| 403 | Forbidden | You are being blocked, add headers |
| 404 | Page not found | Check your URL |
| 429 | Too many requests | Slow down, add delays |
| 500 | Server error | Not your fault, retry later |
Your First Request¶
import requests
url = "https://example.com"
response = requests.get(url)
print(f"Status code: {response.status_code}")
print(f"Content length: {len(response.text)} characters")
print(response.text[:500]) # First 500 characters of HTML
Key Insight: Always check
response.status_codebefore parsing. If you get a 403 or 429, your parsing code will silently return empty results, and you will spend an hour debugging your selectors when the real problem was that the server blocked you on the first line.
HTML Basics: What You Are Parsing¶
HTML is a tree of nested tags. Understanding the basics is essential for scraping.
<html>
<body>
<h1>Stock Prices</h1>
<table class="data-table" id="prices">
<thead>
<tr>
<th>Ticker</th>
<th>Price</th>
<th>Change</th>
</tr>
</thead>
<tbody>
<tr>
<td>AAPL</td>
<td>148.20</td>
<td class="positive">+2.35%</td>
</tr>
<tr>
<td>MSFT</td>
<td>375.50</td>
<td class="negative">-1.18%</td>
</tr>
</tbody>
</table>
</body>
</html>
Key HTML concepts for scraping:
- Tags like
<table>,<tr>(table row),<td>(table data),<div>,<span> - Attributes like
class="data-table"andid="prices"help you identify specific elements - Nesting:
<td>tags are inside<tr>tags, which are inside<tbody>, which is inside<table>
You do not need to be an HTML expert. You need to know enough to right-click an element in your browser, click “Inspect”, and understand what you are looking at. That is about 15 minutes of learning.
BeautifulSoup: Finding What You Need¶
Parsing HTML¶
from bs4 import BeautifulSoup
html = """
<html>
<body>
<h1>Market Data</h1>
<table id="prices">
<tr><th>Ticker</th><th>Price</th></tr>
<tr><td>AAPL</td><td>148.20</td></tr>
<tr><td>MSFT</td><td>375.50</td></tr>
</table>
</body>
</html>
"""
soup = BeautifulSoup(html, "html.parser")
Finding Elements¶
# Find by tag name (returns the first match)
title = soup.find("h1")
print(title.text) # "Market Data"
# Find by ID (unique identifier)
table = soup.find("table", id="prices")
# Find by class
# soup.find("div", class_="container")
# Find ALL matching elements (returns a list)
rows = soup.find_all("tr")
print(f"Found {len(rows)} rows") # 3 (header + 2 data rows)
# CSS selectors (most flexible)
cells = soup.select("table#prices tr td")
for cell in cells:
print(cell.text) # AAPL, 148.20, MSFT, 375.50
Extracting Table Data¶
# Extract a full table into a list of dictionaries
table = soup.find("table", id="prices")
headers = [th.text for th in table.find_all("th")]
rows = []
for tr in table.find_all("tr")[1:]: # Skip header row
cells = [td.text.strip() for td in tr.find_all("td")]
rows.append(dict(zip(headers, cells)))
print(rows)
# [{'Ticker': 'AAPL', 'Price': '148.20'},
# {'Ticker': 'MSFT', 'Price': '375.50'}]
Key Insight: The
soup.select()method supports CSS selectors, which are incredibly powerful.table#prices tr tdmeans “find alltdelements insidetrelements inside atablewith idprices.” Learn CSS selectors and you can pinpoint any element on any page.
Project 1: Scraping Wikipedia Financial Tables¶
Let us start with something simple. Wikipedia has tables of data on almost every financial topic. Let us pull the list of S&P 500 companies.
import pandas as pd
# pandas.read_html() is magic: it finds ALL tables on a page
url = "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
tables = pd.read_html(url)
# The first table is the current S&P 500 constituents
sp500 = tables[0]
print(f"Number of companies: {len(sp500)}")
print(sp500.head(10))
Output:
Number of companies: 503
Symbol Security GICS Sector ...
0 MMM 3M Company Industrials ...
1 AOS A. O. Smith Corp Industrials ...
2 ABT Abbott Laboratories Health Care ...
3 ABBV AbbVie Inc. Health Care ...
4 ACN Accenture plc IT ...
That is one line of actual scraping code. pd.read_html() finds every <table> tag on the page and returns a list of DataFrames. It handles headers, row parsing, and type conversion automatically.
Analyzing the Data¶
# Sector breakdown
sector_counts = sp500["GICS Sector"].value_counts()
print(sector_counts)
# Industrials 78
# Information Technology 76
# Financials 72
# Health Care 64
# Consumer Discretionary 53
# ...
# Filter for tech stocks
tech_stocks = sp500[sp500["GICS Sector"] == "Information Technology"]
print(f"Tech stocks in S&P 500: {len(tech_stocks)}")
print(tech_stocks[["Symbol", "Security"]].head())
Saving to CSV or Excel¶
# Save to CSV
sp500.to_csv("sp500_companies.csv", index=False)
# Save to Excel
sp500.to_excel("sp500_companies.xlsx", index=False)
You just downloaded, parsed, and saved the entire S&P 500 constituent list in 4 lines of Python. The same task in Excel would involve 20 minutes of copy-paste and at least one “I accidentally pasted it in the wrong column” moment.
Project 2: Fetching ECB Interest Rates via API¶
Not all data requires scraping HTML. Many institutions provide structured data through APIs (Application Programming Interfaces). APIs return data in clean JSON or XML format, which is way easier to work with than raw HTML.
The European Central Bank provides economic data through their Statistical Data Warehouse API.
import requests
import pandas as pd
# ECB SDW API: Main Refinancing Operations rate
url = "https://data-api.ecb.europa.eu/service/data/FM/M.U2.EUR.4F.KR.MRR_FR.LEV"
headers = {"Accept": "application/json"}
response = requests.get(url, headers=headers)
print(f"Status: {response.status_code}")
data = response.json()
# Navigate the JSON structure to extract the time series
observations = data["dataSets"][0]["series"]["0:0:0:0:0:0:0"]["observations"]
time_periods = data["structure"]["dimensions"]["observation"][0]["values"]
# Build a DataFrame
records = []
for i, period_info in enumerate(time_periods):
records.append({
"date": period_info["id"],
"rate": observations[str(i)][0]
})
ecb_rates = pd.DataFrame(records)
ecb_rates["date"] = pd.to_datetime(ecb_rates["date"])
ecb_rates["rate"] = ecb_rates["rate"].astype(float)
print(ecb_rates.tail(10))
Output (last 10 entries):
date rate
2025-07-01 3.40
2025-08-01 3.40
2025-09-01 3.15
2025-10-01 3.15
2025-11-01 2.90
...
Plotting the Rate History¶
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 5))
plt.plot(ecb_rates["date"], ecb_rates["rate"], linewidth=1.5, color="#1f77b4")
plt.title("ECB Main Refinancing Operations Rate", fontsize=14)
plt.xlabel("Date")
plt.ylabel("Rate (%)")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("ecb_rate_history.png", dpi=150)
plt.show()
Key Insight: Always prefer an API over HTML scraping when one is available. APIs return structured data (JSON or XML), are designed for programmatic access, are less likely to break when the website redesigns, and are generally faster. Most central banks, government agencies, and financial data providers offer APIs.
Scraping HTML is like reading a book by photographing every page and running OCR. Using an API is like getting the text file directly from the author. Always check for an API first.
Project 3: Building a Multi-Stock Data Fetcher¶
Let us build a proper data pipeline that fetches stock data for multiple tickers, handles errors gracefully, and outputs a clean consolidated dataset.
We will use the yfinance library, which wraps Yahoo Finance’s data:
pip install yfinance
import yfinance as yf
import pandas as pd
from datetime import datetime, timedelta
def fetch_stock_data(tickers, period="1mo"):
"""
Fetch historical stock data for a list of tickers.
Parameters:
tickers: list of ticker symbols (e.g., ["AAPL", "MSFT", "GOOGL"])
period: data period ("1d", "5d", "1mo", "3mo", "6mo", "1y", "5y")
Returns:
dict of DataFrames, one per ticker
"""
results = {}
failed = []
for ticker in tickers:
try:
stock = yf.Ticker(ticker)
hist = stock.history(period=period)
if hist.empty:
print(f" WARNING: No data for {ticker}")
failed.append(ticker)
continue
results[ticker] = hist
print(f" OK: {ticker} - {len(hist)} rows")
except Exception as e:
print(f" ERROR: {ticker} - {e}")
failed.append(ticker)
if failed:
print(f"\nFailed tickers: {failed}")
return results
# Fetch data for a portfolio
tickers = ["AAPL", "MSFT", "GOOGL", "AMZN", "NVDA", "JPM", "GS", "V"]
print("Fetching stock data...\n")
stock_data = fetch_stock_data(tickers, period="3mo")
Building a Summary Dashboard¶
def build_summary(stock_data):
"""Build a summary table from fetched stock data."""
summary = []
for ticker, df in stock_data.items():
if df.empty:
continue
current = df["Close"].iloc[-1]
previous = df["Close"].iloc[-2]
first = df["Close"].iloc[0]
summary.append({
"Ticker": ticker,
"Current Price": round(current, 2),
"Daily Change (%)": round((current - previous) / previous * 100, 2),
"Period Change (%)": round((current - first) / first * 100, 2),
"Period High": round(df["High"].max(), 2),
"Period Low": round(df["Low"].min(), 2),
"Avg Volume": int(df["Volume"].mean()),
})
return pd.DataFrame(summary)
dashboard = build_summary(stock_data)
print(dashboard.to_string(index=False))
Output:
Ticker Current Price Daily Change (%) Period Change (%) Period High Period Low Avg Volume
AAPL 148.20 0.85 5.32 152.40 135.10 58234100
MSFT 375.50 -0.32 3.18 382.10 355.20 22156000
GOOGL 152.80 1.15 8.45 155.30 138.90 25890000
AMZN 185.40 0.62 6.78 189.50 170.20 43120000
NVDA 825.30 2.10 15.60 840.00 695.50 38450000
JPM 195.20 -0.45 2.30 198.80 186.40 10230000
GS 425.60 0.28 4.15 432.10 398.50 2890000
V 275.80 0.52 3.95 280.20 260.10 7560000
Calculating Correlations¶
# Build a DataFrame of closing prices
closes = pd.DataFrame({
ticker: df["Close"]
for ticker, df in stock_data.items()
})
# Calculate correlation matrix
correlation = closes.corr().round(3)
print("\nCorrelation Matrix:")
print(correlation)
This shows how stocks move relative to each other. High correlation means they tend to move together. Low correlation means they provide diversification benefits.
If all your stocks have a 0.95 correlation with each other, you do not have a diversified portfolio. You have one bet with extra steps.
Project 4: Scraping HTML Tables with BeautifulSoup¶
When there is no API and pd.read_html() does not parse the table correctly, you need to do it manually with BeautifulSoup. Here is a robust, reusable table scraper:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
class TableScraper:
"""A reusable scraper for HTML tables."""
def __init__(self, headers=None):
self.session = requests.Session()
self.session.headers.update(headers or {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36"
)
})
def fetch_page(self, url):
"""Fetch a page and return a BeautifulSoup object."""
response = self.session.get(url, timeout=10)
response.raise_for_status()
return BeautifulSoup(response.text, "html.parser")
def extract_table(self, soup, table_selector):
"""
Extract a table into a list of dictionaries.
table_selector: CSS selector for the target table
(e.g., "table.data-table", "table#results")
"""
table = soup.select_one(table_selector)
if table is None:
raise ValueError(f"No table found for selector: {table_selector}")
# Extract headers
header_row = table.select_one("thead tr") or table.select_one("tr")
headers = [th.get_text(strip=True) for th in header_row.find_all(["th", "td"])]
# Extract data rows
rows = []
body = table.select_one("tbody") or table
for tr in body.find_all("tr"):
cells = [td.get_text(strip=True) for td in tr.find_all("td")]
if cells and len(cells) == len(headers):
rows.append(dict(zip(headers, cells)))
return pd.DataFrame(rows)
def scrape_multiple(self, urls, table_selector, delay=1.0):
"""Scrape the same table structure from multiple pages."""
all_data = []
for url in urls:
try:
soup = self.fetch_page(url)
df = self.extract_table(soup, table_selector)
all_data.append(df)
print(f" OK: {url} ({len(df)} rows)")
except Exception as e:
print(f" ERROR: {url} - {e}")
time.sleep(delay) # Be polite
if all_data:
return pd.concat(all_data, ignore_index=True)
return pd.DataFrame()
# Usage example
scraper = TableScraper()
soup = scraper.fetch_page(
"https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
)
sp500 = scraper.extract_table(soup, "table.wikitable")
print(f"Extracted {len(sp500)} rows")
Key Insight: Always add a
User-Agentheader to your requests. Many websites block requests that do not include one, because the defaultpython-requests/2.xuser agent is a known scraping signature. You are not trying to be sneaky. You are just being a polite citizen of the internet.
Best Practices for Web Scraping¶
1. Rate Limiting¶
Never hammer a server with rapid-fire requests. Add delays between requests.
import time
for url in urls:
data = scrape(url)
time.sleep(1) # Wait 1 second between requests
A good rule of thumb: one request per second for public data. Slower for smaller sites.
2. Check robots.txt¶
Before scraping any site, check https://example.com/robots.txt. This file tells you what paths are allowed and disallowed for automated access.
response = requests.get("https://example.com/robots.txt")
print(response.text)
3. Cache Responses¶
Do not re-scrape the same page every time you run your script. Cache the HTML locally.
import os
import hashlib
def cached_fetch(url, cache_dir="cache"):
"""Fetch a URL, caching the response locally."""
os.makedirs(cache_dir, exist_ok=True)
# Create a filename from the URL hash
filename = hashlib.md5(url.encode()).hexdigest() + ".html"
filepath = os.path.join(cache_dir, filename)
if os.path.exists(filepath):
with open(filepath, "r", encoding="utf-8") as f:
return f.read()
response = requests.get(url)
response.raise_for_status()
with open(filepath, "w", encoding="utf-8") as f:
f.write(response.text)
return response.text
4. Handle Failures Gracefully¶
Networks fail. Servers go down. Pages change. Always use try/except and implement retries.
from time import sleep
def fetch_with_retry(url, max_retries=3, delay=2):
"""Fetch a URL with automatic retry on failure."""
for attempt in range(max_retries):
try:
response = requests.get(url, timeout=10)
response.raise_for_status()
return response
except requests.RequestException as e:
print(f" Attempt {attempt + 1} failed: {e}")
if attempt < max_retries - 1:
sleep(delay * (attempt + 1)) # Exponential backoff
raise Exception(f"Failed to fetch {url} after {max_retries} attempts")
5. API vs. Scraping Decision Matrix¶
| Factor | Use API | Use Scraping |
|---|---|---|
| Structured data available? | Yes, API returns JSON/XML | No API, data only in HTML |
| Data stability | API contracts are versioned | HTML can change anytime |
| Rate limits | Usually documented | You have to guess |
| Legal clarity | Usually has terms of use | Check robots.txt + ToS |
| Ease of use | Higher (structured data) | Lower (parse HTML manually) |
| Reliability | Higher | Lower (sites change layouts) |
Always prefer APIs. Always. Scraping is the backup plan for when the data you need does not have an API. Think of scraping as breaking the glass on the emergency exit. It works, but check if the front door is open first.
Common Pitfalls¶
Not checking the status code. If you get a 403 and try to parse the response, you will parse an error page and get empty results. Always check
response.status_code == 200before proceeding.Hardcoding selectors. A selector like
div:nth-child(3) > table > tbody > tr:nth-child(2) > td:nth-child(5)will break the moment the site adds a banner or rearranges elements. Use IDs and classes when possible. They are more stable.Not handling encoding. Some pages use unusual character encodings. If you see garbled text, check
response.encodingand set it manually if needed:response.encoding = "utf-8".Scraping JavaScript-rendered content. If the data you want is loaded dynamically by JavaScript,
requestswill not see it (it only gets the initial HTML). You need either the site’s underlying API (check the browser’s Network tab) or a browser automation tool like Selenium or Playwright. But try the Network tab first.Not being polite. Sending 100 requests per second to a small website can effectively DDoS it. Add delays, respect robots.txt, and cache aggressively. Being a good internet citizen is not optional.
Forgetting that websites change. Your scraper will break. Not if, when. Build your code to fail gracefully and alert you when the structure changes, rather than silently returning wrong data.
Wrapping Up¶
Web scraping is one of those skills that pays for itself immediately. The first time you automate a 30-minute manual data collection task into a 3-second Python script, you will wonder why you waited so long.
For financial data specifically, start with APIs whenever they are available. The ECB, Federal Reserve, World Bank, and many exchanges provide clean, structured data through APIs. When APIs are not available, pandas.read_html() handles simple tables in one line. For everything else, requests + BeautifulSoup gives you surgical precision over what you extract.
Build your scrapers with rate limiting, caching, error handling, and graceful failure. The website you are scraping does not owe you anything, so be a respectful consumer. And always remember: your scraper will break eventually. Build it so that when it does, you find out immediately rather than discovering six months later that you have been working with stale data.
And if your boss asks why you are reading HTML source code at your desk, tell them you are “performing automated data acquisition from unstructured web sources.” It sounds way better than “I wrote a script so I don’t have to copy-paste anymore.”
Cheat Sheet¶
Key Questions & Answers¶
What is web scraping?¶
Automated extraction of data from websites using code. You send HTTP requests to get the page HTML, parse the HTML to find the data you want, and extract it into a structured format like a CSV or DataFrame.
When should I use an API instead of scraping?¶
Always prefer APIs when available. APIs return structured data (JSON/XML), are designed for programmatic access, have documented rate limits, and do not break when the website redesigns. Only scrape HTML when no API exists for the data you need.
What is the difference between requests and BeautifulSoup?¶
requestsfetches the web page (sends the HTTP request, receives the HTML).BeautifulSoupparses the HTML to find and extract specific elements. You need both: requests to get the page, BeautifulSoup to read it. Think of requests as the delivery truck and BeautifulSoup as the unboxing crew.
How do I handle JavaScript-rendered content?¶
requestsonly gets the initial HTML. If data is loaded by JavaScript, first check the browser’s Network tab (F12 > Network) for API calls the page makes, as you can often call those APIs directly. If that fails, use Selenium or Playwright to render the page in a real browser.
Key Concepts at a Glance¶
| Concept | Summary |
|---|---|
| requests.get(url) | Sends an HTTP GET request, returns the response |
| response.status_code | HTTP status (200 = OK, 403 = blocked, 404 = not found) |
| BeautifulSoup(html, “html.parser”) | Parses HTML into a navigable tree |
| soup.find(“tag”, attrs) | Finds the first matching element |
| soup.find_all(“tag”) | Finds all matching elements (returns a list) |
| soup.select(“css selector”) | Finds elements using CSS selectors |
| element.text / .get_text(strip=True) | Extracts text content from an element |
| pd.read_html(url) | Scrapes all HTML tables from a URL into DataFrames |
| User-Agent header | Identifies your client; set one to avoid being blocked |
| robots.txt | File that specifies which paths can be scraped |
| Rate limiting | Adding delays between requests to be polite |
| Caching | Saving responses locally to avoid re-fetching |
| yfinance | Python wrapper for Yahoo Finance data |
| try/except with retry | Handle network failures gracefully |
| CSS selectors | Powerful syntax for targeting HTML elements |
Sources & Further Reading¶
- Mitchell, R., Web Scraping with Python, O’Reilly
- Requests Library, Quickstart Guide
- BeautifulSoup Documentation, Beautiful Soup 4
- pandas Documentation, pandas.read_html
- yfinance Documentation, yfinance on PyPI
- ECB Statistical Data Warehouse, API Documentation
- Mozilla Developer Network, HTTP Status Codes
- Google, Robots.txt Specification
- Real Python, Web Scraping with Python