StudyComputer SciencePython

Web Scraping 101 - Extracting Financial Data with Python

2026-04-02 19 min read Computer Science
Web Scraping 101 - Extracting Financial Data with Python

Web Scraping 101: Extracting Financial Data with Python

Every finance professional has done it. You open a website, see a table full of numbers you need, and start copying them into Excel. One cell at a time. Ctrl+C, click, Ctrl+V, repeat. After 20 minutes you realize you are only halfway through, you have already made three paste errors, and your will to live is declining faster than a meme stock after earnings.

There is a better way. It is called web scraping, and it is the art of writing code that extracts data from websites automatically. What takes you 20 minutes of painful copy-pasting takes Python about 3 seconds. And unlike you at 4 PM on a Friday, Python does not make paste errors.

Web scraping is one of the most practically useful skills you can learn as someone who works with data. It sits at the intersection of programming, data engineering, and “I refuse to do this manually one more time.” In finance specifically, it opens the door to pulling market data, economic indicators, central bank publications, earnings reports, and anything else that lives on a web page.

The first rule of web scraping: if you are copying data from a website by hand, you are doing it wrong.


What Is Web Scraping?

Web scraping is the automated extraction of data from websites. At its core, it works like this:

  1. Request a web page (just like your browser does)
  2. Parse the HTML content to find the data you want
  3. Extract the data into a structured format (like a DataFrame or CSV)
  4. Store or analyze the data

When you visit a website, your browser sends an HTTP request to a server, receives HTML back, and renders it into the pretty page you see. Web scraping does the same thing, except instead of rendering the page visually, it reads the raw HTML and pulls out the numbers and text you care about.

Key Insight: Web scraping is not hacking. You are making the same HTTP requests a browser makes. The data is public (the website is showing it to anyone who visits). You are just automating the “reading” part. That said, always check a website’s robots.txt file and terms of service before scraping. Some sites explicitly prohibit it, and hitting a server with thousands of requests per second is not polite.

Think of it this way: web scraping is reading a newspaper, but with a robot arm that clips the articles for you. You are not breaking in. You are just reading faster.


The Python Toolkit

You need three libraries. That is it.

1. requests: The Fetcher

requests sends HTTP requests and returns the response. It is how you “visit” a web page from Python.

pip install requests

2. BeautifulSoup: The Parser

BeautifulSoup (from the bs4 package) parses HTML and lets you navigate the document tree to find specific elements. It is how you say “give me the third table on this page.”

pip install beautifulsoup4

3. pandas: The Organizer

pandas takes the extracted data and turns it into clean DataFrames. It also has a built-in read_html() function that can scrape tables directly.

pip install pandas

Or install all three at once:

pip install requests beautifulsoup4 pandas

Three libraries. Three purposes. No more, no less. If someone tells you that you need Selenium to scrape a simple table, they are overcomplicating it.


HTTP Requests: How the Web Works (In 2 Minutes)

Before we scrape anything, a quick refresher on how the web actually works.

The Request-Response Cycle

Your Script                    Web Server
    |                              |
    |--- GET /page.html ---------->|
    |                              |
    |<-- 200 OK + HTML content ----|
    |                              |
  1. Your script sends a GET request to a URL
  2. The server responds with a status code (200 = success) and the HTML content
  3. You parse the HTML to extract data

Common Status Codes

CodeMeaningWhat to Do
200OK, here is the pageParse it
301/302Redirectedrequests handles this automatically
403ForbiddenYou are being blocked, add headers
404Page not foundCheck your URL
429Too many requestsSlow down, add delays
500Server errorNot your fault, retry later

Your First Request

import requests

url = "https://example.com"
response = requests.get(url)

print(f"Status code: {response.status_code}")
print(f"Content length: {len(response.text)} characters")
print(response.text[:500])  # First 500 characters of HTML

Key Insight: Always check response.status_code before parsing. If you get a 403 or 429, your parsing code will silently return empty results, and you will spend an hour debugging your selectors when the real problem was that the server blocked you on the first line.


HTML Basics: What You Are Parsing

HTML is a tree of nested tags. Understanding the basics is essential for scraping.

<html>
  <body>
    <h1>Stock Prices</h1>
    <table class="data-table" id="prices">
      <thead>
        <tr>
          <th>Ticker</th>
          <th>Price</th>
          <th>Change</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <td>AAPL</td>
          <td>148.20</td>
          <td class="positive">+2.35%</td>
        </tr>
        <tr>
          <td>MSFT</td>
          <td>375.50</td>
          <td class="negative">-1.18%</td>
        </tr>
      </tbody>
    </table>
  </body>
</html>

Key HTML concepts for scraping:

  • Tags like <table>, <tr> (table row), <td> (table data), <div>, <span>
  • Attributes like class="data-table" and id="prices" help you identify specific elements
  • Nesting: <td> tags are inside <tr> tags, which are inside <tbody>, which is inside <table>

You do not need to be an HTML expert. You need to know enough to right-click an element in your browser, click “Inspect”, and understand what you are looking at. That is about 15 minutes of learning.


BeautifulSoup: Finding What You Need

Parsing HTML

from bs4 import BeautifulSoup

html = """
<html>
  <body>
    <h1>Market Data</h1>
    <table id="prices">
      <tr><th>Ticker</th><th>Price</th></tr>
      <tr><td>AAPL</td><td>148.20</td></tr>
      <tr><td>MSFT</td><td>375.50</td></tr>
    </table>
  </body>
</html>
"""

soup = BeautifulSoup(html, "html.parser")

Finding Elements

# Find by tag name (returns the first match)
title = soup.find("h1")
print(title.text)  # "Market Data"

# Find by ID (unique identifier)
table = soup.find("table", id="prices")

# Find by class
# soup.find("div", class_="container")

# Find ALL matching elements (returns a list)
rows = soup.find_all("tr")
print(f"Found {len(rows)} rows")  # 3 (header + 2 data rows)

# CSS selectors (most flexible)
cells = soup.select("table#prices tr td")
for cell in cells:
    print(cell.text)  # AAPL, 148.20, MSFT, 375.50

Extracting Table Data

# Extract a full table into a list of dictionaries
table = soup.find("table", id="prices")
headers = [th.text for th in table.find_all("th")]
rows = []

for tr in table.find_all("tr")[1:]:  # Skip header row
    cells = [td.text.strip() for td in tr.find_all("td")]
    rows.append(dict(zip(headers, cells)))

print(rows)
# [{'Ticker': 'AAPL', 'Price': '148.20'},
#  {'Ticker': 'MSFT', 'Price': '375.50'}]

Key Insight: The soup.select() method supports CSS selectors, which are incredibly powerful. table#prices tr td means “find all td elements inside tr elements inside a table with id prices.” Learn CSS selectors and you can pinpoint any element on any page.


Project 1: Scraping Wikipedia Financial Tables

Let us start with something simple. Wikipedia has tables of data on almost every financial topic. Let us pull the list of S&P 500 companies.

import pandas as pd

# pandas.read_html() is magic: it finds ALL tables on a page
url = "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
tables = pd.read_html(url)

# The first table is the current S&P 500 constituents
sp500 = tables[0]
print(f"Number of companies: {len(sp500)}")
print(sp500.head(10))

Output:

Number of companies: 503
   Symbol               Security  GICS Sector  ...
0     MMM           3M Company  Industrials  ...
1     AOS     A. O. Smith Corp  Industrials  ...
2     ABT  Abbott Laboratories  Health Care  ...
3    ABBV          AbbVie Inc.  Health Care  ...
4     ACN       Accenture plc    IT          ...

That is one line of actual scraping code. pd.read_html() finds every <table> tag on the page and returns a list of DataFrames. It handles headers, row parsing, and type conversion automatically.

Analyzing the Data

# Sector breakdown
sector_counts = sp500["GICS Sector"].value_counts()
print(sector_counts)

# Industrials           78
# Information Technology 76
# Financials            72
# Health Care           64
# Consumer Discretionary 53
# ...

# Filter for tech stocks
tech_stocks = sp500[sp500["GICS Sector"] == "Information Technology"]
print(f"Tech stocks in S&P 500: {len(tech_stocks)}")
print(tech_stocks[["Symbol", "Security"]].head())

Saving to CSV or Excel

# Save to CSV
sp500.to_csv("sp500_companies.csv", index=False)

# Save to Excel
sp500.to_excel("sp500_companies.xlsx", index=False)

You just downloaded, parsed, and saved the entire S&P 500 constituent list in 4 lines of Python. The same task in Excel would involve 20 minutes of copy-paste and at least one “I accidentally pasted it in the wrong column” moment.


Project 2: Fetching ECB Interest Rates via API

Not all data requires scraping HTML. Many institutions provide structured data through APIs (Application Programming Interfaces). APIs return data in clean JSON or XML format, which is way easier to work with than raw HTML.

The European Central Bank provides economic data through their Statistical Data Warehouse API.

import requests
import pandas as pd

# ECB SDW API: Main Refinancing Operations rate
url = "https://data-api.ecb.europa.eu/service/data/FM/M.U2.EUR.4F.KR.MRR_FR.LEV"
headers = {"Accept": "application/json"}

response = requests.get(url, headers=headers)
print(f"Status: {response.status_code}")

data = response.json()

# Navigate the JSON structure to extract the time series
observations = data["dataSets"][0]["series"]["0:0:0:0:0:0:0"]["observations"]
time_periods = data["structure"]["dimensions"]["observation"][0]["values"]

# Build a DataFrame
records = []
for i, period_info in enumerate(time_periods):
    records.append({
        "date": period_info["id"],
        "rate": observations[str(i)][0]
    })

ecb_rates = pd.DataFrame(records)
ecb_rates["date"] = pd.to_datetime(ecb_rates["date"])
ecb_rates["rate"] = ecb_rates["rate"].astype(float)

print(ecb_rates.tail(10))

Output (last 10 entries):

         date  rate
  2025-07-01  3.40
  2025-08-01  3.40
  2025-09-01  3.15
  2025-10-01  3.15
  2025-11-01  2.90
  ...

Plotting the Rate History

import matplotlib.pyplot as plt

plt.figure(figsize=(12, 5))
plt.plot(ecb_rates["date"], ecb_rates["rate"], linewidth=1.5, color="#1f77b4")
plt.title("ECB Main Refinancing Operations Rate", fontsize=14)
plt.xlabel("Date")
plt.ylabel("Rate (%)")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("ecb_rate_history.png", dpi=150)
plt.show()

Key Insight: Always prefer an API over HTML scraping when one is available. APIs return structured data (JSON or XML), are designed for programmatic access, are less likely to break when the website redesigns, and are generally faster. Most central banks, government agencies, and financial data providers offer APIs.

Scraping HTML is like reading a book by photographing every page and running OCR. Using an API is like getting the text file directly from the author. Always check for an API first.


Project 3: Building a Multi-Stock Data Fetcher

Let us build a proper data pipeline that fetches stock data for multiple tickers, handles errors gracefully, and outputs a clean consolidated dataset.

We will use the yfinance library, which wraps Yahoo Finance’s data:

pip install yfinance
import yfinance as yf
import pandas as pd
from datetime import datetime, timedelta

def fetch_stock_data(tickers, period="1mo"):
    """
    Fetch historical stock data for a list of tickers.

    Parameters:
        tickers: list of ticker symbols (e.g., ["AAPL", "MSFT", "GOOGL"])
        period: data period ("1d", "5d", "1mo", "3mo", "6mo", "1y", "5y")

    Returns:
        dict of DataFrames, one per ticker
    """
    results = {}
    failed = []

    for ticker in tickers:
        try:
            stock = yf.Ticker(ticker)
            hist = stock.history(period=period)

            if hist.empty:
                print(f"  WARNING: No data for {ticker}")
                failed.append(ticker)
                continue

            results[ticker] = hist
            print(f"  OK: {ticker} - {len(hist)} rows")

        except Exception as e:
            print(f"  ERROR: {ticker} - {e}")
            failed.append(ticker)

    if failed:
        print(f"\nFailed tickers: {failed}")

    return results


# Fetch data for a portfolio
tickers = ["AAPL", "MSFT", "GOOGL", "AMZN", "NVDA", "JPM", "GS", "V"]
print("Fetching stock data...\n")
stock_data = fetch_stock_data(tickers, period="3mo")

Building a Summary Dashboard

def build_summary(stock_data):
    """Build a summary table from fetched stock data."""
    summary = []

    for ticker, df in stock_data.items():
        if df.empty:
            continue

        current = df["Close"].iloc[-1]
        previous = df["Close"].iloc[-2]
        first = df["Close"].iloc[0]

        summary.append({
            "Ticker": ticker,
            "Current Price": round(current, 2),
            "Daily Change (%)": round((current - previous) / previous * 100, 2),
            "Period Change (%)": round((current - first) / first * 100, 2),
            "Period High": round(df["High"].max(), 2),
            "Period Low": round(df["Low"].min(), 2),
            "Avg Volume": int(df["Volume"].mean()),
        })

    return pd.DataFrame(summary)


dashboard = build_summary(stock_data)
print(dashboard.to_string(index=False))

Output:

Ticker  Current Price  Daily Change (%)  Period Change (%)  Period High  Period Low  Avg Volume
  AAPL         148.20              0.85               5.32       152.40      135.10    58234100
  MSFT         375.50             -0.32               3.18       382.10      355.20    22156000
 GOOGL         152.80              1.15               8.45       155.30      138.90    25890000
  AMZN         185.40              0.62               6.78       189.50      170.20    43120000
  NVDA         825.30              2.10              15.60       840.00      695.50    38450000
   JPM         195.20             -0.45               2.30       198.80      186.40    10230000
    GS         425.60              0.28               4.15       432.10      398.50     2890000
     V         275.80              0.52               3.95       280.20      260.10     7560000

Calculating Correlations

# Build a DataFrame of closing prices
closes = pd.DataFrame({
    ticker: df["Close"]
    for ticker, df in stock_data.items()
})

# Calculate correlation matrix
correlation = closes.corr().round(3)
print("\nCorrelation Matrix:")
print(correlation)

This shows how stocks move relative to each other. High correlation means they tend to move together. Low correlation means they provide diversification benefits.

If all your stocks have a 0.95 correlation with each other, you do not have a diversified portfolio. You have one bet with extra steps.


Project 4: Scraping HTML Tables with BeautifulSoup

When there is no API and pd.read_html() does not parse the table correctly, you need to do it manually with BeautifulSoup. Here is a robust, reusable table scraper:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

class TableScraper:
    """A reusable scraper for HTML tables."""

    def __init__(self, headers=None):
        self.session = requests.Session()
        self.session.headers.update(headers or {
            "User-Agent": (
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/120.0.0.0 Safari/537.36"
            )
        })

    def fetch_page(self, url):
        """Fetch a page and return a BeautifulSoup object."""
        response = self.session.get(url, timeout=10)
        response.raise_for_status()
        return BeautifulSoup(response.text, "html.parser")

    def extract_table(self, soup, table_selector):
        """
        Extract a table into a list of dictionaries.

        table_selector: CSS selector for the target table
                        (e.g., "table.data-table", "table#results")
        """
        table = soup.select_one(table_selector)
        if table is None:
            raise ValueError(f"No table found for selector: {table_selector}")

        # Extract headers
        header_row = table.select_one("thead tr") or table.select_one("tr")
        headers = [th.get_text(strip=True) for th in header_row.find_all(["th", "td"])]

        # Extract data rows
        rows = []
        body = table.select_one("tbody") or table
        for tr in body.find_all("tr"):
            cells = [td.get_text(strip=True) for td in tr.find_all("td")]
            if cells and len(cells) == len(headers):
                rows.append(dict(zip(headers, cells)))

        return pd.DataFrame(rows)

    def scrape_multiple(self, urls, table_selector, delay=1.0):
        """Scrape the same table structure from multiple pages."""
        all_data = []
        for url in urls:
            try:
                soup = self.fetch_page(url)
                df = self.extract_table(soup, table_selector)
                all_data.append(df)
                print(f"  OK: {url} ({len(df)} rows)")
            except Exception as e:
                print(f"  ERROR: {url} - {e}")
            time.sleep(delay)  # Be polite

        if all_data:
            return pd.concat(all_data, ignore_index=True)
        return pd.DataFrame()


# Usage example
scraper = TableScraper()
soup = scraper.fetch_page(
    "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
)
sp500 = scraper.extract_table(soup, "table.wikitable")
print(f"Extracted {len(sp500)} rows")

Key Insight: Always add a User-Agent header to your requests. Many websites block requests that do not include one, because the default python-requests/2.x user agent is a known scraping signature. You are not trying to be sneaky. You are just being a polite citizen of the internet.


Best Practices for Web Scraping

1. Rate Limiting

Never hammer a server with rapid-fire requests. Add delays between requests.

import time

for url in urls:
    data = scrape(url)
    time.sleep(1)  # Wait 1 second between requests

A good rule of thumb: one request per second for public data. Slower for smaller sites.

2. Check robots.txt

Before scraping any site, check https://example.com/robots.txt. This file tells you what paths are allowed and disallowed for automated access.

response = requests.get("https://example.com/robots.txt")
print(response.text)

3. Cache Responses

Do not re-scrape the same page every time you run your script. Cache the HTML locally.

import os
import hashlib

def cached_fetch(url, cache_dir="cache"):
    """Fetch a URL, caching the response locally."""
    os.makedirs(cache_dir, exist_ok=True)

    # Create a filename from the URL hash
    filename = hashlib.md5(url.encode()).hexdigest() + ".html"
    filepath = os.path.join(cache_dir, filename)

    if os.path.exists(filepath):
        with open(filepath, "r", encoding="utf-8") as f:
            return f.read()

    response = requests.get(url)
    response.raise_for_status()

    with open(filepath, "w", encoding="utf-8") as f:
        f.write(response.text)

    return response.text

4. Handle Failures Gracefully

Networks fail. Servers go down. Pages change. Always use try/except and implement retries.

from time import sleep

def fetch_with_retry(url, max_retries=3, delay=2):
    """Fetch a URL with automatic retry on failure."""
    for attempt in range(max_retries):
        try:
            response = requests.get(url, timeout=10)
            response.raise_for_status()
            return response
        except requests.RequestException as e:
            print(f"  Attempt {attempt + 1} failed: {e}")
            if attempt < max_retries - 1:
                sleep(delay * (attempt + 1))  # Exponential backoff

    raise Exception(f"Failed to fetch {url} after {max_retries} attempts")

5. API vs. Scraping Decision Matrix

FactorUse APIUse Scraping
Structured data available?Yes, API returns JSON/XMLNo API, data only in HTML
Data stabilityAPI contracts are versionedHTML can change anytime
Rate limitsUsually documentedYou have to guess
Legal clarityUsually has terms of useCheck robots.txt + ToS
Ease of useHigher (structured data)Lower (parse HTML manually)
ReliabilityHigherLower (sites change layouts)

Always prefer APIs. Always. Scraping is the backup plan for when the data you need does not have an API. Think of scraping as breaking the glass on the emergency exit. It works, but check if the front door is open first.


Common Pitfalls

  1. Not checking the status code. If you get a 403 and try to parse the response, you will parse an error page and get empty results. Always check response.status_code == 200 before proceeding.

  2. Hardcoding selectors. A selector like div:nth-child(3) > table > tbody > tr:nth-child(2) > td:nth-child(5) will break the moment the site adds a banner or rearranges elements. Use IDs and classes when possible. They are more stable.

  3. Not handling encoding. Some pages use unusual character encodings. If you see garbled text, check response.encoding and set it manually if needed: response.encoding = "utf-8".

  4. Scraping JavaScript-rendered content. If the data you want is loaded dynamically by JavaScript, requests will not see it (it only gets the initial HTML). You need either the site’s underlying API (check the browser’s Network tab) or a browser automation tool like Selenium or Playwright. But try the Network tab first.

  5. Not being polite. Sending 100 requests per second to a small website can effectively DDoS it. Add delays, respect robots.txt, and cache aggressively. Being a good internet citizen is not optional.

  6. Forgetting that websites change. Your scraper will break. Not if, when. Build your code to fail gracefully and alert you when the structure changes, rather than silently returning wrong data.


Wrapping Up

Web scraping is one of those skills that pays for itself immediately. The first time you automate a 30-minute manual data collection task into a 3-second Python script, you will wonder why you waited so long.

For financial data specifically, start with APIs whenever they are available. The ECB, Federal Reserve, World Bank, and many exchanges provide clean, structured data through APIs. When APIs are not available, pandas.read_html() handles simple tables in one line. For everything else, requests + BeautifulSoup gives you surgical precision over what you extract.

Build your scrapers with rate limiting, caching, error handling, and graceful failure. The website you are scraping does not owe you anything, so be a respectful consumer. And always remember: your scraper will break eventually. Build it so that when it does, you find out immediately rather than discovering six months later that you have been working with stale data.

And if your boss asks why you are reading HTML source code at your desk, tell them you are “performing automated data acquisition from unstructured web sources.” It sounds way better than “I wrote a script so I don’t have to copy-paste anymore.”


Cheat Sheet

Key Questions & Answers

What is web scraping?

Automated extraction of data from websites using code. You send HTTP requests to get the page HTML, parse the HTML to find the data you want, and extract it into a structured format like a CSV or DataFrame.

When should I use an API instead of scraping?

Always prefer APIs when available. APIs return structured data (JSON/XML), are designed for programmatic access, have documented rate limits, and do not break when the website redesigns. Only scrape HTML when no API exists for the data you need.

What is the difference between requests and BeautifulSoup?

requests fetches the web page (sends the HTTP request, receives the HTML). BeautifulSoup parses the HTML to find and extract specific elements. You need both: requests to get the page, BeautifulSoup to read it. Think of requests as the delivery truck and BeautifulSoup as the unboxing crew.

How do I handle JavaScript-rendered content?

requests only gets the initial HTML. If data is loaded by JavaScript, first check the browser’s Network tab (F12 > Network) for API calls the page makes, as you can often call those APIs directly. If that fails, use Selenium or Playwright to render the page in a real browser.

Key Concepts at a Glance

ConceptSummary
requests.get(url)Sends an HTTP GET request, returns the response
response.status_codeHTTP status (200 = OK, 403 = blocked, 404 = not found)
BeautifulSoup(html, “html.parser”)Parses HTML into a navigable tree
soup.find(“tag”, attrs)Finds the first matching element
soup.find_all(“tag”)Finds all matching elements (returns a list)
soup.select(“css selector”)Finds elements using CSS selectors
element.text / .get_text(strip=True)Extracts text content from an element
pd.read_html(url)Scrapes all HTML tables from a URL into DataFrames
User-Agent headerIdentifies your client; set one to avoid being blocked
robots.txtFile that specifies which paths can be scraped
Rate limitingAdding delays between requests to be polite
CachingSaving responses locally to avoid re-fetching
yfinancePython wrapper for Yahoo Finance data
try/except with retryHandle network failures gracefully
CSS selectorsPowerful syntax for targeting HTML elements

Sources & Further Reading

PreviousExcel VBA for Traders - Automating Spreadsheets That Actually Work