Pandas for Finance - The Swiss Army Knife of Data Manipulation

2026-04-06 18 min read Computer Science

Pandas for Finance: The Swiss Army Knife of Data Manipulation¶

If NumPy is the engine of Python’s scientific stack, pandas is the cockpit. You sit in it, grab the wheel, and do everything from loading CSV files to computing correlation matrices to resampling tick data into OHLC bars. Nobody in quantitative finance has a long career without becoming uncomfortably intimate with this library. It is the library that turns “here’s 50 million rows of trade data” into “here’s the answer, ready for my 9 AM meeting.”

Pandas was created by Wes McKinney in 2008 while he was working at AQR Capital Management, a quantitative hedge fund. The library exists because finance people needed something better than raw NumPy arrays for working with labeled, heterogeneous, time-series-heavy data. That origin story is important. Pandas was not designed for machine learning or web scraping or IoT sensor streams. It was designed for financial data, and it shows in every API decision.

This article is a tour of the parts of pandas that actually matter when you are manipulating market data, trade blotters, position files, and risk reports. Not every feature. Just the ones you will use in the next 6 months.

If you cannot read a CSV, filter rows, group by ticker, and compute daily returns in under 5 minutes, you are about to have a bad quarter.

The Two Core Structures: Series and DataFrame¶

Everything in pandas is either a Series or a DataFrame. Understanding these two is 80% of the battle.

Series: A Labeled One-Dimensional Array¶

A Series is like a NumPy array with labels. Each value has an index, and the index can be anything: integers, strings, dates, whatever you want.

import pandas as pd

# A Series of daily closing prices
prices = pd.Series(
    [148.20, 149.50, 151.30, 150.80, 152.40],
    index=pd.date_range("2026-03-30", periods=5, freq="B"),
    name="AAPL"
)
print(prices)

Output:

2026-03-30    148.20
2026-03-31    149.50
2026-04-01    151.30
2026-04-02    150.80
2026-04-03    152.40
Freq: B, Name: AAPL, dtype: float64

The freq="B" means “business days” (Mon-Fri, no holidays handled automatically). This is the kind of detail that makes pandas feel finance-native.

DataFrame: A Labeled Two-Dimensional Array¶

A DataFrame is a table. Rows have an index, columns have names, and every column is a Series.

data = {
    "AAPL":  [148.20, 149.50, 151.30, 150.80, 152.40],
    "MSFT":  [375.80, 378.20, 380.50, 379.10, 382.40],
    "GOOGL": [152.40, 153.10, 155.20, 154.80, 156.30],
}

df = pd.DataFrame(
    data,
    index=pd.date_range("2026-03-30", periods=5, freq="B")
)
print(df)

Output:

             AAPL    MSFT   GOOGL
2026-03-30  148.20  375.80  152.40
2026-03-31  149.50  378.20  153.10
2026-04-01  151.30  380.50  155.20
2026-04-02  150.80  379.10  154.80
2026-04-03  152.40  382.40  156.30

Key Insight: The index is not just row numbering. It is a first-class structure that pandas uses for alignment, joining, and time-based operations. Two Series with the same index can be added together and the values line up automatically. Two Series with different indexes get automatically aligned with NaN filling the gaps. This “label-based alignment” is the feature that makes pandas feel magical.

Pandas treats row labels as the soul of your data. Lose them and things get weird fast.

Loading Data: The Starting Point¶

Real work begins with data from somewhere else. Pandas reads every format you will encounter.

CSV Files¶

# Read from CSV
trades = pd.read_csv("trades.csv", parse_dates=["trade_date"])
print(trades.head())

Options that matter:
- parse_dates=["col"] parses date columns on the fly
- index_col="col" uses a column as the index
- usecols=[...] reads only specific columns (saves memory)
- dtype={"col": float} forces specific data types

Excel Files¶

# Read from Excel
trades = pd.read_excel("trades.xlsx", sheet_name="Q1 2026")

# Multiple sheets at once
all_sheets = pd.read_excel("data.xlsx", sheet_name=None)  # dict of DataFrames

Directly from Yahoo Finance¶

import yfinance as yf

# Get 1 year of Apple data
aapl = yf.download("AAPL", start="2025-04-01", end="2026-04-01")
print(aapl.head())

From a Database¶

import sqlite3

conn = sqlite3.connect("trades.db")
df = pd.read_sql("SELECT * FROM trades WHERE trade_date >= '2026-01-01'", conn)

Reading data is 20% of the work. Figuring out why column 7 contains the string “N/A” where you expected a float is the other 80%.

Inspecting Data: The First 5 Commands You Will Ever Run¶

Before doing anything, look at your data. Always.

df.head()          # First 5 rows
df.tail(10)        # Last 10 rows
df.info()          # Column types, non-null counts, memory
df.describe()      # Summary statistics for numeric columns
df.shape           # (rows, columns)
df.columns         # List of column names
df.dtypes          # Type of each column
df.index           # The index

Example output of df.info() on a realistic trades table:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 148532 entries, 0 to 148531
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   trade_id     148532 non-null  int64         
 1   trade_date   148532 non-null  datetime64[ns]
 2   trader       148532 non-null  object        
 3   ticker       148530 non-null  object        
 4   side         148532 non-null  object        
 5   quantity     148532 non-null  int64         
 6   price        148508 non-null  float64       
 7   notional     148508 non-null  float64       
dtypes: datetime64[ns](1), float64(2), int64(2), object(3)
memory usage: 9.1+ MB

Notice the ticker and price columns have missing values (148530 and 148508 out of 148532). That is the first thing to investigate.

Key Insight: Always run df.info() on any DataFrame you did not create yourself. Missing values, wrong data types, and memory usage problems all show up there. Skipping this step is the root cause of most “why is my analysis wrong” moments.

Selecting Data: The Bread and Butter¶

There are three ways to select data: by label, by position, and by condition. Master all three.

By Column¶

df["AAPL"]                    # Single column, returns Series
df[["AAPL", "MSFT"]]           # Multiple columns, returns DataFrame

By Row with `.loc` (label-based)¶

df.loc["2026-04-01"]                    # One row by index label
df.loc["2026-03-30":"2026-04-01"]       # Slice of rows (inclusive!)
df.loc["2026-04-01", "AAPL"]            # Single cell
df.loc[:, "AAPL":"MSFT"]                # All rows, column range

By Row with `.iloc` (position-based)¶

df.iloc[0]                # First row
df.iloc[-1]               # Last row
df.iloc[0:3]              # First 3 rows (exclusive end, like Python slicing)
df.iloc[0, 1]             # First row, second column

By Condition: Boolean Indexing¶

This is where pandas really earns its keep.

# Trades with notional over $1M
large = trades[trades["notional"] > 1_000_000]

# Buys by Alice in March
alice_march = trades[
    (trades["trader"] == "Alice")
    & (trades["side"] == "BUY")
    & (trades["trade_date"].dt.month == 3)
]

# Using .query() for cleaner syntax
alice_march = trades.query(
    "trader == 'Alice' and side == 'BUY' and trade_date.dt.month == 3"
)

The & and | operators work on Series, but you must wrap conditions in parentheses. Forgetting parentheses is pandas’ favorite gotcha and will cost you 15 minutes the first 10 times it happens.

Adding and Modifying Columns¶

New columns are just assignments.

# Compute notional from quantity and price
trades["notional"] = trades["quantity"] * trades["price"]

# Categorize trade size
trades["size_category"] = pd.cut(
    trades["notional"],
    bins=[0, 100_000, 1_000_000, float("inf")],
    labels=["Small", "Medium", "Large"]
)

# Conditional values with np.where
import numpy as np
trades["direction"] = np.where(trades["side"] == "BUY", 1, -1)
trades["signed_qty"] = trades["quantity"] * trades["direction"]

The .apply() Method¶

For more complex transformations, apply() lets you run arbitrary Python functions across rows or columns.

def classify_trade(row):
    if row["notional"] > 5_000_000:
        return "Block"
    elif row["side"] == "SELL" and row["quantity"] > 10_000:
        return "Distribution"
    else:
        return "Standard"

trades["trade_type"] = trades.apply(classify_trade, axis=1)

Key Insight: Avoid .apply() for anything that can be done with vectorized operations. .apply(axis=1) iterates row by row in Python and is 100 to 1000 times slower than equivalent vectorized code. Use it only when you genuinely need Python logic that cannot be expressed with standard operators.

Time Series: The Finance Superpower¶

This is where pandas absolutely dominates. Everything about time series is first-class.

Setting a Date Index¶

trades = trades.set_index("trade_date").sort_index()

Once the index is a DatetimeIndex, everything time-based gets easy.

Slicing by Date¶

trades["2026"]              # All trades in 2026
trades["2026-03"]           # All trades in March 2026
trades["2026-03-15":"2026-03-31"]  # Last two weeks of March

Resampling: The Magic Method¶

Resampling lets you change the time frequency of your data. This is how you turn tick data into minute bars, or daily prices into monthly returns.

# Daily prices
prices = pd.Series(
    [148.20, 149.50, 151.30, 150.80, 152.40, 153.10, 154.80, 153.90, 155.20, 156.30],
    index=pd.date_range("2026-03-30", periods=10, freq="B")
)

# Weekly mean
weekly_mean = prices.resample("W").mean()

# Monthly last price (end-of-month close)
monthly_close = prices.resample("M").last()

# OHLC bars from tick data
ohlc = tick_prices.resample("1min").ohlc()

Common frequency codes:

Code	Meaning
`D`	Calendar day
`B`	Business day
`W`	Weekly
`M`	Month end
`MS`	Month start
`Q`	Quarter end
`Y` or `A`	Year end
`H`	Hourly
`15T`	15 minutes
`1min`	1 minute

Rolling Windows¶

Moving averages, rolling standard deviations, and every other “last N days” calculation uses .rolling().

# 20-day moving average
prices["ma_20"] = prices["close"].rolling(window=20).mean()

# 20-day standard deviation (for Bollinger bands)
prices["vol_20"] = prices["close"].rolling(window=20).std()

# Upper and lower Bollinger bands
prices["upper"] = prices["ma_20"] + 2 * prices["vol_20"]
prices["lower"] = prices["ma_20"] - 2 * prices["vol_20"]

If you can write a 20-day moving average in one line of pandas, you have more skill than 90% of people who put “Excel” on their resume.

Computing Returns: The Finance Canonical Example¶

Returns are arguably the single most important calculation in quantitative finance. Pandas makes them trivial.

Simple Returns¶

# r_t = P_t / P_{t-1} - 1
returns = prices.pct_change()

Log Returns¶

# r_t = ln(P_t / P_{t-1})
import numpy as np
log_returns = np.log(prices / prices.shift(1))

Cumulative Returns¶

# Cumulative simple returns
cum_returns = (1 + returns).cumprod() - 1

# Cumulative log returns (just sum them)
cum_log_returns = log_returns.cumsum()

Annualized Volatility¶

# Daily volatility
daily_vol = returns.std()

# Annualized volatility (assume 252 trading days)
annual_vol = daily_vol * np.sqrt(252)
print(f"Annualized vol: {annual_vol:.2%}")

Sharpe Ratio¶

def sharpe_ratio(returns, risk_free=0.04, periods_per_year=252):
    excess = returns - risk_free / periods_per_year
    return excess.mean() / returns.std() * np.sqrt(periods_per_year)

print(f"Sharpe: {sharpe_ratio(returns):.2f}")

Every hedge fund on Earth computes Sharpe ratios in exactly these three lines, just maybe dressed up in extra classes and type hints. The math does not care about your fancy abstractions.

Grouping and Aggregation¶

Nothing beats .groupby() for summarizing data by category.

Basic Grouping¶

# Total notional by trader
by_trader = trades.groupby("trader")["notional"].sum()

# Number of trades by trader and ticker
counts = trades.groupby(["trader", "ticker"]).size()

Multiple Aggregations at Once¶

# Full trader summary
summary = trades.groupby("trader").agg(
    total_notional=("notional", "sum"),
    num_trades=("notional", "count"),
    avg_trade_size=("notional", "mean"),
    max_trade=("notional", "max"),
    avg_price=("price", "mean")
).round(2)

print(summary)

Output:

         total_notional  num_trades  avg_trade_size   max_trade   avg_price
trader
Alice        531660.00           4       132915.00  187900.00      206.09
Bob          272820.00           3        90940.00  165060.00      377.60
Charlie      287200.00           2       143600.00  191000.00       95.85
David        780675.00           1       780675.00  780675.00      520.45

Group Transformations¶

Sometimes you want to compute something within each group but keep the original row structure. Use .transform().

# Each trade's percentage of the trader's total notional
trades["pct_of_trader"] = (
    trades["notional"]
    / trades.groupby("trader")["notional"].transform("sum")
)

Merging and Joining¶

Combining multiple DataFrames is where SQL-style joins meet pandas.

Setup¶

trades = pd.DataFrame({
    "trade_id": [1, 2, 3, 4],
    "ticker":   ["AAPL", "MSFT", "GOOGL", "AAPL"],
    "quantity": [100, 200, 150, 300]
})

prices = pd.DataFrame({
    "ticker": ["AAPL", "MSFT", "GOOGL", "NVDA"],
    "price":  [148.20, 375.80, 152.40, 825.30]
})

Inner Join¶

# Keep only rows with matches in both
combined = trades.merge(prices, on="ticker", how="inner")

Left Join¶

# Keep all trades, add price where available
combined = trades.merge(prices, on="ticker", how="left")

Joining on Different Column Names¶

left_df.merge(
    right_df,
    left_on="trade_ticker",
    right_on="symbol"
)

Joining on the Index¶

# If index is the key (very common for time series)
combined = df1.join(df2, how="inner")

How	Description
`inner`	Only rows matching in both
`left`	All rows from left, matching from right
`right`	All rows from right, matching from left
`outer`	All rows, NaN for missing matches

pd.merge() is just SQL JOIN with nicer syntax and more NaN. Same concepts, same footguns, same occasional 20-minute debugging session when a key column has whitespace.

Pivoting: Wide vs Long Format¶

Real data alternates between “long format” (one row per observation) and “wide format” (one row per entity, many columns). Pandas makes both directions easy.

Long to Wide: pivot_table¶

# Long format: one row per (date, ticker)
prices_long = pd.DataFrame({
    "date":   ["2026-04-01", "2026-04-01", "2026-04-02", "2026-04-02"],
    "ticker": ["AAPL", "MSFT", "AAPL", "MSFT"],
    "close":  [148.20, 375.80, 149.50, 378.20]
})

# Pivot to wide: one column per ticker
prices_wide = prices_long.pivot_table(
    index="date",
    columns="ticker",
    values="close"
)

Output:

ticker      AAPL    MSFT
date
2026-04-01  148.20  375.80
2026-04-02  149.50  378.20

Wide to Long: melt¶

prices_long_again = prices_wide.reset_index().melt(
    id_vars="date",
    var_name="ticker",
    value_name="close"
)

Key Insight: Finance data is usually stored “long” (trade logs, tick databases) but analyzed “wide” (correlation matrices, returns panels). Know how to switch between them without thinking, and you will save yourself hours of misery every week.

Handling Missing Data¶

Real data is full of holes. Pandas has a whole toolkit for dealing with them.

# Check for missing values
df.isna().sum()              # Count of NaN per column
df.isna().any(axis=1).sum()  # Number of rows with any NaN

# Drop rows with any NaN
df_clean = df.dropna()

# Drop rows where specific columns are NaN
df_clean = df.dropna(subset=["price", "quantity"])

# Fill NaN with a value
df["price"] = df["price"].fillna(0)

# Forward fill (use last known value), common for market data
df["price"] = df["price"].ffill()

# Interpolate numerically
df["price"] = df["price"].interpolate(method="linear")

Forward fill is the finance world’s favorite. When the market is closed, yesterday’s closing price is “still true” until we hear otherwise.

Real Example: A Complete Workflow¶

Let us put it all together. We will load trades, enrich with market data, compute P&L, and produce a summary.

import pandas as pd
import numpy as np

# 1. Load data
trades = pd.read_csv("trades.csv", parse_dates=["trade_date"])
market = pd.read_csv("market_data.csv", parse_dates=["trade_date"])

# 2. Clean
trades = trades.dropna(subset=["price", "quantity"])
trades["ticker"] = trades["ticker"].str.upper().str.strip()

# 3. Compute notional
trades["notional"] = trades["quantity"] * trades["price"]
trades["signed_qty"] = np.where(
    trades["side"] == "BUY",
    trades["quantity"],
    -trades["quantity"]
)

# 4. Enrich with current market prices
latest_market = (
    market.sort_values("trade_date")
    .groupby("ticker")
    .tail(1)[["ticker", "close_price"]]
)

enriched = trades.merge(latest_market, on="ticker", how="left")

# 5. Calculate unrealized P&L per trade
enriched["market_value"] = enriched["signed_qty"] * enriched["close_price"]
enriched["cost_basis"] = enriched["signed_qty"] * enriched["price"]
enriched["pnl"] = enriched["market_value"] - enriched["cost_basis"]

# 6. Aggregate by trader and ticker
by_trader_ticker = (
    enriched.groupby(["trader", "ticker"])
    .agg(
        positions=("signed_qty", "sum"),
        total_cost=("cost_basis", "sum"),
        market_value=("market_value", "sum"),
        total_pnl=("pnl", "sum")
    )
    .round(2)
)

# 7. Sort by P&L
by_trader_ticker = by_trader_ticker.sort_values("total_pnl", ascending=False)

print(by_trader_ticker.head(10))

This is essentially what every P&L reconciliation script does. 20 lines of pandas. The SQL version is similar. The VBA version is 200 lines. The C++ version is 2000 lines. Pick your poison.

Performance Tips¶

Pandas is fast, but not magically so. A few practices make it much faster.

1. Avoid iterrows() and itertuples()¶

Iterating over rows is the slowest thing you can do. 99% of the time there is a vectorized equivalent.

# BAD: 10-100x slower
for index, row in df.iterrows():
    df.loc[index, "notional"] = row["quantity"] * row["price"]

# GOOD: vectorized
df["notional"] = df["quantity"] * df["price"]

2. Use Categorical Dtype for Repeated Strings¶

If a column has limited unique values (tickers, traders, sectors), convert it to category. This reduces memory by 10x and speeds up groupby operations.

df["ticker"] = df["ticker"].astype("category")
df["trader"] = df["trader"].astype("category")

3. Use .loc for Setting Values¶

Chained indexing like df[df.x > 0]["y"] = 5 can silently fail. Always use .loc.

# Correct
df.loc[df["quantity"] < 0, "side"] = "SELL"

4. Read Only What You Need¶

# Only load columns you will actually use
df = pd.read_csv("trades.csv", usecols=["trade_date", "ticker", "quantity", "price"])

Key Insight: Memory is usually the bottleneck, not CPU. A 10 GB CSV does not need to be loaded all at once. Use chunksize=10000 with read_csv() for iterative processing, or load specific columns and date ranges only. A well-filtered DataFrame is always faster than a giant one.

Common Pitfalls¶

SettingWithCopyWarning. Pandas warns you when you might be modifying a copy instead of the original. The fix: use .loc explicitly, or use .copy() when slicing.
Misaligned operations. Adding two Series with different indexes produces NaN where they do not align. If you are getting unexpected NaNs, check your indexes.
Type inference surprises. CSV columns with mixed types default to object (Python strings). Always check df.dtypes after loading.
Forgetting to sort by date. Time-based operations like .rolling() and .resample() assume the index is sorted. Call .sort_index() after loading any time series.
Using the wrong merge direction. left vs inner vs outer look similar but produce very different results. When in doubt, use how="left" and check row counts before and after.
Using chained assignment. df[condition]["col"] = value often silently fails. Always use df.loc[condition, "col"] = value.

Wrapping Up¶

Pandas is not glamorous. It is a library for moving tabular data around, cleaning it up, and summarizing it. But that activity is 70% of what quantitative analysts, data scientists, and financial engineers actually do. The other 30% (machine learning, visualization, backtesting, presentation) sits on top of pandas outputs.

Learn the core: Series, DataFrame, indexing, boolean filtering, groupby, merge, resample, rolling. Those primitives cover almost every real-world task. Once you are fluent, you will find yourself reaching for pandas whenever someone hands you a CSV and asks “what’s in this?”

And one day you will realize you have written 10,000 lines of pandas code this year and only 200 lines of actual finance logic. That is not a bug. That is the nature of the job.

Cheat Sheet¶

Key Questions & Answers¶

What is the difference between Series and DataFrame?¶

A Series is a one-dimensional labeled array (one column of data). A DataFrame is a two-dimensional labeled table (multiple columns). Every column of a DataFrame is a Series. Both have an index that enables label-based alignment.

When should I use .loc vs .iloc?¶

Use .loc for label-based selection (by row/column name). Use .iloc for position-based selection (by integer position). .loc is generally safer because it does not change behavior when the index changes. Boolean indexing uses .loc under the hood.

How do I compute returns in pandas?¶

Simple returns: prices.pct_change(). Log returns: np.log(prices / prices.shift(1)). Cumulative returns: (1 + returns).cumprod() - 1. Annualized volatility: returns.std() * np.sqrt(252) for daily data.

How do I reshape from long to wide format?¶

Use pivot_table(index=..., columns=..., values=...) to go from long to wide. Use .melt(id_vars=..., var_name=..., value_name=...) to go from wide to long. Financial data is usually stored long but analyzed wide.

Key Concepts at a Glance¶

Concept	Summary
Series	One-dimensional labeled array (one column)
DataFrame	Two-dimensional labeled table (rows + columns)
Index	Row labels; enables alignment and time-based operations
.loc	Label-based selection
.iloc	Position-based selection
Boolean indexing	`df[df["col"] > value]` filters rows by condition
.groupby()	Split-apply-combine on categorical columns
.agg()	Multiple aggregations in one call
.merge()	SQL-style joins between DataFrames
.pivot_table()	Reshape from long to wide
.melt()	Reshape from wide to long
.resample()	Change time frequency (daily to monthly, etc.)
.rolling()	Sliding window calculations (moving averages)
.pct_change()	Period-over-period percentage change
.shift()	Lag or lead values by N periods
.ffill() / .bfill()	Forward/backward fill missing values
category dtype	Memory-efficient for repeated string columns
SettingWithCopyWarning	Fix by using `.loc` explicitly for assignment

Sources & Further Reading¶

McKinney, W., Python for Data Analysis, O’Reilly (the pandas creator’s own book)
pandas Documentation, User Guide
pandas Documentation, Time Series / Date Functionality
Real Python, The Pandas DataFrame
Quantitative Economics, pandas for Quantitative Finance
Jansen, S., Machine Learning for Algorithmic Trading, Packt
yfinance Documentation, yfinance on PyPI
pandas Cheat Sheet, Official PDF
McKinney, W. (2010), Data Structures for Statistical Computing in Python, Proceedings of the 9th Python in Science Conference

The Two Core Structures: Series and DataFrame
- Series: A Labeled One-Dimensional Array
- DataFrame: A Labeled Two-Dimensional Array
Loading Data: The Starting Point
Inspecting Data: The First 5 Commands You Will Ever Run
Selecting Data: The Bread and Butter
Adding and Modifying Columns
- The .apply() Method
Time Series: The Finance Superpower
Computing Returns: The Finance Canonical Example
Grouping and Aggregation
Merging and Joining
Pivoting: Wide vs Long Format
- Long to Wide: pivot_table
- Wide to Long: melt
Handling Missing Data
Real Example: A Complete Workflow
Performance Tips
Common Pitfalls
Wrapping Up
Cheat Sheet
- Key Questions & Answers
- Key Concepts at a Glance
Sources & Further Reading

Pandas for Finance - The Swiss Army Knife of Data Manipulation

Pandas for Finance: The Swiss Army Knife of Data Manipulation¶

The Two Core Structures: Series and DataFrame¶

Series: A Labeled One-Dimensional Array¶

DataFrame: A Labeled Two-Dimensional Array¶

Loading Data: The Starting Point¶

CSV Files¶

Excel Files¶

Directly from Yahoo Finance¶

From a Database¶

Inspecting Data: The First 5 Commands You Will Ever Run¶

Selecting Data: The Bread and Butter¶

By Column¶

By Row with .loc (label-based)¶

By Row with .iloc (position-based)¶

By Condition: Boolean Indexing¶

Adding and Modifying Columns¶

The .apply() Method¶

Time Series: The Finance Superpower¶

Setting a Date Index¶

Slicing by Date¶

Resampling: The Magic Method¶

Rolling Windows¶

Computing Returns: The Finance Canonical Example¶

Simple Returns¶

Log Returns¶

Cumulative Returns¶

Annualized Volatility¶

Sharpe Ratio¶

Grouping and Aggregation¶

Basic Grouping¶

Multiple Aggregations at Once¶

Group Transformations¶

Merging and Joining¶

Setup¶

Inner Join¶

Left Join¶

Joining on Different Column Names¶

Joining on the Index¶

Pivoting: Wide vs Long Format¶

Long to Wide: pivot_table¶

Wide to Long: melt¶

Handling Missing Data¶

Real Example: A Complete Workflow¶

Performance Tips¶

1. Avoid iterrows() and itertuples()¶

2. Use Categorical Dtype for Repeated Strings¶

3. Use .loc for Setting Values¶

4. Read Only What You Need¶

Common Pitfalls¶

Wrapping Up¶

Cheat Sheet¶

Key Questions & Answers¶

What is the difference between Series and DataFrame?¶

When should I use .loc vs .iloc?¶

How do I compute returns in pandas?¶

How do I reshape from long to wide format?¶

Key Concepts at a Glance¶

Sources & Further Reading¶

Table of Contents

Browse Articles

By Row with `.loc` (label-based)¶

By Row with `.iloc` (position-based)¶