14: Data Wrangling with pandas and NumPy

Chapter 14: Data Wrangling with pandas and NumPy
Section titled “Chapter 14: Data Wrangling with pandas and NumPy”Overview
Section titled “Overview”In the previous chapter, you learned the basics of Python syntax and introduced the powerhouse duo: pandas and NumPy. Now, it’s time to master Data Wrangling—the process of transforming raw, chaotic data into a structured, clean format ready for analysis or machine learning.
As a PHP developer, you’re used to processing data using foreach loops and associative arrays. While that works for small datasets, it becomes painfully slow and complex as data scales. In this chapter, you’ll learn how to replace nested loops with single-line vectorized operations, how to join massive datasets like a SQL pro, and how to handle time-series data with ease. You’ll see firsthand why pandas is 10-100x faster than traditional PHP loops for data transformation.
By the end of this chapter, you’ll be able to build robust data cleaning pipelines that turn messy CSVs and API logs into pristine analytical assets.
Prerequisites
Section titled “Prerequisites”Before starting this chapter, you should have:
- Completed Chapter 13: Python Fundamentals
- Python 3.10+ installed with a virtual environment
pandasandnumpyinstalled (pip install pandas numpy)- Estimated Time: ~90 minutes
Verify your setup:
# Check if libraries are installedpython3 -c "import pandas; import numpy; print(f'pandas: {pandas.__version__}, numpy: {numpy.__version__}')"What You’ll Build
Section titled “What You’ll Build”By the end of this chapter, you will have created:
- E-commerce Analytics Pipeline: A script that joins user, order, and product data to find high-value customers.
- Log Aggregator: A time-series tool that takes raw server logs and resamples them into hourly error rates.
- Missing Data Handler: A systematic approach to filling or dropping
NaNvalues without breaking math operations. - Performance Benchmark Suite: A side-by-side comparison showing exactly when pandas outpaces PHP.
Objectives
Section titled “Objectives”- Combine datasets using Inner, Left, and Outer joins (just like SQL).
- Reshape data between “long” and “wide” formats using
pivot_tableandmelt. - Master Time Series analysis by resampling dates and frequencies.
- Implement vectorized cleaning routines for strings and numerical outliers.
- Optimize memory and speed for datasets exceeding 1M rows.
Step 1: Advanced Joining and Merging (~20 min)
Section titled “Step 1: Advanced Joining and Merging (~20 min)”Combine multiple DataFrames using SQL-like joins to build a unified view of your data.
Why It Matters
Section titled “Why It Matters”In PHP, joining two arrays usually requires nested loops or a lookup map. In pandas, merge allows you to join millions of rows in milliseconds using optimized C-algorithms.
Actions
Section titled “Actions”1. Create a join experimentation script:
import pandas as pd
# 1. Setup sample datausers = pd.DataFrame({ 'user_id': [1, 2, 3, 4], 'name': ['Alice', 'Bob', 'Charlie', 'David'], 'plan': ['Pro', 'Free', 'Pro', 'Free']})
orders = pd.DataFrame({ 'order_id': [101, 102, 103, 104], 'user_id': [1, 2, 1, 5], # Note: user 5 doesn't exist in 'users' 'amount': [150.0, 25.0, 300.0, 45.0]})
print("--- Users ---")print(users)print("\n--- Orders ---")print(orders)
# 2. Inner Join (Keep only records in BOTH)# PHP equivalent: nested loops with if (isset($lookup[$id]))inner_join = pd.merge(users, orders, on='user_id', how='inner')print("\n--- Inner Join (Matches only) ---")print(inner_join)
# 3. Left Join (Keep all users, even those with no orders)left_join = pd.merge(users, orders, on='user_id', how='left')print("\n--- Left Join (All Users) ---")print(left_join)
# 4. Outer Join (Keep EVERYTHING from both)outer_join = pd.merge(users, orders, on='user_id', how='outer')print("\n--- Outer Join (Full Union) ---")print(outer_join)
# 5. Right Join (Keep all orders, even those with no user)right_join = pd.merge(users, orders, on='user_id', how='right')print("\n--- Right Join (All Orders) ---")print(right_join)2. Run the script:
python3 examples/merging_lab.pyExpected Result
Section titled “Expected Result”--- Inner Join (Matches only) --- user_id name plan order_id amount0 1 Alice Pro 101 150.01 1 Alice Pro 103 300.02 2 Bob Free 102 25.0
--- Left Join (All Users) --- user_id name plan order_id amount0 1 Alice Pro 101.0 150.01 1 Alice Pro 103.0 300.02 2 Bob Free 102.0 25.03 3 Charlie Pro NaN NaN4 4 David Free NaN NaN
--- Outer Join (Full Union) --- user_id name plan order_id amount0 1 Alice Pro 101.0 150.01 1 Alice Pro 103.0 300.02 2 Bob Free 102.0 25.03 3 Charlie Pro NaN NaN4 4 David Free NaN NaN5 5 NaN NaN 104.0 45.0Why It Works
Section titled “Why It Works”Pandas uses index-based lookup for merging. When you call merge, it identifies the “key” column (like user_id), aligns the data in memory, and performs the set operation (Intersection for Inner, Union for Outer).
- Left Joins are crucial for maintaining your “entities” while checking for optional related data (like “Users who haven’t ordered yet”).
- NaN Handling: Notice how users with no orders get
NaN(Not a Number) for the order columns. This is Python’s way of saying “missing data.”
Troubleshooting
Section titled “Troubleshooting”Problem: MergeError: No common columns to merge on
Cause: The column names don’t match exactly (e.g., id vs user_id).
Solution: Use left_on and right_on:
pd.merge(users, orders, left_on='id', right_on='user_id')Problem: Duplicate columns after merge (e.g., name_x, name_y)
Cause: Both DataFrames had a column named name that wasn’t used as the merge key.
Solution: Use the suffixes parameter:
pd.merge(users, orders, on='user_id', suffixes=('_user', '_order'))Step 2: Reshaping and Pivoting Data (~20 min)
Section titled “Step 2: Reshaping and Pivoting Data (~20 min)”Transform data between “long” format (easy for storage) and “wide” format (easy for analysis).
Actions
Section titled “Actions”1. Create a pivoting lab:
import pandas as pd
# 1. Sample "Long" data (like event logs)data = { 'date': ['2025-01-01', '2025-01-01', '2025-01-02', '2025-01-02'], 'region': ['East', 'West', 'East', 'West'], 'sales': [100, 150, 120, 180]}df = pd.DataFrame(data)print("--- Original Long Format ---")print(df)
# 2. Pivot to "Wide" format (Excel-style)# Goal: Rows are dates, Columns are regionspivoted = df.pivot(index='date', columns='region', values='sales')print("\n--- Wide Format (Pivoted) ---")print(pivoted)
# 3. Pivot Table (Aggregate while pivoting)# What if multiple sales on same day/region?data_messy = pd.concat([df, pd.DataFrame({'date': ['2025-01-01'], 'region': ['East'], 'sales': [50]})])# Use pivot_table to SUM the valuespivot_agg = data_messy.pivot_table(index='date', columns='region', values='sales', aggfunc='sum')print("\n--- Pivot Table (Summed) ---")print(pivot_agg)
# 4. Melt (Go back from wide to long)# Useful for preparing data for machine learning or plottingmelted = pivoted.reset_index().melt(id_vars='date', value_name='sales_amount')print("\n--- Melted Back to Long ---")print(melted)2. Run the script:
python3 examples/pivoting_lab.pyExpected Result
Section titled “Expected Result”--- Wide Format (Pivoted) ---region East Westdate2025-01-01 100 1502025-01-02 120 180
--- Pivot Table (Summed) ---region East Westdate2025-01-01 150 1502025-01-02 120 180Why It Works
Section titled “Why It Works”pivot: Simple reshaping when unique index/column pairs exist.pivot_table: The “Grand Master” of aggregation. It handles duplicate entries by applying a function likesum,mean, orcount.melt: “Unpivots” a DataFrame. It takes column labels and turns them into data values in a new column. This is often required before sending data to visualization libraries like Seaborn.
Troubleshooting
Section titled “Troubleshooting”Problem: ValueError: Index contains duplicate entries, cannot reshape
Cause: You used pivot instead of pivot_table when your data has multiple values for the same index/column combination.
Solution: Use pivot_table and specify an aggfunc.
Step 3: Mastering Time Series Data (~20 min)
Section titled “Step 3: Mastering Time Series Data (~20 min)”Transform timestamps into a powerful index for trend analysis and resampling.
Actions
Section titled “Actions”1. Create a time series lab:
import pandas as pdimport numpy as np
# 1. Generate 10 days of hourly datadates = pd.date_range('2025-01-01', periods=240, freq='H')traffic = np.random.poisson(lam=50, size=240) # Avg 50 visitors/hr
df = pd.DataFrame({'traffic': traffic}, index=dates)print("--- Hourly Traffic (Head) ---")print(df.head())
# 2. Slice by time# PHP: requires complex date math or string parsingprint("\n--- Traffic on Jan 2nd ---")print(df.loc['2025-01-02'])
# 3. Resampling (The "Magic" of pandas)# Aggregate hourly data into Daily totalsdaily = df.resample('D').sum()print("\n--- Daily Traffic Totals ---")print(daily)
# Aggregate into Daily Averagesdaily_avg = df.resample('D').mean()print("\n--- Daily Traffic Average ---")print(daily_avg)
# 4. Moving Averages (Smooth out the noise)df['7hr_moving_avg'] = df['traffic'].rolling(window=7).mean()print("\n--- Hourly Traffic with Moving Average ---")print(df.head(10))2. Run the script:
python3 examples/timeseries_lab.pyExpected Result
Section titled “Expected Result”--- Daily Traffic Totals --- traffic2025-01-01 12102025-01-02 1185...
--- Hourly Traffic with Moving Average --- traffic 7hr_moving_avg2025-01-01 00:00:00 48 NaN...2025-01-01 06:00:00 52 50.142857Why It Works
Section titled “Why It Works”Pandas treats time as a first-class citizen. By setting a DatetimeIndex, you unlock:
- Resampling (
resample): LikeGROUP BYfor time.Dfor Day,Mfor Month,Wfor Week. - Rolling (
rolling): Performs calculations on a sliding window—perfect for identifying trends in volatile data. - Natural Slicing: You can use strings like
'2025-01'to get all data for a specific month.
Troubleshooting
Section titled “Troubleshooting”Problem: Dates are being treated as strings (object)
Cause: You loaded a CSV where the date column is just text.
Solution: Convert it using pd.to_datetime():
df['date'] = pd.to_datetime(df['date'])df.set_index('date', inplace=True)Step 4: Cleaning and Pipelines at Scale (~15 min)
Section titled “Step 4: Cleaning and Pipelines at Scale (~15 min)”Apply vectorized string and math operations to clean messy data without row-by-row loops.
Actions
Section titled “Actions”1. Create a cleaning lab:
import pandas as pdimport numpy as np
# 1. Messy datadata = { 'name': [' ALICE ', 'bob', ' Charlie', 'DAVID'], 'email': ['alice@GMAIL.com', 'BOB@outlook.com', 'charlie@gmail.com', np.nan], 'age': [25, -1, 30, 150], # -1 and 150 are likely errors 'score': [85, 90, np.nan, 75]}df = pd.DataFrame(data)
# 2. Vectorized String Cleaning# PHP: foreach loop with trim(), strtolower()df['name'] = df['name'].str.strip().str.title()df['email'] = df['email'].str.lower()print("--- Cleaned Strings ---")print(df[['name', 'email']])
# 3. Handling Outliers with Clipping# Clamp age between 0 and 100df['age'] = df['age'].clip(lower=0, upper=100)
# 4. Fill Missing Values# Strategy: fill with mean, median, or a constantdf['score'] = df['score'].fillna(df['score'].mean())df['email'] = df['email'].fillna('unknown@example.com')
print("\n--- Final Cleaned Data ---")print(df)
# 5. Query filtering (Fast SQL-like syntax)top_performers = df.query('score > 80')print("\n--- Top Performers ---")print(top_performers)2. Run the script:
python3 examples/cleaning_lab.pyExpected Result
Section titled “Expected Result”--- Cleaned Strings --- name email0 Alice alice@gmail.com1 Bob bob@outlook.com2 Charlie charlie@gmail.com3 David NaN
--- Final Cleaned Data --- name email age score0 Alice alice@gmail.com 25 85.0000001 Bob bob@outlook.com 0 90.0000002 Charlie charlie@gmail.com 30 83.3333333 David unknown@example.com 100 75.000000Why It Works
Section titled “Why It Works”.straccessor: Provides vectorized string methods. If you have 1M names,str.strip()cleans them all at once..fillna(): Safely replacesNaNwith values. Using the mean is a common data science technique called “Imputation.”.clip(): Fast way to bound values without writingif/elselogic.
Troubleshooting
Section titled “Troubleshooting”Problem: .str operations failing on numeric columns
Cause: You tried to use string methods on an integer or float column.
Solution: Ensure column type is string or convert first: df['col'].astype(str).str.upper().
Step 5: Performance Optimization - pandas vs PHP (~15 min)
Section titled “Step 5: Performance Optimization - pandas vs PHP (~15 min)”Benchmark a large operation to understand why pandas is the industry standard for data science.
Actions
Section titled “Actions”1. Benchmark script (Conceptual):
Imagine we want to calculate the 15% tax on 1,000,000 transaction amounts.
import pandas as pdimport numpy as npimport time
# Create 1M rowssize = 1_000_000amounts = np.random.uniform(10, 500, size)df = pd.DataFrame({'amount': amounts})
# 1. Benchmarking pandas Vectorizedstart = time.time()df['tax'] = df['amount'] * 0.15print(f"pandas (Vectorized) time: {(time.time() - start)*1000:.2f}ms")
# 2. Benchmarking Python Loop (Similar to PHP speed)start = time.time()taxes = []for a in amounts: taxes.append(a * 0.15)print(f"Python (Loop) time: {(time.time() - start)*1000:.2f}ms")2. Run the benchmark:
python3 examples/benchmark_tax.pyExpected Result
Section titled “Expected Result”pandas (Vectorized) time: 2.10msPython (Loop) time: 85.40ms(Note: PHP 8.4 loops would be closer to the Python Loop speed, though often faster thanks to JIT, but still significantly slower than vectorized C-routines).
Why It Works
Section titled “Why It Works”Pandas and NumPy don’t perform the math in the high-level interpreter. They pass the memory pointers of the entire data block to highly optimized BLAS (Basic Linear Algebra Subprograms) or LAPACK routines written in Fortran or C.
When to use pandas vs PHP:
- Use PHP: For application logic, routing, UI, and simple data validation.
- Use pandas: For heavy data transformations, aggregations on 10k+ rows, and statistical cleaning.
Exercises
Section titled “Exercises”Exercise 1: The Churn Analyzer
Section titled “Exercise 1: The Churn Analyzer”Goal: Join user data with event logs to identify “at risk” users.
Requirements:
- Create
usersDataFrame (id, signup_date). - Create
loginsDataFrame (id, user_id, login_date) with multiple logins per user. - Perform a Left Join.
- Group by
user_idand find themax()(latest) login date. - Filter for users whose latest login was more than 30 days ago.
Validation: Your final DataFrame should only contain users who haven’t logged in for 30+ days.
Exercise 2: Reshaping Sales Data
Section titled “Exercise 2: Reshaping Sales Data”Goal: Convert a wide dataset into a long one for a chart.
Requirements:
- Create a “Wide” DataFrame with columns:
product,sales_jan,sales_feb,sales_mar. - Use
meltto turn it into:product,month,sales. - Filter out any rows where
salesis 0 or NaN.
Validation: The resulting DataFrame should be tidy and ready for a line chart.
Exercise 3: Cleaning Server Logs
Section titled “Exercise 3: Cleaning Server Logs”Goal: Process a messy log string into a structured DataFrame.
Requirements:
- Create an array of 5 strings in the format:
[2025-01-01] ERROR: database connection failed. - Use
.str.extract()or string slicing to split intodate,level, andmessage. - Convert
dateto Datetime objects. - Filter for only
ERRORlevel logs.
Validation: Print the final cleaned DataFrame.
Wrap-up
Section titled “Wrap-up”What You’ve Learned
Section titled “What You’ve Learned”In this chapter, you moved beyond basic Python and mastered the primary tools of the data scientist:
- Relational Data: Using
mergeto perform complex joins that would be nightmare nested loops in PHP. - Reshaping: Transforming data layouts using
pivot_tableandmeltto see patterns from different angles. - Temporal Analysis: Mastering the
DatetimeIndexandresampleto identify trends over time. - Vectorized Cleaning: Cleaning millions of rows of strings and numbers instantly using
.straccessors and NumPy math. - Handling “Messy” Data: Strategically managing
NaNvalues and outliers to ensure data quality. - The Performance Gap: Understanding why vectorized operations are the only choice for modern data scale.
What You’ve Built
Section titled “What You’ve Built”You now have a production-ready toolkit:
- Merging Pipeline: A repeatable way to combine business entities.
- Reshaping Suite: Logic for pivoting financial or analytical data.
- Time Series Engine: A template for processing logs and trend data.
- Cleaning Pipeline: A reusable workflow for preparing raw data for ML.
Key Data Wrangling Principles
Section titled “Key Data Wrangling Principles”1. Tidy Data is Happy Data
Each variable is a column, each observation is a row. Use melt to get there.
2. Don’t Loop, Vectorize
If you find yourself writing for index, row in df.iterrows():, stop and look for a pandas function instead.
3. Indexing is your Friend Use meaningful indexes (like user IDs or dates) to make your code more readable and your lookups faster.
4. Data Cleaning is 80% of the Job Machine learning models are “Garbage In, Garbage Out.” Your wrangling skills are the foundation of all AI success.
Connection to Data Science Workflow
Section titled “Connection to Data Science Workflow”You are now a high-performance data engineer:
- ✅ Chapter 1-12: Built end-to-end data systems using PHP.
- ✅ Chapter 13: Mastered Python syntax foundations.
- ✅ Chapter 14: Mastered high-performance Data Wrangling ← You are here
- ➡️ Chapter 15: Moving into rigorous Statistical Analysis with SciPy.
Next Steps
Section titled “Next Steps”Immediate Practice:
- Find a complex SQL query from your app and try to replicate the same logic using pandas
mergeandgroupby. - Experiment with the
df.plot()method to see how pandas integrates with Matplotlib for quick charts. - Look into pandas-profiling for automated data reports.
Chapter 15 Preview:
In the next chapter, we’ll move from “cleaning” to “proving” using Advanced Statistical Analysis with SciPy and statsmodels. You’ll learn:
- Hypothesis testing (T-tests, p-values) at scale
- Normality tests and data distributions
- Correlation vs. Causation analysis
- Building Ordinary Least Squares (OLS) regression models
- Interpreting statistical summaries
You’ll gain the mathematical rigor needed to back up your data visualizations with hard evidence!
Further Reading
Section titled “Further Reading”- Pandas Documentation: Merge, Join, Concatenate — The definitive join guide.
- Python for Data Analysis (3rd Edition) — Specifically Chapter 7 (Data Cleaning) and 8 (Wrangling).
- NumPy Vectorization Explained — Why it’s so much faster.
- Pandas Time Series / Date functionality — Master date frequencies.
::: tip Next Chapter Continue to Chapter 15: Advanced Statistical Analysis with SciPy and statsmodels to add rigorous statistical methods! :::