13: Python Fundamentals for Data Science
Syntax Comparison
Section titled “Syntax Comparison”Python and PHP share many concepts but have different “flavors.” The biggest difference is Python’s use of indentation instead of curly braces {}.
| Feature | PHP | Python |
|---|---|---|
| Variables | $x = 10; | x = 10 |
| Constants | define('PI', 3.14); | PI = 3.14 (convention) |
| Arrays | $arr = [1, 2, 3]; | arr = [1, 2, 3] (List) |
| Assoc Arrays | ['a' => 1] | {'a': 1} (Dictionary) |
| Comments | // or /* */ | # or """ """ |
| Concatenation | "Hi " . $name | "Hi " + name or f"Hi {name}" |
| Loops | foreach ($arr as $v) {} | for v in arr: |
| Functions | function name() {} | def name(): |
| Class Prop | public string $name; | self.name = name (in __init__) |
| Null/None | null | None |
| Boolean | true / false | True / False (Capitalized!) |
Actions
Section titled “Actions”1. Create a Python/PHP comparison script:
# 1. Variables and Data Typesname = "PHP Developer"age = 30is_ready = Trueprice = 19.99
# 2. Lists (PHP Indexed Arrays)fruits = ["apple", "banana", "cherry"]fruits.append("date") # $fruits[] = 'date';print(f"Second fruit: {fruits[1]}") # Indices start at 0
# 3. Dictionaries (PHP Associative Arrays)user = { "name": "Dale", "role": "Developer", "skills": ["PHP", "SQL", "Python"]}print(f"User Name: {user['name']}")
# 4. Control Structures (Note the indentation!)if age >= 18: print(f"{name} is an adult") # f-string like "{$name}"elif age > 13: print(f"{name} is a teen") # PHP: elseifelse: print(f"{name} is a minor")
# 5. Loopsprint("\nLooping through fruits with index:")for i, fruit in enumerate(fruits): # PHP: foreach($fruits as $i => $fruit) print(f" {i}: {fruit}")
# 6. List Comprehensions (Powerful Python idiom!)# PHP: $squared = array_map(fn($x) => $x * $x, [1, 2, 3, 4, 5]);numbers = [1, 2, 3, 4, 5]squared = [x * x for x in numbers]print(f"\nSquared numbers: {squared}")
# Filtering with list comprehensions# PHP: $evens = array_filter($numbers, fn($x) => $x % 2 == 0);evens = [x for x in numbers if x % 2 == 0]print(f"Even numbers: {evens}")
# 7. Functions and Type Hinting (Modern Python)def calculate_total(price: float, tax_rate: float = 0.05) -> float: """Calculates total with tax (Python uses -> for return types)""" return price * (1 + tax_rate)
final_price = calculate_total(100.0, 0.2)print(f"Final price: ${final_price:.2f}")
# 8. Classes (Briefly)class DataAnalyzer: def __init__(self, data_source): # Constructor (__construct) self.data_source = data_source self.is_processed = False
def process(self): print(f"Processing data from {self.data_source}...") self.is_processed = True
analyzer = DataAnalyzer("sales.csv")analyzer.process()2. Run the script:
python3 examples/python_vs_php.pyExpected Result
Section titled “Expected Result”Second fruit: bananaUser Name: DalePHP Developer is an adult
Looping through fruits with index: 0: apple 1: banana 2: cherry 3: date
Squared numbers: [1, 4, 9, 16, 25]Even numbers: [2, 4]Final price: $120.00Processing data from sales.csv...Why It Works
Section titled “Why It Works”Python is dynamically typed like PHP but more strictly enforced.
- **Self instead of this`.
- Dunder Methods: Magic methods like
__init__(constructor) or__str__(tostring) are prefixed/suffixed with double underscores (“Double UNDERscore”). - Enumerate: Python’s
forloop doesn’t naturally give you the index like PHP’sforeach($arr as $k => $v). You useenumerate()to get both. - List Comprehensions: These are the “soul” of idiomatic Python. They are faster than
forloops and more readable thanarray_map/filteronce you learn the syntax.
Troubleshooting
Section titled “Troubleshooting”Problem: IndentationError: expected an indented block
Cause: Python is missing the 4-space indentation after a colon :, or you mixed tabs and spaces.
Solution: Ensure consistent 4-space indentation. Most editors (including Cursor) do this automatically.
Problem: TypeError: can only concatenate str (not "int") to str
Cause: PHP converts types automatically ("Age: " . 30 works). Python does not.
Solution: Use f-strings f"Age: {age}" or explicit conversion str(age).
Problem: NameError: name 'x' is not defined
Cause: Using a variable before assigning it. Python doesn’t have “undefined” variables that default to null/warning.
Solution: Initialize your variables.
Step 2: Mastering NumPy Arrays (~20 min)
Section titled “Step 2: Mastering NumPy Arrays (~20 min)”Use NumPy for numerical operations that are significantly faster and more concise than PHP array loops.
What is NumPy?
Section titled “What is NumPy?”NumPy (Numerical Python) is the foundation of data science in Python. While PHP arrays are flexible (allowing mixed types) but slow for large math operations, NumPy arrays are:
- Vectorized: Perform math on entire arrays at once (no loops!)
- Fixed-Type: Highly optimized memory usage (C-backend)
- Multi-Dimensional: Easy matrix operations
Actions
Section titled “Actions”1. Create a NumPy experimentation script:
import numpy as npimport time
# 1. Creating Arrays# PHP: $arr = range(1, 5);arr = np.array([1, 2, 3, 4, 5])print(f"NumPy Array: {arr}")print(f"Data type: {arr.dtype}") # Int64 or similar
# 2. Vectorized Operations (The "Magic")# PHP: $multiplied = array_map(fn($x) => $x * 10, $arr);multiplied = arr * 10print(f"Multiplied (* 10): {multiplied}")
added = arr + 5print(f"Added (+ 5): {added}")
# 3. Indexing and Slicing (Powerful!)# PHP: array_slice($arr, 1, 2);print(f"\nFirst 3 elements: {arr[:3]}") # [1 2 3]print(f"Last 2 elements: {arr[-2:]}") # [4 5]print(f"Elements from index 1 to 3: {arr[1:4]}") # [2 3 4]
# 4. Multi-Dimensional Arrays (Matrices)matrix = np.array([ [1, 2, 3], [4, 5, 6]])print(f"\nMatrix:\n{matrix}")print(f"Matrix Shape: {matrix.shape}") # (2, 3)print(f"Sum of columns: {matrix.sum(axis=0)}") # [5 7 9]print(f"Mean of matrix: {matrix.mean()}")
# 5. Masking (Conditional selection)# PHP: array_filter logicmask = arr > 3print(f"\nMask: {mask}") # [False False False True True]print(f"Values > 3: {arr[mask]}") # [4 5]
# 6. Performance Comparison: PHP-style Loop vs NumPysize = 1_000_000data = list(range(size)) # Standard Python list (like PHP array)np_data = np.array(data) # NumPy array
# Standard loop speedstart = time.time()loop_result = [x * 2 for x in data]print(f"\nStandard Loop time: {(time.time() - start)*1000:.2f}ms")
# NumPy vectorized speedstart = time.time()np_result = np_data * 2print(f"NumPy Vectorized time: {(time.time() - start)*1000:.2f}ms")2. Run the script:
python3 examples/numpy_lab.pyExpected Result
Section titled “Expected Result”NumPy Array: [1 2 3 4 5]Multiplied (* 10): [10 20 30 40 50]Added (+ 5): [6 7 8 9 10]
First 3 elements: [1 2 3]Last 2 elements: [4 5]Elements from index 1 to 3: [2 3 4]
Matrix:[[1 2 3] [4 5 6]]Matrix Shape: (2, 3)Sum of columns: [5 7 9]Mean of matrix: 3.5
Mask: [False False False True True]Values > 3: [4 5]
Standard Loop time: 45.20msNumPy Vectorized time: 1.15msWhy It Works
Section titled “Why It Works”NumPy uses vectorization. Instead of iterating over elements one by one in the Python interpreter (slow), it passes the entire operation to optimized C/C++ routines (fast).
Key Concepts:
- Slicing:
arr[start:stop:step]syntax allows complex subsetting withoutarray_slice. - Masking: You can use a boolean array to index another array. This is the basis for advanced data filtering in Python.
- Axes: In multi-dimensional arrays,
axis=0refers to columns andaxis=1refers to rows (usually).
Troubleshooting
Section titled “Troubleshooting”Problem: ValueError: operands could not be broadcast together
Cause: You’re trying to add or multiply two arrays with incompatible shapes (e.g., a 2x2 matrix and a 3x3 matrix).
Solution: Check shapes with arr.shape. NumPy can “broadcast” a smaller array across a larger one if certain dimensions match.
Problem: My math results are integers instead of floats
Cause: NumPy infers the type from your input. If you provide all integers, it uses an integer array.
Solution: Specify the type with dtype=float when creating the array, or use a decimal point (e.g., np.array([1., 2.])).
Step 3: Tabular Data with pandas (~20 min)
Section titled “Step 3: Tabular Data with pandas (~20 min)”Use pandas DataFrames to manipulate tabular data—think associative arrays on steroids with SQL-like powers.
What is pandas?
Section titled “What is pandas?”pandas is the “Excel for Python.” It introduces the DataFrame, a two-dimensional labeled data structure with columns of potentially different types.
- PHP Equivalent: An array of associative arrays (like PDO results).
- Pandas Strength: Built-in filtering, grouping, merging, and time-series handling.
Actions
Section titled “Actions”1. Create a pandas data processing script:
import pandas as pdimport numpy as np
# 1. Creating a DataFrame from scratch# PHP: $data = [['name' => 'A', 'sales' => 100], ...]data = { 'product': ['Laptop', 'Mouse', 'Monitor', 'Keyboard', 'Webcam', 'Microphone'], 'category': ['Electronics', 'Peripherals', 'Electronics', 'Peripherals', 'Peripherals', 'Audio'], 'sales': [1200, 25, 300, 75, 150, np.nan], # Missing value! 'stock': [15, 100, 8, 40, 25, 10]}
df = pd.DataFrame(data)
# 2. Handling Missing Data (Crucial!)# PHP devs often forget to check for nulls in mathprint("--- Check for Nulls ---")print(df.isnull().sum())
# Fill missing sales with the meandf['sales'] = df['sales'].fillna(df['sales'].mean())print("\n--- DataFrame after filling NaNs ---")print(df)
# 3. Advanced Filtering (Like SQL WHERE)# PHP: complex array_filterhigh_value_peripherals = df[(df['category'] == 'Peripherals') & (df['sales'] > 50)]print("\n--- High Value Peripherals ---")print(high_value_peripherals)
# 4. Feature Engineeringdf['revenue'] = df['sales'] * 0.8df['status'] = np.where(df['stock'] < 20, 'LOW', 'OK')print("\n--- Feature Engineering Result ---")print(df[['product', 'stock', 'status']])
# 5. Grouping and Aggregation (Powerful!)# PHP: usually requires nested loops and array_reducesummary = df.groupby('category').agg({ 'sales': ['sum', 'mean'], 'stock': 'min'})print("\n--- Category Summary ---")print(summary)
# 6. Merging DataFrames (Like SQL JOIN)inventory_data = pd.DataFrame({ 'product': ['Laptop', 'Mouse', 'Monitor'], 'supplier': ['Dell', 'Logitech', 'Samsung']})
# Left Joinmerged_df = df.merge(inventory_data, on='product', how='left')print("\n--- Merged (Joined) Data ---")print(merged_df[['product', 'category', 'supplier']])2. Run the script:
python3 examples/pandas_lab.pyExpected Result
Section titled “Expected Result”--- Check for Nulls ---product 0category 0sales 1stock 0
--- DataFrame after filling NaNs --- product category sales stock0 Laptop Electronics 1200.0 15...5 Microphone Audio 350.0 10
--- High Value Peripherals --- product category sales stock3 Keyboard Peripherals 75.0 404 Webcam Peripherals 150.0 25
--- Category Summary --- sales stock sum mean mincategoryAudio 350.0 350.0 10Electronics 1500.0 750.0 8Peripherals 250.0 83.3 25Why It Works
Section titled “Why It Works”Pandas DataFrames are built on NumPy but optimized for heterogeneous data (different types per column).
Key Skills for PHP Developers:
- NaN (Not a Number): Python uses
np.nanfor missing numerical data. Pandas has built-in tools (fillna,dropna) to handle this, which is much safer than PHP’snullbehavior in math. - Aggregation: The
.groupby().agg()pattern replaces dozens of lines of PHP loops. You can calculate sums, counts, and averages for multiple columns in one go. - Merging:
df.merge()is exactly like a SQL JOIN. You don’t need to manually map IDs between two arrays.
Troubleshooting
Section titled “Troubleshooting”Problem: SettingWithCopyWarning
Cause: You’re trying to modify a slice of a DataFrame (a “view”) instead of the original.
Solution: Use .copy() when creating a subset, or use .loc[row_indexer, col_indexer] to modify data explicitly.
Problem: Performance is slow on a loop
Cause: You used .iterrows().
Solution: Use Vectorized Methods. Almost any row-wise operation can be done with pandas/NumPy functions (e.g., df['a'] + df['b']).
Step 4: Interactive Data Science with Jupyter (~15 min)
Section titled “Step 4: Interactive Data Science with Jupyter (~15 min)”Set up and use Jupyter Notebooks for interactive data exploration and visualization.
Actions
Section titled “Actions”1. Start Jupyter Lab:
jupyter lab2. Create a Visual Analysis Notebook: In a new cell, perform basic visualization using Matplotlib:
import matplotlib.pyplot as pltimport pandas as pdimport numpy as np
# Create datacategories = ['Organic', 'Paid', 'Social', 'Referral']visitors = [4500, 2300, 1800, 1200]
# 1. Bar Chartplt.figure(figsize=(10, 6))plt.bar(categories, visitors, color=['skyblue', 'orange', 'green', 'red'])plt.title('Website Visitors by Source')plt.ylabel('Number of Visitors')plt.grid(axis='y', linestyle='--', alpha=0.7)plt.show()
# 2. Scatter Plot with NumPyx = np.random.randn(100)y = x + np.random.randn(100) * 0.5
plt.scatter(x, y, alpha=0.5)plt.title('Correlation Analysis')plt.show()3. Interactive Dataframes:
Type df (the name of your dataframe) in a cell and run it. Jupyter renders it as a clean, sortable HTML table automatically.
Why It Works
Section titled “Why It Works”Jupyter is the “IDE” for data science. It allows you to:
- Visualize Instantly: No need to refresh a browser or generate image files.
- Document as you go: Use Markdown cells to explain your methodology to stakeholders.
- Save State: Unlike a PHP script that restarts from zero every run, Jupyter keeps your variables in memory. You can spend 10 minutes loading a massive CSV, then spend hours analyzing it without reloading.
Troubleshooting
Section titled “Troubleshooting”Problem: Plot doesn’t show up
Cause: Missing %matplotlib inline (for older versions) or plt.show().
Solution: Use plt.show() at the end of your plotting code.
Problem: Jupyter can’t find my virtual environment
Cause: Jupyter is running from your global Python install instead of your venv.
Solution: Install the “IPyKernel” in your venv:
pip install ipykernelpython -m ipykernel install --user --name=my-data-envThen select “my-data-env” in the Jupyter “Kernel” menu.
Step 5: Hybrid Orchestration — Connecting PHP and Python (~15 min)
Section titled “Step 5: Hybrid Orchestration — Connecting PHP and Python (~15 min)”Learn the two primary ways to call your Python data analysis code from a PHP application.
Why Orchestrate?
Section titled “Why Orchestrate?”You shouldn’t rewrite your entire PHP app in Python. Instead, keep your Business Logic in PHP and your Heavy Lifting in Python.
Actions
Section titled “Actions”1. Strategy A: Command Line Integration (Simple)
Use PHP’s proc_open to pass data to a Python script and receive JSON back.
<?phpdeclare(strict_types=1);
$data = [10, 20, 30, 40, 50];$json_input = json_encode($data);
// Call Python script$process = proc_open('python3 examples/process_data.py', [ 0 => ['pipe', 'r'], // stdin 1 => ['pipe', 'w'], // stdout], $pipes);
if (is_resource($process)) { fwrite($pipes[0], $json_input); fclose($pipes[0]);
$output = stream_get_contents($pipes[1]); fclose($pipes[1]); proc_close($process);
$result = json_decode($output, true); echo "PHP received from Python: " . $result['mean'] . "\n";}import sysimport jsonimport numpy as np
# Read from PHP's stdininput_data = json.load(sys.stdin)arr = np.array(input_data)
# Perform heavy mathresult = { "mean": float(arr.mean()), "sum": int(arr.sum())}
# Output to PHP's stdoutprint(json.dumps(result))2. Strategy B: Microservice Integration (Robust) Expose your pandas analysis via a small Flask or FastAPI service (as seen in Chapter 9).
Why It Works
Section titled “Why It Works”- Loose Coupling: Your PHP app doesn’t care how the Python code works, as long as the JSON matches the schema.
- Performance: You only pay the “Python tax” when doing complex analysis.
- Scalability: You can eventually move the Python microservice to its own server/container if it becomes a bottleneck.
Troubleshooting
Section titled “Troubleshooting”Problem: Python script can’t find libraries when called from PHP
Cause: PHP is using the global Python path instead of your virtual environment.
Solution: Use the full path to the python executable in your venv:
$venv_python = __DIR__ . '/venv/bin/python';$command = "{$venv_python} my_script.py";Exercises
Section titled “Exercises”Exercise 1: The Idiomatic Transformer
Section titled “Exercise 1: The Idiomatic Transformer”Goal: Convert complex PHP logic into a concise Python list comprehension.
Requirement: Given a list of transactions, calculate the tax (15%) for all “Approved” transactions over $100.
$transactions = [ ['id' => 1, 'amount' => 150, 'status' => 'Approved'], ['id' => 2, 'amount' => 50, 'status' => 'Approved'], ['id' => 3, 'amount' => 200, 'status' => 'Pending'], ['id' => 4, 'amount' => 300, 'status' => 'Approved']];
$taxes = [];foreach ($transactions as $t) { if ($t['status'] === 'Approved' && $t['amount'] > 100) { $taxes[] = $t['amount'] * 0.15; }}Validation: Your Python code should be a single line (excluding the list creation).
Exercise 2: Anomaly Detection with NumPy
Section titled “Exercise 2: Anomaly Detection with NumPy”Goal: Use NumPy’s speed to find outliers in a dataset.
Requirement:
- Generate an array of 5,000 random numbers from a normal distribution (
np.random.normal). - Calculate the Mean and Standard Deviation.
- Use Boolean Masking to find all values that are more than 3 standard deviations away from the mean (these are the “outliers”).
- Print the percentage of the data that are outliers.
Validation: You should use zero for loops.
Exercise 3: The Customer Segmenter
Section titled “Exercise 3: The Customer Segmenter”Goal: Use pandas to perform business segmentation.
Requirement:
- Create a DataFrame from a dictionary of 10 customers with columns:
name,total_spend,last_purchase_days_ago. - Create a new column
segmentusingnp.whereor.apply():- “VIP” if spend > 500 AND last purchase < 30 days ago.
- “At Risk” if last purchase > 90 days ago.
- “Standard” otherwise.
- Group by
segmentand calculate the averagetotal_spend.
Validation: Print the resulting summary table.
Wrap-up
Section titled “Wrap-up”What You’ve Learned
Section titled “What You’ve Learned”In this chapter, you bridged the gap between PHP and the Python data science ecosystem:
- Python Syntax: Mastery of indentation, lists, dictionaries, and f-strings from a PHP perspective.
- Environment Management: Using
venvandpipto isolate and manage project dependencies safely. - NumPy Foundations: Understanding vectorized operations and why they outperform standard loops for math.
- Pandas Mastery: Using DataFrames to clean, filter, aggregate, and join tabular data with SQL-like efficiency.
- Functional Idioms: Replacing verbose
foreachloops with powerful list comprehensions and lambda functions. - Interactive Exploration: Leveraging Jupyter Notebooks for rapid experimentation, visualization, and documentation.
- Hybrid Thinking: Knowing when to use PHP for application logic and Python for heavy analytical lifting.
What You’ve Built
Section titled “What You’ve Built”You now have a solid foundation in Python data science:
- Comparison Cheat Sheet: A reference library for translating PHP patterns to Python.
- NumPy Matrix Lab: High-performance numerical code for statistical analysis.
- Pandas Analyzer: A robust toolkit for processing structured business datasets.
- Jupyter Portfolio: An interactive document showcasing your visualization and analysis work.
- Environment Template: A professional workflow for managing Python data projects.
PHP vs Python: The Data Science View
Section titled “PHP vs Python: The Data Science View”| Task | PHP (Chapter 1-12) | Python (Chapter 13+) |
|---|---|---|
| Data Scrapping/APIs | Excellent (Guzzle, Symfony) | Good, but often more complex |
| Data Cleaning | Great for strings/regex | Superior for numerical/missing data |
| Math/Stats | Good (MathPHP) | Industry standard (NumPy/SciPy) |
| Tabular Analysis | Verbose associative arrays | Intuitive DataFrames (pandas) |
| Machine Learning | Emerging (PHP-ML) | Cutting edge (scikit-learn, PyTorch) |
| Visualization | Web-based (Chart.js) | Research-grade (Matplotlib, Seaborn) |
Key Python Data Science Principles
Section titled “Key Python Data Science Principles”1. Avoid Loops Like the Plague
If you’re writing a for loop to do math on an array, there’s probably a NumPy or pandas function that does it 100x faster in a single line.
2. Handle Missing Data Explicitly
Don’t let null values break your models. Use .fillna() or .dropna() to be intentional about your data quality.
3. Type Hinting Matters As a PHP developer used to strict typing in 8.4, use Python’s type hints to make your data pipelines robust and self-documenting.
4. Notebooks for Research, Scripts for Production Use Jupyter to find the right insight, then refactor that code into clean Python classes for your PHP application to orchestrate.
Connection to Data Science Workflow
Section titled “Connection to Data Science Workflow”You are now entering the Advanced Specialization phase of your data science journey:
- ✅ Chapter 1-12: Built end-to-end data systems using PHP.
- ✅ Chapter 13: Deepened Python skills for specialized tasks ← You are here
- ➡️ Chapter 14: Mastering advanced Data Wrangling with Python.
Next Steps
Section titled “Next Steps”Immediate Practice:
- Take a complex data transformation from your current PHP project and try to implement it in pandas.
- Visit Kaggle and download a small dataset to explore in Jupyter.
- Explore the scikit-learn gallery to see what’s possible in the next few chapters.
Chapter 14 Preview:
In the next chapter, we’ll dive deep into Data Wrangling with pandas and NumPy. You’ll learn:
- Advanced joining and merging strategies (Inner, Outer, Right joins)
- Reshaping data with
pivot_tableandmelt - Working with Time Series data (resampling by day/month/year)
- Cleaning messy strings and handling outliers at scale
- Optimized cleaning pipelines for machine learning
You’ll transform raw, chaotic data into high-quality datasets ready for predictive modeling!
Further Reading
Section titled “Further Reading”- The Python Tutorial — Official Python documentation
- NumPy Quickstart — Essential NumPy guide
- Pandas Getting Started — Core pandas concepts
- JupyterLab Documentation — Master your interactive environment
- Python for Data Analysis — The definitive guide by the creator of pandas
::: tip Next Chapter Continue to Chapter 14: Data Wrangling with pandas and NumPy to master advanced data manipulation! :::