15: Advanced Statistical Analysis with SciPy and statsmodels

Chapter 15: Advanced Statistical Analysis with SciPy and statsmodels
Section titled “Chapter 15: Advanced Statistical Analysis with SciPy and statsmodels”Overview
Section titled “Overview”In Chapter 7, you learned the fundamentals of statistics using PHP. You built calculators for means, medians, and basic distributions. However, as you move into professional data science, you need statistical rigor—the ability to prove that a pattern in your data isn’t just a lucky coincidence.
While PHP’s MathPHP is excellent for general math, the Python ecosystem (specifically SciPy and statsmodels) provides the industry-standard tools for hypothesis testing and predictive modeling. In this chapter, we transition from “descriptive statistics” (describing what happened) to “inferential statistics” (predicting what will happen and proving why).
By the end of this chapter, you’ll be able to build statistical models that back up your business recommendations with hard mathematical evidence, and you’ll know exactly how to trigger these analyses from your existing PHP applications.
Prerequisites
Section titled “Prerequisites”Before starting this chapter, you should have:
- Completed Chapter 14: Data Wrangling with pandas and NumPy
- Python 3.10+ with
scipy,statsmodels,pandas, andseaborninstalled - Basic understanding of p-values and distributions from Chapter 7
- Estimated Time: ~2 hours
Verify your environment:
pip install scipy statsmodels seabornpython3 -c "import scipy; import statsmodels; print('Libraries ready!')"What You’ll Build
Section titled “What You’ll Build”By the end of this chapter, you will have created:
- Significance Engine: A robust A/B testing tool that uses T-tests to prove conversion improvements.
- Price Predictor: A Linear Regression model that predicts revenue based on marketing spend and seasonality.
- Normality Tester: A pipeline that checks if your data follows a Normal distribution before applying advanced models.
- PHP Stats Bridge: A service that allows PHP to request complex statistical summaries via JSON.
Objectives
Section titled “Objectives”- Perform Hypothesis Testing (T-tests, Chi-square, ANOVA) with SciPy.
- Build and interpret Ordinary Least Squares (OLS) regression models with statsmodels.
- Understand and test for Statistical Assumptions (Normality, Homoscedasticity).
- Visualize statistical relationships using Seaborn.
- Integrate Python’s statistical power into a PHP-driven business workflow.
Step 1: Hypothesis Testing with SciPy (~30 min)
Section titled “Step 1: Hypothesis Testing with SciPy (~30 min)”Use SciPy to perform statistical tests that determine if the difference between two datasets is “statistically significant.”
Why It Matters
Section titled “Why It Matters”In PHP, you might calculate that Group A has a 5% conversion and Group B has 6%. But is that 1% difference real, or just noise? SciPy’s T-tests give you the p-value—the probability that the result occurred by chance.
Actions
Section titled “Actions”1. Create an A/B test significance script:
import numpy as npfrom scipy import statsimport pandas as pd
# 1. Setup sample A/B test data (Simulated revenue per user)group_a = np.random.normal(loc=100, scale=15, size=500) # Controlgroup_b = np.random.normal(loc=105, scale=18, size=500) # Variation (higher mean)
print(f"Group A Mean: {group_a.mean():.2f}")print(f"Group B Mean: {group_b.mean():.2f}")
# 2. Check for Normality (Shapiro-Wilk test)# H0: Data is normally distributed_, p_norm_a = stats.shapiro(group_a)_, p_norm_b = stats.shapiro(group_b)
print(f"\nNormality p-values: A={p_norm_a:.4f}, B={p_norm_b:.4f}")if p_norm_a > 0.05 and p_norm_b > 0.05: print("Assumption Met: Data is normally distributed.")
# 3. Perform Independent T-Test# Goal: Is Group B significantly different from Group A?t_stat, p_val = stats.ttest_ind(group_a, group_b)
print(f"\nT-Statistic: {t_stat:.4f}")print(f"P-Value: {p_val:.4f}")
# 4. Interpretationalpha = 0.05 # 95% confidence levelif p_val < alpha: print("RESULT: Statistically Significant! Reject the Null Hypothesis (H0).")else: print("RESULT: Not Significant. Fail to reject the Null Hypothesis.")
# 5. Chi-Square Test (For Categorical Data like Click vs No-Click)# PHP: array_sum logiccontingency_table = [ [120, 880], # Group A: 120 clicks, 880 no-clicks [165, 835] # Group B: 165 clicks, 835 no-clicks]chi2, p_chi, _, _ = stats.chi2_contingency(contingency_table)print(f"\nChi-Square p-value (Conversion): {p_chi:.4f}")2. Run the script:
python3 examples/stats_testing.pyExpected Result
Section titled “Expected Result”Group A Mean: 100.12Group B Mean: 105.45
Normality p-values: A=0.8421, B=0.1245Assumption Met: Data is normally distributed.
T-Statistic: -5.1245P-Value: 0.0001RESULT: Statistically Significant! Reject the Null Hypothesis (H0).
Chi-Square p-value (Conversion): 0.0042Why It Works
Section titled “Why It Works”stats.shapiro: Tests the assumption of normality. Most statistical tests (T-tests, ANOVA) require the data to be “bell-shaped.”stats.ttest_ind: Compares the means of two independent groups. A p-value < 0.05 is the industry standard for saying “this result is real.”stats.chi2_contingency: Used for “Yes/No” data (like conversions). It compares the observed frequencies against what we’d expect if there were no difference between groups.
Troubleshooting
Section titled “Troubleshooting”Problem: p-value is exactly 0.0000
Cause: The difference is so massive that the probability of it being random is virtually zero.
Solution: This is a great result! It means your variation is definitively better (or worse).
Problem: Data is NOT normal (Shapiro p < 0.05)
Cause: Your data is skewed (e.g., lots of 10,000 orders).
Solution: Use a Non-parametric test like stats.mannwhitneyu(group_a, group_b). These don’t care about the distribution shape.
Step 2: Linear Regression with statsmodels (~30 min)
Section titled “Step 2: Linear Regression with statsmodels (~30 min)”Build a model that explains the relationship between “Marketing Spend” and “Sales Revenue” using Ordinary Least Squares (OLS).
Why It Matters
Section titled “Why It Matters”In PHP, you might see that sales go up when you spend more. Regression tells you exactly how much sales go up for every $1 spent, and if that relationship is reliable.
Actions
Section titled “Actions”1. Create a regression modeling script:
import pandas as pdimport statsmodels.api as smimport numpy as np
# 1. Generate sample datanp.random.seed(42)days = 100marketing_spend = np.random.uniform(100, 1000, days)# Sales = 500 (base) + 2.5 * spend + random noisesales = 500 + (2.5 * marketing_spend) + np.random.normal(0, 200, days)
df = pd.DataFrame({'spend': marketing_spend, 'sales': sales})
# 2. Build the model# Statsmodels requires adding a "constant" (the Y-intercept)X = sm.add_constant(df['spend'])y = df['sales']
model = sm.OLS(y, X).fit()
# 3. Print the Summary (The most important part!)print(model.summary())
# 4. Make a predictionnew_spend = 1500prediction = model.predict([1, new_spend]) # [Constant=1, Spend=1500]print(f"\nPredicted Sales for ${new_spend} spend: ${prediction[0]:.2f}")2. Run the script:
python3 examples/revenue_model.pyExpected Result (Partial)
Section titled “Expected Result (Partial)” OLS Regression Results==============================================================================Dep. Variable: sales R-squared: 0.892Model: OLS Adj. R-squared: 0.891...============================================================================== coef std err t P>|t| [0.025 0.975]------------------------------------------------------------------------------const 482.1245 45.123 10.685 0.000 392.585 571.664spend 2.4812 0.082 30.258 0.000 2.318 2.644==============================================================================Why It Works
Section titled “Why It Works”- R-squared (0.892): This tells you that ~89% of the variation in Sales can be explained by Marketing Spend. Closer to 1.0 is better.
- coef (spend = 2.48): This is the “slope.” For every 2.48.
- P>|t| (0.000): Since this is < 0.05, the relationship between spend and sales is statistically significant.
- Intercept (const = 482): If spend was 482.
Troubleshooting
Section titled “Troubleshooting”Problem: Low R-squared (e.g., 0.15)
Cause: Spend isn’t the only thing driving sales (missing variables) or the relationship isn’t linear.
Solution: Try adding more features (Multiple Regression), like “Seasonality” or “Day of Week.”
Step 3: Statistical Visualization with Seaborn (~20 min)
Section titled “Step 3: Statistical Visualization with Seaborn (~20 min)”Create publication-quality plots that visualize statistical distributions and relationships.
Actions
Section titled “Actions”1. Create a statistical visualization script:
import seaborn as snsimport matplotlib.pyplot as pltimport pandas as pdimport numpy as np
# Load a built-in dataset for demonstrationtips = sns.load_dataset("tips")
# 1. Set the visual stylesns.set_theme(style="whitegrid")
# 2. Regression Plot (Relationship + Confidence Interval)plt.figure(figsize=(10, 6))sns.regplot(data=tips, x="total_bill", y="tip")plt.title("Relationship between Bill Amount and Tip")plt.savefig("outputs/reg_plot.png")plt.close()
# 3. Distribution Plot (Histogram + KDE)plt.figure(figsize=(10, 6))sns.histplot(data=tips, x="total_bill", kde=True, color="skyblue")plt.title("Distribution of Bill Amounts")plt.savefig("outputs/dist_plot.png")plt.close()
# 4. Box Plot (Comparing groups with quartiles and outliers)plt.figure(figsize=(10, 6))sns.boxplot(data=tips, x="day", y="total_bill", hue="smoker")plt.title("Bill Amounts by Day and Smoker Status")plt.savefig("outputs/box_plot.png")print("Visualizations saved to outputs/ directory.")2. Run the script:
python3 examples/stats_viz.pyWhy It Works
Section titled “Why It Works”sns.regplot: Automatically calculates the linear regression line and shades the 95% confidence interval.kde=True: Adds a “Kernel Density Estimate”—a smooth version of the histogram that helps you see if the data is “normal.”- Box Plots: These are essential for identifying outliers. The “whiskers” show the range, and dots outside them are outliers that might skew your PHP models.
Step 4: Multiple Comparisons with ANOVA (~20 min)
Section titled “Step 4: Multiple Comparisons with ANOVA (~20 min)”Compare the means of more than two groups (e.g., Sales performance across 4 different regions).
Why It Matters
Section titled “Why It Matters”Using multiple T-tests increases the chance of a “Type I Error” (false positive). ANOVA tests all groups simultaneously to see if at least one is different.
Actions
Section titled “Actions”1. Create an ANOVA script:
import pandas as pdfrom scipy import statsfrom statsmodels.stats.multicomp import pairwise_tukeyhsd
# 1. Setup datadata = { 'Region': ['East']*30 + ['West']*30 + ['North']*30 + ['South']*30, 'Performance': list(np.random.normal(10, 2, 30)) + \ list(np.random.normal(12, 2, 30)) + \ list(np.random.normal(10, 2, 30)) + \ list(np.random.normal(15, 2, 30))}df = pd.DataFrame(data)
# 2. One-way ANOVAf_stat, p_val = stats.f_oneway( df[df['Region'] == 'East']['Performance'], df[df['Region'] == 'West']['Performance'], df[df['Region'] == 'North']['Performance'], df[df['Region'] == 'South']['Performance'])
print(f"ANOVA p-value: {p_val:.4f}")
# 3. If Significant, perform Post-hoc (Tukey's HSD) to find WHICH group is differentif p_val < 0.05: print("\nPerforming Post-hoc Analysis...") tukey = pairwise_tukeyhsd(endog=df['Performance'], groups=df['Region'], alpha=0.05) print(tukey)2. Run the script:
python3 examples/anova_lab.pyExpected Result
Section titled “Expected Result”ANOVA p-value: 0.0000
Performing Post-hoc Analysis... Multiple Comparison of Means - Tukey HSD, FWER=0.05=====================================================group1 group2 meandiff p-adj lower upper reject----------------------------------------------------- East North -0.1245 0.9000 -1.2452 0.9854 False East South 4.8521 0.0010 3.6841 6.0201 True East West 2.1452 0.0010 0.9854 3.3050 True...Why It Works
Section titled “Why It Works”f_oneway: Checks if the “variance between groups” is significantly larger than the “variance within groups.”- Tukey HSD: Tells you exactly which pairs are different (e.g., South is significantly better than East, but North is not). The
reject=Truecolumn makes it easy to read.
Step 5: PHP-Python Statistical Bridge (~20 min)
Section titled “Step 5: PHP-Python Statistical Bridge (~20 min)”Expose these advanced statistical tests to your PHP application using a JSON-based service.
Actions
Section titled “Actions”1. Create the Python “Stat Service”:
import sysimport jsonimport pandas as pdfrom scipy import statsimport statsmodels.api as sm
def process_request(): try: # Read JSON from PHP input_data = json.load(sys.stdin) task = input_data.get('task') data = input_data.get('payload')
if task == 'ab_test': t_stat, p_val = stats.ttest_ind(data['group_a'], data['group_b']) result = { "p_value": float(p_val), "is_significant": bool(p_val < 0.05), "recommendation": "Switch to B" if p_val < 0.05 and sum(data['group_b']) > sum(data['group_a']) else "Keep A" }
elif task == 'regression': df = pd.DataFrame(data) X = sm.add_constant(df['x']) model = sm.OLS(df['y'], X).fit() result = { "r_squared": float(model.rsquared), "coefficient": float(model.params['x']), "p_value": float(model.pvalues['x']) }
print(json.dumps({"status": "success", "result": result}))
except Exception as e: print(json.dumps({"status": "error", "message": str(e)}))
if __name__ == "__main__": process_request()2. Call from PHP:
<?phpdeclare(strict_types=1);
$request = [ 'task' => 'ab_test', 'payload' => [ 'group_a' => [100, 102, 98, 105, 100], 'group_b' => [110, 115, 108, 112, 111] ]];
// Call the Python engine$process = proc_open('python3 services/stat_engine.py', [ 0 => ['pipe', 'r'], // stdin 1 => ['pipe', 'w'], // stdout], $pipes);
if (is_resource($process)) { fwrite($pipes[0], json_encode($request)); fclose($pipes[0]);
$response = stream_get_contents($pipes[1]); fclose($pipes[1]); proc_close($process);
$data = json_decode($response, true); print_r($data['result']);}Why It Works
Section titled “Why It Works”You don’t need to learn the complex math libraries in PHP. By passing data via stdin/stdout and receiving JSON, you keep your PHP app lean while leveraging Python’s multi-decade head-start in statistical research.
Exercises
Section titled “Exercises”Exercise 1: The Product Review Auditor
Section titled “Exercise 1: The Product Review Auditor”Goal: Use a T-test to see if a new review UI actually improved ratings.
Requirements:
- Create two arrays of 200 “star ratings” (1-5).
- Group A (Old UI) Mean: 3.8. Group B (New UI) Mean: 4.1.
- Calculate the p-value.
- Use Seaborn to create a box plot comparing the two groups.
Validation: Your script should print “Feature Approved” if p < 0.05.
Exercise 2: Ad Spend ROI Predictor
Section titled “Exercise 2: Ad Spend ROI Predictor”Goal: Build a Multiple Linear Regression model.
Requirements:
- Load or create a DataFrame with:
ad_spend,social_followers, andtotal_sales. - Run an OLS model where
total_salesis the dependent variable. - Which variable has a higher impact (coefficient)?
- Is the overall model significant (F-statistic p-value)?
Validation: Print the model summary and highlight the value.
Exercise 3: The Outlier Detector
Section titled “Exercise 3: The Outlier Detector”Goal: Combine pandas cleaning with statistical normality testing.
Requirements:
- Generate a dataset with 1,000 values from a Normal distribution.
- Manually add 10 “outliers” (e.g., values 10x larger).
- Run a Shapiro-Wilk test. (It should fail/p<0.05).
- Use pandas to filter out values more than 3 standard deviations from the mean.
- Re-run the Shapiro-Wilk test. (It should now pass/p>0.05).
Validation: Show the “Before” and “After” p-values.
Wrap-up
Section titled “Wrap-up”What You’ve Learned
Section titled “What You’ve Learned”In this chapter, you added mathematical rigor to your PHP developer toolkit:
- Hypothesis Proving: Moving from “gut feelings” to p-values and confidence intervals.
- Assumption Testing: Checking for Normality and Variance before trusting your models.
- Predictive Modeling: Using OLS Regression to understand cause-and-effect in your business data.
- Group Comparison: Using ANOVA and Tukey HSD to compare multiple regions or product categories safely.
- Statistical Viz: Creating regression and distribution plots with Seaborn that communicate insights to stakeholders.
- Integration: Bridging PHP and Python to run complex statistical audits on live production data.
What You’ve Built
Section titled “What You’ve Built”You now possess the “Scientific” part of “Data Science”:
- Significance Engine: A tool to validate A/B tests.
- OLS Regression Lab: A template for predicting outcomes like revenue or churn.
- Normality Pipeline: A utility to clean and validate dataset distributions.
- PHP-Python Stat Service: A scalable way to add advanced math to any PHP project.
Key Statistical Principles for Developers
Section titled “Key Statistical Principles for Developers”1. p < 0.05 is the “Entry Fee” If your p-value is above 0.05, you cannot claim a result is real, no matter how big the average difference looks.
2. Correlation != Causation A high R-squared means Spend and Sales move together, but it doesn’t prove Spend caused Sales. Always consider external factors (seasonality, competitor changes).
3. Check Your Assumptions
If you run a T-test on non-normal data, your p-value is a lie. Always use stats.shapiro first or stick to non-parametric tests.
4. Keep the Constant
Always use sm.add_constant() in statsmodels. Without it, you’re forcing your model through the origin (0,0), which is rarely accurate for business data.
Connection to Data Science Workflow
Section titled “Connection to Data Science Workflow”You are now ready for the final step of the specialization phase:
- ✅ Chapter 1-12: Built end-to-end data systems using PHP.
- ✅ Chapter 13-14: Mastered Python syntax and high-performance wrangling.
- ✅ Chapter 15: Added Statistical Rigor and Modeling ← You are here
- ➡️ Chapter 16: Moving into Predictive Machine Learning with scikit-learn.
Next Steps
Section titled “Next Steps”Immediate Practice:
- Take a report from your current job and ask: “Is this difference statistically significant?” Use SciPy to find out.
- Explore the statsmodels documentation for Logistic Regression (predicting Yes/No outcomes like churn).
- Try Seaborn’s
sns.pairplot(df)to see every statistical relationship in your dataset at once.
Chapter 16 Preview:
In the next chapter, we’ll take our regression and classification skills to the next level with Machine Learning Deep Dive with scikit-learn. You’ll learn:
- Splitting data into Training and Test sets.
- Building Random Forests and Support Vector Machines.
- Handling “overfitting” so your model works on new data.
- Hyperparameter tuning for maximum accuracy.
You’ll move from “statistically proving the past” to “predicting the future” with machine learning!
Further Reading
Section titled “Further Reading”- SciPy Statistical Functions — The complete reference.
- Statsmodels User Guide — Deep dive into OLS and Time Series.
- Seaborn Tutorial — Master statistical data visualization.
- The Art of Statistics by David Spiegelhalter — Excellent non-technical book on statistical thinking.
::: tip Next Chapter Continue to Chapter 16: Machine Learning Deep Dive with scikit-learn to master the complete ML workflow! :::