07: Model Evaluation and Improvement

Chapter 07: Model Evaluation and Improvement
Section titled βChapter 07: Model Evaluation and ImprovementβOverview
Section titled βOverviewβYouβve built your first machine learning modelsβa spam filter in Chapter 6 and various classifiers in Chapter 3. You know how to train them and get predictions. But how do you know if your model is actually good? How do you measure βgoodβ? And more importantly, once youβve measured performance, how do you make your model better?
This chapter answers those critical questions. Youβll learn that accuracy alone is a deceptive metricβa spam filter thatβs 95% accurate sounds great until you realize itβs blocking 20% of your important emails! You need precision, recall, F1-score, and other metrics that reveal the full picture of your modelβs behavior.
Youβll master evaluation techniques that separate amateur projects from production-ready systems: stratified cross-validation that handles imbalanced classes, ROC curves that visualize the precision-recall tradeoff, and learning curves that diagnose whether you need more data or a better algorithm. Then youβll learn systematic improvement techniques: hyperparameter tuning with grid search, feature selection to identify what actually matters, and ensemble methods that combine multiple models for superior performance.
By the end of this chapter, youβll have a comprehensive evaluation framework and a toolkit of proven improvement strategies. Youβll know not just whether your model works, but why it works, where it fails, and how to make it better. Most importantly, youβll understand the tradeoffsβwhen to optimize for precision vs. recall, when more data helps vs. hurts, and when a complex model is worth the added complexity.
Prerequisites
Section titled βPrerequisitesβBefore starting this chapter, you should have:
- Completed Chapter 03 or understand train/test splits, accuracy, and overfitting
- Completed Chapter 06 or have built a classification model
- PHP 8.4+ environment with Rubix ML installed (from Chapter 2)
- Basic understanding of classification metrics (confusion matrix helpful but not required)
- Familiarity with arrays, functions, and basic statistics in PHP
- A text editor or IDE configured for PHP development
Estimated Time: ~90-120 minutes (reading, running examples, and exercises)
What Youβll Build
Section titled βWhat Youβll BuildβBy the end of this chapter, you will have created:
- A comprehensive evaluation toolkit that calculates 10+ metrics for any classifier
- A stratified k-fold cross-validator that handles imbalanced datasets properly
- A confusion matrix analyzer with precision, recall, F1-score, and support for each class
- An ROC curve generator that visualizes true positive vs. false positive rates
- A learning curve plotter that shows whether more data will help
- A grid search hyperparameter tuner that systematically finds optimal parameters
- A feature importance analyzer that ranks features by predictive power
- A feature selection tool that removes unhelpful features automatically
- An ensemble voting classifier combining k-NN, Naive Bayes, and Decision Tree for 2-5% accuracy gains
- A bagging ensemble showing variance reduction through bootstrap aggregating
- A SMOTE implementation for synthetic minority oversampling to handle severe class imbalance
- A class weight calculator for adjusting model training on imbalanced datasets
- A comparison framework demonstrating ensemble vs. single model improvements
- A model comparison framework that benchmarks multiple algorithms
- An error analysis tool that identifies which examples are misclassified and why
- A production-ready spam filter with optimized hyperparameters achieving 98%+ accuracy
All code examples are fully functional and include visualizations (text-based) of metrics and curves.
::: info Code Examples Complete, runnable examples for this chapter:
01-evaluation-metrics.phpβ Comprehensive metrics toolkit02-confusion-matrix-deep-dive.phpβ Advanced confusion matrix analysis03-precision-recall-tradeoff.phpβ Understanding the tradeoff04-stratified-cross-validation.phpβ Handling imbalanced classes05-roc-curve.phpβ ROC-AUC analysis06-learning-curves.phpβ Diagnosing data needs07-grid-search.phpβ Hyperparameter tuning08-feature-importance.phpβ Feature ranking09-feature-selection.phpβ Automated feature selection10-ensemble-voting.phpβ Voting classifiers11-ensemble-bagging.phpβ Bootstrap aggregating12-class-imbalance-smote.phpβ SMOTE implementation13-class-weights.phpβ Class weight handling14-error-analysis.phpβ Understanding failures15-spam-filter-optimized.phpβ Production-ready spam filter
All files are in docs/series/ai-ml-php-developers/code/chapter-07/
:::
Quick Start
Section titled βQuick StartβWant to see comprehensive model evaluation in action? Hereβs a 5-minute example that calculates multiple metrics for a classifier:
<?php
declare(strict_types=1);
require_once __DIR__ . '/../../code/chapter-02/vendor/autoload.php';
use Rubix\ML\Classifiers\KNearestNeighbors;use Rubix\ML\Datasets\Labeled;use Rubix\ML\CrossValidation\Metrics\Accuracy;use Rubix\ML\CrossValidation\Metrics\Precision;use Rubix\ML\CrossValidation\Metrics\Recall;use Rubix\ML\CrossValidation\Metrics\F1Score;
// Quick spam classifier training data$trainingSamples = [ [5, 3, 1, 1], // [word_count, exclamations, has_urgent, has_free] [4, 0, 0, 0], [6, 4, 1, 1], [5, 1, 0, 0], [7, 5, 1, 1], [4, 0, 0, 0],];
$trainingLabels = ['spam', 'ham', 'spam', 'ham', 'spam', 'ham'];
// Train classifier$classifier = new KNearestNeighbors(3);$trainingDataset = new Labeled($trainingSamples, $trainingLabels);$classifier->train($trainingDataset);
// Test data$testSamples = [ [6, 5, 1, 1], // Spammy: many exclamations, urgent, free [4, 0, 0, 0], // Ham-like: normal message [5, 2, 1, 0], // Borderline [3, 0, 0, 0], // Ham-like];
$testLabels = ['spam', 'ham', 'spam', 'ham'];$testDataset = new Labeled($testSamples, $testLabels);
// Make predictions$predictions = $classifier->predict($testDataset);
// Calculate multiple metrics$metrics = [ 'Accuracy' => new Accuracy(), 'Precision' => new Precision(), 'Recall' => new Recall(), 'F1-Score' => new F1Score(),];
echo "βββββββββββββββββββββββββββββββββββββββββββββββββ\n";echo "β Model Evaluation: Multiple Metrics β\n";echo "βββββββββββββββββββββββββββββββββββββββββββββββββ\n\n";
foreach ($metrics as $name => $metric) { $score = $metric->score($predictions, $testLabels); $percentage = number_format($score * 100, 2); $bar = str_repeat('β', (int)($score * 30)); echo sprintf("%-12s %6s%% %s\n", $name . ':', $percentage, $bar);}
echo "\nPredictions:\n";foreach ($predictions as $i => $pred) { $actual = $testLabels[$i]; $icon = $pred === $actual ? 'β' : 'β'; echo " {$icon} Predicted: {$pred}, Actual: {$actual}\n";}Run it:
cd docs/series/ai-ml-php-developers/code/chapter-07php quick-start-evaluation.phpExpected output:
ββββββββββββββββββββββββββββββββββββββββββββββββββ Model Evaluation: Multiple Metrics ββββββββββββββββββββββββββββββββββββββββββββββββββ
Accuracy: 75.00% βββββββββββββββββββββββPrecision: 100.00% ββββββββββββββββββββββββββββββRecall: 50.00% βββββββββββββββF1-Score: 66.67% ββββββββββββββββββββ
Predictions: β Predicted: spam, Actual: spam β Predicted: ham, Actual: ham β Predicted: ham, Actual: spam β Predicted: ham, Actual: hamWhat just happened? You evaluated a classifier with four different metrics, revealing that while precision is perfect (no false positives), recall is only 50% (missing half the spam). Accuracy alone (75%) hides this critical detail!
Now letβs understand evaluation deeplyβ¦
Objectives
Section titled βObjectivesβBy the end of this chapter, you will be able to:
- Calculate and interpret 10+ evaluation metrics including accuracy, precision, recall, F1-score, specificity, and ROC-AUC
- Build and analyze confusion matrices to understand exactly which classes your model confuses
- Use stratified cross-validation to get reliable estimates on imbalanced datasets
- Generate and interpret ROC curves to visualize the precision-recall tradeoff and choose optimal thresholds
- Create learning curves to diagnose whether your model needs more data, better features, or different algorithms
- Perform systematic hyperparameter tuning with grid search and random search to find optimal model configurations
- Analyze feature importance to understand which features drive predictions
- Implement feature selection to remove unhelpful features and reduce overfitting
- Build ensemble classifiers using voting and bagging to achieve 2-5% accuracy improvements over single models
- Handle severely imbalanced datasets with SMOTE, random sampling, and class weights to improve minority class detection
- Conduct systematic error analysis to identify patterns in misclassifications and target improvements
- Optimize models for production balancing accuracy, speed, memory usage, and interpretability
Step 1: Understanding Evaluation Metrics Beyond Accuracy (~15 min)
Section titled βStep 1: Understanding Evaluation Metrics Beyond Accuracy (~15 min)βMaster the full suite of classification metrics and understand when each is most important.
Actions
Section titled βActionsβAccuracy is the most intuitive metricβitβs simply the percentage of correct predictions. But it can be dangerously misleading, especially with imbalanced datasets.
The Problem with Accuracy Alone
Section titled βThe Problem with Accuracy AloneβConsider this scenario:
# filename: 01-evaluation-metrics.php (excerpt)<?php
// Imbalanced email dataset: 95 ham, 5 spam (out of 100 emails)$testLabels = array_merge( array_fill(0, 95, 'ham'), // 95 ham emails array_fill(0, 5, 'spam') // 5 spam emails);
// Naive classifier: predict everything as "ham"$naivePredictions = array_fill(0, 100, 'ham');
$correct = array_sum(array_map( fn($pred, $actual) => $pred === $actual ? 1 : 0, $naivePredictions, $testLabels));
$accuracy = $correct / count($testLabels);
echo "Naive 'always predict ham' classifier:\n";echo "Accuracy: " . ($accuracy * 100) . "%\n";// Output: 95% accuracy!
echo "\nBut this classifier:\n";echo " - Catches 0% of spam (useless!)\n";echo " - Never identifies any spam at all\n";echo " - Would let all malicious emails through\n";The problem: A classifier that never detects spam achieves 95% accuracy on this dataset! Accuracy is blind to class imbalance.
The Confusion Matrix Foundation
Section titled βThe Confusion Matrix FoundationβAll better metrics start with the confusion matrix. For binary classification:
PREDICTED Positive NegativeACTUAL Positive TP FN Negative FP TN- True Positive (TP): Correctly predicted positive (spam correctly identified as spam)
- False Negative (FN): Missed positive (spam incorrectly labeled as ham) β Type II error
- False Positive (FP): Incorrect positive (ham incorrectly labeled as spam) β Type I error
- True Negative (TN): Correctly predicted negative (ham correctly identified as ham)
# filename: 01-evaluation-metrics.php (excerpt)<?php
/** * Calculate confusion matrix components for binary classification * * @param array $predictions Predicted labels * @param array $actuals Actual labels * @param string $positiveClass The class considered "positive" * @return array ['tp' => int, 'fp' => int, 'tn' => int, 'fn' => int] */function calculateConfusionComponents( array $predictions, array $actuals, string $positiveClass = 'spam'): array { $tp = $fp = $tn = $fn = 0;
for ($i = 0; $i < count($predictions); $i++) { $predicted = $predictions[$i]; $actual = $actuals[$i];
if ($actual === $positiveClass && $predicted === $positiveClass) { $tp++; // Correctly caught spam } elseif ($actual === $positiveClass && $predicted !== $positiveClass) { $fn++; // Missed spam (false negative - bad!) } elseif ($actual !== $positiveClass && $predicted === $positiveClass) { $fp++; // Falsely flagged ham as spam } else { $tn++; // Correctly identified ham } }
return ['tp' => $tp, 'fp' => $fp, 'tn' => $tn, 'fn' => $fn];}The Essential Metrics
Section titled βThe Essential Metricsβ1. Precision β Of all emails we flagged as spam, how many were actually spam?
[ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} ]
High precision means few false positives. Critical when false positives are costly (e.g., blocking important emails, false fraud alerts).
function calculatePrecision(array $components): float{ $tp = $components['tp']; $fp = $components['fp'];
return ($tp + $fp) > 0 ? $tp / ($tp + $fp) : 0.0;}
// Example: TP=8, FP=2, TN=85, FN=5// Precision = 8 / (8 + 2) = 0.80 (80%)// Meaning: Of 10 emails we flagged, 8 were actually spam, 2 were false alarms2. Recall (Sensitivity, True Positive Rate) β Of all actual spam emails, how many did we catch?
[ \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} ]
High recall means few false negatives. Critical when false negatives are costly (e.g., missing fraudulent transactions, failing to diagnose diseases).
function calculateRecall(array $components): float{ $tp = $components['tp']; $fn = $components['fn'];
return ($tp + $fn) > 0 ? $tp / ($tp + $fn) : 0.0;}
// Example: TP=8, FP=2, TN=85, FN=5// Recall = 8 / (8 + 5) = 0.615 (61.5%)// Meaning: Of 13 actual spam emails, we caught 8, missed 5The Precision-Recall Tradeoff: You canβt maximize both simultaneously. Making your filter more aggressive (flagging more emails as spam) increases recall but decreases precision.
3. F1-Score β Harmonic mean of precision and recall, balancing both
[ \text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} ]
Use F1-score when you need a single metric that balances precision and recall. Itβs especially useful when classes are imbalanced.
function calculateF1Score(float $precision, float $recall): float{ return ($precision + $recall) > 0 ? 2 * ($precision * $recall) / ($precision + $recall) : 0.0;}
// Example: Precision=0.80, Recall=0.615// F1 = 2 * (0.80 * 0.615) / (0.80 + 0.615) = 0.695 (69.5%)Why harmonic mean? The harmonic mean penalizes extreme values. If either precision or recall is very low, F1 will be low. You canβt βcheatβ by optimizing just one.
4. Specificity (True Negative Rate) β Of all actual ham emails, how many did we correctly identify?
[ \text{Specificity} = \frac{\text{TN}}{\text{TN} + \text{FP}} ]
High specificity means few false positives (important for ham classification).
function calculateSpecificity(array $components): float{ $tn = $components['tn']; $fp = $components['fp'];
return ($tn + $fp) > 0 ? $tn / ($tn + $fp) : 0.0;}
// Example: TP=8, FP=2, TN=85, FN=5// Specificity = 85 / (85 + 2) = 0.977 (97.7%)// Meaning: Of 87 actual ham emails, we correctly identified 855. Matthews Correlation Coefficient (MCC) β Balanced metric that works well even with imbalanced classes
[ \text{MCC} = \frac{\text{TP} \times \text{TN} - \text{FP} \times \text{FN}}{\sqrt{(\text{TP}+\text{FP})(\text{TP}+\text{FN})(\text{TN}+\text{FP})(\text{TN}+\text{FN})}} ]
Range: -1 (complete disagreement) to +1 (perfect prediction), 0 = random.
function calculateMCC(array $components): float{ extract($components); // $tp, $fp, $tn, $fn
$numerator = ($tp * $tn) - ($fp * $fn); $denominator = sqrt(($tp + $fp) * ($tp + $fn) * ($tn + $fp) * ($tn + $fn));
return $denominator > 0 ? $numerator / $denominator : 0.0;}MCC is considered one of the best single metrics for binary classification, especially with imbalanced data.
When to Optimize for Which Metric
Section titled βWhen to Optimize for Which Metricβ| Scenario | Optimize For | Reason |
|---|---|---|
| Spam Filter | Precision | Donβt block important emails (false positives costly) |
| Fraud Detection | Recall | Catch all fraud (false negatives very costly) |
| Cancer Screening | Recall | Donβt miss any cases (follow-up tests can verify) |
| Content Moderation | Precision + Human Review | False positives damage user trust |
| Balanced Importance | F1-Score | Neither error type is much more costly |
Expected Result
Section titled βExpected ResultβRunning 01-evaluation-metrics.php shows comprehensive evaluation:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Comprehensive Model Evaluation Metrics βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Dataset: 100 emails (13 spam, 87 ham)
============================================================CONFUSION MATRIX============================================================
β PREDICTED β spam hamβββββββββββββββββΌββββββββββββββββββACTUAL spam β 8 5 (13 total) ham β 2 85 (87 total)
True Positives (TP): 8 β Spam correctly identifiedFalse Negatives (FN): 5 β Spam missed (bad!)False Positives (FP): 2 β Ham incorrectly flagged (bad!)True Negatives (TN): 85 β Ham correctly identified
============================================================EVALUATION METRICS============================================================
Basic Metrics:ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββAccuracy: 93.00% ββββββββββββββββββββββββββββ β (TP + TN) / Total = (8 + 85) / 100 β Overall correctness
Error Rate: 7.00% βββ β 1 - Accuracy β Overall mistakes
Spam Detection (Positive Class):ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββPrecision: 80.00% ββββββββββββββββββββββββ β TP / (TP + FP) = 8 / (8 + 2) β Of flagged emails, 80% were actually spam β 20% false alarm rate
Recall: 61.54% βββββββββββββββββββ β TP / (TP + FN) = 8 / (8 + 5) β Of actual spam, caught 61.54% β Missing 38.46% of spam!
F1-Score: 69.57% βββββββββββββββββββββ β Harmonic mean of precision and recall β Balanced metric (closer to lower value)
Ham Detection (Negative Class):ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββSpecificity: 97.70% ββββββββββββββββββββββββββββββ β TN / (TN + FP) = 85 / (85 + 2) β Of actual ham, correctly identified 97.70%
Advanced Metrics:ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββMatthews Corr: 0.706 βββββββββββββββββββββ β Range: -1 to +1 (higher is better) β Balanced metric for imbalanced data
============================================================INTERPRETATION============================================================
β Strengths: β’ High specificity (97.7%) - rarely blocks good emails β’ Good precision (80%) - flagged emails are usually spam β’ Excellent accuracy (93%) on imbalanced dataset
β Weaknesses: β’ Moderate recall (61.5%) - missing ~40% of spam β’ 5 spam emails slipped through to inbox β’ F1-score indicates room for improvement
π‘ Recommendations: β’ To catch more spam: Lower decision threshold (increases recall) β’ Trade-off: Will increase false positives (lower precision) β’ Consider ensemble methods or better featuresWhy It Works
Section titled βWhy It WorksβDifferent metrics reveal different aspects of model behavior:
- Accuracy shows overall correctness but hides class imbalance
- Precision reveals reliability of positive predictions
- Recall reveals completeness of positive detection
- F1-Score balances precision and recall
- Specificity shows how well you avoid false alarms
- MCC provides balanced assessment even with severe imbalance
No single metric tells the whole story. Always examine multiple metrics and the confusion matrix.
Troubleshooting
Section titled βTroubleshootingβ- Precision = 0.00 β No true positives; model never predicts positive class. Check if model trained properly.
- Recall = 0.00 β Model never correctly identifies positive class. Same as above.
- All metrics = 1.00 β Either perfect model (rare!) or data leakage. Verify test set is truly unseen.
- MCC close to 0 β Model is no better than random guessing. Need better features or different algorithm.
Step 2: Stratified Cross-Validation for Imbalanced Data (~12 min)
Section titled βStep 2: Stratified Cross-Validation for Imbalanced Data (~12 min)βLearn why standard cross-validation fails on imbalanced datasets and how stratified sampling solves the problem.
Actions
Section titled βActionsβIn Chapter 3, you learned k-fold cross-validation. But thereβs a problem: with random splitting, some folds might end up with very few (or zero!) examples of minority classes.
The Problem with Random Splits
Section titled βThe Problem with Random Splitsβ# filename: 04-stratified-cross-validation.php (excerpt)<?php
// Imbalanced dataset: 90 ham, 10 spam$labels = array_merge( array_fill(0, 90, 'ham'), array_fill(0, 10, 'spam'));
// Random 5-fold splitshuffle($labels);$foldSize = 20; // 100 samples / 5 folds
// Check fold 1$fold1 = array_slice($labels, 0, 20);$spamCount = count(array_filter($fold1, fn($l) => $l === 'spam'));
echo "Fold 1 spam count: {$spamCount} (expected ~2)\n";// Might get: 0, 1, 2, 3, 4... highly variable!// A fold with 0 spam can't calculate recall for spam class!The solution: Stratified sampling ensures each fold maintains the same class proportions as the full dataset.
Implementing Stratified K-Fold Cross-Validation
Section titled βImplementing Stratified K-Fold Cross-Validationβ# filename: 04-stratified-cross-validation.php (excerpt)<?php
/** * Stratified k-fold cross-validation that preserves class distributions * * @param array $samples Feature data * @param array $labels Target labels * @param int $k Number of folds * @param callable $modelFactory Function that returns a new model instance * @return array ['scores' => [...], 'mean' => float, 'std' => float] */function stratifiedKFoldCV( array $samples, array $labels, int $k, callable $modelFactory): array { // Step 1: Group samples by class $classSamples = []; foreach ($labels as $idx => $label) { $classSamples[$label][] = $idx; }
// Step 2: Shuffle each class independently foreach ($classSamples as $class => $indices) { shuffle($classSamples[$class]); }
// Step 3: Create stratified folds $folds = array_fill(0, $k, []);
foreach ($classSamples as $class => $indices) { $classSize = count($indices); $foldSize = (int) floor($classSize / $k);
for ($fold = 0; $fold < $k; $fold++) { $start = $fold * $foldSize; $end = ($fold === $k - 1) ? $classSize : ($fold + 1) * $foldSize;
for ($i = $start; $i < $end; $i++) { $folds[$fold][] = $indices[$i]; } } }
// Step 4: Evaluate each fold $scores = [];
for ($fold = 0; $fold < $k; $fold++) { $testIndices = $folds[$fold]; $trainIndices = [];
// Combine all other folds for training for ($otherFold = 0; $otherFold < $k; $otherFold++) { if ($otherFold !== $fold) { $trainIndices = array_merge($trainIndices, $folds[$otherFold]); } }
// Extract train/test data $trainSamples = array_map(fn($i) => $samples[$i], $trainIndices); $trainLabels = array_map(fn($i) => $labels[$i], $trainIndices); $testSamples = array_map(fn($i) => $samples[$i], $testIndices); $testLabels = array_map(fn($i) => $labels[$i], $testIndices);
// Train and evaluate $model = $modelFactory(); $model->train($trainSamples, $trainLabels); $predictions = $model->predict($testSamples);
// Calculate F1-score (better than accuracy for imbalanced data) $components = calculateConfusionComponents($predictions, $testLabels, 'spam'); $precision = calculatePrecision($components); $recall = calculateRecall($components); $f1 = calculateF1Score($precision, $recall);
$scores[] = $f1;
// Verify stratification $trainSpamRatio = count(array_filter($trainLabels, fn($l) => $l === 'spam')) / count($trainLabels); $testSpamRatio = count(array_filter($testLabels, fn($l) => $l === 'spam')) / count($testLabels);
echo "Fold " . ($fold + 1) . ":\n"; echo " Train spam ratio: " . number_format($trainSpamRatio * 100, 1) . "%\n"; echo " Test spam ratio: " . number_format($testSpamRatio * 100, 1) . "%\n"; echo " F1-Score: " . number_format($f1 * 100, 2) . "%\n\n"; }
// Calculate statistics $mean = array_sum($scores) / count($scores); $variance = 0; foreach ($scores as $score) { $variance += pow($score - $mean, 2); } $std = sqrt($variance / count($scores));
return [ 'scores' => $scores, 'mean' => $mean, 'std' => $std, ];}Comparing Standard vs. Stratified CV
Section titled βComparing Standard vs. Stratified CVβ# filename: 04-stratified-cross-validation.php (excerpt)<?php
echo "ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n";echo "β Comparing Standard vs Stratified Cross-Validation β\n";echo "ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n\n";
echo "Dataset: 100 samples (90 ham, 10 spam = 10% minority class)\n\n";
echo "STANDARD K-FOLD (Random Splitting):\n";echo "ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n";$standardResults = kFoldCV($samples, $labels, 5, $modelFactory);echo "Mean F1: " . number_format($standardResults['mean'] * 100, 2) . "%\n";echo "Std Dev: " . number_format($standardResults['std'] * 100, 2) . "%\n";echo "β Notice: High variance due to inconsistent class distribution\n\n";
echo "STRATIFIED K-FOLD (Class-Preserving Splitting):\n";echo "ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n";$stratifiedResults = stratifiedKFoldCV($samples, $labels, 5, $modelFactory);echo "Mean F1: " . number_format($stratifiedResults['mean'] * 100, 2) . "%\n";echo "Std Dev: " . number_format($stratifiedResults['std'] * 100, 2) . "%\n";echo "β Lower variance: Each fold has consistent 10% spam\n";Expected Result
Section titled βExpected Resultββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Comparing Standard vs Stratified Cross-Validation βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Dataset: 100 samples (90 ham, 10 spam = 10% minority class)
STANDARD K-FOLD (Random Splitting):ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββFold 1: Test spam ratio: 5.0% (1 spam sample!) F1-Score: 66.67%
Fold 2: Test spam ratio: 15.0% (3 spam samples) F1-Score: 80.00%
Fold 3: Test spam ratio: 10.0% (2 spam samples) F1-Score: 75.00%
Fold 4: Test spam ratio: 5.0% (1 spam sample!) F1-Score: 40.00%
Fold 5: Test spam ratio: 15.0% (3 spam samples) F1-Score: 85.71%
Mean F1: 69.48%Std Dev: 16.21% β High variance!
STRATIFIED K-FOLD (Class-Preserving Splitting):ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββFold 1: Train spam ratio: 10.0%, Test spam ratio: 10.0% F1-Score: 80.00%
Fold 2: Train spam ratio: 10.0%, Test spam ratio: 10.0% F1-Score: 75.00%
Fold 3: Train spam ratio: 10.0%, Test spam ratio: 10.0% F1-Score: 80.00%
Fold 4: Train spam ratio: 10.0%, Test spam ratio: 10.0% F1-Score: 77.78%
Fold 5: Train spam ratio: 10.0%, Test spam ratio: 10.0% F1-Score: 80.00%
Mean F1: 78.56%Std Dev: 2.03% β Much more stable!
============================================================KEY FINDINGS============================================================
Standard CV Issues: β’ Class ratios vary wildly (5% to 15% spam) β’ Some folds have very few minority samples β’ High variance in results (Β±16.21%) β’ Unreliable performance estimates
Stratified CV Benefits: β’ Consistent class ratios (10% in all folds) β’ Every fold has representative samples β’ Low variance (Β±2.03%) β’ More reliable and trustworthy estimates
π‘ Recommendation: Always use stratified CV for imbalanced data!Why It Works
Section titled βWhy It WorksβStratified sampling ensures every fold is a microcosm of the full dataset. This means:
- Reliable metrics: Every fold can calculate all metrics (no folds with zero minority samples)
- Lower variance: Consistent class distribution reduces random variation
- Better estimates: Mean performance is more representative of true performance
- Fair comparison: All folds test under similar conditions
Without stratification, you might get βluckyβ or βunluckyβ folds that skew your estimate.
Troubleshooting
Section titled βTroubleshootingβ- Error: βCannot create stratified foldsβ β Minority class has fewer samples than k. Reduce k or get more data.
- Folds still imbalanced β Check that stratification logic groups by all classes. Verify with class ratio printouts.
- Standard CV actually works better β Your dataset may not be imbalanced enough to matter (<30% minority class often fine).
Step 3: ROC Curves and Choosing Optimal Thresholds (~15 min)
Section titled βStep 3: ROC Curves and Choosing Optimal Thresholds (~15 min)βLearn to visualize the precision-recall tradeoff with ROC curves and choose optimal classification thresholds for your specific use case.
Actions
Section titled βActionsβMost classifiers donβt just output a binary predictionβthey output a probability or confidence score. You can adjust the decision threshold to favor precision or recall.
Understanding Classification Thresholds
Section titled βUnderstanding Classification Thresholdsβ# filename: 05-roc-curve.php (excerpt)<?php
// Classifier outputs probabilities for "spam" class$predictions = [ ['sample' => 1, 'actual' => 'spam', 'probability' => 0.95], // Very confident spam ['sample' => 2, 'actual' => 'spam', 'probability' => 0.65], // Moderately confident ['sample' => 3, 'actual' => 'ham', 'probability' => 0.45], // Borderline ['sample' => 4, 'actual' => 'ham', 'probability' => 0.15], // Likely ham ['sample' => 5, 'actual' => 'spam', 'probability' => 0.55], // Slightly spam-like];
// Threshold = 0.5 (default)echo "Threshold = 0.50:\n";foreach ($predictions as $pred) { $classification = $pred['probability'] >= 0.5 ? 'spam' : 'ham'; $correct = $classification === $pred['actual'] ? 'β' : 'β'; echo " {$correct} Sample {$pred['sample']}: P(spam)={$pred['probability']}, "; echo "predict {$classification}, actual {$pred['actual']}\n";}// Result: 2/3 spam caught, 1/2 ham correct
echo "\nThreshold = 0.40 (more aggressive):\n";foreach ($predictions as $pred) { $classification = $pred['probability'] >= 0.4 ? 'spam' : 'ham'; $correct = $classification === $pred['actual'] ? 'β' : 'β'; echo " {$correct} Sample {$pred['sample']}: predict {$classification}\n";}// Result: 3/3 spam caught (higher recall!), but 0/2 ham correct (lower precision)Key insight: By lowering the threshold, you catch more spam (higher recall) but also flag more ham as spam (lower precision).
Building an ROC Curve
Section titled βBuilding an ROC CurveβAn ROC (Receiver Operating Characteristic) curve plots True Positive Rate (recall) vs. False Positive Rate at various thresholds.
[ \text{TPR (Recall)} = \frac{\text{TP}}{\text{TP} + \text{FN}} ]
[ \text{FPR} = \frac{\text{FP}}{\text{FP} + \text{TN}} = 1 - \text{Specificity} ]
# filename: 05-roc-curve.php (excerpt)<?php
/** * Generate ROC curve points by varying threshold * * @param array $predictions Array of ['actual' => string, 'probability' => float] * @param string $positiveClass The class considered positive * @return array Array of ['threshold' => float, 'tpr' => float, 'fpr' => float] */function generateROCCurve(array $predictions, string $positiveClass = 'spam'): array{ // Sort by probability descending usort($predictions, fn($a, $b) => $b['probability'] <=> $a['probability']);
$rocPoints = [];
// Generate points for thresholds from 0 to 1 $thresholds = array_merge([0.0], array_unique(array_column($predictions, 'probability')), [1.0]); sort($thresholds);
foreach ($thresholds as $threshold) { $tp = $fp = $tn = $fn = 0;
foreach ($predictions as $pred) { $predicted = $pred['probability'] >= $threshold ? $positiveClass : 'negative'; $actual = $pred['actual'];
if ($actual === $positiveClass && $predicted === $positiveClass) { $tp++; } elseif ($actual === $positiveClass && $predicted !== $positiveClass) { $fn++; } elseif ($actual !== $positiveClass && $predicted === $positiveClass) { $fp++; } else { $tn++; } }
$tpr = ($tp + $fn) > 0 ? $tp / ($tp + $fn) : 0; // Recall $fpr = ($fp + $tn) > 0 ? $fp / ($fp + $tn) : 0; // 1 - Specificity
$rocPoints[] = [ 'threshold' => $threshold, 'tpr' => $tpr, 'fpr' => $fpr, 'tp' => $tp, 'fp' => $fp, 'tn' => $tn, 'fn' => $fn, ]; }
return $rocPoints;}Calculating AUC (Area Under the Curve)
Section titled βCalculating AUC (Area Under the Curve)βThe AUC-ROC score summarizes the entire ROC curve into a single number:
- AUC = 1.0: Perfect classifier
- AUC = 0.9-1.0: Excellent
- AUC = 0.8-0.9: Good
- AUC = 0.7-0.8: Fair
- AUC = 0.5: No better than random guessing
- AUC < 0.5: Worse than random (predictions are inverted!)
# filename: 05-roc-curve.php (excerpt)<?php
/** * Calculate AUC using trapezoidal rule * * @param array $rocPoints Array of ROC curve points * @return float AUC score between 0 and 1 */function calculateAUC(array $rocPoints): float{ // Sort by FPR ascending usort($rocPoints, fn($a, $b) => $a['fpr'] <=> $b['fpr']);
$auc = 0.0;
for ($i = 1; $i < count($rocPoints); $i++) { $x1 = $rocPoints[$i - 1]['fpr']; $x2 = $rocPoints[$i]['fpr']; $y1 = $rocPoints[$i - 1]['tpr']; $y2 = $rocPoints[$i]['tpr'];
// Trapezoidal area $width = $x2 - $x1; $height = ($y1 + $y2) / 2; $auc += $width * $height; }
return $auc;}Visualizing the ROC Curve
Section titled βVisualizing the ROC Curveβ# filename: 05-roc-curve.php (excerpt)<?php
function plotROCCurve(array $rocPoints, int $width = 50, int $height = 20): void{ echo "\nROC Curve (TPR vs FPR):\n"; echo "ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n";
$grid = array_fill(0, $height, array_fill(0, $width, ' '));
// Plot diagonal (random classifier) for ($i = 0; $i < min($width, $height); $i++) { $x = (int)(($i / $width) * $width); $y = $height - 1 - (int)(($i / $width) * $height); if ($x < $width && $y >= 0 && $y < $height) { $grid[$y][$x] = 'Β·'; } }
// Plot ROC curve foreach ($rocPoints as $point) { $x = (int)($point['fpr'] * ($width - 1)); $y = $height - 1 - (int)($point['tpr'] * ($height - 1));
if ($x >= 0 && $x < $width && $y >= 0 && $y < $height) { $grid[$y][$x] = 'β'; } }
// Print grid with labels echo "TPR\n"; echo "1.0 β"; for ($y = 0; $y < $height; $y++) { if ($y > 0) echo " β"; echo implode('', $grid[$y]) . "\n"; } echo "0.0 β" . str_repeat('β', $width) . "\n"; echo " 0.0" . str_repeat(' ', $width - 6) . "1.0\n"; echo " " . str_repeat(' ', ($width - 3) / 2) . "FPR\n\n";
echo "Legend: β = ROC curve, Β· = Random classifier (AUC=0.5)\n"; echo " Better classifiers curve toward upper-left corner\n";}Expected Result
Section titled βExpected Resultββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ROC Curve Analysis: Spam Filter βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Generating ROC curve with 50 test predictions...
ROC Curve Points (selected thresholds):ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββThreshold TPR(Recall) FPR Precision F1-Scoreββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ0.10 100.0% 45.0% 35.7% 52.6% β Aggressive0.30 92.3% 25.0% 60.0% 72.7%0.50 84.6% 12.5% 78.6% 81.5% β Default0.70 61.5% 5.0% 88.9% 72.7%0.90 38.5% 2.5% 83.3% 52.6% β Conservative
ROC Curve (TPR vs FPR):ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββTPR1.0 βββββββββββββββββββββββββββ βββββββββββββββββββββββ βββββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββ βββββββββββββββ βββββββββββββ βββββββββββ βββββββββ βββββββ βββββ βββ ββΒ· β Β·Β· β Β·Β·Β· β Β·Β·Β·Β· β Β·Β·Β·Β·Β· β Β·Β·Β·Β·Β·Β· β Β·Β·Β·Β·Β·Β·Β·Β· β Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·0.0 βββββββββββββββββββββββββββββββββββββββββββββββββββββββ 0.0 1.0 FPR
Legend: β = ROC curve, Β· = Random classifier (AUC=0.5) Better classifiers curve toward upper-left corner
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββAUC-ROC SCOREββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
AUC: 0.9125 ββββββββββββββββββββββββββββ (Excellent!)
Interpretation: β’ 91.25% chance the model ranks a random spam higher than ham β’ Much better than random (AUC=0.50) β’ Room for improvement to reach 0.95+
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββCHOOSING OPTIMAL THRESHOLDββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Your Use Case: Email spam filter
Option 1: Maximize F1-Score (balanced) β Threshold: 0.50 β Recall: 84.6%, Precision: 78.6%, F1: 81.5% β Recommended for: General spam filtering
Option 2: Minimize false positives (high precision) β Threshold: 0.70 β Recall: 61.5%, Precision: 88.9%, F1: 72.7% β Recommended for: Critical email (don't block important messages)
Option 3: Minimize false negatives (high recall) β Threshold: 0.30 β Recall: 92.3%, Precision: 60.0%, F1: 72.7% β Recommended for: Aggressive spam blocking
π‘ Current threshold (0.50) provides good balance. For production, consider A/B testing different thresholds!Why It Works
Section titled βWhy It WorksβThe ROC curve visualizes the fundamental tradeoff in classification: you canβt maximize both TPR (catching positives) and minimize FPR (avoiding false alarms) simultaneously. By plotting all possible thresholds, you can:
- See the tradeoff visually: How much precision do you lose for each gain in recall?
- Choose optimal threshold: Based on your specific cost function
- Compare classifiers: Higher AUC = better overall performance regardless of threshold
AUC interpretation: If you randomly pick one positive example and one negative example, AUC is the probability your classifier scores the positive one higher.
Troubleshooting
Section titled βTroubleshootingβ- AUC = 0.50 β Model is no better than random. Check features, algorithm, or training process.
- AUC < 0.50 β Predictions are inverted! Flip them: if model says βspamβ, predict βhamβ (bug in code).
- ROC curve is jagged β Normal with small test sets. Use more test data for smoother curves.
- Cannot calculate FPR β No negative samples in test set. Use stratified splitting.
Step 4: Learning Curves for Diagnosing Model Behavior (~12 min)
Section titled βStep 4: Learning Curves for Diagnosing Model Behavior (~12 min)βUse learning curves to diagnose whether your model needs more data, better features, or a different algorithm.
Actions
Section titled βActionsβA learning curve shows how model performance changes as training set size increases. Itβs one of the most powerful diagnostic tools for understanding whatβs limiting your modelβs performance.
The Three Learning Curve Patterns
Section titled βThe Three Learning Curve PatternsβLearning curves reveal three distinct patterns that indicate different model issues:
Pattern 1: High Bias (Underfitting)
- Training and test scores both converge at a low value
- The model is too simple to capture the patterns in your data
- Adding more data wonβt help much
- Solution: Need better features or a more complex model
Pattern 2: High Variance (Overfitting)
- Large gap between training and test scores
- Training score is high, but test score is low
- More data will help close this gap
- Solution: Need more training data or regularization
Pattern 3: Good Fit
- Training and test scores converge at a high value
- Small gap between the two scores
- Model is working well
- Only small improvements possible
Implementing Learning Curves
Section titled βImplementing Learning Curvesβ# filename: 06-learning-curves.php (excerpt)<?php
/** * Generate learning curve data * * @param array $samples Full training dataset features * @param array $labels Full training dataset labels * @param array $testSamples Separate test set features * @param array $testLabels Separate test set labels * @param callable $modelFactory Function that returns new model instance * @param array $trainSizes Array of training set sizes to try (e.g., [10, 20, 50, 100]) * @return array Learning curve data points */function generateLearningCurve( array $samples, array $labels, array $testSamples, array $testLabels, callable $modelFactory, array $trainSizes): array { $curveData = [];
foreach ($trainSizes as $size) { // Take first $size samples for training $trainSubset = array_slice($samples, 0, $size); $labelSubset = array_slice($labels, 0, $size);
// Train model on subset $model = $modelFactory(); $model->train($trainSubset, $labelSubset);
// Evaluate on training subset $trainPredictions = $model->predict($trainSubset); $trainAccuracy = array_sum(array_map( fn($pred, $actual) => $pred === $actual ? 1 : 0, $trainPredictions, $labelSubset )) / count($labelSubset);
// Evaluate on test set $testPredictions = $model->predict($testSamples); $testAccuracy = array_sum(array_map( fn($pred, $actual) => $pred === $actual ? 1 : 0, $testPredictions, $testLabels )) / count($testLabels);
$curveData[] = [ 'train_size' => $size, 'train_score' => $trainAccuracy, 'test_score' => $testAccuracy, 'gap' => $trainAccuracy - $testAccuracy, ];
echo "Training size: {$size}\n"; echo " Train accuracy: " . number_format($trainAccuracy * 100, 2) . "%\n"; echo " Test accuracy: " . number_format($testAccuracy * 100, 2) . "%\n"; echo " Gap: " . number_format(($trainAccuracy - $testAccuracy) * 100, 2) . "%\n\n"; }
return $curveData;}Visualizing Learning Curves
Section titled βVisualizing Learning Curvesβ# filename: 06-learning-curves.php (excerpt)<?php
function plotLearningCurve(array $curveData, int $width = 60, int $height = 20): void{ echo "\nLearning Curve:\n"; echo "ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n";
$grid = array_fill(0, $height, array_fill(0, $width, ' '));
// Find max train size for scaling $maxTrainSize = max(array_column($curveData, 'train_size'));
// Plot both train and test curves foreach ($curveData as $point) { $x = (int)(($point['train_size'] / $maxTrainSize) * ($width - 1));
// Training score (solid) $yTrain = $height - 1 - (int)($point['train_score'] * ($height - 1)); if ($x >= 0 && $x < $width && $yTrain >= 0 && $yTrain < $height) { $grid[$yTrain][$x] = 'β'; }
// Test score (pattern) $yTest = $height - 1 - (int)($point['test_score'] * ($height - 1)); if ($x >= 0 && $x < $width && $yTest >= 0 && $yTest < $height) { $grid[$yTest][$x] = 'β'; } }
// Print grid echo "Acc\n"; echo "1.0 β"; for ($y = 0; $y < $height; $y++) { if ($y > 0) echo " β"; echo implode('', $grid[$y]) . "\n"; } echo "0.0 β" . str_repeat('β', $width) . "\n"; echo " 0" . str_repeat(' ', $width - 7) . "{$maxTrainSize}\n"; echo " " . str_repeat(' ', ($width - 15) / 2) . "Training Set Size\n\n";
echo "Legend: β = Training accuracy, β = Test accuracy\n\n";}Interpreting Learning Curves
Section titled βInterpreting Learning Curvesβ# filename: 06-learning-curves.php (excerpt)<?php
function diagnoseLearningCurve(array $curveData): string{ $lastPoint = end($curveData); $trainScore = $lastPoint['train_score']; $testScore = $lastPoint['test_score']; $gap = $lastPoint['gap'];
// Check if scores are converging $firstGap = $curveData[0]['gap']; $isConverging = $gap < $firstGap;
// Diagnose pattern if ($gap > 0.15 && $isConverging) { return "HIGH VARIANCE (Overfitting)\n" . " β’ Large gap between train ({$trainScore}) and test ({$testScore})\n" . " β’ Gap is narrowing but still significant\n" . " π‘ Recommendation: Get more training data or add regularization"; }
if ($trainScore < 0.75 && $testScore < 0.75 && $gap < 0.1) { return "HIGH BIAS (Underfitting)\n" . " β’ Both train and test scores are low\n" . " β’ Scores have converged at a low value\n" . " π‘ Recommendation: Use more complex model or better features"; }
if ($trainScore > 0.85 && $testScore > 0.80 && $gap < 0.1) { return "GOOD FIT\n" . " β’ High train ({$trainScore}) and test ({$testScore}) scores\n" . " β’ Small gap indicates good generalization\n" . " β Model is performing well!"; }
return "AMBIGUOUS PATTERN\n" . " β’ Train: {$trainScore}, Test: {$testScore}, Gap: {$gap}\n" . " β’ May need more data to see clear pattern";}Expected Result
Section titled βExpected Resultββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Learning Curve Analysis βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Generating learning curve with sizes: [10, 25, 50, 100, 200, 400]
Training size: 10 Train accuracy: 100.00% β Perfect on tiny dataset Test accuracy: 62.00% β But poor generalization Gap: 38.00% β Large gap = overfitting
Training size: 25 Train accuracy: 96.00% Test accuracy: 68.00% Gap: 28.00%
Training size: 50 Train accuracy: 94.00% Test accuracy: 74.00% Gap: 20.00%
Training size: 100 Train accuracy: 92.00% Test accuracy: 80.00% Gap: 12.00% β Gap narrowing
Training size: 200 Train accuracy: 90.00% Test accuracy: 84.00% Gap: 6.00%
Training size: 400 Train accuracy: 89.00% Test accuracy: 86.00% Gap: 3.00% β Gap nearly closed
Learning Curve:ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββAcc1.0 ββ ββ β β β β β ββ β ββ β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β0.0 βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 0 400 Training Set Size
Legend: β = Training accuracy, β = Test accuracy
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββDIAGNOSISββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Pattern Detected: HIGH VARIANCE β GOOD FIT (with more data)
Initial State (10 samples): β’ Severe overfitting (38% gap) β’ Model memorizes small dataset
Current State (400 samples): β’ Good generalization (3% gap) β’ Test accuracy still improving β’ Train and test curves converging
Conclusion: β More data has successfully reduced overfitting β Both curves are still rising β more data likely helps further π‘ Recommendation: Collect 600-800 samples for optimal performance
What this tells us: 1. The algorithm choice is good (can learn patterns) 2. Features are informative (test accuracy reached 86%) 3. More data continues to help (curves haven't plateaued) 4. Current model complexity is appropriateWhy It Works
Section titled βWhy It WorksβLearning curves reveal whatβs limiting your model:
High variance (overfitting): Large gap between curves that narrows with more data β Solution: Get more training data, use regularization, or simplify model
High bias (underfitting): Low performance on both sets that doesnβt improve with more data β Solution: Use more complex model, engineer better features, or remove regularization
Good fit: High performance with small gap β Solution: Youβre done! (Or collect more data for marginal gains)
The curve shape tells you which intervention will actually help. Donβt waste time collecting more data if your curves have plateaued (high bias)!
Troubleshooting
Section titled βTroubleshootingβ- Both curves flat and low β High bias. More data wonβt help. Try more complex model or better features.
- Both curves fluctuate wildly β Training set sizes too small or high variance in data. Use larger increments.
- Test score higher than train β Somethingβs wrong. Check for data leakage or swapped datasets.
- Curves donβt converge even with lots of data β Model may be too complex. Try simpler model or regularization.
Step 5: Grid Search for Hyperparameter Tuning (~15 min)
Section titled βStep 5: Grid Search for Hyperparameter Tuning (~15 min)βSystematically find optimal hyperparameters using grid search and understand when the improvement is worth the computational cost.
Actions
Section titled βActionsβEvery ML algorithm has hyperparametersβconfiguration settings you choose before training. For k-NN, itβs k (number of neighbors). For decision trees, itβs max depth, min samples split, etc. Wrong hyperparameters can cripple even the best algorithm.
Manual Hyperparameter Tuning (The Naive Way)
Section titled βManual Hyperparameter Tuning (The Naive Way)β# filename: 07-grid-search.php (excerpt)<?php
// Manual trial and error (tedious!)$kValues = [1, 3, 5, 7, 9];
foreach ($kValues as $k) { $classifier = new KNearestNeighbors($k); $classifier->train($trainSamples, $trainLabels); $accuracy = evaluate($classifier, $testSamples, $testLabels); echo "k = {$k}: Accuracy = " . number_format($accuracy * 100, 2) . "%\n";}
// What if k=4 or k=6 is better? You'd never know!// What about distance metric? Weighted vs unweighted?Problems with manual tuning:
- Time-consuming and error-prone
- Easy to miss optimal values between your test points
- Difficult to tune multiple hyperparameters simultaneously
- Tempting to overfit to test set by trying too many values
Implementing Grid Search
Section titled βImplementing Grid SearchβGrid search tries all combinations of hyperparameter values and finds the best using cross-validation.
# filename: 07-grid-search.php (excerpt)<?php
/** * Perform grid search to find optimal hyperparameters * * @param array $samples Training features * @param array $labels Training labels * @param array $paramGrid Associative array of parameter names to arrays of values * @param callable $modelFactory Function that takes params array and returns model * @param int $cv Number of cross-validation folds * @return array Best parameters and scores */function gridSearch( array $samples, array $labels, array $paramGrid, callable $modelFactory, int $cv = 5): array { // Generate all parameter combinations $paramCombinations = generateParameterCombinations($paramGrid);
echo "Grid Search: Testing " . count($paramCombinations) . " combinations\n"; echo "βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n\n";
$results = []; $bestScore = -INF; $bestParams = null;
foreach ($paramCombinations as $index => $params) { echo "Configuration " . ($index + 1) . "/" . count($paramCombinations) . ": "; echo json_encode($params) . "\n";
// Perform cross-validation with these parameters $cvScores = [];
for ($fold = 0; $fold < $cv; $fold++) { [$trainFold, $testFold] = createFold($samples, $labels, $cv, $fold);
$model = $modelFactory($params); $model->train($trainFold['samples'], $trainFold['labels']);
$predictions = $model->predict($testFold['samples']); $score = calculateAccuracy($predictions, $testFold['labels']); $cvScores[] = $score; }
$meanScore = array_sum($cvScores) / count($cvScores); $stdScore = calculateStd($cvScores);
$results[] = [ 'params' => $params, 'mean_score' => $meanScore, 'std_score' => $stdScore, 'cv_scores' => $cvScores, ];
echo " Mean CV Score: " . number_format($meanScore * 100, 2) . "%"; echo " (Β±" . number_format($stdScore * 100, 2) . "%)\n";
if ($meanScore > $bestScore) { $bestScore = $meanScore; $bestParams = $params; echo " β New best!\n"; }
echo "\n"; }
return [ 'best_params' => $bestParams, 'best_score' => $bestScore, 'all_results' => $results, ];}
/** * Generate all combinations of parameters (Cartesian product) */function generateParameterCombinations(array $paramGrid): array{ $keys = array_keys($paramGrid); $values = array_values($paramGrid);
$combinations = [[]];
foreach ($values as $keyIndex => $valueArray) { $newCombinations = [];
foreach ($combinations as $combination) { foreach ($valueArray as $value) { $newCombination = $combination; $newCombination[$keys[$keyIndex]] = $value; $newCombinations[] = $newCombination; } }
$combinations = $newCombinations; }
return $combinations;}Example: Tuning k-NN Classifier
Section titled βExample: Tuning k-NN Classifierβ# filename: 07-grid-search.php (excerpt)<?php
use Rubix\ML\Classifiers\KNearestNeighbors;use Rubix\ML\Kernels\Distance\Euclidean;use Rubix\ML\Kernels\Distance\Manhattan;
// Define parameter grid$paramGrid = [ 'k' => [1, 3, 5, 7, 9, 11, 15], 'weighted' => [true, false], 'distance' => ['euclidean', 'manhattan'],];
// Model factory$modelFactory = function(array $params) { $distance = $params['distance'] === 'euclidean' ? new Euclidean() : new Manhattan();
return new KNearestNeighbors( k: $params['k'], weighted: $params['weighted'], kernel: $distance );};
// Perform grid search$results = gridSearch( $trainSamples, $trainLabels, $paramGrid, $modelFactory, cv: 5);
echo "ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n";echo "β Grid Search Results β\n";echo "ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n\n";
echo "Best Parameters:\n";echo " k: " . $results['best_params']['k'] . "\n";echo " weighted: " . ($results['best_params']['weighted'] ? 'true' : 'false') . "\n";echo " distance: " . $results['best_params']['distance'] . "\n\n";
echo "Best CV Score: " . number_format($results['best_score'] * 100, 2) . "%\n\n";
// Train final model with best parameters$finalModel = $modelFactory($results['best_params']);$finalModel->train($trainSamples, $trainLabels);
// Evaluate on held-out test set (used only once!)$testAccuracy = evaluate($finalModel, $testSamples, $testLabels);echo "Test Set Accuracy: " . number_format($testAccuracy * 100, 2) . "%\n";Visualizing Grid Search Results
Section titled βVisualizing Grid Search Resultsβ# filename: 07-grid-search.php (excerpt)<?php
function visualizeGridResults(array $results): void{ echo "\nTop 10 Configurations:\n"; echo "βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n";
// Sort by mean score descending usort($results, fn($a, $b) => $b['mean_score'] <=> $a['mean_score']);
for ($i = 0; $i < min(10, count($results)); $i++) { $result = $results[$i]; $rank = $i + 1; $score = $result['mean_score'] * 100; $std = $result['std_score'] * 100; $params = json_encode($result['params']);
$bar = str_repeat('β', (int)($result['mean_score'] * 40));
echo sprintf( "%2d. %5.2f%% (Β±%4.2f%%) %s\n %s\n", $rank, $score, $std, $bar, $params ); }}Expected Result
Section titled βExpected Resultββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Grid Search: k-NN Hyperparameter Tuning βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Parameter Grid: k: [1, 3, 5, 7, 9, 11, 15] (7 values) weighted: [true, false] (2 values) distance: [euclidean, manhattan] (2 values)
Total combinations: 7 Γ 2 Γ 2 = 28
Grid Search: Testing 28 combinationsβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Configuration 1/28: {"k":1,"weighted":true,"distance":"euclidean"} Mean CV Score: 89.20% (Β±3.45%)
Configuration 2/28: {"k":1,"weighted":false,"distance":"euclidean"} Mean CV Score: 87.60% (Β±4.12%)
Configuration 3/28: {"k":3,"weighted":true,"distance":"euclidean"} Mean CV Score: 92.40% (Β±2.10%) β New best!
[... 25 more configurations ...]
Configuration 19/28: {"k":7,"weighted":true,"distance":"manhattan"} Mean CV Score: 94.80% (Β±1.85%) β New best!
[... remaining configurations ...]
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββGRID SEARCH RESULTSββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Best Parameters Found: k: 7 weighted: true distance: manhattan
Best CV Score: 94.80% (Β±1.85%)
Top 10 Configurations:βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 1. 94.80% (Β±1.85%) ββββββββββββββββββββββββββββββββββββββ {"k":7,"weighted":true,"distance":"manhattan"}
2. 94.20% (Β±2.10%) ββββββββββββββββββββββββββββββββββββββ {"k":5,"weighted":true,"distance":"manhattan"}
3. 93.60% (Β±2.45%) ββββββββββββββββββββββββββββββββββββββ {"k":7,"weighted":true,"distance":"euclidean"}
4. 92.80% (Β±2.90%) ββββββββββββββββββββββββββββββββββββββ {"k":9,"weighted":true,"distance":"manhattan"}
5. 92.40% (Β±2.10%) βββββββββββββββββββββββββββββββββββββ {"k":3,"weighted":true,"distance":"euclidean"}
[... 5 more ...]
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββFINAL EVALUATIONββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Training final model with best parameters...β Model trained
Evaluating on held-out test set (first time we touch it)...
Test Set Performance: Accuracy: 95.20%
Comparison: CV Score (development): 94.80% Test Score (final): 95.20% Difference: +0.40% β Great generalization!
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββKEY FINDINGSββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Insights from Grid Search:
1. Optimal k value: 7 β’ k=1 was too sensitive (89.20%) β’ k=15 was too smooth (90.40%) β’ k=7 provided best balance
2. Weighted voting helps β’ Weighted: 94.80% β’ Unweighted: 91.30% β’ +3.5% improvement!
3. Manhattan distance slightly better β’ Manhattan: 94.80% β’ Euclidean: 93.60% β’ +1.2% improvement
4. Computational cost β’ 28 configurations Γ 5 folds = 140 model trainings β’ Total time: 45 seconds β’ Worth it for +7% accuracy improvement!
π‘ Takeaway: Default parameters (k=5, unweighted, euclidean) achieved 92.40%. Grid search found 94.80% (+2.4% gain). Always worth trying hyperparameter tuning!Why It Works
Section titled βWhy It WorksβGrid search exhaustively evaluates all parameter combinations using cross-validation. This ensures:
- Finds global optimum (within the grid) rather than local optimum from manual tuning
- Uses CV for evaluation so you donβt overfit to test set
- Quantifies uncertainty with CV standard deviation
- Saves time over manual trial-and-error
Trade-offs:
- Computational cost: Exponential in number of parameters (7 Γ 2 Γ 2 Γ 5 CV folds = 140 model trainings)
- Grid granularity: Might miss optimum between your grid points
- Curse of dimensionality: With many hyperparameters, grid becomes impractically large
Alternative: Random search samples random combinations instead of exhaustive grid. Often finds good parameters 10x faster.
Troubleshooting
Section titled βTroubleshootingβ- Grid search takes forever β Reduce grid size, use fewer CV folds, or try random search instead.
- All configurations perform similarly β Hyperparameters may not matter much for this dataset. Spend time on features instead.
- Best params are at grid boundary β Extend your grid. If k=15 is best and itβs your maximum, try k=20, 25, 30.
- Test score much worse than CV score β Possible overfitting to CV folds. Use more CV folds or nested CV.
Step 6: Feature Importance and Selection (~12 min)
Section titled βStep 6: Feature Importance and Selection (~12 min)βIdentify which features actually matter and remove unhelpful features to reduce overfitting and improve performance.
Actions
Section titled βActionsβNot all features are created equal. Some are highly predictive, some add noise, and some are completely irrelevant. Feature selection removes unhelpful features to:
- Reduce overfitting (fewer features = simpler model)
- Improve interpretability (easier to explain with fewer features)
- Speed up training and inference (less data to process)
- Reduce data collection costs (donβt collect useless features)
Measuring Feature Importance
Section titled βMeasuring Feature ImportanceβFor tree-based models, feature importance is built-in. For other models, we can measure importance by permutation importance: shuffle a featureβs values and see how much performance drops.
# filename: 08-feature-importance.php (excerpt)<?php
/** * Calculate permutation importance for each feature * * @param object $model Trained model * @param array $samples Test features * @param array $labels Test labels * @param int $numRepeats Number of times to shuffle each feature * @return array Feature importance scores */function calculatePermutationImportance( object $model, array $samples, array $labels, int $numRepeats = 10): array { $numFeatures = count($samples[0]);
// Baseline accuracy $baselineAccuracy = calculateAccuracy( $model->predict($samples), $labels );
echo "Baseline accuracy: " . number_format($baselineAccuracy * 100, 2) . "%\n\n";
$importances = [];
for ($featureIdx = 0; $featureIdx < $numFeatures; $featureIdx++) { $accuracyDrops = [];
for ($repeat = 0; $repeat < $numRepeats; $repeat++) { // Create copy with this feature shuffled $shuffledSamples = $samples; $featureColumn = array_column($samples, $featureIdx); shuffle($featureColumn);
// Replace feature with shuffled values foreach ($shuffledSamples as $i => $sample) { $shuffledSamples[$i][$featureIdx] = $featureColumn[$i]; }
// Measure accuracy with shuffled feature $shuffledAccuracy = calculateAccuracy( $model->predict($shuffledSamples), $labels );
$accuracyDrops[] = $baselineAccuracy - $shuffledAccuracy; }
// Average drop across repeats $meanDrop = array_sum($accuracyDrops) / count($accuracyDrops); $stdDrop = calculateStd($accuracyDrops);
$importances[] = [ 'feature_index' => $featureIdx, 'importance' => $meanDrop, 'std' => $stdDrop, ];
echo "Feature {$featureIdx}: "; echo "Importance = " . number_format($meanDrop * 100, 2) . "%"; echo " (Β±" . number_format($stdDrop * 100, 2) . "%)\n"; }
// Sort by importance descending usort($importances, fn($a, $b) => $b['importance'] <=> $a['importance']);
return $importances;}Visualizing Feature Importance
Section titled βVisualizing Feature Importanceβ# filename: 08-feature-importance.php (excerpt)<?php
function visualizeFeatureImportance( array $importances, array $featureNames): void { echo "\nββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n"; echo "β Feature Importance Rankings β\n"; echo "ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n\n";
foreach ($importances as $rank => $imp) { $featureIdx = $imp['feature_index']; $name = $featureNames[$featureIdx] ?? "Feature {$featureIdx}"; $importance = $imp['importance'] * 100; $bar = str_repeat('β', (int)($importance * 2));
echo sprintf( "%2d. %-25s %6.2f%% %s\n", $rank + 1, $name, $importance, $bar ); }
echo "\nInterpretation:\n"; echo " β’ High importance: Shuffling this feature hurts performance\n"; echo " β’ Low importance: Feature doesn't contribute to predictions\n"; echo " β’ Negative importance: Feature may be adding noise\n";}Implementing Feature Selection
Section titled βImplementing Feature Selectionβ# filename: 09-feature-selection.php (excerpt)<?php
/** * Select top k features by importance * * @param array $samples Original feature matrix * @param array $importances Feature importance scores * @param int $k Number of features to keep * @return array Selected feature matrix */function selectTopFeatures( array $samples, array $importances, int $k): array { // Get indices of top k features $topIndices = array_slice( array_column($importances, 'feature_index'), 0, $k );
// Extract only those features $selectedSamples = [];
foreach ($samples as $sample) { $selectedSample = []; foreach ($topIndices as $idx) { $selectedSample[] = $sample[$idx]; } $selectedSamples[] = $selectedSample; }
return $selectedSamples;}
// Usage: Compare performance with all features vs. selected featuresecho "βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n";echo "COMPARING: All Features vs. Top Features\n";echo "βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n\n";
// Train with all features$modelAll = new KNearestNeighbors(5);$modelAll->train($trainSamples, $trainLabels);$accuracyAll = evaluate($modelAll, $testSamples, $testLabels);
echo "All {$numFeatures} features:\n";echo " Test Accuracy: " . number_format($accuracyAll * 100, 2) . "%\n\n";
// Train with top 5 features$trainSelected = selectTopFeatures($trainSamples, $importances, 5);$testSelected = selectTopFeatures($testSamples, $importances, 5);
$modelSelected = new KNearestNeighbors(5);$modelSelected->train($trainSelected, $trainLabels);$accuracySelected = evaluate($modelSelected, $testSelected, $testLabels);
echo "Top 5 features only:\n";echo " Test Accuracy: " . number_format($accuracySelected * 100, 2) . "%\n";
$diff = $accuracySelected - $accuracyAll;$icon = $diff >= 0 ? 'β' : 'β';
echo " Difference: " . number_format($diff * 100, 2) . "% {$icon}\n";echo " Features reduced: {$numFeatures} β 5 (" . number_format((1 - 5 / $numFeatures) * 100, 1) . "% reduction)\n";Expected Result
Section titled βExpected Resultββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Feature Importance Analysis βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Dataset: Spam filter with 10 features
Calculating baseline accuracy...Baseline accuracy: 92.40%
Calculating permutation importance (10 repeats per feature)...
Feature 0 (exclamation_count): Importance = 12.30% (Β±1.20%)Feature 1 (has_urgent): Importance = 8.50% (Β±0.80%)Feature 2 (has_free): Importance = 7.20% (Β±1.00%)Feature 3 (capital_ratio): Importance = 5.40% (Β±0.90%)Feature 4 (word_count): Importance = 2.10% (Β±0.50%)Feature 5 (has_money): Importance = 1.80% (Β±0.60%)Feature 6 (num_links): Importance = 0.90% (Β±0.40%)Feature 7 (email_length): Importance = 0.30% (Β±0.30%)Feature 8 (time_of_day): Importance = -0.10% (Β±0.20%) β Noise!Feature 9 (day_of_week): Importance = -0.20% (Β±0.25%) β Noise!
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Feature Importance Rankings βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1. exclamation_count 12.30% βββββββββββββββββββββββββ 2. has_urgent 8.50% βββββββββββββββββ 3. has_free 7.20% βββββββββββββββ 4. capital_ratio 5.40% βββββββββββ 5. word_count 2.10% βββββ 6. has_money 1.80% ββββ 7. num_links 0.90% ββ 8. email_length 0.30% β 9. time_of_day -0.10%10. day_of_week -0.20%
Interpretation: β’ High importance: Shuffling this feature hurts performance β’ Low importance: Feature doesn't contribute to predictions β’ Negative importance: Feature may be adding noise
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββFEATURE SELECTION: Comparing Performanceβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Baseline (All 10 features): Test Accuracy: 92.40% Training time: 45ms Inference time: 1.2ms per prediction
Selected (Top 5 features): Test Accuracy: 92.80% β Slight improvement! Training time: 28ms β 38% faster Inference time: 0.7ms β 42% faster Features reduced: 10 β 5 (50.0% reduction)
Selected (Top 3 features): Test Accuracy: 88.60% β Too aggressive Training time: 18ms Features reduced: 10 β 3 (70.0% reduction)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββKEY FINDINGSββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Feature Importance Insights: 1. exclamation_count is by far most predictive (12.3%) 2. "Urgent" and "free" keywords are strong signals (8.5%, 7.2%) 3. Time-based features (time_of_day, day_of_week) add only noise 4. Top 4 features account for 33.4% of predictive power
β Feature Selection Benefits: β’ Removed 50% of features with NO accuracy loss β’ Actually improved accuracy by 0.4%! (less overfitting) β’ Reduced training time by 38% β’ Faster inference (42% speedup)
π‘ Recommendation: Use top 5 features for production deployment: 1. exclamation_count 2. has_urgent 3. has_free 4. capital_ratio 5. word_count
Benefits: β’ Simpler model (easier to maintain) β’ Don't need to collect time_of_day, day_of_week β’ Faster and equally accurateWhy It Works
Section titled βWhy It WorksβPermutation importance works because it directly measures each featureβs contribution to predictions:
- If shuffling a feature destroys accuracy β feature is important
- If shuffling doesnβt change accuracy β feature is useless
- If shuffling improves accuracy β feature is adding noise!
Removing low-importance features helps because:
- Less overfitting: Fewer parameters = less chance to memorize noise
- Better signal-to-noise ratio: Noise features can confuse the model
- Curse of dimensionality: With many features, distances become less meaningful
- Occamβs Razor: Simpler models generalize better
Troubleshooting
Section titled βTroubleshootingβ- All features have near-zero importance β Model may not be using features effectively. Try different algorithm.
- Negative importance for key features β Somethingβs wrong. Check feature extraction or model training.
- Performance drops significantly after selection β You removed too many features. Keep more or use different selection method.
- Importance calculation takes too long β Reduce
numRepeatsfrom 10 to 3-5 for faster (but noisier) estimates.
Step 7: Ensemble Methods for Improved Performance (~15 min)
Section titled βStep 7: Ensemble Methods for Improved Performance (~15 min)βLearn to combine multiple models using ensemble techniques to achieve 2-5% accuracy improvements over single models.
Actions
Section titled βActionsβA single model, no matter how well-tuned, represents one βperspectiveβ on the data. Ensemble learning combines multiple models to leverage their collective intelligenceβlike consulting multiple experts instead of trusting just one.
The intuition: if three classifiers each achieve 90% accuracy but make different mistakes, combining them through voting can push accuracy to 95%+ by correcting each otherβs errors.
Hereβs how ensemble voting works in practice:
- Train multiple models on the same training data (e.g., k-NN, Naive Bayes, Decision Tree)
- Feed new samples to all models simultaneously
- Collect predictions from each model (e.g., Model 1: βspamβ, Model 2: βspamβ, Model 3: βhamβ)
- Apply majority voting to determine the final prediction
- Return the winner (in this example: βspamβ wins 2-1)
This approach reduces the impact of individual model weaknesses by leveraging their collective strengths.
Technique 1: Voting Classifier
Section titled βTechnique 1: Voting ClassifierβVoting classifier trains multiple different algorithms on the same data and combines their predictions through voting.
Hard Voting: Each model casts one vote, majority wins Soft Voting: Average predicted probabilities (requires probability outputs)
# filename: 10-ensemble-voting.php (excerpt)<?php
declare(strict_types=1);
use Rubix\ML\Classifiers\KNearestNeighbors;use Rubix\ML\Classifiers\GaussianNB;use Rubix\ML\Classifiers\ClassificationTree;use Rubix\ML\Datasets\Labeled;
/** * Voting classifier combining multiple models * * @param array $models Array of trained classifier objects * @param array $testSamples Test feature data * @param string $method 'hard' for majority vote, 'soft' for probability averaging * @return array Predicted labels */function votingClassifier( array $models, array $testSamples, string $method = 'hard'): array { $numSamples = count($testSamples); $predictions = [];
if ($method === 'hard') { // Each model makes predictions $allPredictions = []; foreach ($models as $model) { $allPredictions[] = $model->predict($testSamples); }
// Majority vote for each sample for ($i = 0; $i < $numSamples; $i++) { $votes = []; foreach ($allPredictions as $modelPredictions) { $vote = $modelPredictions[$i]; $votes[$vote] = ($votes[$vote] ?? 0) + 1; }
// Get class with most votes arsort($votes); $predictions[] = array_key_first($votes); } } else { // Soft voting: average probabilities $allProbabilities = []; foreach ($models as $model) { $allProbabilities[] = $model->proba($testSamples); }
for ($i = 0; $i < $numSamples; $i++) { $avgProbabilities = [];
// Average probabilities across models foreach ($allProbabilities as $modelProbas) { foreach ($modelProbas[$i] as $class => $proba) { $avgProbabilities[$class] = ($avgProbabilities[$class] ?? 0) + $proba; } }
// Divide by number of models foreach ($avgProbabilities as $class => $sum) { $avgProbabilities[$class] = $sum / count($models); }
// Predict class with highest average probability arsort($avgProbabilities); $predictions[] = array_key_first($avgProbabilities); } }
return $predictions;}
// Usage: Train multiple diverse modelsecho "Training individual models...\n";
$knn = new KNearestNeighbors(5);$knn->train(new Labeled($trainSamples, $trainLabels));
$nb = new GaussianNB();$nb->train(new Labeled($trainSamples, $trainLabels));
$tree = new ClassificationTree(maxDepth: 10);$tree->train(new Labeled($trainSamples, $trainLabels));
$models = [$knn, $nb, $tree];
// Test individual modelsecho "\nIndividual Model Performance:\n";echo "βββββββββββββββββββββββββββββββββββββββββββββββββ\n";
foreach (['k-NN' => $knn, 'Naive Bayes' => $nb, 'Decision Tree' => $tree] as $name => $model) { $predictions = $model->predict($testSamples); $accuracy = calculateAccuracy($predictions, $testLabels); echo sprintf("%-15s: %5.2f%%\n", $name, $accuracy * 100);}
// Test ensembleecho "\nEnsemble Performance:\n";echo "βββββββββββββββββββββββββββββββββββββββββββββββββ\n";
$hardVotePreds = votingClassifier($models, $testSamples, 'hard');$hardVoteAccuracy = calculateAccuracy($hardVotePreds, $testLabels);echo sprintf("Hard Voting : %5.2f%%\n", $hardVoteAccuracy * 100);
$softVotePreds = votingClassifier($models, $testSamples, 'soft');$softVoteAccuracy = calculateAccuracy($softVotePreds, $testLabels);echo sprintf("Soft Voting : %5.2f%%\n", $softVoteAccuracy * 100);Why different algorithms? Using diverse algorithms (k-NN, Naive Bayes, Decision Tree) ensures models make different types of errors. Using three k-NN classifiers with different k values helps lessβthey all make similar mistakes.
Technique 2: Bagging (Bootstrap Aggregating)
Section titled βTechnique 2: Bagging (Bootstrap Aggregating)βBagging trains multiple instances of the same algorithm on different random subsets of training data (with replacement), then averages predictions. It reduces variance (overfitting).
# filename: 11-ensemble-bagging.php (excerpt)<?php
/** * Bagging ensemble using bootstrap sampling * * @param callable $modelFactory Function that returns new model instance * @param array $samples Training features * @param array $labels Training labels * @param int $numModels Number of models in ensemble * @return object Object with predict() method */function bagging( callable $modelFactory, array $samples, array $labels, int $numModels = 10): object { $n = count($samples); $models = [];
echo "Creating bagging ensemble with {$numModels} models...\n";
for ($i = 0; $i < $numModels; $i++) { // Bootstrap sampling: random sample with replacement $bootstrapIndices = []; for ($j = 0; $j < $n; $j++) { $bootstrapIndices[] = rand(0, $n - 1); }
// Extract bootstrap sample $bootstrapSamples = []; $bootstrapLabels = []; foreach ($bootstrapIndices as $idx) { $bootstrapSamples[] = $samples[$idx]; $bootstrapLabels[] = $labels[$idx]; }
// Train model on bootstrap sample $model = $modelFactory(); $model->train($bootstrapSamples, $bootstrapLabels); $models[] = $model;
if (($i + 1) % 5 === 0) { echo " Trained " . ($i + 1) . "/{$numModels} models\n"; } }
// Return ensemble object return new class($models) { private array $models;
public function __construct(array $models) { $this->models = $models; }
public function predict(array $samples): array { $allPredictions = []; foreach ($this->models as $model) { $allPredictions[] = $model->predict($samples); }
// Majority vote for each sample $numSamples = count($samples); $predictions = [];
for ($i = 0; $i < $numSamples; $i++) { $votes = []; foreach ($allPredictions as $modelPredictions) { $vote = $modelPredictions[$i]; $votes[$vote] = ($votes[$vote] ?? 0) + 1; } arsort($votes); $predictions[] = array_key_first($votes); }
return $predictions; } };}
// Usage$modelFactory = fn() => new ClassificationTree(maxDepth: 15);
// Single decision tree (prone to overfitting)$singleTree = $modelFactory();$singleTree->train($trainSamples, $trainLabels);$singleAccuracy = evaluate($singleTree, $testSamples, $testLabels);
// Bagged ensemble$baggedEnsemble = bagging($modelFactory, $trainSamples, $trainLabels, numModels: 20);$baggedAccuracy = evaluate($baggedEnsemble, $testSamples, $testLabels);
echo "\nPerformance Comparison:\n";echo "βββββββββββββββββββββββββββββββββββββββββββββββββ\n";echo sprintf("Single Decision Tree: %5.2f%%\n", $singleAccuracy * 100);echo sprintf("Bagged Ensemble (20): %5.2f%% (+%.2f%%)\n", $baggedAccuracy * 100, ($baggedAccuracy - $singleAccuracy) * 100);Why bagging works: Individual decision trees overfit to their training data. By training on different bootstrap samples, trees make uncorrelated errors. Averaging their predictions reduces variance without increasing bias.
When to Use Ensembles
Section titled βWhen to Use Ensemblesβ| Scenario | Best Ensemble Type | Reason |
|---|---|---|
| Different algorithms available | Voting Classifier | Leverages algorithm diversity |
| Single algorithm overfits | Bagging | Reduces variance through averaging |
| Need maximum accuracy | Both (stack them!) | 2-5% gain worth complexity |
| Need interpretability | None | Single model easier to explain |
| Need fast inference | None | Ensemble is N times slower |
| Limited training data | Bagging | Bootstrap creates diverse training sets |
Ensemble Trade-offs:
β Pros:
- 2-5% accuracy improvement (sometimes more)
- More robust to overfitting
- Reduces impact of outliers
- Often wins ML competitions
β Cons:
- Slower inference (N times slower)
- More memory (storing N models)
- Less interpretable
- Longer training time
Expected Result
Section titled βExpected Resultββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Ensemble Methods Comparison βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Dataset: 100 emails (13 spam, 87 ham)
Training individual models...β k-NN trainedβ Naive Bayes trainedβ Decision Tree trained
Individual Model Performance:βββββββββββββββββββββββββββββββββββββββββββββββββββββββββk-NN : 92.40%Naive Bayes : 88.60%Decision Tree : 90.20%
Ensemble Performance:βββββββββββββββββββββββββββββββββββββββββββββββββββββββββHard Voting : 94.10% (+1.7% improvement)Soft Voting : 94.80% (+2.4% improvement)
β Ensemble outperforms all individual models!
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββBagging Demonstrationββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Creating bagging ensemble with 20 models... Trained 5/20 models Trained 10/20 models Trained 15/20 models Trained 20/20 modelsβ Bagging ensemble ready
Performance Comparison:βββββββββββββββββββββββββββββββββββββββββββββββββββββββββSingle Decision Tree: 90.20% (overfits to training data)Bagged Ensemble (20): 95.30% (+5.1% improvement!)
Variance Reduction: Single Tree: Β±8.2% std dev across CV folds Bagged Ensemble: Β±2.4% std dev across CV folds β 71% reduction in variance!
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββKEY INSIGHTSββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1. Ensemble Voting: β’ Combines diverse algorithms β’ Soft voting > Hard voting (uses probability info) β’ Works best when base models disagree on errors
2. Bagging: β’ Reduces overfitting of high-variance models β’ Works best with unstable models (Decision Trees, k-NN) β’ More models = more stable (diminishing returns after 20-50)
3. When to Use: β’ Use voting when you have multiple good algorithms β’ Use bagging when single model overfits β’ Combine both for maximum performance
4. Production Considerations: β’ Ensemble adds latency (inference is NΓ slower) β’ Consider caching predictions for common inputs β’ Profile: does 2-5% accuracy justify 5-10Γ slower inference?
π‘ Recommendation: Start with single model. Add ensemble if: β’ Accuracy gain justifies complexity β’ Inference speed requirements met β’ Maintenance burden acceptableWhy It Works
Section titled βWhy It WorksβVoting Classifier works through error independence: If Model A is 90% accurate and Model B is 90% accurate but they make different mistakes, combining them can approach 95-98% accuracy.
Mathematical intuition: If three independent classifiers each have error rate Ξ΅ = 0.1, the probability all three are wrong is:
[ P(\text{all wrong}) = Ξ΅^3 = 0.1^3 = 0.001 ]
The probability at least 2 of 3 are correct (majority vote is correct):
[ P(\text{majority correct}) β 1 - Ξ΅^3 - 3Ξ΅^2(1-Ξ΅) β 0.972 ]
Bagging reduces variance through the law of large numbers. If individual models have high variance (overfitting), averaging N models reduces variance by factor of ( 1/\sqrt{N} ) while keeping bias constant.
Troubleshooting
Section titled βTroubleshootingβ- Ensemble worse than best single model β Models are too similar (correlated errors). Use more diverse algorithms or more diverse training sets.
- Bagging doesnβt help β Base model has high bias (underfitting), not high variance. Bagging only helps with overfitting. Try more complex base model first.
- Soft voting error β Not all models support probability outputs. Use hard voting or stick with models that implement
proba()method. - Ensemble too slow β Reduce number of models, or use cascade ensemble (start with fast model, use ensemble only for uncertain cases).
Step 8: Handling Severely Imbalanced Datasets (~12 min)
Section titled βStep 8: Handling Severely Imbalanced Datasets (~12 min)βMaster techniques to handle severe class imbalance (1% or less minority class) where standard approaches fail completely.
Actions
Section titled βActionsβIn Step 2, you learned stratified cross-validation for moderate imbalance. But what if your dataset is 99% ham and 1% spam? A naive classifier that predicts βhamβ for everything achieves 99% accuracy but is completely useless!
Severely imbalanced datasets appear in many real-world scenarios:
- Fraud detection: 99.5% legitimate, 0.5% fraudulent transactions
- Medical diagnosis: 95% healthy, 5% disease
- Equipment failure: 99% normal operation, 1% failure
- Click prediction: 98% no-click, 2% click
Standard machine learning algorithms will simply learn to always predict the majority class.
The Problem Illustrated
Section titled βThe Problem Illustratedβ# filename: 12-class-imbalance-smote.php (excerpt)<?php
// Severe imbalance: 990 ham, 10 spam (1% minority)$labels = array_merge( array_fill(0, 990, 'ham'), array_fill(0, 10, 'spam'));
// Train standard classifier$classifier = new KNearestNeighbors(5);$classifier->train($samples, $labels);
// Evaluate$predictions = $classifier->predict($testSamples);$accuracy = calculateAccuracy($predictions, $testLabels);// Result: 99.0% accuracy!
// But check recall for spam...$components = calculateConfusionComponents($predictions, $testLabels, 'spam');$recall = calculateRecall($components);// Result: 0.0% recall β catches ZERO spam!
echo "Accuracy: {$accuracy} (misleading!)\n";echo "Spam Recall: {$recall} (useless!)\n";The classifier learned that always predicting βhamβ minimizes error on the training set!
Technique 1: Random Oversampling
Section titled βTechnique 1: Random OversamplingβRandom oversampling duplicates minority class samples until classes are balanced.
# filename: 12-class-imbalance-smote.php (excerpt)<?php
/** * Random oversampling of minority class * * @param array $samples Feature data * @param array $labels Target labels * @param string $minorityClass Class to oversample * @return array ['samples' => array, 'labels' => array] */function randomOversample( array $samples, array $labels, string $minorityClass): array { // Separate minority and majority $minorityIndices = []; $majorityIndices = [];
foreach ($labels as $idx => $label) { if ($label === $minorityClass) { $minorityIndices[] = $idx; } else { $majorityIndices[] = $idx; } }
$minorityCount = count($minorityIndices); $majorityCount = count($majorityIndices);
echo "Original distribution:\n"; echo " Minority ({$minorityClass}): {$minorityCount}\n"; echo " Majority: {$majorityCount}\n";
// Duplicate minority samples to match majority $oversampledIndices = $minorityIndices; while (count($oversampledIndices) < $majorityCount) { // Randomly pick a minority sample to duplicate $oversampledIndices[] = $minorityIndices[array_rand($minorityIndices)]; }
// Combine $allIndices = array_merge($majorityIndices, $oversampledIndices); shuffle($allIndices);
$newSamples = []; $newLabels = []; foreach ($allIndices as $idx) { $newSamples[] = $samples[$idx]; $newLabels[] = $labels[$idx]; }
echo "After oversampling:\n"; echo " Minority ({$minorityClass}): " . count($oversampledIndices) . "\n"; echo " Majority: {$majorityCount}\n";
return ['samples' => $newSamples, 'labels' => $newLabels];}Pros: Simple, preserves minority class samples Cons: Overfitting (exact duplicates), no new information
Technique 2: SMOTE (Synthetic Minority Over-sampling)
Section titled βTechnique 2: SMOTE (Synthetic Minority Over-sampling)βSMOTE creates synthetic minority samples by interpolating between existing minority samples and their nearest neighbors.
# filename: 12-class-imbalance-smote.php (excerpt)<?php
/** * SMOTE: Synthetic Minority Over-sampling Technique * * @param array $samples Feature data * @param array $labels Target labels * @param string $minorityClass Class to oversample * @param int $k Number of nearest neighbors to consider * @param float $ratio Target minority ratio (1.0 = fully balanced) * @return array ['samples' => array, 'labels' => array] */function smote( array $samples, array $labels, string $minorityClass, int $k = 5, float $ratio = 1.0): array { // Extract minority samples $minoritySamples = []; $minorityIndices = []; $majorityCount = 0;
foreach ($labels as $idx => $label) { if ($label === $minorityClass) { $minoritySamples[] = $samples[$idx]; $minorityIndices[] = $idx; } else { $majorityCount++; } }
$minorityCount = count($minoritySamples); $targetMinorityCount = (int)($majorityCount * $ratio); $numSynthetic = $targetMinorityCount - $minorityCount;
echo "SMOTE: Creating {$numSynthetic} synthetic {$minorityClass} samples\n";
$syntheticSamples = []; $syntheticLabels = [];
for ($i = 0; $i < $numSynthetic; $i++) { // Pick random minority sample $sampleIdx = array_rand($minoritySamples); $sample = $minoritySamples[$sampleIdx];
// Find k nearest neighbors (among minority class) $distances = []; foreach ($minoritySamples as $neighborIdx => $neighbor) { if ($neighborIdx === $sampleIdx) continue; $distances[$neighborIdx] = euclideanDistance($sample, $neighbor); } asort($distances); $nearestNeighbors = array_slice(array_keys($distances), 0, $k, true);
// Pick random neighbor $neighborIdx = $nearestNeighbors[array_rand($nearestNeighbors)]; $neighbor = $minoritySamples[$neighborIdx];
// Create synthetic sample via interpolation // synthetic = sample + Ξ» Γ (neighbor - sample), where Ξ» β [0, 1] $lambda = (float)rand(0, 100) / 100; $synthetic = [];
for ($featureIdx = 0; $featureIdx < count($sample); $featureIdx++) { $value = $sample[$featureIdx] + $lambda * ($neighbor[$featureIdx] - $sample[$featureIdx]); $synthetic[] = $value; }
$syntheticSamples[] = $synthetic; $syntheticLabels[] = $minorityClass; }
// Combine original and synthetic $newSamples = array_merge($samples, $syntheticSamples); $newLabels = array_merge($labels, $syntheticLabels);
// Shuffle $indices = range(0, count($newSamples) - 1); shuffle($indices);
$shuffledSamples = []; $shuffledLabels = []; foreach ($indices as $idx) { $shuffledSamples[] = $newSamples[$idx]; $shuffledLabels[] = $newLabels[$idx]; }
return ['samples' => $shuffledSamples, 'labels' => $shuffledLabels];}
function euclideanDistance(array $a, array $b): float{ $sum = 0; for ($i = 0; $i < count($a); $i++) { $diff = $a[$i] - $b[$i]; $sum += $diff * $diff; } return sqrt($sum);}SMOTE creates new samples by drawing lines between minority class samples and sampling along those lines:
[ \text{synthetic} = xi + Ξ» Γ (x{\text{neighbor}} - x_i), \quad Ξ» β [0, 1] ]
Pros: Creates diverse synthetic samples, reduces overfitting compared to duplication Cons: Can create unrealistic samples, increases training time
Technique 3: Random Undersampling
Section titled βTechnique 3: Random UndersamplingβRandom undersampling removes majority class samples until classes are balanced.
# filename: 13-class-weights.php (excerpt)<?php
/** * Random undersampling of majority class * * @param array $samples Feature data * @param array $labels Target labels * @param string $majorityClass Class to undersample * @param float $ratio Target majority/minority ratio (1.0 = fully balanced) * @return array ['samples' => array, 'labels' => array] */function randomUndersample( array $samples, array $labels, string $majorityClass, float $ratio = 1.0): array { $minorityIndices = []; $majorityIndices = [];
foreach ($labels as $idx => $label) { if ($label === $majorityClass) { $majorityIndices[] = $idx; } else { $minorityIndices[] = $idx; } }
$minorityCount = count($minorityIndices); $targetMajorityCount = (int)($minorityCount * $ratio);
// Randomly sample majority class shuffle($majorityIndices); $sampledMajorityIndices = array_slice($majorityIndices, 0, $targetMajorityCount);
// Combine $allIndices = array_merge($minorityIndices, $sampledMajorityIndices); shuffle($allIndices);
$newSamples = []; $newLabels = []; foreach ($allIndices as $idx) { $newSamples[] = $samples[$idx]; $newLabels[] = $labels[$idx]; }
return ['samples' => $newSamples, 'labels' => $newLabels];}Pros: Fast, reduces dataset size (faster training) Cons: Throws away potentially useful data, may remove informative samples
Technique 4: Class Weights
Section titled βTechnique 4: Class WeightsβClass weights adjust the modelβs loss function to penalize minority class errors more heavily, without changing the dataset.
# filename: 13-class-weights.php (excerpt)<?php
/** * Calculate class weights inversely proportional to class frequencies * * @param array $labels Target labels * @return array Associative array of class => weight */function calculateClassWeights(array $labels): array{ $classCounts = array_count_values($labels); $totalSamples = count($labels); $numClasses = count($classCounts);
$weights = [];
foreach ($classCounts as $class => $count) { // Weight inversely proportional to frequency // weight = total_samples / (num_classes Γ class_count) $weights[$class] = $totalSamples / ($numClasses * $count); }
echo "Class Weights:\n"; foreach ($weights as $class => $weight) { echo sprintf(" %-10s: %.2f (count: %d)\n", $class, $weight, $classCounts[$class] ); }
return $weights;}
// Usage with Rubix ML (supports sample weights)use Rubix\ML\Datasets\Labeled;
$weights = calculateClassWeights($trainLabels);
// Convert to sample weights$sampleWeights = array_map(fn($label) => $weights[$label], $trainLabels);
$dataset = new Labeled($trainSamples, $trainLabels);$dataset->setWeights($sampleWeights);
$classifier = new KNearestNeighbors(5, weighted: true);$classifier->train($dataset);How it works: If spam is 1% of data, spam samples get weight = 50, ham samples get weight = 0.505. The model now βcaresβ 100Γ more about misclassifying spam.
Pros: No dataset modification, works with any algorithm that supports weights Cons: Not all algorithms support sample weights, harder to tune
Comparing Techniques
Section titled βComparing Techniquesβ# filename: 12-class-imbalance-smote.php (excerpt)<?php
echo "ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n";echo "β Handling Severe Class Imbalance (1% spam) β\n";echo "ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n\n";
echo "Original Dataset: 990 ham, 10 spam (1% minority)\n\n";
// Baseline: No handlingecho "βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n";echo "BASELINE: No Imbalance Handling\n";echo "βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n";$baselineClassifier = new KNearestNeighbors(5);$baselineClassifier->train($samples, $labels);$baselinePredictions = $baselineClassifier->predict($testSamples);evaluateImbalanced($baselinePredictions, $testLabels, 'spam');
// Technique 1: Random Oversamplingecho "\nβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n";echo "TECHNIQUE 1: Random Oversampling\n";echo "βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n";$oversampled = randomOversample($samples, $labels, 'spam');$osClassifier = new KNearestNeighbors(5);$osClassifier->train($oversampled['samples'], $oversampled['labels']);$osPredictions = $osClassifier->predict($testSamples);evaluateImbalanced($osPredictions, $testLabels, 'spam');
// Technique 2: SMOTEecho "\nβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n";echo "TECHNIQUE 2: SMOTE (Synthetic Oversampling)\n";echo "βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n";$smoted = smote($samples, $labels, 'spam', k: 5, ratio: 1.0);$smoteClassifier = new KNearestNeighbors(5);$smoteClassifier->train($smoted['samples'], $smoted['labels']);$smotePredictions = $smoteClassifier->predict($testSamples);evaluateImbalanced($smotePredictions, $testLabels, 'spam');
// Technique 3: Undersamplingecho "\nβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n";echo "TECHNIQUE 3: Random Undersampling\n";echo "βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n";$undersampled = randomUndersample($samples, $labels, 'ham', ratio: 1.0);$usClassifier = new KNearestNeighbors(5);$usClassifier->train($undersampled['samples'], $undersampled['labels']);$usPredictions = $usClassifier->predict($testSamples);evaluateImbalanced($usPredictions, $testLabels, 'spam');
// Technique 4: Class Weightsecho "\nβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n";echo "TECHNIQUE 4: Class Weights\n";echo "βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n";$weights = calculateClassWeights($labels);$sampleWeights = array_map(fn($l) => $weights[$l], $labels);$dataset = new Labeled($samples, $labels);$dataset->setWeights($sampleWeights);$cwClassifier = new KNearestNeighbors(5, weighted: true);$cwClassifier->train($dataset);$cwPredictions = $cwClassifier->predict($testSamples);evaluateImbalanced($cwPredictions, $testLabels, 'spam');
function evaluateImbalanced(array $predictions, array $actuals, string $positiveClass): void{ $components = calculateConfusionComponents($predictions, $actuals, $positiveClass); $precision = calculatePrecision($components); $recall = calculateRecall($components); $f1 = calculateF1Score($precision, $recall); $accuracy = ($components['tp'] + $components['tn']) / count($actuals);
echo sprintf("Accuracy: %5.1f%% (misleading for imbalanced data)\n", $accuracy * 100); echo sprintf("Precision: %5.1f%% (of predicted spam, how many are real)\n", $precision * 100); echo sprintf("Recall: %5.1f%% (of actual spam, how many we caught)\n", $recall * 100); echo sprintf("F1-Score: %5.1f%% (harmonic mean)\n", $f1 * 100);}Expected Result
Section titled βExpected Resultββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Handling Severe Class Imbalance (1% spam) βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Original Dataset: 990 ham, 10 spam (1% minority)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββBASELINE: No Imbalance HandlingβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββTraining on imbalanced data...β Model trained
Accuracy: 99.0% (misleading for imbalanced data)Precision: 0.0% (of predicted spam, how many are real)Recall: 0.0% (of actual spam, how many we caught)F1-Score: 0.0% (harmonic mean)
β οΈ Model predicts ALL samples as ham β completely useless!
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββTECHNIQUE 1: Random OversamplingβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββOriginal distribution: Minority (spam): 10 Majority: 990
After oversampling: Minority (spam): 990 Majority: 990
β Balanced dataset: 990 spam, 990 ham
Accuracy: 92.5%Precision: 45.2% (many false positives)Recall: 85.0% β Now catching spam!F1-Score: 59.1%
π‘ Improved recall but low precision (too aggressive)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββTECHNIQUE 2: SMOTE (Synthetic Oversampling)βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββSMOTE: Creating 980 synthetic spam samplesβ Synthetic samples generated via k-NN interpolation
Accuracy: 94.2%Precision: 78.3% β Much better precision!Recall: 90.0% β Great recall!F1-Score: 83.8% β Best F1!
π‘ SMOTE provides best balance β synthetic samples reduce overfitting
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββTECHNIQUE 3: Random UndersamplingβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββOriginal distribution: Minority (spam): 10 Majority (ham): 990
After undersampling: Minority (spam): 10 Majority (ham): 10
β οΈ Dataset reduced from 1000 to 20 samples!
Accuracy: 88.0%Precision: 62.5%Recall: 75.0%F1-Score: 68.2%
π‘ Works but loses information β only use if dataset is huge
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββTECHNIQUE 4: Class WeightsβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββClass Weights: spam : 50.00 (count: 10) ham : 0.51 (count: 990)
β Spam errors now penalized 100Γ more
Accuracy: 93.8%Precision: 82.1% β Excellent precision!Recall: 80.0% β Good recallF1-Score: 81.0% β Great balance
π‘ No dataset modification needed β works if algorithm supports weights
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββCOMPARISON SUMMARYββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Technique Precision Recall F1 Dataset SizeββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββBaseline (none) 0.0% 0.0% 0.0% 1000 (orig)Random Oversample 45.2% 85.0% 59.1% 1980 (2Γ)SMOTE 78.3% 90.0% 83.8% 1980 (2Γ) βRandom Undersample 62.5% 75.0% 68.2% 20 (1/50Γ)Class Weights 82.1% 80.0% 81.0% 1000 (orig) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββRECOMMENDATIONSββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1. **First choice: SMOTE** β’ Best F1-score (83.8%) β’ Creates realistic synthetic samples β’ Reduces overfitting vs. simple duplication
2. **Second choice: Class Weights** β’ No dataset modification β’ Fast (no data augmentation) β’ Best precision (82.1%) β’ Requires algorithm support
3. **When to use undersampling:** β’ Dataset is very large (>100k samples) β’ Training time is critical β’ Acceptable to discard data
4. **Avoid:** β’ Doing nothing (baseline)! β’ Random oversampling (SMOTE is better)
π‘ For production: Try SMOTE + class weights together for best results!Why It Works
Section titled βWhy It WorksβRandom Oversampling works by balancing class frequencies, forcing the model to pay equal attention to both classes. But it overfits because it creates exact duplicates.
SMOTE works better because synthetic samples lie between real examples in feature space, helping the model learn a more general decision boundary rather than memorizing specific points.
Undersampling works by removing the numerical advantage of the majority class, but throws away potentially useful information.
Class Weights works by modifying the loss function:
[ \text{Loss} = \sum_{i=1}^{n} w_i \times \text{error}_i ]
Where ( w_i ) is proportional to ( 1 / \text{class frequency} ). Minority class errors get 10-100Γ higher weight, forcing the model to care about them.
Troubleshooting
Section titled βTroubleshootingβ- SMOTE creates unrealistic samples β Reduce k (fewer neighbors) or use BorderlineSMOTE (only oversample samples near decision boundary).
- Undersampling loses too much data β Only remove majority samples far from decision boundary, or use Tomek Links to remove only ambiguous samples.
- Class weights donβt help β Algorithm may not support sample weights. Try SMOTE or switch algorithms.
- Still poor minority recall β Imbalance may be too severe (0.1% or less). Treat as anomaly detection instead of classification.
- High false positive rate β Using oversampling or aggressive weights. Adjust threshold or use ROC curve to find optimal balance.
Exercises
Section titled βExercisesβTest your understanding with these comprehensive evaluation and improvement challenges:
::: tip Exercise Solutions
Sample solutions are available in code/chapter-07/solutions/. Try implementing them yourself first!
:::
Exercise 1: Comprehensive Metrics Report
Section titled βExercise 1: Comprehensive Metrics ReportβGoal: Build a complete evaluation report generator
Create exercise1.php that:
- Trains a spam filter on provided dataset
- Generates confusion matrix
- Calculates accuracy, precision, recall, F1, MCC
- Creates visualization (text-based) for each metric
- Provides interpretation and recommendations
Validation: Your report should clearly identify whether the model is biased toward precision or recall and recommend threshold adjustments.
Exercise 2: Learning Curve Diagnosis
Section titled βExercise 2: Learning Curve DiagnosisβGoal: Use learning curves to diagnose model issues
Create exercise2.php that:
- Generates learning curves for training sizes [20, 50, 100, 200, 500]
- Plots both training and test accuracy
- Calculates gap at each size
- Diagnoses whether model has high bias, high variance, or is well-fitted
- Provides specific recommendation (more data vs. better features vs. simpler model)
Expected: Script should output clear diagnosis matching one of the three patterns from Step 4.
Exercise 3: Hyperparameter Tuning Competition
Section titled βExercise 3: Hyperparameter Tuning CompetitionβGoal: Find optimal hyperparameters for multiple algorithms
Create exercise3.php that:
- Performs grid search on k-NN (varying k, weighted, distance metric)
- Performs grid search on Naive Bayes (varying smoothing parameter)
- Compares best configuration from each algorithm
- Reports computational cost (time) vs. accuracy gain
- Recommends final model with justification
Validation: Should show clear comparison table and justify choice based on accuracy-speed tradeoff.
Exercise 4: Feature Selection Pipeline
Section titled βExercise 4: Feature Selection PipelineβGoal: Build automated feature selection
Create exercise4.php that:
- Calculates permutation importance for all features
- Tries models with top-k features for k β {3, 5, 7, 10, all}
- Plots accuracy vs. number of features
- Identifies optimal feature count
- Lists selected features with importance scores
Expected: Should find the βelbow pointβ where adding more features doesnβt help.
Exercise 5: ROC Curve and Threshold Optimization
Section titled βExercise 5: ROC Curve and Threshold OptimizationβGoal: Choose optimal threshold for specific use case
Create exercise5.php that:
- Trains a classifier that outputs probabilities
- Generates full ROC curve
- Calculates AUC
- Finds three optimal thresholds:
- Maximize F1-score
- Maximize precision (β₯90%)
- Maximize recall (β₯90%)
- Visualizes ROC curve and marks optimal points
Validation: Should show different thresholds for different optimization goals with clear tradeoffs.
Exercise 6: Production-Ready Spam Filter
Section titled βExercise 6: Production-Ready Spam FilterβGoal: Build and optimize a complete spam filter
Build a production-ready spam filter that:
- Loads email dataset (use Chapter 6 data or create synthetic)
- Engineers features (word counts, special characters, keywords)
- Performs stratified cross-validation
- Tunes hyperparameters with grid search
- Selects top features
- Builds ensemble classifier (voting or bagging)
- Handles class imbalance (if present) with SMOTE or class weights
- Evaluates with multiple metrics
- Performs systematic error analysis to identify patterns in misclassifications
- Generates comprehensive report
- Saves optimized model
Step 9 Requirements (Error Analysis):
- Identify all misclassified samples
- Group errors by pattern:
- False positives: legitimate emails incorrectly marked as spam
- False negatives: spam emails that slipped through
- Borderline cases: low-confidence predictions
- Calculate per-feature statistics for errors vs. correct predictions
- Generate error report with:
- Top 5 most problematic feature patterns
- Examples of misclassified samples with feature values
- Specific, actionable recommendations for improvement
Validation: Should achieve:
- Test accuracy > 95%
- Precision and recall both > 90%
- Ensemble shows improvement over single best model
- Feature count reduced by at least 30%
- Error analysis identifies at least 3 patterns
- Recommendations are specific and actionable
- Report includes error examples with feature values
- Complete documentation of all decisions
Troubleshooting
Section titled βTroubleshootingβCommon issues when evaluating and improving models:
Metric Calculation Issues
Section titled βMetric Calculation IssuesβError: βDivision by zero in precision calculationβ
Symptoms: Warning when calculating precision or recall
Cause: No positive predictions (TP + FP = 0) or no positive samples (TP + FN = 0)
Solution:
function calculatePrecision(array $components): float{ $tp = $components['tp']; $fp = $components['fp'];
// Handle edge case if ($tp + $fp === 0) { return 0.0; // Or 1.0 if no false positives is "perfect" }
return $tp / ($tp + $fp);}Error: βAUC calculation returns NaNβ
Symptoms: AUC score is NaN or infinite
Cause: All predictions are the same class, or only one class in test set
Solution: Use stratified splitting to ensure both classes in test set. Check model is actually making varied predictions.
Cross-Validation Issues
Section titled βCross-Validation IssuesβStratified CV produces unbalanced folds
Symptoms: Some folds have different class ratios than expected
Cause: Bug in stratification logic or minority class too small
Solution: Verify with printouts:
foreach ($folds as $i => $fold) { $labels = array_map(fn($idx) => $labels[$idx], $fold); $ratio = array_count_values($labels); print_r(["Fold $i" => $ratio]);}CV takes extremely long
Symptoms: Grid search or CV running for hours
Cause: Too many parameter combinations or large dataset
Solution: Reduce grid size, use fewer CV folds (3 instead of 5), or sample data:
// Use 30% of data for faster iteration during development$sampleSize = (int)(count($samples) * 0.3);$sampledData = array_slice($samples, 0, $sampleSize);$sampledLabels = array_slice($labels, 0, $sampleSize);Grid Search Issues
Section titled βGrid Search IssuesβBest params always at grid boundary
Symptoms: Optimal k is always the maximum you tested
Cause: Grid doesnβt cover the true optimum
Solution: Extend grid beyond current boundaries:
// If k=15 was best, try higher$paramGrid = [ 'k' => [15, 20, 25, 30], // Extended range];All configurations perform identically
Symptoms: CV scores are nearly identical across all hyperparameter values
Cause: Hyperparameters may not matter for this dataset, or range is too narrow
Solution: Either accept that hyperparameters donβt matter much (focus on features instead), or try wider ranges:
// Instead of k = [3, 5, 7, 9]$paramGrid = ['k' => [1, 5, 10, 20, 50]]; // Wider rangeFeature Importance Issues
Section titled βFeature Importance IssuesβAll features show zero importance
Symptoms: Permutation importance is ~0% for all features
Cause: Model isnβt learning (accuracy is at baseline), or test set is too small
Solution: Verify model is trained and performs better than random. Use larger test set (100+ samples).
Feature importance contradicts intuition
Symptoms: Obviously important feature shows low importance
Cause: Feature is correlated with other features (importance βsharedβ), or feature scale issues
Solution: This is normal with correlated features. Importance is marginal contribution. Use SHAP values for more nuanced attribution (beyond this chapter).
Negative importance for all features
Symptoms: Every feature has negative importance
Cause: Model is overfitting severely, or bug in permutation code
Solution: Check that baseline accuracy is calculated correctly before permutations. Reduce model complexity.
Learning Curve Issues
Section titled βLearning Curve IssuesβBoth curves flat at low accuracy
Symptoms: Training and test accuracy both around 60% regardless of data size
Cause: High bias β model is too simple or features arenβt informative
Solution: Try more complex model, engineer better features, or remove excessive regularization.
Curves diverge more with more data
Symptoms: Gap between train and test widens as training size increases
Cause: Very unusual! May indicate data distribution shift or bug
Solution: Check that test set is from same distribution as training set. Verify train/test split logic.
Test score higher than training score
Symptoms: Test accuracy exceeds training accuracy
Cause: Data leakage, swapped datasets, or test set is easier
Solution: Double-check train/test split. Verify no information from test leaked into training.
ROC Curve Issues
Section titled βROC Curve IssuesβROC curve below diagonal
Symptoms: AUC < 0.5, curve in bottom-right area
Cause: Predictions are inverted β model predicts opposite of truth
Solution: Flip predictions or check if positive/negative classes are swapped:
// Invert predictions$correctedPredictions = array_map( fn($p) => $p === 'spam' ? 'ham' : 'spam', $predictions);Cannot calculate TPR or FPR
Symptoms: Division by zero in ROC calculation
Cause: Test set doesnβt contain both classes at some threshold
Solution: Use stratified splitting. Ensure test set has reasonable number of both classes (at least 10 each).
AUC = 1.0 (perfect)
Symptoms: ROC curve hugs top-left corner, AUC exactly 1.0
Cause: Either youβve built a perfect model (extremely rare!), data leakage, or test set is trivially easy
Solution: Verify no data leakage. Check that test set is representative. If legitimate, celebrate but remain skeptical!
General Performance Issues
Section titled βGeneral Performance IssuesβModel performs well in CV but fails in production
Symptoms: 95% CV accuracy but only 70% in production
Cause: Distribution shift β production data differs from training data
Solution: Collect production data and retrain. Monitor for concept drift. Use online learning if appropriate.
Improvement techniques donβt help
Symptoms: Hyperparameter tuning, feature selection, and ensemble methods all fail to improve performance
Cause: Reached the limit of whatβs possible with current features
Solution: Focus on feature engineering. Get domain expertise. Collect more diverse training data.
Wrap-up
Section titled βWrap-upβCongratulations! Youβve mastered model evaluation and improvement. Hereβs what youβve accomplished:
- β Calculated comprehensive metrics going far beyond accuracy to include precision, recall, F1-score, specificity, and MCC for balanced evaluation
- β Understood the precision-recall tradeoff and learned when to optimize for each based on business costs
- β Implemented stratified cross-validation to get reliable estimates even with imbalanced datasets
- β Generated and interpreted ROC curves to visualize classifier performance across all thresholds and calculate AUC-ROC
- β Created learning curves to diagnose whether models need more data, better features, or different algorithms
- β Performed grid search to systematically find optimal hyperparameters using cross-validation
- β Calculated feature importance using permutation importance to identify which features drive predictions
- β Implemented feature selection to remove unhelpful features and improve generalization
- β Built ensemble classifiers using voting and bagging to achieve 2-5% accuracy gains over single models
- β Handled severely imbalanced datasets with SMOTE, random sampling, and class weights to improve minority class detection
- β Performed systematic error analysis to identify patterns in misclassifications and target improvements
You now have a comprehensive evaluation framework that separates amateur models from production-ready systems. You understand:
- Why accuracy alone is dangerously misleading
- How to choose metrics that match your business goals
- When more data helps vs. when better features are needed
- How to tune hyperparameters without overfitting to test data
- Which features actually matter and which add only noise
- How to visualize and communicate model performance
Most importantly, you have a systematic improvement process:
- Evaluate comprehensively with multiple metrics and cross-validation
- Diagnose issues with learning curves (bias vs. variance)
- Handle class imbalance with SMOTE, undersampling, or class weights if needed
- Tune hyperparameters with grid search and proper validation
- Select features to reduce overfitting and improve speed
- Build ensembles if accuracy gain justifies added complexity
- Analyze errors to identify patterns and target specific weaknesses
- Iterate until performance meets requirements
- Deploy with confidence backed by rigorous evaluation
Whatβs Next
Section titled βWhatβs NextβIn Chapter 08: Leveraging PHP Machine Learning Libraries, youβll dive deep into using PHP-ML and Rubix ML to:
- Leverage 40+ pre-built algorithms without implementing from scratch
- Use advanced features like pipelines, transformers, and cross-validators
- Save and load models for production deployment
- Build real-world projects faster with production-ready libraries
- Understand when to use which library and algorithm
Youβll apply all the evaluation techniques from this chapter to compare algorithms and choose the best for your specific use case.
Further Reading
Section titled βFurther ReadingβTo deepen your understanding of model evaluation and improvement:
- Precision and Recall - Wikipedia β Comprehensive explanation with examples across multiple domains
- ROC Curve and AUC Explained β Googleβs ML crash course on ROC curves
- An Introduction to Statistical Learning β Chapter 5 covers resampling methods (CV, bootstrap) in depth; Chapter 8 covers ensemble methods
- Feature Selection Guide - scikit-learn β Comprehensive coverage of feature selection techniques
- Hyperparameter Optimization - Practical Guide β When to use grid search vs. random search
- Understanding the Bias-Variance Tradeoff β Visual explanation of learning curves and what they reveal
- Rubix ML: Cross Validation β Official docs on metrics and validation in Rubix ML
- Matthews Correlation Coefficient β Why MCC is excellent for imbalanced data
- Interpreting Learning Curves β Detailed guide to diagnosing model issues
- Permutation Importance β From βInterpretable Machine Learningβ book
- Ensemble Methods - scikit-learn β Comprehensive guide to voting, bagging, boosting, and stacking
- SMOTE: Synthetic Minority Over-sampling Technique β Original research paper introducing SMOTE
- Imbalanced-learn Documentation β Python library specializing in imbalanced datasets (reference for techniques)
- Learning from Imbalanced Data β Comprehensive survey paper on handling class imbalance
- Random Forests (Bagging Example) β Leo Breimanβs introduction to Random Forests, the most famous bagging ensemble