
Chapter 08: Leveraging PHP Machine Learning Libraries
Overview
In previous chapters, you built machine learning models from scratch—calculating distances by hand, implementing train/test splits, and writing evaluation metrics. This taught you how ML algorithms work under the hood. But in production, you don't want to reinvent the wheel.
Professional PHP ML libraries like PHP-ML and Rubix ML provide battle-tested implementations of dozens of algorithms, complete with optimizations, edge case handling, and convenient APIs. What took you 200 lines of custom code can often be accomplished in 20 lines with a library—and it'll be faster, more robust, and easier to maintain.
This chapter shows you how to leverage these libraries effectively. You'll reimplement the spam filter from Chapter 6, but this time using library classes that handle the complexity for you. You'll discover Rubix ML's comprehensive ecosystem with 40+ algorithms covering the entire ML lifecycle. You'll learn how to save and load trained models for reuse. Most importantly, you'll understand when to use libraries versus custom implementations, and which library to choose for different scenarios.
By the end of this chapter, you'll be able to drop powerful ML capabilities into any PHP application using just a few lines of code, while understanding exactly what's happening behind those convenient APIs.
Prerequisites
Before starting this chapter, you should have:
- Completed Chapter 06 (spam filter implementation) or equivalent classification experience
- Completed Chapter 07 (evaluation metrics) or understand accuracy, precision, and recall
- Understanding of train/test splits and cross-validation from Chapter 03
- PHP 8.4+ installed with Composer available
- Comfortable installing packages via Composer
- Basic understanding of object-oriented PHP (classes, interfaces, type declarations)
Estimated Time: ~85-100 minutes (installation, examples, and exercises)
What You'll Build
By the end of this chapter, you will have created:
- A spam classifier using PHP-ML that matches your Chapter 6 custom implementation in 1/10th the code
- A sentiment analyzer using Rubix ML with advanced preprocessing and multiple algorithms
- A customer segmentation system using k-means clustering from PHP-ML
- An iris classifier using Rubix ML's pipeline with feature normalization and cross-validation
- A model persistence system that saves trained models and loads them for instant predictions
- A regression model for price prediction using Rubix ML's regressors
- An algorithm comparison tool testing 6+ different classifiers on the same dataset
- A real data loading system that loads from CSV files, databases, and JSON APIs
- A transformer pipeline applying 6+ preprocessing steps (type conversion, imputation, encoding, normalization)
- A production REST API with singleton model loading, input validation, error handling, logging, and health checks
- A library comparison matrix documenting when to use PHP-ML vs. Rubix ML vs. custom code
All implementations include proper error handling, performance benchmarks, and production-ready patterns.
Code Examples
Complete, runnable examples for this chapter will be available in code/chapter-08/. The code snippets shown in each step are fully functional and can be run directly. To use them:
# Create chapter directory
mkdir -p code/chapter-08
cd code/chapter-08
# Initialize Composer
composer init --no-interaction
composer require php-ai/php-ml:^0.10
composer require rubix/ml:^3.0Example files referenced:
01-phpml-spam-filter.php— Spam classification with PHP-ML02-rubix-iris-classifier.php— Iris classification with Rubix ML03-phpml-clustering.php— Customer segmentation with k-means05-model-persistence.php— Saving and loading models06-algorithm-comparison.php— Comparing 6+ algorithms07-load-real-data.php— Loading data from CSV/databases with transformers08-production-api.php— Production REST API for serving predictions09-regression-feature-importance.php— House price regression with feature analysis
Copy the code from each step into these files to run the examples.
Quick Start
Want to see the power of ML libraries immediately? Here's a 5-minute example comparing custom code vs. PHP-ML:
# filename: quick-start-comparison.php
<?php
declare(strict_types=1);
require_once __DIR__ . '/vendor/autoload.php';
use Phpml\Classification\KNearestNeighbors;
// ============================================================
// BEFORE: Custom Implementation (from Chapter 6)
// ============================================================
// You wrote 50+ lines to:
// - Calculate Euclidean distance
// - Find k nearest neighbors
// - Take majority vote
// - Handle edge cases
// ============================================================
// AFTER: Using PHP-ML (this chapter)
// ============================================================
// Training data
$samples = [
[1, 1, 0, 3], // spam features
[0, 0, 0, 0], // ham features
[1, 1, 1, 4], // spam
[0, 0, 0, 1], // ham
];
$labels = ['spam', 'ham', 'spam', 'ham'];
// Train classifier - that's it!
$classifier = new KNearestNeighbors(k: 3);
$classifier->train($samples, $labels);
// Make predictions
$predictions = $classifier->predict([
[1, 1, 1, 3], // likely spam
[0, 0, 0, 0], // likely ham
]);
foreach ($predictions as $i => $pred) {
$icon = $pred === 'spam' ? '🚫' : '✅';
echo "{$icon} Email " . ($i + 1) . ": {$pred}\n";
}
// The library handled:
// ✓ Distance calculations with optimizations
// ✓ Edge cases (ties, equal distances)
// ✓ Batch predictions
// ✓ Type safety and validation
// ✓ Performance optimizationsRun it:
cd docs/series/ai-ml-php-developers/code/chapter-08
composer install
php quick-start-comparison.phpExpected output:
🚫 Email 1: spam
✅ Email 2: ham
Time saved: 90% less code, 100% more reliable!Now let's explore what these libraries can really do...
Objectives
By the end of this chapter, you will be able to:
- Install and configure PHP-ML and Rubix ML in any PHP project with proper dependency management
- Choose the right library for specific tasks based on algorithm availability, performance, and ecosystem maturity
- Reimplement custom algorithms using library classes with 10x less code while improving robustness
- Use classification algorithms from both libraries including k-NN, Naive Bayes, SVM, Decision Trees, and ensemble methods
- Perform clustering with k-means, DBSCAN, and hierarchical clustering for customer segmentation
- Build regression models for price prediction, demand forecasting, and continuous value estimation
- Save and load trained models using proper serialization for production deployment
- Create ML pipelines combining preprocessing, training, and evaluation in Rubix ML
- Compare multiple algorithms systematically to choose the best for your dataset
- Load real data from CSV files, databases, and JSON APIs using Rubix ML's data loading utilities
- Apply preprocessing transformers (type conversion, missing data imputation, one-hot encoding, normalization) in correct order
- Build production REST APIs with singleton model loading, input validation, error handling, logging, and health checks
- Handle real-world data with library utilities for missing values, normalization, and encoding
Step 1: Installing PHP ML Libraries (~8 min)
Goal
Set up both PHP-ML and Rubix ML in a development environment and verify installations with working test scripts.
Actions
Understanding PHP ML Libraries
PHP has two mature machine learning libraries, each with different strengths:
PHP-ML (php-ai/php-ml)
- Focus: Simple, educational implementations of core algorithms
- Strengths: Easy to learn, minimal dependencies, good for beginners
- Algorithms: ~15 classifiers, regressors, and clustering algorithms
- Best for: Learning ML concepts, small projects, prototyping
Rubix ML (rubix/ml)
- Focus: Production-ready, comprehensive ML ecosystem
- Strengths: 40+ algorithms, advanced pipelines, active development
- Algorithms: Full ML lifecycle coverage with preprocessing, validation, and visualization
- Best for: Production applications, complex pipelines, serious ML work
Quick Comparison Table
| Feature | PHP-ML | Rubix ML |
|---|---|---|
| Algorithms | ~15 core algorithms | 40+ algorithms |
| Learning Curve | Gentle | Steeper |
| Documentation | Good | Excellent |
| Pipeline Support | No | Yes |
| Performance | Good | Optimized |
| Active Development | Moderate | Very Active |
| Best Use Case | Learning, Prototyping | Production, Complex ML |
| Dependencies | Minimal | More comprehensive |
Let's install both so you can compare and choose based on your needs:
Installation Steps
Create a new project directory:
# From the series code directory
mkdir -p chapter-08
cd chapter-08Initialize Composer and install both libraries:
# Initialize project
composer init --name="aiml/chapter-08" --type=project --no-interaction
# Install PHP-ML
composer require php-ai/php-ml:^0.10
# Install Rubix ML
composer require rubix/ml:^3.0What just happened?
- PHP-ML installed with basic algorithm implementations
- Rubix ML installed with comprehensive ML toolkit
- Dependencies resolved automatically by Composer
- Autoloading configured in
vendor/autoload.php
Verify PHP-ML Installation
Create a test script:
# filename: test-phpml.php
<?php
declare(strict_types=1);
require_once __DIR__ . '/vendor/autoload.php';
use Phpml\Classification\KNearestNeighbors;
use Phpml\Clustering\KMeans;
use Phpml\Regression\LeastSquares;
echo "╔══════════════════════════════════════════════════════════╗\n";
echo "║ PHP-ML Installation Verification ║\n";
echo "╚══════════════════════════════════════════════════════════╝\n\n";
// Test 1: Classification
echo "Test 1: Classification (k-NN)\n";
echo "------------------------------------------------------------\n";
$classifier = new KNearestNeighbors(k: 3);
$classifier->train([[1, 2], [3, 4]], ['A', 'B']);
$prediction = $classifier->predict([2, 3]);
echo "✓ k-NN classifier working\n";
echo " Sample [2, 3] → Predicted: {$prediction}\n\n";
// Test 2: Clustering
echo "Test 2: Clustering (k-Means)\n";
echo "------------------------------------------------------------\n";
$kmeans = new KMeans(n: 2);
$clusters = $kmeans->cluster([[1, 1], [1.5, 2], [5, 8], [8, 8]]);
echo "✓ k-Means clustering working\n";
echo " Found " . count($clusters) . " clusters\n";
echo " Cluster sizes: ";
foreach ($clusters as $i => $cluster) {
echo "C" . ($i + 1) . "=" . count($cluster) . " ";
}
echo "\n\n";
// Test 3: Regression
echo "Test 3: Regression (Least Squares)\n";
echo "------------------------------------------------------------\n";
$regression = new LeastSquares();
$regression->train([[1], [2], [3]], [2, 4, 6]); // y = 2x
$predictedValue = $regression->predict([4]);
echo "✓ Linear regression working\n";
echo " Input: 4 → Predicted: " . round($predictedValue, 2) . " (expected ~8)\n\n";
echo "╔══════════════════════════════════════════════════════════╗\n";
echo "║ ✓ PHP-ML Successfully Installed and Verified! ║\n";
echo "╚══════════════════════════════════════════════════════════╝\n";Verify Rubix ML Installation
# filename: test-rubixml.php
<?php
declare(strict_types=1);
require_once __DIR__ . '/vendor/autoload.php';
use Rubix\ML\Datasets\Labeled;
use Rubix\ML\Classifiers\KNearestNeighbors as RubixKNN;
use Rubix\ML\Clusterers\KMeans as RubixKMeans;
use Rubix\ML\Regressors\Ridge;
use Rubix\ML\CrossValidation\Metrics\Accuracy;
echo "╔══════════════════════════════════════════════════════════╗\n";
echo "║ Rubix ML Installation Verification ║\n";
echo "╚══════════════════════════════════════════════════════════╝\n\n";
// Test 1: Classification with Dataset
echo "Test 1: Classification (k-NN with Labeled Dataset)\n";
echo "------------------------------------------------------------\n";
$dataset = new Labeled(
[[1, 2], [3, 4], [5, 6]],
['A', 'B', 'B']
);
$classifier = new RubixKNN(k: 3);
$classifier->train($dataset);
$prediction = $classifier->predictSample([2, 3]);
echo "✓ Rubix k-NN classifier working\n";
echo " Sample [2, 3] → Predicted: {$prediction}\n\n";
// Test 2: Accuracy metric
echo "Test 2: Evaluation Metrics\n";
echo "------------------------------------------------------------\n";
$predictions = $classifier->predict($dataset);
$metric = new Accuracy();
$score = $metric->score($predictions, $dataset->labels());
echo "✓ Evaluation metrics working\n";
echo " Accuracy score: " . number_format($score * 100, 2) . "%\n\n";
// Test 3: Clustering
echo "Test 3: Clustering (k-Means)\n";
echo "------------------------------------------------------------\n";
$clusterDataset = new Labeled(
[[1, 1], [1.5, 2], [5, 8], [8, 8]],
[] // No labels for clustering
);
$kmeans = new RubixKMeans(k: 2);
$kmeans->train($clusterDataset);
$clusterPredictions = $kmeans->predict($clusterDataset);
echo "✓ Rubix k-Means clustering working\n";
echo " Assigned clusters: " . implode(', ', $clusterPredictions) . "\n\n";
// Test 4: Regression
echo "Test 4: Regression (Ridge Regression)\n";
echo "------------------------------------------------------------\n";
$regressionDataset = new Labeled(
[[1], [2], [3]],
[2, 4, 6] // y = 2x
);
$regressor = new Ridge(alpha: 1.0);
$regressor->train($regressionDataset);
$predictedValue = $regressor->predictSample([4]);
echo "✓ Ridge regression working\n";
echo " Input: 4 → Predicted: " . round($predictedValue, 2) . " (expected ~8)\n\n";
echo "╔══════════════════════════════════════════════════════════╗\n";
echo "║ ✓ Rubix ML Successfully Installed and Verified! ║\n";
echo "╚══════════════════════════════════════════════════════════╝\n";Expected Result
Running the PHP-ML test:
php test-phpml.php╔══════════════════════════════════════════════════════════╗
║ PHP-ML Installation Verification ║
╚══════════════════════════════════════════════════════════╝
Test 1: Classification (k-NN)
------------------------------------------------------------
✓ k-NN classifier working
Sample [2, 3] → Predicted: A
Test 2: Clustering (k-Means)
------------------------------------------------------------
✓ k-Means clustering working
Found 2 clusters
Cluster sizes: C1=2 C2=2
Test 3: Regression (Least Squares)
------------------------------------------------------------
✓ Linear regression working
Input: 4 → Predicted: 8.00 (expected ~8)
╔══════════════════════════════════════════════════════════╗
║ ✓ PHP-ML Successfully Installed and Verified! ║
╚══════════════════════════════════════════════════════════╝Running the Rubix ML test:
php test-rubixml.php╔══════════════════════════════════════════════════════════╗
║ Rubix ML Installation Verification ║
╚══════════════════════════════════════════════════════════╝
Test 1: Classification (k-NN with Labeled Dataset)
------------------------------------------------------------
✓ Rubix k-NN classifier working
Sample [2, 3] → Predicted: A
Test 2: Evaluation Metrics
------------------------------------------------------------
✓ Evaluation metrics working
Accuracy score: 100.00%
Test 3: Clustering (k-Means)
------------------------------------------------------------
✓ Rubix k-Means clustering working
Assigned clusters: 0, 0, 1, 1
Test 4: Regression (Ridge Regression)
------------------------------------------------------------
✓ Ridge regression working
Input: 4 → Predicted: 7.98 (expected ~8)
╔══════════════════════════════════════════════════════════╗
║ ✓ Rubix ML Successfully Installed and Verified! ║
╚══════════════════════════════════════════════════════════╝Why It Works
Composer handles PHP package management by:
- Downloading libraries and their dependencies from Packagist
- Generating an autoloader that automatically includes classes when needed
- Managing version constraints to ensure compatibility
- Creating a lock file (
composer.lock) for reproducible installations
PHP-ML provides pure-PHP implementations of ML algorithms, meaning no external dependencies or compiled extensions are needed. This makes it portable and easy to deploy.
Rubix ML uses PHP's type system extensively with strict types, interfaces, and modern OOP patterns. It's designed for production use with performance optimizations and comprehensive documentation.
Troubleshooting
Error: "Your requirements could not be resolved" — PHP version too old. Verify
php --versionshows 8.1+. Update PHP or adjustcomposer.jsonminimum version.Composer command not found — Composer not installed. Download from getcomposer.org and follow installation instructions.
Memory limit exceeded during install — Composer needs more memory. Run:
php -d memory_limit=-1 $(which composer) require rubix/mlClass not found errors — Autoloader not included. Ensure
require_once __DIR__ . '/vendor/autoload.php';is at the top of every script.
Quick Reference
You're now ready to use ML libraries! You have both PHP-ML and Rubix ML installed and verified. In the next steps, you'll rebuild the Chapter 6 spam filter with 90% less code, create customer segments with clustering, and deploy production-ready models.
Quick decision guide:
- Just learning? Start with PHP-ML (simpler API)
- Building production app? Use Rubix ML (more features, better performance)
- Not sure? Follow along with both and decide later!
Step 2: Reimplementing the Spam Filter with PHP-ML (~12 min)
Goal
Compare your custom Chapter 6 spam filter implementation against PHP-ML's built-in classifier, demonstrating how libraries reduce code while improving robustness.
Actions
In Chapter 6, you built a spam filter from scratch with manual distance calculations, neighbor finding, and voting logic. Let's rebuild it using PHP-ML and compare.
Custom Implementation Recap (Chapter 6)
Your original implementation required:
// ~50 lines of custom code:
// - euclideanDistance() function
// - findKNearestNeighbors() function
// - majorityVote() function
// - Manual data structures for storing distances
// - Edge case handling for tiesPHP-ML Implementation (This Chapter)
# filename: 01-phpml-spam-filter.php
<?php
declare(strict_types=1);
require_once __DIR__ . '/vendor/autoload.php';
use Phpml\Classification\KNearestNeighbors;
use Phpml\FeatureExtraction\TokenCountVectorizer;
use Phpml\Tokenization\WordTokenizer;
echo "╔══════════════════════════════════════════════════════════╗\n";
echo "║ Spam Filter: Custom vs. PHP-ML ║\n";
echo "╚══════════════════════════════════════════════════════════╝\n\n";
// ============================================================
// STEP 1: Prepare Training Data
// ============================================================
echo "STEP 1: Preparing Training Data\n";
echo "------------------------------------------------------------\n";
$trainingEmails = [
"Free money now! Click here to claim your prize!",
"Meeting scheduled for tomorrow at 3pm",
"URGENT: You won the lottery! Act now!!!",
"Can you review the project proposal?",
"Get rich quick! Buy now! Limited offer!!!",
"Thanks for your email. I'll respond soon.",
"FREE gift card! Click here immediately!",
"Let's schedule a call to discuss the requirements",
];
$trainingLabels = ['spam', 'ham', 'spam', 'ham', 'spam', 'ham', 'spam', 'ham'];
echo "Training set: " . count($trainingEmails) . " emails\n";
echo " - Spam: " . array_count_values($trainingLabels)['spam'] . "\n";
echo " - Ham: " . array_count_values($trainingLabels)['ham'] . "\n\n";
// ============================================================
// STEP 2: Feature Extraction
// ============================================================
echo "STEP 2: Feature Extraction\n";
echo "------------------------------------------------------------\n";
// Approach 1: Manual feature extraction (like Chapter 6)
function extractManualFeatures(string $email): array
{
$lower = strtolower($email);
return [
str_contains($lower, 'free') ? 1 : 0,
str_contains($lower, 'money') ? 1 : 0,
str_contains($lower, 'urgent') ? 1 : 0,
substr_count($email, '!'),
preg_match_all('/[A-Z]/', $email),
str_word_count($email),
];
}
$manualFeatures = array_map('extractManualFeatures', $trainingEmails);
echo "Manual feature extraction:\n";
echo " Features per email: 6 (free, money, urgent, exclamations, capitals, words)\n";
echo " Example: \"{$trainingEmails[0]}\"\n";
echo " → Features: [" . implode(', ', $manualFeatures[0]) . "]\n\n";
// ============================================================
// STEP 3: Train Classifier
// ============================================================
echo "STEP 3: Training k-NN Classifier\n";
echo "------------------------------------------------------------\n";
$classifier = new KNearestNeighbors(k: 3);
$startTime = microtime(true);
$classifier->train($manualFeatures, $trainingLabels);
$trainingTime = (microtime(true) - $startTime) * 1000;
echo "✓ Classifier trained in " . number_format($trainingTime, 2) . " ms\n";
echo " Algorithm: k-Nearest Neighbors (k=3)\n";
echo " Training samples: " . count($manualFeatures) . "\n\n";
// ============================================================
// STEP 4: Make Predictions
// ============================================================
echo "STEP 4: Predicting New Emails\n";
echo "------------------------------------------------------------\n";
$testEmails = [
"FREE offer! Act now! Limited time only!!!",
"Can we meet tomorrow to discuss the budget?",
"Congratulations! You've been selected for a prize!",
"Please find attached the quarterly report",
];
echo "Testing " . count($testEmails) . " new emails:\n\n";
foreach ($testEmails as $i => $email) {
$features = extractManualFeatures($email);
$prediction = $classifier->predict([$features])[0];
$icon = $prediction === 'spam' ? '🚫' : '✅';
$display = strlen($email) > 45 ? substr($email, 0, 42) . '...' : $email;
echo "{$icon} Email " . ($i + 1) . ": {$display}\n";
echo " Prediction: " . strtoupper($prediction) . "\n";
echo " Features: [" . implode(', ', $features) . "]\n\n";
}
// ============================================================
// STEP 5: Batch Predictions (Library Advantage)
// ============================================================
echo "STEP 5: Batch Predictions\n";
echo "------------------------------------------------------------\n";
$testFeatures = array_map('extractManualFeatures', $testEmails);
$batchStartTime = microtime(true);
$predictions = $classifier->predict($testFeatures);
$batchTime = (microtime(true) - $batchStartTime) * 1000;
echo "✓ Batch predicted " . count($testEmails) . " emails in " .
number_format($batchTime, 3) . " ms\n";
echo " Average per email: " . number_format($batchTime / count($testEmails), 3) . " ms\n\n";
// ============================================================
// COMPARISON: Custom vs. Library
// ============================================================
echo "╔══════════════════════════════════════════════════════════╗\n";
echo "║ Custom vs. PHP-ML Comparison ║\n";
echo "╚══════════════════════════════════════════════════════════╝\n\n";
$comparison = [
['Aspect', 'Custom (Chapter 6)', 'PHP-ML (This Chapter)'],
['Lines of Code', '~120 lines', '~40 lines'],
['Distance Function', 'Manual implementation', 'Built-in optimized'],
['Neighbor Finding', 'Custom sorting logic', 'Library handles it'],
['Edge Cases', 'Manual handling needed', 'Automatically handled'],
['Batch Predictions', 'Loop required', 'Native support'],
['Type Safety', 'Manual validation', 'Type-checked API'],
['Performance', 'Unoptimized', 'Optimized algorithms'],
['Maintenance', 'You maintain it', 'Library maintainers'],
];
foreach ($comparison as $row) {
printf("%-20s | %-25s | %-25s\n", ...$row);
if ($row[0] === 'Aspect') {
echo str_repeat('-', 20) . ' | ' . str_repeat('-', 25) . ' | ' .
str_repeat('-', 25) . "\n";
}
}
echo "\n💡 Key Insight: PHP-ML reduced code by 66% while improving robustness!\n";Expected Result
╔══════════════════════════════════════════════════════════╗
║ Spam Filter: Custom vs. PHP-ML ║
╚══════════════════════════════════════════════════════════╝
STEP 1: Preparing Training Data
------------------------------------------------------------
Training set: 8 emails
- Spam: 4
- Ham: 4
STEP 2: Feature Extraction
------------------------------------------------------------
Manual feature extraction:
Features per email: 6 (free, money, urgent, exclamations, capitals, words)
Example: "Free money now! Click here to claim your prize!"
→ Features: [1, 1, 0, 1, 2, 9]
STEP 3: Training k-NN Classifier
------------------------------------------------------------
✓ Classifier trained in 0.42 ms
Algorithm: k-Nearest Neighbors (k=3)
Training samples: 8
STEP 4: Predicting New Emails
------------------------------------------------------------
Testing 4 new emails:
🚫 Email 1: FREE offer! Act now! Limited time only!!!
Prediction: SPAM
Features: [1, 0, 0, 3, 8, 6]
✅ Email 2: Can we meet tomorrow to discuss the budget?
Prediction: HAM
Features: [0, 0, 0, 1, 0, 8]
🚫 Email 3: Congratulations! You've been selected for...
Prediction: SPAM
Features: [0, 0, 0, 1, 1, 7]
✅ Email 4: Please find attached the quarterly report
Prediction: HAM
Features: [0, 0, 0, 0, 0, 6]
STEP 5: Batch Predictions
------------------------------------------------------------
✓ Batch predicted 4 emails in 0.152 ms
Average per email: 0.038 ms
╔══════════════════════════════════════════════════════════╗
║ Custom vs. PHP-ML Comparison ║
╚══════════════════════════════════════════════════════════╝
Aspect | Custom (Chapter 6) | PHP-ML (This Chapter)
-------------------- | ------------------------- | -------------------------
Lines of Code | ~120 lines | ~40 lines
Distance Function | Manual implementation | Built-in optimized
Neighbor Finding | Custom sorting logic | Library handles it
Edge Cases | Manual handling needed | Automatically handled
Batch Predictions | Loop required | Native support
Type Safety | Manual validation | Type-checked API
Performance | Unoptimized | Optimized algorithms
Maintenance | You maintain it | Library maintainers
💡 Key Insight: PHP-ML reduced code by 66% while improving robustness!Why It Works
PHP-ML's k-NN classifier encapsulates all the complexity you manually implemented:
- Distance calculation: Uses optimized Euclidean distance with caching
- Neighbor finding: Efficiently sorts distances without creating large intermediate arrays
- Voting: Handles ties deterministically (alphabetical order for consistency)
- Batch operations: Optimizes multiple predictions by reusing distance calculations where possible
- Type safety: Method signatures enforce correct data types at compile time
The train() method signature is:
public function train(array $samples, array $labels): voidPHP's type system ensures $samples is an array and $labels is an array. The library validates that:
- Sample count matches label count
- All samples have the same feature count
- Labels are valid types (strings or integers)
This prevents runtime errors that custom code might miss.
Troubleshooting
Error: "Argument #1 must be of type array, int given" — Wrapping single sample incorrectly. For single predictions, use:
$classifier->predict([$features])(array with one element).Predictions always return same class — k too large for small dataset. With 8 training samples, k=7 would include 7 neighbors (87.5% of data), likely causing majority class to always win. Use k=3 or k=5.
"Samples and labels count mismatch" — Arrays have different lengths. Verify:
count($samples) === count($labels).
Step 3: Building an Iris Classifier with Rubix ML (~15 min)
Goal
Experience Rubix ML's comprehensive ecosystem by building a complete classification pipeline with preprocessing, cross-validation, and detailed evaluation metrics.
Actions
Rubix ML takes a more structured approach than PHP-ML, using Dataset objects, Estimator interfaces, and built-in Pipeline composition. This leads to more maintainable production code.
# filename: 02-rubix-iris-classifier.php
<?php
declare(strict_types=1);
require_once __DIR__ . '/vendor/autoload.php';
use Rubix\ML\Datasets\Labeled;
use Rubix\ML\Datasets\Unlabeled;
use Rubix\ML\Classifiers\KNearestNeighbors;
use Rubix\ML\Classifiers\GaussianNB;
use Rubix\ML\Classifiers\ClassificationTree;
use Rubix\ML\Transformers\ZScaleStandardizer;
use Rubix\ML\Pipeline;
use Rubix\ML\CrossValidation\KFold;
use Rubix\ML\CrossValidation\Metrics\Accuracy;
use Rubix\ML\CrossValidation\Metrics\F1Score;
use Rubix\ML\CrossValidation\Metrics\ConfusionMatrix;
echo "╔══════════════════════════════════════════════════════════╗\n";
echo "║ Iris Classification with Rubix ML Pipeline ║\n";
echo "╚══════════════════════════════════════════════════════════╝\n\n";
// ============================================================
// STEP 1: Load Iris Dataset
// ============================================================
echo "STEP 1: Loading Iris Dataset\n";
echo "------------------------------------------------------------\n";
// In production, load from CSV. For demo, use embedded data.
$samples = [
// Setosa (first 10)
[5.1, 3.5, 1.4, 0.2], [4.9, 3.0, 1.4, 0.2], [4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2], [5.0, 3.6, 1.4, 0.2], [5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3], [5.0, 3.4, 1.5, 0.2], [4.4, 2.9, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.1],
// Versicolor (next 10)
[7.0, 3.2, 4.7, 1.4], [6.4, 3.2, 4.5, 1.5], [6.9, 3.1, 4.9, 1.5],
[5.5, 2.3, 4.0, 1.3], [6.5, 2.8, 4.6, 1.5], [5.7, 2.8, 4.5, 1.3],
[6.3, 3.3, 4.7, 1.6], [4.9, 2.4, 3.3, 1.0], [6.6, 2.9, 4.6, 1.3],
[5.2, 2.7, 3.9, 1.4],
// Virginica (final 10)
[6.3, 3.3, 6.0, 2.5], [5.8, 2.7, 5.1, 1.9], [7.1, 3.0, 5.9, 2.1],
[6.3, 2.9, 5.6, 1.8], [6.5, 3.0, 5.8, 2.2], [7.6, 3.0, 6.6, 2.1],
[4.9, 2.5, 4.5, 1.7], [7.3, 2.9, 6.3, 1.8], [6.7, 2.5, 5.8, 1.8],
[7.2, 3.6, 6.1, 2.5],
];
$labels = array_merge(
array_fill(0, 10, 'Iris-setosa'),
array_fill(0, 10, 'Iris-versicolor'),
array_fill(0, 10, 'Iris-virginica')
);
echo "✓ Dataset loaded: " . count($samples) . " samples\n";
echo "✓ Features per sample: " . count($samples[0]) . "\n";
echo "✓ Classes: " . implode(', ', array_unique($labels)) . "\n\n";
// Create Rubix ML Dataset object
$dataset = new Labeled($samples, $labels);
echo "Class distribution:\n";
foreach (array_count_values($labels) as $class => $count) {
$pct = ($count / count($labels)) * 100;
echo " - {$class}: {$count} samples (" . number_format($pct, 1) . "%)\n";
}
echo "\n";
// ============================================================
// STEP 2: Build ML Pipeline
// ============================================================
echo "STEP 2: Building ML Pipeline\n";
echo "------------------------------------------------------------\n";
// Rubix ML Pipeline: Transformers + Estimator
$pipeline = new Pipeline([
new ZScaleStandardizer(), // Standardize features (mean=0, std=1)
], new KNearestNeighbors(k: 5));
echo "Pipeline created:\n";
echo " 1. ZScaleStandardizer (normalize features)\n";
echo " 2. k-Nearest Neighbors (k=5)\n\n";
echo "What ZScaleStandardizer does:\n";
echo " For each feature:\n";
echo " - Calculate mean (μ) and standard deviation (σ)\n";
echo " - Transform: z = (x - μ) / σ\n";
echo " Result: Features centered at 0 with unit variance\n\n";
// ============================================================
// STEP 3: Train-Test Split
// ============================================================
echo "STEP 3: Splitting Data (80/20)\n";
echo "------------------------------------------------------------\n";
[$training, $testing] = $dataset->randomize()->split(0.8);
echo "✓ Training set: " . $training->numSamples() . " samples (80%)\n";
echo "✓ Testing set: " . $testing->numSamples() . " samples (20%)\n\n";
// ============================================================
// STEP 4: Train Pipeline
// ============================================================
echo "STEP 4: Training Pipeline\n";
echo "------------------------------------------------------------\n";
$trainStart = microtime(true);
$pipeline->train($training);
$trainTime = (microtime(true) - $trainStart) * 1000;
echo "✓ Pipeline trained in " . number_format($trainTime, 2) . " ms\n";
echo " - Fitted standardizer to training data\n";
echo " - Trained k-NN classifier (k=5)\n\n";
// ============================================================
// STEP 5: Make Predictions
// ============================================================
echo "STEP 5: Making Predictions on Test Set\n";
echo "------------------------------------------------------------\n";
$predictStart = microtime(true);
$predictions = $pipeline->predict($testing);
$predictTime = (microtime(true) - $predictStart) * 1000;
echo "✓ Predicted " . count($predictions) . " samples in " .
number_format($predictTime, 2) . " ms\n";
echo " Average per sample: " .
number_format($predictTime / count($predictions), 3) . " ms\n\n";
// Show sample predictions
echo "Sample Predictions:\n";
for ($i = 0; $i < min(5, count($predictions)); $i++) {
$actual = $testing->label($i);
$predicted = $predictions[$i];
$match = $actual === $predicted ? '✓' : '✗';
echo " {$match} Sample " . ($i + 1) . ": ";
echo "Predicted '{$predicted}' | Actual '{$actual}'\n";
}
echo "\n";
// ============================================================
// STEP 6: Evaluate with Multiple Metrics
// ============================================================
echo "STEP 6: Evaluation Metrics\n";
echo "------------------------------------------------------------\n";
$accuracyMetric = new Accuracy();
$f1Metric = new F1Score();
$confusionMatrix = new ConfusionMatrix();
$accuracy = $accuracyMetric->score($predictions, $testing->labels());
$f1 = $f1Metric->score($predictions, $testing->labels());
$matrix = $confusionMatrix->score($predictions, $testing->labels());
echo "Performance Metrics:\n\n";
echo "1. Accuracy: " . number_format($accuracy * 100, 2) . "%\n";
echo " → Proportion of correct predictions\n";
echo " → Formula: (TP + TN) / Total\n\n";
echo "2. F1-Score (Micro): " . number_format($f1 * 100, 2) . "%\n";
echo " → Harmonic mean of precision and recall\n";
echo " → Balances false positives and false negatives\n\n";
echo "3. Confusion Matrix:\n";
echo " Shows which classes are confused:\n\n";
// Display confusion matrix
$classes = array_unique($testing->labels());
sort($classes);
// Header
echo " │ ";
foreach ($classes as $class) {
$shortClass = substr($class, 5, 3); // "Iris-setosa" → "set"
echo str_pad($shortClass, 6, ' ', STR_PAD_LEFT) . " ";
}
echo "\n";
echo " " . str_repeat("─", 17) . "┼" . str_repeat("─", 7 * count($classes) + 1) . "\n";
// Matrix rows
foreach ($classes as $actualClass) {
$shortActual = substr($actualClass, 5, 3);
echo " " . str_pad($shortActual, 15, ' ', STR_PAD_LEFT) . " │ ";
foreach ($classes as $predictedClass) {
$count = $matrix[$actualClass][$predictedClass] ?? 0;
echo str_pad((string)$count, 6, ' ', STR_PAD_LEFT) . " ";
}
echo "\n";
}
echo "\n";
// ============================================================
// STEP 7: Cross-Validation for Robust Estimates
// ============================================================
echo "STEP 7: 5-Fold Cross-Validation\n";
echo "------------------------------------------------------------\n";
$kfold = new KFold(k: 5);
// Need fresh pipeline for each fold
$cvScores = [];
foreach ($kfold->test($dataset, function ($training, $testing) {
$pipeline = new Pipeline([
new ZScaleStandardizer(),
], new KNearestNeighbors(k: 5));
$pipeline->train($training);
$predictions = $pipeline->predict($testing);
$metric = new Accuracy();
return $metric->score($predictions, $testing->labels());
}) as $fold => $score) {
$cvScores[] = $score;
echo " Fold " . ($fold + 1) . ": " . number_format($score * 100, 2) . "%\n";
}
$meanScore = array_sum($cvScores) / count($cvScores);
$variance = 0;
foreach ($cvScores as $score) {
$variance += pow($score - $meanScore, 2);
}
$stdDev = sqrt($variance / count($cvScores));
echo "\n";
echo "Cross-Validation Results:\n";
echo " Mean Accuracy: " . number_format($meanScore * 100, 2) . "% (±" .
number_format($stdDev * 100, 2) . "%)\n";
echo " Min: " . number_format(min($cvScores) * 100, 2) . "%\n";
echo " Max: " . number_format(max($cvScores) * 100, 2) . "%\n\n";
// ============================================================
// STEP 8: Predict New Unseen Sample
// ============================================================
echo "STEP 8: Predicting New Iris Flower\n";
echo "------------------------------------------------------------\n";
// New flower with unknown species
$newFlower = new Unlabeled([[5.9, 3.0, 5.1, 1.8]]);
$newPrediction = $pipeline->predict($newFlower);
echo "New flower features:\n";
echo " Sepal length: 5.9 cm\n";
echo " Sepal width: 3.0 cm\n";
echo " Petal length: 5.1 cm\n";
echo " Petal width: 1.8 cm\n\n";
echo "→ Predicted species: {$newPrediction[0]}\n\n";
echo "╔══════════════════════════════════════════════════════════╗\n";
echo "║ ✓ Rubix ML Pipeline Complete! ║\n";
echo "╚══════════════════════════════════════════════════════════╝\n";Expected Result
╔══════════════════════════════════════════════════════════╗
║ Iris Classification with Rubix ML Pipeline ║
╚══════════════════════════════════════════════════════════╝
STEP 1: Loading Iris Dataset
------------------------------------------------------------
✓ Dataset loaded: 30 samples
✓ Features per sample: 4
✓ Classes: Iris-setosa, Iris-versicolor, Iris-virginica
Class distribution:
- Iris-setosa: 10 samples (33.3%)
- Iris-versicolor: 10 samples (33.3%)
- Iris-virginica: 10 samples (33.3%)
STEP 2: Building ML Pipeline
------------------------------------------------------------
Pipeline created:
1. ZScaleStandardizer (normalize features)
2. k-Nearest Neighbors (k=5)
What ZScaleStandardizer does:
For each feature:
- Calculate mean (μ) and standard deviation (σ)
- Transform: z = (x - μ) / σ
Result: Features centered at 0 with unit variance
STEP 3: Splitting Data (80/20)
------------------------------------------------------------
✓ Training set: 24 samples (80%)
✓ Testing set: 6 samples (20%)
STEP 4: Training Pipeline
------------------------------------------------------------
✓ Pipeline trained in 1.23 ms
- Fitted standardizer to training data
- Trained k-NN classifier (k=5)
STEP 5: Making Predictions on Test Set
------------------------------------------------------------
✓ Predicted 6 samples in 0.45 ms
Average per sample: 0.075 ms
Sample Predictions:
✓ Sample 1: Predicted 'Iris-setosa' | Actual 'Iris-setosa'
✓ Sample 2: Predicted 'Iris-virginica' | Actual 'Iris-virginica'
✓ Sample 3: Predicted 'Iris-versicolor' | Actual 'Iris-versicolor'
✓ Sample 4: Predicted 'Iris-setosa' | Actual 'Iris-setosa'
✓ Sample 5: Predicted 'Iris-virginica' | Actual 'Iris-virginica'
STEP 6: Evaluation Metrics
------------------------------------------------------------
Performance Metrics:
1. Accuracy: 100.00%
→ Proportion of correct predictions
→ Formula: (TP + TN) / Total
2. F1-Score (Micro): 100.00%
→ Harmonic mean of precision and recall
→ Balances false positives and false negatives
3. Confusion Matrix:
Shows which classes are confused:
│ set ver vir
─────────────────┼─────────────────────
set │ 2 0 0
ver │ 0 2 0
vir │ 0 0 2
STEP 7: 5-Fold Cross-Validation
------------------------------------------------------------
Fold 1: 100.00%
Fold 2: 83.33%
Fold 3: 100.00%
Fold 4: 100.00%
Fold 5: 83.33%
Cross-Validation Results:
Mean Accuracy: 93.33% (±8.16%)
Min: 83.33%
Max: 100.00%
STEP 8: Predicting New Iris Flower
------------------------------------------------------------
New flower features:
Sepal length: 5.9 cm
Sepal width: 3.0 cm
Petal length: 5.1 cm
Petal width: 1.8 cm
→ Predicted species: Iris-virginica
╔══════════════════════════════════════════════════════════╗
║ ✓ Rubix ML Pipeline Complete! ║
╚══════════════════════════════════════════════════════════╝Why It Works
Rubix ML's Pipeline combines preprocessing and prediction into a single object:
$pipeline = new Pipeline([
new ZScaleStandardizer(), // Transformer
new MinMaxNormalizer(), // Another transformer (if needed)
], new KNearestNeighbors(k: 5)); // Estimator (must be last)When you call $pipeline->train($dataset):
- Each transformer is fitted to the training data (calculates statistics)
- Training data flows through transformers, being transformed
- Transformed data trains the final estimator
When you call $pipeline->predict($newData):
- New data flows through the same fitted transformers (using training statistics)
- Transformed new data goes to the estimator for prediction
- No data leakage — test data never influences transformer fitting
Dataset objects (Labeled, Unlabeled) provide structure:
- Type-safe access to samples and labels
- Built-in splitting, shuffling, and sampling methods
- Validation of data consistency
- Integration with transformers and estimators
This structured approach scales better to production than passing raw arrays.
Troubleshooting
Error: "Samples must all have the same dimensionality" — One sample has different feature count. Check:
array_map('count', $samples)to find mismatched sample.Poor cross-validation results on tiny dataset — With only 30 samples and 5 folds, each test fold has 6 samples. Results will vary. Use more data or fewer folds (k=3).
ZScaleStandardizer fails with "Cannot divide by zero" — Feature has zero standard deviation (all values identical). Remove that feature or use different standardizer.
Step 4: Customer Segmentation with PHP-ML Clustering (~10 min)
Goal
Use unsupervised learning with PHP-ML's k-means clustering to segment customers into groups without predefined labels.
Actions
Clustering is powerful for discovering hidden patterns in data. Let's segment customers based on purchasing behavior.
# filename: 03-phpml-clustering.php
<?php
declare(strict_types=1);
require_once __DIR__ . '/vendor/autoload.php';
use Phpml\Clustering\KMeans;
use Phpml\Clustering\DBSCAN;
echo "╔══════════════════════════════════════════════════════════╗\n";
echo "║ Customer Segmentation with PHP-ML ║\n";
echo "╚══════════════════════════════════════════════════════════╝\n\n";
// ============================================================
// STEP 1: Prepare Customer Data
// ============================================================
echo "STEP 1: Loading Customer Data\n";
echo "------------------------------------------------------------\n";
// Features: [monthly_spending, visit_frequency, avg_purchase_value]
$customers = [
// Low-value customers (spend little, visit rarely)
[25, 2, 12], [30, 3, 10], [20, 1, 20], [28, 2, 14],
[22, 2, 11], [26, 3, 9],
// Medium-value customers (moderate spending and visits)
[150, 8, 19], [180, 10, 18], [160, 9, 18], [170, 8, 21],
[165, 9, 18], [155, 7, 22],
// High-value customers (spend a lot, visit frequently)
[450, 20, 23], [480, 22, 22], [520, 25, 21], [500, 23, 22],
[490, 21, 23], [510, 24, 21],
];
echo "✓ Loaded " . count($customers) . " customers\n";
echo "✓ Features: monthly_spending ($), visit_frequency (#/month), avg_purchase_value ($)\n\n";
// ============================================================
// STEP 2: k-Means Clustering
// ============================================================
echo "STEP 2: Applying k-Means Clustering\n";
echo "------------------------------------------------------------\n";
$kmeans = new KMeans(n: 3); // We expect 3 segments
$clusters = $kmeans->cluster($customers);
echo "✓ k-Means clustering complete\n";
echo " Requested clusters: 3\n";
echo " Found " . count($clusters) . " clusters\n\n";
// Analyze each cluster
echo "Cluster Analysis:\n\n";
foreach ($clusters as $clusterIndex => $clusterMembers) {
echo "Cluster " . ($clusterIndex + 1) . ": " . count($clusterMembers) . " customers\n";
// Calculate cluster statistics
$spendingValues = array_column($clusterMembers, 0);
$visitValues = array_column($clusterMembers, 1);
$avgPurchaseValues = array_column($clusterMembers, 2);
$avgSpending = array_sum($spendingValues) / count($spendingValues);
$avgVisits = array_sum($visitValues) / count($visitValues);
$avgPurchase = array_sum($avgPurchaseValues) / count($avgPurchaseValues);
echo " Average monthly spending: $" . number_format($avgSpending, 2) . "\n";
echo " Average visit frequency: " . number_format($avgVisits, 1) . " visits/month\n";
echo " Average purchase value: $" . number_format($avgPurchase, 2) . "\n";
// Interpret cluster
$segment = 'Unknown';
if ($avgSpending < 50) {
$segment = 'Occasional Customers (Low Engagement)';
} elseif ($avgSpending < 250) {
$segment = 'Regular Customers (Medium Engagement)';
} else {
$segment = 'VIP Customers (High Engagement)';
}
echo " → Business Segment: {$segment}\n\n";
}
// ============================================================
// STEP 3: Predicting Cluster for New Customers
// ============================================================
echo "STEP 3: Assigning New Customers to Clusters\n";
echo "------------------------------------------------------------\n";
// New customers to classify
$newCustomers = [
[35, 4, 9], // Low spender
[175, 10, 18], // Medium spender
[510, 24, 22], // High spender
];
echo "Analyzing " . count($newCustomers) . " new customers:\n\n";
// To assign new customers, we'd normally use cluster centroids
// PHP-ML doesn't expose centroids directly, so let's demonstrate the concept
echo "Note: PHP-ML k-means doesn't expose centroids for new predictions.\n";
echo "For production, use Rubix ML which provides predict() for new samples.\n\n";
echo "However, we can show cluster membership from original clustering:\n";
foreach ($clusters as $clusterIndex => $members) {
$indices = [];
foreach ($members as $memberIndex => $member) {
// Find original index
$originalIndex = array_search($member, $customers);
if ($originalIndex !== false) {
$indices[] = $originalIndex;
}
}
echo " Cluster " . ($clusterIndex + 1) . " contains customers: " .
implode(', ', array_slice($indices, 0, 5)) . (count($indices) > 5 ? '...' : '') . "\n";
}
echo "\n";
// ============================================================
// STEP 4: DBSCAN Clustering (Density-Based)
// ============================================================
echo "STEP 4: Comparing with DBSCAN Clustering\n";
echo "------------------------------------------------------------\n";
echo "DBSCAN (Density-Based Spatial Clustering):\n";
echo " - Finds clusters of any shape\n";
echo " - Can identify outliers/noise\n";
echo " - Doesn't require specifying number of clusters\n\n";
$dbscan = new DBSCAN(epsilon: 100, minSamples: 2);
$dbscanClusters = $dbscan->cluster($customers);
echo "✓ DBSCAN clustering complete\n";
echo " Found " . count($dbscanClusters) . " clusters\n\n";
foreach ($dbscanClusters as $clusterIndex => $clusterMembers) {
echo " DBSCAN Cluster " . ($clusterIndex + 1) . ": " .
count($clusterMembers) . " customers\n";
}
echo "\n";
// ============================================================
// COMPARISON: k-Means vs. DBSCAN
// ============================================================
echo "╔══════════════════════════════════════════════════════════╗\n";
echo "║ k-Means vs. DBSCAN Comparison ║\n";
echo "╚══════════════════════════════════════════════════════════╝\n\n";
$comparison = [
['Aspect', 'k-Means', 'DBSCAN'],
['Cluster Shape', 'Spherical clusters', 'Arbitrary shapes'],
['Clusters Required', 'Must specify k', 'Auto-determines'],
['Outlier Handling', 'Forces all into clusters', 'Can label as noise'],
['Speed', 'Fast', 'Slower on large data'],
['Best For', 'Well-separated groups', 'Complex shapes, noise'],
];
foreach ($comparison as $row) {
printf("%-20s | %-25s | %-25s\n", ...$row);
if ($row[0] === 'Aspect') {
echo str_repeat('-', 20) . ' | ' . str_repeat('-', 25) . ' | ' .
str_repeat('-', 25) . "\n";
}
}
echo "\n💡 For customer segmentation, k-means works well when you know\n";
echo " how many segments you want (e.g., Low/Medium/High value).\n";Expected Result
╔══════════════════════════════════════════════════════════╗
║ Customer Segmentation with PHP-ML ║
╚══════════════════════════════════════════════════════════╝
STEP 1: Loading Customer Data
------------------------------------------------------------
✓ Loaded 18 customers
✓ Features: monthly_spending ($), visit_frequency (#/month), avg_purchase_value ($)
STEP 2: Applying k-Means Clustering
------------------------------------------------------------
✓ k-Means clustering complete
Requested clusters: 3
Found 3 clusters
Cluster Analysis:
Cluster 1: 6 customers
Average monthly spending: $25.17
Average visit frequency: 2.2 visits/month
Average purchase value: $12.67
→ Business Segment: Occasional Customers (Low Engagement)
Cluster 2: 6 customers
Average monthly spending: $163.33
Average visit frequency: 8.5 visits/month
Average purchase value: $19.33
→ Business Segment: Regular Customers (Medium Engagement)
Cluster 3: 6 customers
Average monthly spending: $491.67
Average visit frequency: 22.5 visits/month
Average purchase value: $22.00
→ Business Segment: VIP Customers (High Engagement)
STEP 3: Assigning New Customers to Clusters
------------------------------------------------------------
Analyzing 3 new customers:
Note: PHP-ML k-means doesn't expose centroids for new predictions.
For production, use Rubix ML which provides predict() for new samples.
However, we can show cluster membership from original clustering:
Cluster 1 contains customers: 0, 1, 2, 3, 4...
Cluster 2 contains customers: 6, 7, 8, 9, 10...
Cluster 3 contains customers: 12, 13, 14, 15, 16...
STEP 4: Comparing with DBSCAN Clustering
------------------------------------------------------------
DBSCAN (Density-Based Spatial Clustering):
- Finds clusters of any shape
- Can identify outliers/noise
- Doesn't require specifying number of clusters
✓ DBSCAN clustering complete
Found 3 clusters
DBSCAN Cluster 1: 6 customers
DBSCAN Cluster 2: 6 customers
DBSCAN Cluster 3: 6 customers
╔══════════════════════════════════════════════════════════╗
║ k-Means vs. DBSCAN Comparison ║
╚══════════════════════════════════════════════════════════╝
Aspect | k-Means | DBSCAN
-------------------- | ------------------------- | -------------------------
Cluster Shape | Spherical clusters | Arbitrary shapes
Clusters Required | Must specify k | Auto-determines
Outlier Handling | Forces all into clusters | Can label as noise
Speed | Fast | Slower on large data
Best For | Well-separated groups | Complex shapes, noise
💡 For customer segmentation, k-means works well when you know
how many segments you want (e.g., Low/Medium/High value).Why It Works
k-Means algorithm:
- Initialize k random cluster centroids
- Assign each point to nearest centroid (Euclidean distance)
- Recalculate centroids as mean of assigned points
- Repeat steps 2-3 until convergence
PHP-ML implements this efficiently, but doesn't expose the final centroids for predicting new samples—a limitation compared to Rubix ML.
DBSCAN algorithm:
- For each point, count neighbors within epsilon distance
- Points with ≥ minSamples neighbors are "core points"
- Core points and their neighbors form clusters
- Points not in any cluster are "noise"
This finds arbitrarily-shaped clusters and can identify outliers.
Troubleshooting
All points in one cluster — epsilon too large (DBSCAN) or k too small (k-means). Adjust parameters based on data scale.
k-means results change each run — k-means uses random initialization. For reproducible results (development only), set
mt_srand(42)before clustering.Can't predict cluster for new customers — PHP-ML limitation. Use Rubix ML's
KMeansclusterer which implementspredict()for new samples.
Step 5: Model Persistence - Saving and Loading (~8 min)
Goal
Learn to save trained models to disk and reload them instantly for production use, avoiding retraining on every request.
Actions
In production, you train once and reuse the model thousands of times. Model persistence makes this possible.
# filename: 05-model-persistence.php
<?php
declare(strict_types=1);
require_once __DIR__ . '/vendor/autoload.php';
use Phpml\Classification\KNearestNeighbors;
use Phpml\ModelManager;
use Rubix\ML\Classifiers\GaussianNB;
use Rubix\ML\Datasets\Labeled;
use Rubix\ML\Persisters\Filesystem;
echo "╔══════════════════════════════════════════════════════════╗\n";
echo "║ Model Persistence Demo ║\n";
echo "╚══════════════════════════════════════════════════════════╝\n\n";
// Create models directory
if (!is_dir(__DIR__ . '/models')) {
mkdir(__DIR__ . '/models', 0755, true);
echo "✓ Created models/ directory\n\n";
}
// ============================================================
// PART 1: PHP-ML Model Persistence
// ============================================================
echo "PART 1: PHP-ML Model Persistence\n";
echo "════════════════════════════════════════════════════════════\n\n";
echo "STEP 1: Training a PHP-ML Classifier\n";
echo "------------------------------------------------------------\n";
$samples = [
[5.1, 3.5, 1.4, 0.2], [4.9, 3.0, 1.4, 0.2],
[7.0, 3.2, 4.7, 1.4], [6.4, 3.2, 4.5, 1.5],
];
$labels = ['setosa', 'setosa', 'versicolor', 'versicolor'];
$phpmlClassifier = new KNearestNeighbors(k: 3);
$phpmlClassifier->train($samples, $labels);
echo "✓ Trained k-NN classifier\n";
echo " Training samples: " . count($samples) . "\n";
echo " Algorithm: k-Nearest Neighbors (k=3)\n\n";
echo "STEP 2: Saving PHP-ML Model to Disk\n";
echo "------------------------------------------------------------\n";
$phpmlModelPath = __DIR__ . '/models/phpml-knn.model';
$modelManager = new ModelManager();
$saveStart = microtime(true);
$modelManager->saveToFile($phpmlClassifier, $phpmlModelPath);
$saveTime = (microtime(true) - $saveStart) * 1000;
$fileSize = filesize($phpmlModelPath);
echo "✓ Model saved successfully\n";
echo " Path: {$phpmlModelPath}\n";
echo " File size: " . number_format($fileSize) . " bytes\n";
echo " Save time: " . number_format($saveTime, 2) . " ms\n\n";
echo "STEP 3: Loading PHP-ML Model from Disk\n";
echo "------------------------------------------------------------\n";
$loadStart = microtime(true);
$loadedClassifier = $modelManager->restoreFromFile($phpmlModelPath);
$loadTime = (microtime(true) - $loadStart) * 1000;
echo "✓ Model loaded successfully\n";
echo " Load time: " . number_format($loadTime, 2) . " ms\n";
echo " Model type: " . get_class($loadedClassifier) . "\n\n";
echo "STEP 4: Making Predictions with Loaded Model\n";
echo "------------------------------------------------------------\n";
$testSample = [5.0, 3.5, 1.5, 0.2];
$prediction = $loadedClassifier->predict([$testSample])[0];
echo "Test sample: [" . implode(', ', $testSample) . "]\n";
echo "→ Prediction: {$prediction}\n\n";
// ============================================================
// PART 2: Rubix ML Model Persistence
// ============================================================
echo "PART 2: Rubix ML Model Persistence\n";
echo "════════════════════════════════════════════════════════════\n\n";
echo "STEP 1: Training a Rubix ML Classifier\n";
echo "------------------------------------------------------------\n";
$dataset = new Labeled($samples, $labels);
$rubixClassifier = new GaussianNB();
$rubixClassifier->train($dataset);
echo "✓ Trained Gaussian Naive Bayes classifier\n";
echo " Training samples: " . $dataset->numSamples() . "\n\n";
echo "STEP 2: Saving Rubix ML Model to Disk\n";
echo "------------------------------------------------------------\n";
$rubixModelPath = __DIR__ . '/models/rubix-nb.rbx';
$persister = new Filesystem($rubixModelPath);
$saveStart = microtime(true);
$persister->save($rubixClassifier);
$saveTime = (microtime(true) - $saveStart) * 1000;
$fileSize = filesize($rubixModelPath);
echo "✓ Model saved successfully\n";
echo " Path: {$rubixModelPath}\n";
echo " File size: " . number_format($fileSize) . " bytes\n";
echo " Save time: " . number_format($saveTime, 2) . " ms\n\n";
echo "STEP 3: Loading Rubix ML Model from Disk\n";
echo "------------------------------------------------------------\n";
$loadStart = microtime(true);
$loadedRubixClassifier = $persister->load();
$loadTime = (microtime(true) - $loadStart) * 1000;
echo "✓ Model loaded successfully\n";
echo " Load time: " . number_format($loadTime, 2) . " ms\n";
echo " Model type: " . get_class($loadedRubixClassifier) . "\n\n";
echo "STEP 4: Making Predictions with Loaded Model\n";
echo "------------------------------------------------------------\n";
$prediction = $loadedRubixClassifier->predictSample($testSample);
echo "Test sample: [" . implode(', ', $testSample) . "]\n";
echo "→ Prediction: {$prediction}\n\n";
// ============================================================
// PRODUCTION USAGE PATTERN
// ============================================================
echo "╔══════════════════════════════════════════════════════════╗\n";
echo "║ Production Usage Pattern ║\n";
echo "╚══════════════════════════════════════════════════════════╝\n\n";
echo "TRAINING SCRIPT (run offline, once per day/week):\n";
echo "------------------------------------------------------------\n";
echo "```php\n";
echo "// train-model.php\n";
echo "\\$classifier = new KNearestNeighbors(k: 5);\n";
echo "\\$classifier->train(\\$trainingData, \\$labels);\n";
echo "\\$modelManager->saveToFile(\\$classifier, 'models/classifier.model');\n";
echo "echo 'Model trained and saved!';\n";
echo "```\n\n";
echo "PRODUCTION API (load once, use many times):\n";
echo "------------------------------------------------------------\n";
echo "```php\n";
echo "// api.php\n";
echo "// Load model once when application starts\n";
echo "\\$classifier = \\$modelManager->restoreFromFile('models/classifier.model');\n";
echo "\n";
echo "// Handle requests - no retraining!\n";
echo "\\$features = extractFeatures(\\$_POST['data']);\n";
echo "\\$prediction = \\$classifier->predict([\\$features])[0];\n";
echo "echo json_encode(['prediction' => \\$prediction]);\n";
echo "```\n\n";
echo "PERFORMANCE COMPARISON:\n";
echo "------------------------------------------------------------\n";
echo " Without persistence:\n";
echo " - Train on every request: ~50ms\n";
echo " - Predict: ~1ms\n";
echo " - Total per request: ~51ms\n";
echo " - Can handle: ~20 req/sec\n\n";
echo " With persistence:\n";
echo " - Load model once: ~2ms (startup)\n";
echo " - Predict per request: ~1ms\n";
echo " - Total per request: ~1ms\n";
echo " - Can handle: ~1000 req/sec\n\n";
echo " Speedup: 50x faster! 🚀\n";Expected Result
╔══════════════════════════════════════════════════════════╗
║ Model Persistence Demo ║
╚══════════════════════════════════════════════════════════╝
✓ Created models/ directory
PART 1: PHP-ML Model Persistence
════════════════════════════════════════════════════════════
STEP 1: Training a PHP-ML Classifier
------------------------------------------------------------
✓ Trained k-NN classifier
Training samples: 4
Algorithm: k-Nearest Neighbors (k=3)
STEP 2: Saving PHP-ML Model to Disk
------------------------------------------------------------
✓ Model saved successfully
Path: /path/to/models/phpml-knn.model
File size: 1,245 bytes
Save time: 0.85 ms
STEP 3: Loading PHP-ML Model from Disk
------------------------------------------------------------
✓ Model loaded successfully
Load time: 0.42 ms
Model type: Phpml\Classification\KNearestNeighbors
STEP 4: Making Predictions with Loaded Model
------------------------------------------------------------
Test sample: [5.0, 3.5, 1.5, 0.2]
→ Prediction: setosa
PART 2: Rubix ML Model Persistence
════════════════════════════════════════════════════════════
STEP 1: Training a Rubix ML Classifier
------------------------------------------------------------
✓ Trained Gaussian Naive Bayes classifier
Training samples: 4
STEP 2: Saving Rubix ML Model to Disk
------------------------------------------------------------
✓ Model saved successfully
Path: /path/to/models/rubix-nb.rbx
File size: 2,134 bytes
Save time: 1.23 ms
STEP 3: Loading Rubix ML Model from Disk
------------------------------------------------------------
✓ Model loaded successfully
Load time: 0.67 ms
Model type: Rubix\ML\Classifiers\GaussianNB
STEP 4: Making Predictions with Loaded Model
------------------------------------------------------------
Test sample: [5.0, 3.5, 1.5, 0.2]
→ Prediction: setosa
╔══════════════════════════════════════════════════════════╗
║ Production Usage Pattern ║
╚══════════════════════════════════════════════════════════╝
TRAINING SCRIPT (run offline, once per day/week):
------------------------------------------------------------
```php
// train-model.php
$classifier = new KNearestNeighbors(k: 5);
$classifier->train($trainingData, $labels);
$modelManager->saveToFile($classifier, 'models/classifier.model');
echo 'Model trained and saved!';PRODUCTION API (load once, use many times):
// api.php
// Load model once when application starts
$classifier = $modelManager->restoreFromFile('models/classifier.model');
// Handle requests - no retraining!
$features = extractFeatures($_POST['data']);
$prediction = $classifier->predict([$features])[0];
echo json_encode(['prediction' => $prediction]);PERFORMANCE COMPARISON:
Without persistence: - Train on every request: ~50ms - Predict: ~1ms - Total per request: ~51ms - Can handle: ~20 req/sec
With persistence: - Load model once: ~2ms (startup) - Predict per request: ~1ms - Total per request: ~1ms - Can handle: ~1000 req/sec
Speedup: 50x faster! 🚀
### Why It Works
Both libraries use PHP's serialization to save models:
**PHP-ML** uses standard PHP serialization (`serialize()`/`unserialize()`):
- Pros: Simple, works with any PHP object
- Cons: Can break if class definitions change, security concerns with untrusted data
**Rubix ML** uses enhanced serialization with compression:
- Pros: Optimized for ML models, better forward compatibility
- Cons: Slightly larger file sizes due to metadata
In production:
1. Train model offline (cron job, manual trigger)
2. Save to disk in accessible location
3. Load model once when application starts (use object caching if possible)
4. Reuse loaded model for all predictions
5. Retrain periodically to adapt to new data (model drift mitigation)
### Troubleshooting
- **Error: "Unserialization failed"** — Library version changed between save and load. Retrain and resave model with current library version.
- **Model file corrupted** — Check disk space and permissions. Save to temporary file first, then atomic rename: `rename($tmpPath, $finalPath)`.
- **Loaded model makes wrong predictions** — Verify feature extraction is identical during training and inference. Feature order matters!
- **Permission denied when saving** — Models directory not writable: `chmod 755 models/` or `chown www-data:www-data models/`.
## Step 6: Comparing Multiple Algorithms (~10 min)
### Goal
Systematically compare 6+ algorithms on the same dataset to identify the best performer for your specific use case.
### Actions
There's no universal "best" algorithm. Performance depends on your data, so always compare multiple options.
```php
# filename: 06-algorithm-comparison.php
<?php
declare(strict_types=1);
require_once __DIR__ . '/vendor/autoload.php';
use Rubix\ML\Datasets\Labeled;
use Rubix\ML\Classifiers\KNearestNeighbors;
use Rubix\ML\Classifiers\GaussianNB;
use Rubix\ML\Classifiers\ClassificationTree;
use Rubix\ML\Classifiers\RandomForest;
use Rubix\ML\Classifiers\AdaBoost;
use Rubix\ML\Classifiers\MultilayerPerceptron;
use Rubix\ML\NeuralNet\Layers\Dense;
use Rubix\ML\NeuralNet\Layers\Activation;
use Rubix\ML\NeuralNet\ActivationFunctions\ReLU;
use Rubix\ML\CrossValidation\Metrics\Accuracy;
use Rubix\ML\CrossValidation\Metrics\F1Score;
echo "╔══════════════════════════════════════════════════════════╗\n";
echo "║ Algorithm Comparison on Iris Dataset ║\n";
echo "╚══════════════════════════════════════════════════════════╝\n\n";
// Load iris dataset (using embedded data for demo)
// ... (iris data loading code) ...
$samples = [
// Setosa samples
[5.1, 3.5, 1.4, 0.2], [4.9, 3.0, 1.4, 0.2], [4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2], [5.0, 3.6, 1.4, 0.2], [5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3], [5.0, 3.4, 1.5, 0.2], [4.4, 2.9, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.1], [5.4, 3.7, 1.5, 0.2], [4.8, 3.4, 1.6, 0.2],
// Versicolor samples
[7.0, 3.2, 4.7, 1.4], [6.4, 3.2, 4.5, 1.5], [6.9, 3.1, 4.9, 1.5],
[5.5, 2.3, 4.0, 1.3], [6.5, 2.8, 4.6, 1.5], [5.7, 2.8, 4.5, 1.3],
[6.3, 3.3, 4.7, 1.6], [4.9, 2.4, 3.3, 1.0], [6.6, 2.9, 4.6, 1.3],
[5.2, 2.7, 3.9, 1.4], [5.0, 2.0, 3.5, 1.0], [5.9, 3.0, 4.2, 1.5],
// Virginica samples
[6.3, 3.3, 6.0, 2.5], [5.8, 2.7, 5.1, 1.9], [7.1, 3.0, 5.9, 2.1],
[6.3, 2.9, 5.6, 1.8], [6.5, 3.0, 5.8, 2.2], [7.6, 3.0, 6.6, 2.1],
[4.9, 2.5, 4.5, 1.7], [7.3, 2.9, 6.3, 1.8], [6.7, 2.5, 5.8, 1.8],
[7.2, 3.6, 6.1, 2.5], [6.5, 3.2, 5.1, 2.0], [6.4, 2.7, 5.3, 1.9],
];
$labels = array_merge(
array_fill(0, 12, 'Iris-setosa'),
array_fill(0, 12, 'Iris-versicolor'),
array_fill(0, 12, 'Iris-virginica')
);
$dataset = new Labeled($samples, $labels);
[$training, $testing] = $dataset->randomize()->split(0.75);
echo "Dataset: " . $dataset->numSamples() . " iris flowers\n";
echo "Training: " . $training->numSamples() . " samples (75%)\n";
echo "Testing: " . $testing->numSamples() . " samples (25%)\n\n";
// Define algorithms to compare
$algorithms = [
[
'name' => 'k-Nearest Neighbors (k=5)',
'estimator' => new KNearestNeighbors(k: 5),
'description' => 'Classifies by majority vote of k nearest training examples',
],
[
'name' => 'Gaussian Naive Bayes',
'estimator' => new GaussianNB(),
'description' => 'Probabilistic classifier assuming feature independence',
],
[
'name' => 'Decision Tree (depth=5)',
'estimator' => new ClassificationTree(maxDepth: 5),
'description' => 'Tree of if-then rules for classification',
],
[
'name' => 'Random Forest (100 trees)',
'estimator' => new RandomForest(estimators: 100),
'description' => 'Ensemble of decision trees for robust predictions',
],
[
'name' => 'AdaBoost (50 estimators)',
'estimator' => new AdaBoost(estimators: 50),
'description' => 'Boosting ensemble focusing on difficult examples',
],
[
'name' => 'Neural Network (MLP)',
'estimator' => new MultilayerPerceptron([
new Dense(10),
new Activation(new ReLU()),
new Dense(10),
new Activation(new ReLU()),
]),
'description' => 'Multi-layer neural network with backpropagation',
],
];
$results = [];
echo "Training and evaluating " . count($algorithms) . " algorithms...\n";
echo str_repeat('=', 60) . "\n\n";
foreach ($algorithms as $algo) {
echo "Algorithm: {$algo['name']}\n";
echo str_repeat('-', 60) . "\n";
echo "Description: {$algo['description']}\n\n";
$estimator = $algo['estimator'];
// Train
$trainStart = microtime(true);
try {
$estimator->train($training);
$trainTime = (microtime(true) - $trainStart) * 1000;
// Predict
$predictStart = microtime(true);
$predictions = $estimator->predict($testing);
$predictTime = (microtime(true) - $predictStart) * 1000;
// Evaluate
$accuracyMetric = new Accuracy();
$f1Metric = new F1Score();
$accuracy = $accuracyMetric->score($predictions, $testing->labels());
$f1 = $f1Metric->score($predictions, $testing->labels());
$results[] = [
'name' => $algo['name'],
'accuracy' => $accuracy,
'f1' => $f1,
'train_time' => $trainTime,
'predict_time' => $predictTime,
'success' => true,
];
echo "✓ Success!\n";
echo " Training time: " . number_format($trainTime, 2) . " ms\n";
echo " Prediction time: " . number_format($predictTime, 2) . " ms\n";
echo " Accuracy: " . number_format($accuracy * 100, 2) . "%\n";
echo " F1-Score: " . number_format($f1 * 100, 2) . "%\n";
} catch (\Exception $e) {
$results[] = [
'name' => $algo['name'],
'success' => false,
'error' => $e->getMessage(),
];
echo "✗ Failed: " . $e->getMessage() . "\n";
}
echo "\n";
}
// Summary
echo str_repeat('=', 60) . "\n";
echo "COMPARISON SUMMARY\n";
echo str_repeat('=', 60) . "\n\n";
// Rank by accuracy
$successfulResults = array_filter($results, fn($r) => $r['success']);
usort($successfulResults, fn($a, $b) => $b['accuracy'] <=> $a['accuracy']);
echo "Ranked by Accuracy:\n";
echo str_repeat('-', 60) . "\n";
foreach ($successfulResults as $i => $result) {
$medal = ['🥇', '🥈', '🥉'][$i] ?? ' ';
echo "{$medal} " . ($i + 1) . ". {$result['name']}: " .
number_format($result['accuracy'] * 100, 2) . "%\n";
}
echo "\n\nRanked by Speed (Inference Time):\n";
echo str_repeat('-', 60) . "\n";
usort($successfulResults, fn($a, $b) => $a['predict_time'] <=> $b['predict_time']);
foreach ($successfulResults as $i => $result) {
echo " " . ($i + 1) . ". {$result['name']}: " .
number_format($result['predict_time'], 2) . " ms\n";
}
echo "\n\n╔══════════════════════════════════════════════════════════╗\n";
echo "║ Recommendation ║\n";
echo "╚══════════════════════════════════════════════════════════╝\n\n";
$best = $successfulResults[0];
usort($successfulResults, fn($a, $b) => $b['accuracy'] <=> $a['accuracy']);
$mostAccurate = $successfulResults[0];
usort($successfulResults, fn($a, $b) => $a['predict_time'] <=> $b['predict_time']);
$fastest = $successfulResults[0];
echo "Most Accurate: {$mostAccurate['name']} (" .
number_format($mostAccurate['accuracy'] * 100, 2) . "%)\n";
echo " → Use when: Accuracy is critical, speed is less important\n\n";
echo "Fastest: {$fastest['name']} (" .
number_format($fastest['predict_time'], 2) . " ms)\n";
echo " → Use when: Real-time predictions needed, moderate accuracy OK\n\n";
echo "Balanced: Look for high accuracy with reasonable speed\n";
echo " → Trade-offs matter for production systems\n";Expected Result
╔══════════════════════════════════════════════════════════╗
║ Algorithm Comparison on Iris Dataset ║
╚══════════════════════════════════════════════════════════╝
Dataset: 36 iris flowers
Training: 27 samples (75%)
Testing: 9 samples (25%)
Training and evaluating 6 algorithms...
============================================================
Algorithm: k-Nearest Neighbors (k=5)
------------------------------------------------------------
Description: Classifies by majority vote of k nearest training examples
✓ Success!
Training time: 0.52 ms
Prediction time: 1.23 ms
Accuracy: 100.00%
F1-Score: 100.00%
Algorithm: Gaussian Naive Bayes
------------------------------------------------------------
Description: Probabilistic classifier assuming feature independence
✓ Success!
Training time: 0.89 ms
Prediction time: 0.34 ms
Accuracy: 88.89%
F1-Score: 88.52%
Algorithm: Decision Tree (depth=5)
------------------------------------------------------------
Description: Tree of if-then rules for classification
✓ Success!
Training time: 1.45 ms
Prediction time: 0.12 ms
Accuracy: 88.89%
F1-Score: 88.52%
Algorithm: Random Forest (100 trees)
------------------------------------------------------------
Description: Ensemble of decision trees for robust predictions
✓ Success!
Training time: 145.23 ms
Prediction time: 5.67 ms
Accuracy: 100.00%
F1-Score: 100.00%
Algorithm: AdaBoost (50 estimators)
------------------------------------------------------------
Description: Boosting ensemble focusing on difficult examples
✓ Success!
Training time: 89.34 ms
Prediction time: 3.45 ms
Accuracy: 100.00%
F1-Score: 100.00%
Algorithm: Neural Network (MLP)
------------------------------------------------------------
Description: Multi-layer neural network with backpropagation
✓ Success!
Training time: 234.56 ms
Prediction time: 0.89 ms
Accuracy: 100.00%
F1-Score: 100.00%
============================================================
COMPARISON SUMMARY
============================================================
Ranked by Accuracy:
------------------------------------------------------------
🥇 1. k-Nearest Neighbors (k=5): 100.00%
🥈 2. Random Forest (100 trees): 100.00%
🥉 3. AdaBoost (50 estimators): 100.00%
4. Neural Network (MLP): 100.00%
5. Gaussian Naive Bayes: 88.89%
6. Decision Tree (depth=5): 88.89%
Ranked by Speed (Inference Time):
------------------------------------------------------------
1. Decision Tree (depth=5): 0.12 ms
2. Gaussian Naive Bayes: 0.34 ms
3. Neural Network (MLP): 0.89 ms
4. k-Nearest Neighbors (k=5): 1.23 ms
5. AdaBoost (50 estimators): 3.45 ms
6. Random Forest (100 trees): 5.67 ms
╔══════════════════════════════════════════════════════════╗
║ Recommendation ║
╚══════════════════════════════════════════════════════════╝
Most Accurate: k-Nearest Neighbors (k=5) (100.00%)
→ Use when: Accuracy is critical, speed is less important
Fastest: Decision Tree (depth=5) (0.12 ms)
→ Use when: Real-time predictions needed, moderate accuracy OK
Balanced: Look for high accuracy with reasonable speed
→ Trade-offs matter for production systemsWhy It Works
Different algorithms excel in different scenarios:
k-NN: Simple but effective for small datasets with well-separated classes. Slow inference (must check all training points).
Naive Bayes: Fast training and prediction. Works well when feature independence assumption holds (rarely perfect but often "good enough").
Decision Tree: Fast, interpretable. Prone to overfitting on training data.
Random Forest: Ensemble of trees reduces overfitting. More robust but slower inference.
AdaBoost: Focuses training on difficult examples. Often achieves high accuracy but takes longer to train and predict.
Neural Network: Can learn complex patterns but needs more data and training time than shown here. Overkill for iris dataset.
Troubleshooting
All algorithms show same accuracy — Dataset may be too easy (like iris). Try more challenging data to see differences.
Neural network training very slow — Expected for small datasets (many epochs per sample). For production neural networks, use GPU acceleration or cloud ML services.
AdaBoost or RandomForest fail — Insufficient training data. Ensemble methods need adequate samples relative to estimators count.
Step 7: Loading Real Data with Transformers (~12 min)
Goal
Master loading datasets from CSV files, databases, and JSON sources, then apply multiple preprocessing transformers in Rubix ML pipelines for production-ready data handling.
Actions
So far, we've used hardcoded PHP arrays for simplicity. In production, you'll load data from CSV files, databases, or APIs. Rubix ML provides powerful utilities for loading data and chaining multiple transformers to prepare it for machine learning.
Loading Data from CSV
Rubix ML makes CSV loading trivial with built-in methods:
# filename: 07-load-real-data.php
<?php
declare(strict_types=1);
require_once __DIR__ . '/vendor/autoload.php';
use Rubix\ML\Datasets\Labeled;
use Rubix\ML\Datasets\Unlabeled;
echo "╔══════════════════════════════════════════════════════════╗\n";
echo "║ Loading Real Data with Rubix ML ║\n";
echo "╚══════════════════════════════════════════════════════════╝\n\n";
// ============================================================
// STEP 1: Load Data from CSV
// ============================================================
echo "STEP 1: Loading Data from CSV File\n";
echo "------------------------------------------------------------\n";
// Create sample CSV for demonstration
$csvData = <<<CSV
sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor
6.4,3.2,4.5,1.5,Iris-versicolor
6.3,3.3,6.0,2.5,Iris-virginica
5.8,2.7,5.1,1.9,Iris-virginica
CSV;
if (!is_dir(__DIR__ . '/data')) {
mkdir(__DIR__ . '/data', 0755, true);
}
file_put_contents(__DIR__ . '/data/iris_sample.csv', $csvData);
// Load CSV with Rubix ML
$dataset = Labeled::fromCSV(__DIR__ . '/data/iris_sample.csv', hasHeader: true);
echo "✓ CSV loaded successfully\n";
echo " Samples: " . $dataset->numSamples() . "\n";
echo " Features: " . $dataset->numFeatures() . "\n";
echo " Labels: " . implode(', ', array_unique($dataset->labels())) . "\n\n";
// Display first few samples
echo "First 3 samples:\n";
for ($i = 0; $i < min(3, $dataset->numSamples()); $i++) {
$sample = $dataset->sample($i);
$label = $dataset->label($i);
echo " Sample " . ($i + 1) . ": [" . implode(', ', $sample) . "] → {$label}\n";
}
echo "\n";What fromCSV() does:
- Automatically parses CSV format
- Detects delimiter (comma, semicolon, tab)
- Handles quoted fields
- Converts string numbers to floats
- Last column becomes labels (for Labeled datasets)
Comprehensive Transformer Showcase
Rubix ML provides numerous transformers for data preprocessing. Let's showcase the most important ones:
# filename: 07-load-real-data.php (continued)
use Rubix\ML\Transformers\NumericStringConverter;
use Rubix\ML\Transformers\MissingDataImputer;
use Rubix\ML\Transformers\OneHotEncoder;
use Rubix\ML\Transformers\MinMaxNormalizer;
use Rubix\ML\Transformers\ZScaleStandardizer;
use Rubix\ML\Transformers\VarianceThresholdFilter;
use Rubix\ML\Pipeline;
use Rubix\ML\Classifiers\RandomForest;
// ============================================================
// STEP 2: Demonstrate Individual Transformers
// ============================================================
echo "STEP 2: Transformer Showcase\n";
echo "------------------------------------------------------------\n\n";
// Prepare sample data with issues to demonstrate transformers
$messyData = new Labeled(
[
['5.1', '3.5', 1.4, 0.2, 'red'], // Mix of strings and numbers
[4.9, null, 1.4, 0.2, 'blue'], // Missing value
['7.0', '3.2', 4.7, 1.4, 'red'],
[6.4, 3.2, null, 1.5, 'green'], // Missing value
],
['setosa', 'setosa', 'versicolor', 'versicolor']
);
echo "Original messy data:\n";
echo " Sample types: " . gettype($messyData->sample(0)[0]) . " (should be float)\n";
echo " Has nulls: Yes (sample 2 and 4)\n";
echo " Has categorical: Yes (color column)\n\n";
// Transformer 1: NumericStringConverter
echo "1. NumericStringConverter\n";
echo " Purpose: Convert string numbers to actual floats\n";
$converter = new NumericStringConverter();
$converter->fit($messyData);
$messyData->apply($converter);
echo " ✓ Strings converted: '5.1' → 5.1\n\n";
// Transformer 2: MissingDataImputer
echo "2. MissingDataImputer\n";
echo " Purpose: Fill missing values with mean/median/mode\n";
$imputer = new MissingDataImputer('mean'); // Options: 'mean', 'median', 'mode'
$imputer->fit($messyData);
$messyData->apply($imputer);
echo " ✓ Null values filled with column means\n\n";
// Transformer 3: OneHotEncoder
echo "3. OneHotEncoder\n";
echo " Purpose: Convert categorical values to binary columns\n";
echo " Before: ['red', 'blue', 'green']\n";
echo " After: [[1,0,0], [0,1,0], [0,0,1]]\n";
$encoder = new OneHotEncoder();
$encoder->fit($messyData);
$messyData->apply($encoder);
echo " ✓ Categorical column expanded to " . ($messyData->numFeatures() - 4) . " binary columns\n\n";
// Transformer 4: MinMaxNormalizer
echo "4. MinMaxNormalizer\n";
echo " Purpose: Scale features to [0, 1] range\n";
echo " Formula: (x - min) / (max - min)\n";
$minmax = new MinMaxNormalizer();
$minmax->fit($messyData);
$messyData->apply($minmax);
echo " ✓ All features now in [0.0, 1.0] range\n\n";
// Transformer 5: ZScaleStandardizer (alternative to MinMax)
echo "5. ZScaleStandardizer (Alternative Normalization)\n";
echo " Purpose: Standardize features (mean=0, std=1)\n";
echo " Formula: (x - μ) / σ\n";
echo " Use when: Features have different scales, outliers present\n";
echo " ✓ Would center features around 0 with unit variance\n\n";
// Transformer 6: VarianceThresholdFilter
echo "6. VarianceThresholdFilter\n";
echo " Purpose: Remove low-variance (nearly constant) features\n";
echo " Use when: Dataset has useless features that don't vary\n";
echo " ✓ Automatically removes features with variance < threshold\n\n";
echo "Transformed dataset:\n";
echo " Samples: " . $messyData->numSamples() . "\n";
echo " Features: " . $messyData->numFeatures() . " (increased due to one-hot encoding)\n";
echo " All numeric: Yes\n";
echo " No missing values: Yes\n";
echo " Normalized: Yes\n\n";Complete Pipeline with Multiple Transformers
Now let's chain transformers in a production pipeline:
# filename: 07-load-real-data.php (continued)
// ============================================================
// STEP 3: Production Pipeline with Chained Transformers
// ============================================================
echo "STEP 3: Complete Preprocessing Pipeline\n";
echo "------------------------------------------------------------\n";
// Load fresh data
$rawDataset = Labeled::fromCSV(__DIR__ . '/data/iris_sample.csv', hasHeader: true);
// Create pipeline with multiple transformers
$pipeline = new Pipeline([
new NumericStringConverter(), // 1. Fix type issues
new MissingDataImputer('mean'), // 2. Handle missing values
new MinMaxNormalizer(), // 3. Normalize to [0,1]
], new RandomForest(estimators: 10));
echo "Pipeline created with 3 transformers + Random Forest classifier\n\n";
echo "Transformer order (CRITICAL):\n";
echo " 1. NumericStringConverter - Must be first (convert before math)\n";
echo " 2. MissingDataImputer - Fill nulls before normalization\n";
echo " 3. MinMaxNormalizer - Last preprocessing step\n";
echo " 4. RandomForest - Final estimator\n\n";
echo "Why order matters:\n";
echo " ✗ Normalize → Impute: NaN values break normalization\n";
echo " ✗ Impute → Convert: Can't calculate mean of strings\n";
echo " ✓ Convert → Impute → Normalize: Correct order\n\n";
// Train pipeline (all transformers auto-fit on training data)
[$training, $testing] = $rawDataset->randomize()->split(0.8);
$trainStart = microtime(true);
$pipeline->train($training);
$trainTime = (microtime(true) - $trainStart) * 1000;
echo "✓ Pipeline trained in " . number_format($trainTime, 2) . " ms\n";
echo " - Transformers fitted to training data only\n";
echo " - Test data will use same transformation parameters\n\n";
// Make predictions
$predictions = $pipeline->predict($testing);
echo "✓ Predictions made on test set\n";
echo " Test samples: " . $testing->numSamples() . "\n";
// Calculate accuracy
$correct = 0;
foreach ($predictions as $i => $pred) {
if ($pred === $testing->label($i)) {
$correct++;
}
}
$accuracy = ($correct / count($predictions)) * 100;
echo " Accuracy: " . number_format($accuracy, 1) . "%\n\n";Loading from Database
For production systems, you'll often load data from databases:
# filename: 07-load-real-data.php (continued)
// ============================================================
// STEP 4: Loading from Database
// ============================================================
echo "STEP 4: Loading Data from Database\n";
echo "------------------------------------------------------------\n";
echo "Example: MySQL/PostgreSQL Loading\n\n";
echo "```php\n";
echo "// Connect to database\n";
echo "\$pdo = new PDO(\n";
echo " 'mysql:host=localhost;dbname=ml_data',\n";
echo " 'username',\n";
echo " 'password',\n";
echo " [PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION]\n";
echo ");\n\n";
echo "// Query training data\n";
echo "\$stmt = \$pdo->query('\n";
echo " SELECT feature1, feature2, feature3, label\n";
echo " FROM training_data\n";
echo " WHERE created_at > DATE_SUB(NOW(), INTERVAL 30 DAY)\n";
echo "');\n\n";
echo "// Build dataset\n";
echo "\$samples = [];\n";
echo "\$labels = [];\n";
echo "while (\$row = \$stmt->fetch(PDO::FETCH_NUM)) {\n";
echo " \$samples[] = array_slice(\$row, 0, -1); // All but last column\n";
echo " \$labels[] = \$row[count(\$row) - 1]; // Last column\n";
echo "}\n\n";
echo "\$dataset = new Labeled(\$samples, \$labels);\n";
echo "echo \"Loaded \" . count(\$samples) . \" samples from database\\n\";\n";
echo "```\n\n";
echo "Benefits:\n";
echo " ✓ Load only recent data (avoid stale training data)\n";
echo " ✓ Filter by conditions (WHERE clauses)\n";
echo " ✓ Join multiple tables for features\n";
echo " ✓ Stream large datasets (fetchAll vs fetch loop)\n\n";Loading from JSON API
# filename: 07-load-real-data.php (continued)
// ============================================================
// STEP 5: Loading from JSON API
// ============================================================
echo "STEP 5: Loading Data from JSON API\n";
echo "------------------------------------------------------------\n";
echo "Example: Loading from REST API\n\n";
echo "```php\n";
echo "// Fetch data from API\n";
echo "\$response = file_get_contents('https://api.example.com/training-data');\n";
echo "\$data = json_decode(\$response, true);\n\n";
echo "// Extract features and labels\n";
echo "\$samples = [];\n";
echo "\$labels = [];\n";
echo "foreach (\$data['records'] as \$record) {\n";
echo " \$samples[] = [\n";
echo " \$record['age'],\n";
echo " \$record['income'],\n";
echo " \$record['score'],\n";
echo " ];\n";
echo " \$labels[] = \$record['category'];\n";
echo "}\n\n";
echo "\$dataset = new Labeled(\$samples, \$labels);\n";
echo "```\n\n";
echo "Use cases:\n";
echo " - Load training data from microservices\n";
echo " - Fetch labeled data from annotation services\n";
echo " - Import datasets from cloud ML platforms\n\n";
echo "╔══════════════════════════════════════════════════════════╗\n";
echo "║ ✓ Real Data Loading Complete! ║\n";
echo "╚══════════════════════════════════════════════════════════╝\n";Expected Result
╔══════════════════════════════════════════════════════════╗
║ Loading Real Data with Rubix ML ║
╚══════════════════════════════════════════════════════════╝
STEP 1: Loading Data from CSV File
------------------------------------------------------------
✓ CSV loaded successfully
Samples: 6
Features: 4
Labels: Iris-setosa, Iris-versicolor, Iris-virginica
First 3 samples:
Sample 1: [5.1, 3.5, 1.4, 0.2] → Iris-setosa
Sample 2: [4.9, 3.0, 1.4, 0.2] → Iris-setosa
Sample 3: [7.0, 3.2, 4.7, 1.4] → Iris-versicolor
STEP 2: Transformer Showcase
------------------------------------------------------------
Original messy data:
Sample types: string (should be float)
Has nulls: Yes (sample 2 and 4)
Has categorical: Yes (color column)
1. NumericStringConverter
Purpose: Convert string numbers to actual floats
✓ Strings converted: '5.1' → 5.1
2. MissingDataImputer
Purpose: Fill missing values with mean/median/mode
✓ Null values filled with column means
3. OneHotEncoder
Purpose: Convert categorical values to binary columns
Before: ['red', 'blue', 'green']
After: [[1,0,0], [0,1,0], [0,0,1]]
✓ Categorical column expanded to 3 binary columns
4. MinMaxNormalizer
Purpose: Scale features to [0, 1] range
Formula: (x - min) / (max - min)
✓ All features now in [0.0, 1.0] range
5. ZScaleStandardizer (Alternative Normalization)
Purpose: Standardize features (mean=0, std=1)
Formula: (x - μ) / σ
Use when: Features have different scales, outliers present
✓ Would center features around 0 with unit variance
6. VarianceThresholdFilter
Purpose: Remove low-variance (nearly constant) features
Use when: Dataset has useless features that don't vary
✓ Automatically removes features with variance < threshold
Transformed dataset:
Samples: 4
Features: 7 (increased due to one-hot encoding)
All numeric: Yes
No missing values: Yes
Normalized: Yes
STEP 3: Complete Preprocessing Pipeline
------------------------------------------------------------
Pipeline created with 3 transformers + Random Forest classifier
Transformer order (CRITICAL):
1. NumericStringConverter - Must be first (convert before math)
2. MissingDataImputer - Fill nulls before normalization
3. MinMaxNormalizer - Last preprocessing step
4. RandomForest - Final estimator
Why order matters:
✗ Normalize → Impute: NaN values break normalization
✗ Impute → Convert: Can't calculate mean of strings
✓ Convert → Impute → Normalize: Correct order
✓ Pipeline trained in 45.23 ms
- Transformers fitted to training data only
- Test data will use same transformation parameters
✓ Predictions made on test set
Test samples: 1
Accuracy: 100.0%Why It Works
Rubix ML's Data Loading:
Labeled::fromCSV()automatically handles CSV parsing, type conversion, and validation- Detects data types and converts appropriately
- Separates features from labels based on column position
Transformer Chaining:
- Each transformer is fitted on training data (calculates statistics like mean, min, max)
- Fitted transformers are applied to both training and test data using training statistics
- This prevents data leakage—test data never influences transformation parameters
Order Matters:
- Type conversion first: Can't do math on strings
- Imputation second: Need numeric values to calculate means
- Normalization last: All values must be clean and numeric
Pipeline Benefits:
- Transformers auto-fit during
train() - Same transformations apply during
predict() - No manual tracking of transformation parameters
- Guaranteed consistency between training and inference
Troubleshooting
Error: "CSV file not found" — Check file path is relative to script location. Use
__DIR__ . '/data/file.csv'for absolute paths.Warning: "Non-numeric value encountered" — CSV contains non-numeric data in feature columns. Use
NumericStringConverterfirst orOneHotEncoderfor categorical columns.Error: "Sample dimensionality mismatch" — CSV rows have different column counts. Check for:
php// Validate CSV before loading $lines = file(__DIR__ . '/data/file.csv'); $columnCounts = array_map(fn($line) => count(str_getcsv($line)), $lines); if (count(array_unique($columnCounts)) > 1) { echo "Inconsistent column counts found!\n"; }Memory exhausted on large CSV — Load data in chunks:
php$handle = fopen('large_file.csv', 'r'); $batchSize = 1000; $samples = []; $labels = []; while (($row = fgetcsv($handle)) !== false) { $samples[] = array_slice($row, 0, -1); $labels[] = $row[count($row) - 1]; if (count($samples) >= $batchSize) { // Process batch $dataset = new Labeled($samples, $labels); // ... train incrementally or save to disk $samples = []; $labels = []; } }Database connection timeout — Use connection pooling and set appropriate timeouts:
php$pdo = new PDO($dsn, $user, $pass, [ PDO::ATTR_TIMEOUT => 30, PDO::ATTR_PERSISTENT => true, // Connection pooling ]);Transformer fails on test data — Ensure transformer fitted on training data first:
php// WRONG $normalizer->apply($testData); // Not fitted! // RIGHT $normalizer->fit($trainingData); // Fit on training $normalizer->apply($testData); // Apply training stats to test
Step 8: Production REST API Implementation (~15 min)
Goal
Build a production-ready REST API that serves ML predictions with model loading optimization, input validation, error handling, logging, and health checks.
Actions
Let's create a complete production API that demonstrates all best practices for serving ML models in PHP.
# filename: 08-production-api.php
<?php
declare(strict_types=1);
require_once __DIR__ . '/vendor/autoload.php';
use Rubix\ML\Persisters\Filesystem;
use Rubix\ML\Estimator;
use Rubix\ML\Datasets\Unlabeled;
/**
* Production ML Model Server
*
* Singleton pattern ensures model loads once and serves many requests.
* Includes validation, error handling, logging, and monitoring.
*/
class ModelServer
{
private static ?Estimator $model = null;
private static array $config = [];
private static int $requestCount = 0;
private static float $totalInferenceTime = 0;
/**
* Initialize server and load model once
*/
public static function initialize(): void
{
// Load configuration
self::$config = [
'model_path' => __DIR__ . '/models/production.rbx',
'expected_features' => 4, // Iris dataset has 4 features
'log_path' => __DIR__ . '/logs/predictions.log',
'max_feature_value' => 10.0, // Sanity check
'min_feature_value' => 0.0,
];
// Create logs directory if needed
if (!is_dir(__DIR__ . '/logs')) {
mkdir(__DIR__ . '/logs', 0755, true);
}
// Load model once at startup
try {
$persister = new Filesystem(self::$config['model_path']);
self::$model = $persister->load();
error_log(sprintf(
"[ModelServer] Model loaded successfully (%s)",
get_class(self::$model)
));
} catch (Exception $e) {
error_log("[ModelServer] FATAL: Failed to load model: " . $e->getMessage());
throw $e;
}
}
/**
* Validate input features
*
* @throws InvalidArgumentException if validation fails
*/
private static function validateFeatures(array $features): void
{
// Check feature count
if (count($features) !== self::$config['expected_features']) {
throw new InvalidArgumentException(sprintf(
"Expected %d features, got %d",
self::$config['expected_features'],
count($features)
));
}
// Check all values are numeric
foreach ($features as $i => $value) {
if (!is_numeric($value)) {
throw new InvalidArgumentException(
"Feature at index {$i} is not numeric: " . var_export($value, true)
);
}
// Sanity check: reasonable range
if ($value < self::$config['min_feature_value'] ||
$value > self::$config['max_feature_value']) {
throw new InvalidArgumentException(
"Feature at index {$i} out of range [{self::$config['min_feature_value']}, {self::$config['max_feature_value']}]: {$value}"
);
}
}
}
/**
* Make prediction with timing and logging
*
* @param array $features Input feature vector
* @return array Prediction result with metadata
*/
public static function predict(array $features): array
{
// Validate input
self::validateFeatures($features);
// Make prediction with timing
$startTime = microtime(true);
$prediction = self::$model->predictSample($features);
$duration = (microtime(true) - $startTime) * 1000;
// Get probability/confidence if available
$confidence = null;
if (method_exists(self::$model, 'proba')) {
try {
$proba = self::$model->proba(new Unlabeled([$features]))[0];
$confidence = max($proba);
} catch (Exception $e) {
// Some models don't support proba - that's OK
$confidence = null;
}
}
// Update statistics
self::$requestCount++;
self::$totalInferenceTime += $duration;
// Log prediction
self::logPrediction($features, $prediction, $confidence, $duration);
return [
'prediction' => $prediction,
'confidence' => $confidence,
'processing_time_ms' => round($duration, 2),
'request_id' => uniqid('pred_', true),
];
}
/**
* Log prediction for monitoring and retraining
*/
private static function logPrediction(
array $features,
$prediction,
?float $confidence,
float $duration
): void {
$logEntry = json_encode([
'timestamp' => date('c'),
'features' => $features,
'prediction' => $prediction,
'confidence' => $confidence,
'duration_ms' => round($duration, 2),
'model_class' => get_class(self::$model),
]) . "\n";
file_put_contents(self::$config['log_path'], $logEntry, FILE_APPEND | LOCK_EX);
}
/**
* Get server statistics
*/
public static function getStats(): array
{
return [
'total_requests' => self::$requestCount,
'avg_inference_time_ms' => self::$requestCount > 0
? round(self::$totalInferenceTime / self::$requestCount, 2)
: 0,
'model_class' => self::$model ? get_class(self::$model) : null,
'uptime_seconds' => time() - $_SERVER['REQUEST_TIME'],
];
}
/**
* Health check
*/
public static function isHealthy(): bool
{
return self::$model !== null;
}
}
// ============================================================
// Initialize model server on startup (runs once)
// ============================================================
try {
ModelServer::initialize();
} catch (Exception $e) {
http_response_code(503);
header('Content-Type: application/json');
echo json_encode([
'error' => 'Service Unavailable',
'message' => 'Failed to initialize model server',
]);
exit(1);
}
// ============================================================
// Handle HTTP Requests
// ============================================================
header('Content-Type: application/json');
$method = $_SERVER['REQUEST_METHOD'];
$uri = parse_url($_SERVER['REQUEST_URI'], PHP_URL_PATH);
// Route: POST /predict - Make prediction
if ($method === 'POST' && $uri === '/predict') {
try {
// Parse JSON input
$input = json_decode(file_get_contents('php://input'), true);
if ($input === null) {
http_response_code(400);
echo json_encode(['error' => 'Invalid JSON']);
exit;
}
if (!isset($input['features'])) {
http_response_code(400);
echo json_encode(['error' => 'Missing "features" field']);
exit;
}
// Make prediction
$result = ModelServer::predict($input['features']);
http_response_code(200);
echo json_encode($result);
} catch (InvalidArgumentException $e) {
// Client error - bad input
http_response_code(400);
echo json_encode([
'error' => 'Bad Request',
'message' => $e->getMessage(),
]);
} catch (Exception $e) {
// Server error - unexpected
http_response_code(500);
error_log("[ModelServer] Prediction error: " . $e->getMessage());
echo json_encode([
'error' => 'Internal Server Error',
'message' => 'An unexpected error occurred',
]);
}
// Route: GET /health - Health check
} elseif ($method === 'GET' && $uri === '/health') {
$healthy = ModelServer::isHealthy();
http_response_code($healthy ? 200 : 503);
echo json_encode([
'status' => $healthy ? 'healthy' : 'unhealthy',
'model_loaded' => $healthy,
'timestamp' => date('c'),
]);
// Route: GET /stats - Server statistics
} elseif ($method === 'GET' && $uri === '/stats') {
http_response_code(200);
echo json_encode(ModelServer::getStats());
// Route not found
} else {
http_response_code(404);
echo json_encode([
'error' => 'Not Found',
'message' => "Route {$method} {$uri} not found",
'available_routes' => [
'POST /predict' => 'Make prediction',
'GET /health' => 'Health check',
'GET /stats' => 'Server statistics',
],
]);
}Testing the API
Create a test script to interact with the API:
# filename: test-api.php
<?php
declare(strict_types=1);
echo "╔══════════════════════════════════════════════════════════╗\n";
echo "║ Testing Production ML API ║\n";
echo "╚══════════════════════════════════════════════════════════╝\n\n";
// Start server in background
echo "Starting API server on http://localhost:8000...\n";
$serverProcess = proc_open(
'php -S localhost:8000 08-production-api.php',
[
1 => ['pipe', 'w'], // stdout
2 => ['pipe', 'w'], // stderr
],
$pipes
);
// Give server time to start
sleep(2);
// Test 1: Health Check
echo "\nTest 1: Health Check\n";
echo "------------------------------------------------------------\n";
$healthResponse = file_get_contents('http://localhost:8000/health');
$health = json_decode($healthResponse, true);
echo "✓ Health: " . $health['status'] . "\n";
echo " Model loaded: " . ($health['model_loaded'] ? 'Yes' : 'No') . "\n\n";
// Test 2: Valid Prediction
echo "Test 2: Valid Prediction Request\n";
echo "------------------------------------------------------------\n";
$validRequest = json_encode([
'features' => [5.1, 3.5, 1.4, 0.2] // Iris setosa
]);
$context = stream_context_create([
'http' => [
'method' => 'POST',
'header' => 'Content-Type: application/json',
'content' => $validRequest,
]
]);
$predictionResponse = file_get_contents('http://localhost:8000/predict', false, $context);
$prediction = json_decode($predictionResponse, true);
echo "✓ Prediction: " . $prediction['prediction'] . "\n";
echo " Confidence: " . number_format($prediction['confidence'] * 100, 1) . "%\n";
echo " Processing time: " . $prediction['processing_time_ms'] . " ms\n\n";
// Test 3: Invalid Input (wrong feature count)
echo "Test 3: Invalid Input (Wrong Feature Count)\n";
echo "------------------------------------------------------------\n";
$invalidRequest = json_encode([
'features' => [5.1, 3.5] // Only 2 features instead of 4
]);
$context = stream_context_create([
'http' => [
'method' => 'POST',
'header' => 'Content-Type: application/json',
'content' => $invalidRequest,
'ignore_errors' => true,
]
]);
$errorResponse = file_get_contents('http://localhost:8000/predict', false, $context);
$error = json_decode($errorResponse, true);
echo "✓ Error caught: " . $error['message'] . "\n";
echo " HTTP Status: 400 (Bad Request)\n\n";
// Test 4: Server Statistics
echo "Test 4: Server Statistics\n";
echo "------------------------------------------------------------\n";
$statsResponse = file_get_contents('http://localhost:8000/stats');
$stats = json_decode($statsResponse, true);
echo "✓ Total requests: " . $stats['total_requests'] . "\n";
echo " Avg inference time: " . $stats['avg_inference_time_ms'] . " ms\n";
echo " Model class: " . $stats['model_class'] . "\n\n";
// Cleanup: Stop server
proc_terminate($serverProcess);
proc_close($serverProcess);
echo "╔══════════════════════════════════════════════════════════╗\n";
echo "║ ✓ API Testing Complete! ║\n";
echo "╚══════════════════════════════════════════════════════════╝\n";Expected Result
Starting the API server:
php -S localhost:8000 08-production-api.php[Sun Oct 26 12:34:56 2025] PHP 8.4.0 Development Server (http://localhost:8000) started
[ModelServer] Model loaded successfully (Rubix\ML\Classifiers\KNearestNeighbors)Making requests with curl:
# Make prediction
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{"features": [5.1, 3.5, 1.4, 0.2]}'{
"prediction": "Iris-setosa",
"confidence": 0.95,
"processing_time_ms": 1.23,
"request_id": "pred_67234abc12def"
}Health check:
curl http://localhost:8000/health{
"status": "healthy",
"model_loaded": true,
"timestamp": "2025-10-26T12:35:42+00:00"
}Server statistics:
curl http://localhost:8000/stats{
"total_requests": 127,
"avg_inference_time_ms": 1.18,
"model_class": "Rubix\\ML\\Classifiers\\KNearestNeighbors",
"uptime_seconds": 3842
}Error handling (wrong feature count):
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{"features": [5.1, 3.5]}'{
"error": "Bad Request",
"message": "Expected 4 features, got 2"
}Why It Works
Singleton Pattern:
- Model loaded once at server startup via
ModelServer::initialize() - Shared static
$modelvariable serves all requests - 50x faster than loading per request (2ms vs 50ms)
Request Flow:
- Client sends POST to
/predictwith JSON features - API validates JSON structure and feature count
ModelServer::predict()validates feature values- Model makes prediction (~1ms)
- Result logged and returned to client
Production Benefits:
- Input Validation: Prevents crashes from bad data
- Error Handling: Proper HTTP status codes (400 for client errors, 500 for server errors)
- Logging: All predictions logged for monitoring and retraining
- Health Checks: Load balancers can verify service is healthy
- Statistics: Monitor performance over time
- Request IDs: Track individual predictions for debugging
Scalability:
- PHP-FPM handles concurrent requests via process pool
- Each worker process loads model once (not per request)
- Horizontal scaling: add more PHP-FPM workers or servers
- Stateless design: any server can handle any request
Production Deployment
Development Server:
# Built-in PHP server (development only)
php -S localhost:8000 08-production-api.phpProduction with Nginx + PHP-FPM:
# /etc/nginx/sites-available/ml-api
server {
listen 80;
server_name ml-api.example.com;
root /var/www/ml-api;
index 08-production-api.php;
# API endpoint
location /predict {
try_files $uri $uri/ /08-production-api.php?$query_string;
fastcgi_pass unix:/var/run/php/php8.4-fpm.sock;
fastcgi_param SCRIPT_FILENAME $document_root/08-production-api.php;
include fastcgi_params;
}
# Health check
location /health {
try_files $uri $uri/ /08-production-api.php?$query_string;
fastcgi_pass unix:/var/run/php/php8.4-fpm.sock;
fastcgi_param SCRIPT_FILENAME $document_root/08-production-api.php;
include fastcgi_params;
}
# Stats (restrict to internal IPs)
location /stats {
allow 10.0.0.0/8;
deny all;
try_files $uri $uri/ /08-production-api.php?$query_string;
fastcgi_pass unix:/var/run/php/php8.4-fpm.sock;
fastcgi_param SCRIPT_FILENAME $document_root/08-production-api.php;
include fastcgi_params;
}
}PHP-FPM Configuration:
; /etc/php/8.4/fpm/pool.d/ml-api.conf
[ml-api]
user = www-data
group = www-data
listen = /var/run/php/php8.4-fpm.sock
; Process management
pm = dynamic
pm.max_children = 50 ; Adjust based on RAM (each loads model)
pm.start_servers = 10
pm.min_spare_servers = 5
pm.max_spare_servers = 15
; Each child serves 1000 requests before respawning (prevent memory leaks)
pm.max_requests = 1000
; PHP ini settings
php_value[memory_limit] = 256M
php_value[opcache.enable] = 1Monitoring with Logs:
# Watch predictions in real-time
tail -f logs/predictions.log | jq .
# Analyze performance
cat logs/predictions.log | jq '.duration_ms' | \
awk '{sum+=$1; count++} END {print "Avg:", sum/count, "ms"}'
# Count predictions by class
cat logs/predictions.log | jq -r '.prediction' | sort | uniq -cTroubleshooting
Error: "Model file not found" — Model not trained/saved yet. Run training script first:
bash# Train and save model (from Step 5) php 05-model-persistence.phpSlow first request after startup — Model loading on first request. Use opcache preloading (PHP 7.4+):
ini; php.ini opcache.preload=/var/www/ml-api/preload.php opcache.preload_user=www-dataMemory leaks over time — PHP-FPM child processes accumulate memory. Set
pm.max_requeststo recycle workers:inipm.max_requests = 500 ; Restart worker after 500 requests504 Gateway Timeout on predictions — Increase PHP and Nginx timeouts:
nginx# nginx.conf fastcgi_read_timeout 60s;ini; php-fpm.conf request_terminate_timeout = 60Predictions return different results — Model uses random state (e.g., neural network). Set seed or use deterministic algorithm (Decision Tree, k-NN).
High CPU usage — k-NN has O(n) inference complexity. For large training sets (>10k samples), use faster algorithms (Random Forest, Naive Bayes) or GPU-accelerated libraries.
Step 9: Regression and Feature Importance (~10 min)
Goal
Build a house price prediction model using regression and analyze which features contribute most to predictions through feature importance analysis.
Actions
So far, we've focused on classification (predicting categories: spam/ham, iris species). Now let's tackle regression—predicting continuous numerical values like prices, temperatures, or sales figures.
Understanding Regression
Regression vs. Classification:
- Classification: Predicts discrete categories ("spam" or "ham", "setosa" or "virginica")
- Regression: Predicts continuous numbers (house price: $285,400, temperature: 72.5°F)
- Same workflow: Prepare data → Train model → Evaluate → Predict
Building a House Price Predictor
Let's predict house prices based on features like square footage, number of bedrooms, age, and location quality:
# filename: 09-regression-feature-importance.php
<?php
declare(strict_types=1);
require_once __DIR__ . '/vendor/autoload.php';
use Rubix\ML\Datasets\Labeled;
use Rubix\ML\Datasets\Unlabeled;
use Rubix\ML\Regressors\RandomForestRegressor;
use Rubix\ML\CrossValidation\Metrics\RSquared;
use Rubix\ML\CrossValidation\Metrics\MeanAbsoluteError;
use Rubix\ML\CrossValidation\Metrics\MeanSquaredError;
echo "╔══════════════════════════════════════════════════════════╗\n";
echo "║ House Price Prediction with Regression ║\n";
echo "╚══════════════════════════════════════════════════════════╝\n\n";
// ============================================================
// STEP 1: Prepare House Price Dataset
// ============================================================
echo "STEP 1: Preparing House Price Dataset\n";
echo "------------------------------------------------------------\n";
// Features: [sqft, bedrooms, age_years, location_score (1-10)]
$samples = [
[1200, 3, 10, 7],
[1800, 4, 5, 8],
[900, 2, 20, 5],
[2500, 5, 2, 9],
[1100, 2, 15, 6],
[2200, 4, 8, 8],
[1500, 3, 12, 7],
[1900, 4, 6, 7],
[850, 2, 25, 4],
[2800, 5, 1, 10],
[1400, 3, 10, 6],
[2000, 4, 7, 8],
[1300, 3, 14, 6],
[2400, 4, 3, 9],
[1600, 3, 9, 7],
[1750, 4, 11, 7],
[2100, 4, 5, 8],
[950, 2, 18, 5],
[2300, 5, 4, 9],
[1450, 3, 13, 6],
];
// Target: House prices in dollars
$prices = [
250000, 380000, 180000, 520000, 220000,
440000, 285000, 360000, 165000, 580000,
265000, 395000, 255000, 480000, 305000,
340000, 410000, 195000, 490000, 275000,
];
echo "Dataset prepared:\n";
echo " Samples: " . count($samples) . " houses\n";
echo " Features: 4 (sqft, bedrooms, age, location_score)\n";
echo " Target: Price (\$)\n\n";
echo "Sample data:\n";
for ($i = 0; $i < 3; $i++) {
echo sprintf(
" House %d: %d sqft, %d bed, %d yrs, loc:%d → \$%s\n",
$i + 1,
$samples[$i][0],
$samples[$i][1],
$samples[$i][2],
$samples[$i][3],
number_format($prices[$i], 0)
);
}
echo "\n";
// ============================================================
// STEP 2: Train Random Forest Regressor
// ============================================================
echo "STEP 2: Training Random Forest Regressor\n";
echo "------------------------------------------------------------\n";
// Create dataset
$dataset = new Labeled($samples, $prices);
// Split into training and testing sets (80/20)
[$training, $testing] = $dataset->randomize()->split(0.8);
echo "Data split:\n";
echo " Training: " . $training->numSamples() . " samples\n";
echo " Testing: " . $testing->numSamples() . " samples\n\n";
// Create and train regressor
$regressor = new RandomForestRegressor(
estimators: 100, // 100 decision trees
minLeafSize: 2,
);
echo "Training Random Forest Regressor (100 trees)...\n";
$trainStart = microtime(true);
$regressor->train($training);
$trainTime = (microtime(true) - $trainStart) * 1000;
echo "✓ Model trained in " . number_format($trainTime, 2) . " ms\n\n";
// ============================================================
// STEP 3: Evaluate Model Performance
// ============================================================
echo "STEP 3: Evaluating Model Performance\n";
echo "------------------------------------------------------------\n";
// Make predictions on test set
$predictions = $regressor->predict($testing);
$actuals = $testing->labels();
// Calculate evaluation metrics
$r2Metric = new RSquared();
$maeMetric = new MeanAbsoluteError();
$mseMetric = new MeanSquaredError();
$r2 = $r2Metric->score($predictions, $actuals);
$mae = $maeMetric->score($predictions, $actuals);
$rmse = sqrt($mseMetric->score($predictions, $actuals));
echo "Test Set Performance:\n";
echo " R² Score: " . number_format($r2, 3) . " (explains " . number_format($r2 * 100, 1) . "% of variance)\n";
echo " RMSE: \$" . number_format($rmse, 0) . " (avg prediction error)\n";
echo " MAE: \$" . number_format($mae, 0) . " (avg absolute error)\n\n";
// Show individual predictions
echo "Sample Predictions vs Actuals:\n";
for ($i = 0; $i < min(5, count($predictions)); $i++) {
$error = abs($predictions[$i] - $actuals[$i]);
$errorPct = ($error / $actuals[$i]) * 100;
echo sprintf(
" Predicted: \$%s | Actual: \$%s | Error: \$%s (%.1f%%)\n",
number_format($predictions[$i], 0),
number_format($actuals[$i], 0),
number_format($error, 0),
$errorPct
);
}
echo "\n";
// ============================================================
// STEP 4: Analyze Feature Importance
// ============================================================
echo "STEP 4: Feature Importance Analysis\n";
echo "------------------------------------------------------------\n";
// Get feature importances from Random Forest
$importances = $regressor->featureImportances();
$featureNames = [
'Square Feet',
'Bedrooms',
'Age (years)',
'Location Score',
];
// Sort features by importance
$featureData = array_map(
fn($name, $importance) => ['name' => $name, 'importance' => $importance],
$featureNames,
$importances
);
usort($featureData, fn($a, $b) => $b['importance'] <=> $a['importance']);
echo "Feature Importance (how much each feature drives predictions):\n\n";
foreach ($featureData as $i => $feature) {
$percentage = $feature['importance'] * 100;
$barLength = (int)($percentage / 2.5); // Scale to reasonable bar length
$bar = str_repeat('█', $barLength);
echo sprintf(
" %d. %-15s: %5.1f%% %s\n",
$i + 1,
$feature['name'],
$percentage,
$bar
);
}
echo "\n";
echo "Key Insights:\n";
echo " • " . $featureData[0]['name'] . " is the strongest predictor\n";
echo " • " . $featureData[1]['name'] . " also significantly impacts price\n";
echo " • " . $featureData[3]['name'] . " has minimal influence\n\n";
// ============================================================
// STEP 5: Make Predictions on New Houses
// ============================================================
echo "STEP 5: Predicting Prices for New Houses\n";
echo "------------------------------------------------------------\n";
$newHouses = [
[1500, 3, 8, 7], // Mid-size, 3 bed, 8 years old, good location
[2200, 4, 3, 9], // Large, 4 bed, newer, excellent location
[1000, 2, 15, 5], // Small, 2 bed, older, average location
];
$newHouseDescriptions = [
'1500 sqft, 3 bed, 8 years, location: 7/10',
'2200 sqft, 4 bed, 3 years, location: 9/10',
'1000 sqft, 2 bed, 15 years, location: 5/10',
];
$newDataset = new Unlabeled($newHouses);
$newPredictions = $regressor->predict($newDataset);
echo "Predicted prices for new houses:\n\n";
foreach ($newPredictions as $i => $price) {
echo sprintf(
" House %d (%s)\n",
$i + 1,
$newHouseDescriptions[$i]
);
echo sprintf(" Predicted Price: \$%s\n\n", number_format($price, 0));
}
// ============================================================
// STEP 6: Compare Multiple Regression Algorithms (Optional)
// ============================================================
echo "STEP 6: Comparing Regression Algorithms\n";
echo "------------------------------------------------------------\n";
use Rubix\ML\Regressors\KNNRegressor;
use Rubix\ML\Regressors\Ridge;
$algorithms = [
'Random Forest' => new RandomForestRegressor(estimators: 50),
'k-NN Regressor' => new KNNRegressor(k: 3),
'Ridge Regression' => new Ridge(alpha: 1.0),
];
echo "Training and evaluating 3 regression algorithms...\n\n";
$results = [];
foreach ($algorithms as $name => $algo) {
$startTime = microtime(true);
$algo->train($training);
$trainDuration = (microtime(true) - $startTime) * 1000;
$preds = $algo->predict($testing);
$r2Score = $r2Metric->score($preds, $actuals);
$maeScore = $maeMetric->score($preds, $actuals);
$results[] = [
'name' => $name,
'r2' => $r2Score,
'mae' => $maeScore,
'train_time' => $trainDuration,
];
}
// Sort by R² (descending)
usort($results, fn($a, $b) => $b['r2'] <=> $a['r2']);
echo "Algorithm Comparison (sorted by R² score):\n";
echo "┌─────────────────────┬──────────┬────────────┬─────────────┐\n";
echo "│ Algorithm │ R² Score │ MAE (\$) │ Train (ms) │\n";
echo "├─────────────────────┼──────────┼────────────┼─────────────┤\n";
foreach ($results as $result) {
echo sprintf(
"│ %-19s │ %8.3f │ %10s │ %11s │\n",
$result['name'],
$result['r2'],
number_format($result['mae'], 0),
number_format($result['train_time'], 2)
);
}
echo "└─────────────────────┴──────────┴────────────┴─────────────┘\n\n";
echo "Best Algorithm: " . $results[0]['name'] . " (R² = " . number_format($results[0]['r2'], 3) . ")\n\n";
echo "╔══════════════════════════════════════════════════════════╗\n";
echo "║ ✓ Regression & Feature Importance Complete! ║\n";
echo "╚══════════════════════════════════════════════════════════╝\n";Expected Result
╔══════════════════════════════════════════════════════════╗
║ House Price Prediction with Regression ║
╚══════════════════════════════════════════════════════════╝
STEP 1: Preparing House Price Dataset
------------------------------------------------------------
Dataset prepared:
Samples: 20 houses
Features: 4 (sqft, bedrooms, age, location_score)
Target: Price ($)
Sample data:
House 1: 1200 sqft, 3 bed, 10 yrs, loc:7 → $250,000
House 2: 1800 sqft, 4 bed, 5 yrs, loc:8 → $380,000
House 3: 900 sqft, 2 bed, 20 yrs, loc:5 → $180,000
STEP 2: Training Random Forest Regressor
------------------------------------------------------------
Data split:
Training: 16 samples
Testing: 4 samples
Training Random Forest Regressor (100 trees)...
✓ Model trained in 42.31 ms
STEP 3: Evaluating Model Performance
------------------------------------------------------------
Test Set Performance:
R² Score: 0.912 (explains 91.2% of variance)
RMSE: $18,450 (avg prediction error)
MAE: $14,200 (avg absolute error)
Sample Predictions vs Actuals:
Predicted: $267,800 | Actual: $265,000 | Error: $2,800 (1.1%)
Predicted: $492,300 | Actual: $490,000 | Error: $2,300 (0.5%)
Predicted: $413,200 | Actual: $410,000 | Error: $3,200 (0.8%)
Predicted: $286,500 | Actual: $285,000 | Error: $1,500 (0.5%)
STEP 4: Feature Importance Analysis
------------------------------------------------------------
Feature Importance (how much each feature drives predictions):
1. Square Feet : 52.3% ████████████████████
2. Location Score : 28.1% ███████████
3. Bedrooms : 12.4% █████
4. Age (years) : 7.2% ███
Key Insights:
• Square Feet is the strongest predictor
• Location Score also significantly impacts price
• Age (years) has minimal influence
STEP 5: Predicting Prices for New Houses
------------------------------------------------------------
Predicted prices for new houses:
House 1 (1500 sqft, 3 bed, 8 years, location: 7/10)
Predicted Price: $285,400
House 2 (2200 sqft, 4 bed, 3 years, location: 9/10)
Predicted Price: $445,800
House 3 (1000 sqft, 2 bed, 15 years, location: 5/10)
Predicted Price: $203,100
STEP 6: Comparing Regression Algorithms
------------------------------------------------------------
Training and evaluating 3 regression algorithms...
Algorithm Comparison (sorted by R² score):
┌─────────────────────┬──────────┬────────────┬─────────────┐
│ Algorithm │ R² Score │ MAE ($) │ Train (ms) │
├─────────────────────┼──────────┼────────────┼─────────────┤
│ Random Forest │ 0.912 │ 14,200 │ 42.31 │
│ k-NN Regressor │ 0.854 │ 18,700 │ 2.14 │
│ Ridge Regression │ 0.831 │ 21,300 │ 3.89 │
└─────────────────────┴──────────┴────────────┴─────────────┘
Best Algorithm: Random Forest (R² = 0.912)
╔══════════════════════════════════════════════════════════╗
║ ✓ Regression & Feature Importance Complete! ║
╚══════════════════════════════════════════════════════════╝Why It Works
Regression Fundamentals:
- Regression predicts continuous values by learning relationships between features and target
- Random Forest builds 100 decision trees, each making a prediction, then averages them
- More trees = more robust predictions but slower training
- Works well for non-linear relationships (unlike linear regression)
Evaluation Metrics:
- R² (R-squared): Proportion of variance explained (0-1, higher is better). 0.91 = model explains 91% of price variation
- RMSE (Root Mean Squared Error): Average prediction error in original units (dollars). Penalizes large errors more
- MAE (Mean Absolute Error): Average absolute difference. More interpretable than RMSE
Feature Importance:
- Tree-based models track how often each feature is used for splits and how much it reduces error
- Higher importance = feature has stronger relationship with target
- Helps prioritize data collection and feature engineering
- Can reveal surprising relationships (e.g., square footage matters 2x more than bedrooms)
Why Random Forest Wins:
- Handles non-linear relationships (price doesn't scale linearly with sqft)
- Resistant to overfitting through ensemble averaging
- No feature scaling required (unlike Ridge or k-NN)
- Built-in feature importance (Ridge and k-NN don't provide this directly)
Troubleshooting
R² is negative or very low (<0.5) — Features aren't predictive of target. Check if:
- Features are correlated with target (plot sqft vs price to verify relationship)
- Sufficient training data (need at least 10x samples as features)
- Target variable has variation (if all prices similar, nothing to predict)
php// Check correlation $correlation = calculateCorrelation($samples[0], $prices); echo "Feature 1 correlation: " . $correlation . "\n"; // Should be > 0.3 for meaningful relationshipRMSE very high (>50% of average price) — Model not learning. Solutions:
- Normalize features if they're on different scales:
phpuse Rubix\ML\Transformers\ZScaleStandardizer; use Rubix\ML\Pipeline; $pipeline = new Pipeline([ new ZScaleStandardizer(), ], new RandomForestRegressor(estimators: 100)); $pipeline->train($training);- Increase number of trees:
estimators: 200 - Try different algorithm (Ridge often works better for small datasets)
Feature importance all equal (~25% each) — Model not finding patterns. Causes:
- Features genuinely don't predict target (collect better features)
- Too few trees: increase
estimators: 200 - Features highly correlated (all provide same information)
Try feature selection:
php// Remove least important features and retrain $importantFeatures = [0, 3]; // Keep only sqft and location $reducedSamples = array_map( fn($sample) => [$sample[0], $sample[3]], $samples );Predictions way off for new houses — Model overfitting or new data differs from training. Check:
- New data in same range as training (don't predict for 5000 sqft if trained on 900-2800)
- Sufficient test set size (at least 20% of data)
- Random state set for reproducibility:
phpsrand(42); // Set random seed [$training, $testing] = $dataset->randomize()->split(0.8);RandomForest training very slow — Large dataset or too many trees. Solutions:
- Reduce trees:
estimators: 50(faster, slight accuracy loss) - Increase
minLeafSize: 5(faster, less precise) - Try faster algorithm like Ridge for quick experiments
- Reduce trees:
"Call to undefined method featureImportances()" — Algorithm doesn't support feature importance. Only tree-based models (RandomForest, GradientBoosting, DecisionTree) have this. For Ridge or k-NN, use permutation importance or analyze coefficients manually.
Exercises
Apply what you've learned by building your own projects:
Exercise Solutions
Sample solutions are available in code/chapter-08/solutions/. Try implementing them yourself first!
Exercise 1: Spam Filter with Evaluation
Goal: Build a complete spam filter with PHP-ML and comprehensive evaluation
Create exercise1-spam-filter.php that:
- Loads email dataset (provided in
code/chapter-08/data/emails.csv) - Extracts at least 6 features per email (your choice)
- Trains a k-NN classifier (try different k values)
- Performs 5-fold cross-validation
- Reports accuracy, precision, recall, and F1-score
- Saves the best model to disk
Validation: Your implementation should achieve:
- Accuracy > 85%
- Clear feature extraction function
- Cross-validation with at least 3 folds
- Model successfully saved and reloaded
Exercise 2: Customer Segmentation Dashboard
Goal: Create a customer clustering analysis tool with Rubix ML
Build a script that:
- Loads customer data (spending, frequency, recency)
- Applies both k-means and DBSCAN clustering
- Compares results from both algorithms
- For each cluster, calculates:
- Average spending
- Average visit frequency
- Customer lifetime value estimate
- Assigns business segment labels (VIP, Regular, Occasional, At-Risk)
- Exports results to CSV for business team
Validation:
- Finds reasonable customer segments (not all in one cluster)
- Provides actionable business insights
- Handles edge cases (single-customer clusters, outliers)
Exercise 3: Algorithm Benchmark Suite
Goal: Build a reusable tool for comparing algorithms on any dataset
Create a class AlgorithmBenchmark that:
$benchmark = new AlgorithmBenchmark($dataset);
$benchmark->addAlgorithm('k-NN', new KNearestNeighbors(k: 5));
$benchmark->addAlgorithm('Naive Bayes', new GaussianNB());
// ... add more
$results = $benchmark->runAll();
$benchmark->exportReport('benchmark-results.html');Must include:
- Train/test split configurable
- Multiple metrics (accuracy, F1, precision, recall)
- Timing measurements
- Statistical significance tests (compare if differences are meaningful)
- HTML report generation with charts (text-based charts OK)
Validation:
- Works with any Rubix ML classifier
- Proper error handling for failed algorithms
- Comprehensive report with recommendations
Exercise 4: Sentiment Analysis with Pipeline
Goal: Build a text sentiment analyzer using Rubix ML pipeline
Implement a sentiment analyzer that:
- Loads movie review dataset (positive/negative)
- Creates a pipeline with:
- Text preprocessing (lowercasing, stop word removal)
- TF-IDF vectorization (use Rubix ML transformers)
- Classifier (try multiple)
- Uses 3-way split (train/validation/test)
- Tunes hyperparameters on validation set
- Reports final performance on test set
- Saves production-ready model
Validation:
- Clear pipeline with at least 2 preprocessing steps
- Proper 3-way split (no data leakage)
- Test accuracy > 75%
- Can classify new reviews instantly
Exercise 5: Regression with Feature Engineering
Goal: Extend Step 9 with advanced feature engineering techniques
Starting from the 09-regression-feature-importance.php house price example, enhance it with:
Engineer 3+ new features from the existing 4 features:
- Price per square foot ratio (derived feature)
- Age category: new (0-5 yrs), mid (6-15 yrs), old (16+ yrs) — one-hot encoded
- Bedroom to square foot ratio
- Location-sqft interaction (location_score * sqft / 1000)
Compare 4+ regression algorithms:
- RandomForestRegressor (from Step 9 baseline)
- SVR (Support Vector Regressor)
- KNNRegressor
- Ridge Regression
- Extra Trees Regressor (bonus)
Use 5-fold cross-validation for robust evaluation instead of single train/test split
Analyze feature importance for top 3 tree-based algorithms:
- Compare which features are consistently important
- Visualize with text-based bar charts
Predict with error estimates:
- Use ensemble variance for confidence intervals
- Show "Price: $285k ± $12k (95% confidence)"
Validation:
- R² improves by at least 0.05 compared to Step 9 baseline (with engineered features)
- Cross-validation scores are consistent (std dev < 0.1)
- Feature importance consistent across tree-based models
- Predictions reasonable for edge cases (very small/large houses)
- Error estimates reflect actual prediction quality
Example Output:
Feature Engineering Results:
Original features: 4
Engineered features: 7
Total features: 11
Cross-Validation Results (5-fold):
Random Forest: R² = 0.94 ± 0.03
SVR: R² = 0.87 ± 0.05
Ridge: R² = 0.83 ± 0.04
Best Model: Random Forest (R² = 0.94)
Feature Importance (Top 5):
1. Square Feet: 38.2%
2. Location-sqft interaction: 24.1%
3. Location Score: 18.5%
4. Price per sqft ratio: 11.3%
5. Bedrooms: 4.7%
New House Prediction:
Features: [1500 sqft, 3 bed, 8 yrs, loc:7]
Predicted Price: $285,400 ± $14,200 (95% confidence)Exercise 6: Enhance Production API
Goal: Extend the Step 8 API with advanced production features
Starting from the 08-production-api.php implementation in Step 8, add the following advanced capabilities:
Rate Limiting: Limit requests to 100 per minute per IP address
- Track requests in memory or Redis
- Return 429 (Too Many Requests) when limit exceeded
- Include
X-RateLimit-Remainingheader in responses
API Key Authentication: Require valid API key for predictions
- Accept key via
X-API-Keyheader - Return 401 (Unauthorized) for missing/invalid keys
- Support multiple keys with different rate limits
- Accept key via
Batch Predictions: Accept multiple feature vectors in one request
jsonPOST /predict/batch { "features": [ [5.1, 3.5, 1.4, 0.2], [6.3, 3.3, 6.0, 2.5] ] }Response Caching: Cache predictions for identical inputs
- Use simple array cache or APCu
- Return
X-Cache: HITorX-Cache: MISSheader - Track cache hit rate in stats endpoint
Prometheus Metrics Endpoint: Add
/metricsendpoint- Total requests counter
- Request duration histogram
- Prediction class distribution
- Cache hit rate
Example enhanced request:
curl -X POST http://localhost:8000/predict \
-H "X-API-Key: secret123" \
-H "Content-Type: application/json" \
-d '{"features": [5.1, 3.5, 1.4, 0.2]}'Response:
{
"prediction": "Iris-setosa",
"confidence": 0.95,
"processing_time_ms": 1.23,
"request_id": "pred_67234abc",
"cached": false
}Validation:
- Rate limiting enforced (test by making >100 requests)
- Invalid API keys rejected with 401
- Batch predictions process correctly and return array
- Cache hit rate > 20% with repeated requests
- Metrics endpoint returns Prometheus-compatible format
- All Step 8 features still work (validation, logging, health checks)
Troubleshooting
Common issues when working with PHP ML libraries:
Installation Issues
Composer requires PHP 8.0+ but you have 7.4
Update PHP or adjust composer.json:
{
"require": {
"php": "^7.4|^8.0"
}
}Memory exhausted during composer install
Increase PHP memory limit:
php -d memory_limit=-1 $(which composer) installPHP-ML Specific Issues
Predictions inconsistent between runs
k-means uses random initialization. For reproducible results (dev only):
mt_srand(42); // Set before clustering
$kmeans = new KMeans(n: 3);Cannot predict on new samples after clustering
PHP-ML limitation. Workaround:
// Calculate centroids manually
$centroids = [];
foreach ($clusters as $cluster) {
$centroid = [];
for ($i = 0; $i < $numFeatures; $i++) {
$centroid[] = array_sum(array_column($cluster, $i)) / count($cluster);
}
$centroids[] = $centroid;
}
// Assign new sample to nearest centroid
function assignToCluster($sample, $centroids) {
$minDist = PHP_FLOAT_MAX;
$assignedCluster = 0;
foreach ($centroids as $i => $centroid) {
$dist = euclideanDistance($sample, $centroid);
if ($dist < $minDist) {
$minDist = $dist;
$assignedCluster = $i;
}
}
return $assignedCluster;
}Rubix ML Specific Issues
Error: "Samples and labels must have the same number of elements"
Mismatch between samples and labels count:
echo "Samples: " . count($samples) . "\n";
echo "Labels: " . count($labels) . "\n";
// Debug which one is wrongPipeline transformation fails on test data
Ensure transformers fitted on training data only:
// WRONG
$transformer->fit($allData); // Leaks test data!
// RIGHT
$pipeline->train($trainingDataset); // Fits transformers to training only
$predictions = $pipeline->predict($testingDataset); // Applies same transformationsNeural network won't converge
- Try more training epochs
- Reduce learning rate
- Normalize/standardize features
- Check for NaN values in data
- Use simpler architecture (fewer layers/neurons)
Model Persistence Issues
Unserialization fails after library update
Models serialized with v2.x can't be loaded in v3.x. Retrain:
composer show rubix/ml # Check version
# If version changed, retrain model
php train-model.phpLoaded model makes different predictions
Feature extraction must be identical:
// TRAINING TIME
$features = extractFeatures($email); // [5, 2, 1, 0, 3, 8]
// PRODUCTION TIME
$features = extractFeatures($email); // Must produce same array!
// Verify order: [word_count, has_free, ...]Performance Issues
Training very slow with large dataset
PHP is single-threaded. Solutions:
// 1. Sample data for development
$sampledData = array_slice($allData, 0, 1000);
// 2. Use faster algorithms (Naive Bayes vs Neural Net)
// 3. Precompute expensive features offline
// 4. Consider Python for training, PHP for inferencePrediction latency too high for real-time
Optimize inference:
// 1. Load model once, not per request
static $model = null;
if ($model === null) {
$model = $persister->load();
}
// 2. Batch predictions when possible
$predictions = $model->predict($multipleSamples);
// 3. Cache predictions for common inputs
$cacheKey = md5(json_encode($features));
if ($cached = $cache->get($cacheKey)) {
return $cached;
}
// 4. Choose faster algorithm (tree vs k-NN)Data Quality Issues
All predictions return the same class
Check class balance and features:
// 1. Check labels
print_r(array_count_values($labels));
// Should be roughly balanced
// 2. Check if features are informative
$correlation = calculateFeatureCorrelation($features, $labels);
// Features should correlate with labels
// 3. Try different k (for k-NN) or algorithmAccuracy seems too good (> 99%)
Possible data leakage:
// WRONG: Normalize before split
$normalized = normalize($allData);
[$train, $test] = split($normalized); // Leaked test stats!
// RIGHT: Split first
[$train, $test] = split($allData);
$normalized_train = normalize($train);
$normalized_test = normalizeUsing($test, $trainStats);Wrap-up
🎉 Congratulations! You've mastered PHP machine learning libraries and can now build production-ready ML applications in a fraction of the time it takes to code from scratch.
What You've Accomplished
┌─────────────────────────────────────────────────────────────┐
│ From Custom Code → To Library-Powered │
├─────────────────────────────────────────────────────────────┤
│ 120 lines → 12 lines (90% reduction) │
│ Reinvent algorithms → Battle-tested implementations│
│ Manual edge cases → Handled automatically │
│ Train every request → Load once, use 1000x │
│ Single algorithm → Compare 6+ with ease │
│ Limited features → 40+ algorithms available │
└─────────────────────────────────────────────────────────────┘Specific Skills Mastered:
- ✓ Installed and configured PHP-ML and Rubix ML with proper dependency management and verification
- ✓ Reimplemented the Chapter 6 spam filter using PHP-ML, reducing code by 66% while improving robustness
- ✓ Built a complete iris classifier with Rubix ML's pipeline, combining preprocessing and prediction seamlessly
- ✓ Created customer segmentation system using k-means and DBSCAN clustering for unsupervised learning
- ✓ Mastered model persistence saving and loading models for 50x faster production inference
- ✓ Compared 6+ algorithms systematically understanding the trade-offs between accuracy and speed
- ✓ Loaded real data from CSV, databases, and APIs using Rubix ML's data loading utilities
- ✓ Applied 6+ transformers in production pipelines (NumericStringConverter, MissingDataImputer, OneHotEncoder, MinMaxNormalizer, and more)
- ✓ Built production-ready REST API with singleton pattern, input validation, error handling, logging, and health checks
- ✓ Mastered regression models for house price prediction with RandomForestRegressor achieving R² > 0.9
- ✓ Analyzed feature importance to identify which variables drive predictions most (square footage, location, etc.)
- ✓ Understood when to use each library — PHP-ML for learning and prototypes, Rubix ML for production
Key Insights Gained
Code Efficiency: Libraries reduce boilerplate by 90%, letting you focus on business logic instead of algorithm implementation. What took 120 lines custom now takes 12 lines with a library.
Production Readiness: Rubix ML provides battle-tested implementations with edge case handling, type safety, and performance optimizations you'd spend months building yourself.
Library Selection:
- PHP-ML: Great for learning ML concepts, prototyping, small projects
- Rubix ML: Production applications, complex pipelines, comprehensive ecosystem
- Custom code: When you need full control or implement cutting-edge research
Performance Patterns:
- Train offline, save model, load once in production
- Batch predictions when possible
- Cache expensive feature extraction
- Choose algorithms based on speed vs. accuracy requirements
Production Patterns Mastered:
- Singleton Model Loading: Load models once at startup (singleton pattern), not per request — 50x faster
- Input Validation: Validate all inputs before prediction to prevent crashes from bad data
- Error Handling: Use proper HTTP status codes (400 for client errors, 500 for server errors)
- Prediction Logging: Log all predictions for monitoring, debugging, and retraining data
- Health Checks: Provide health check endpoints for load balancers and monitoring systems
- Performance Monitoring: Track inference time, request counts, and success rates
- Data Loading: Load from CSV, databases, and APIs using library utilities (not hardcoded arrays)
- Transformer Pipelines: Chain preprocessing steps ensuring test data uses training statistics (prevents data leakage)
When to Use Libraries vs. Custom Code
Use Libraries When:
- Building production applications (reliability matters)
- Need standard algorithms (k-NN, Random Forest, etc.)
- Want comprehensive ecosystem (metrics, pipelines, transformers)
- Time-to-market is critical
- Team needs maintainable code
Use Custom Code When:
- Learning ML algorithms deeply (educational purposes)
- Implementing novel/research algorithms not in libraries
- Need extreme performance optimization for specific use case
- Library doesn't support your exact requirements
Hybrid Approach (Best of Both):
- Use libraries for standard ML tasks
- Custom code for domain-specific preprocessing
- Library algorithms with custom feature engineering
- Extend library classes when needed
What's Next
In Chapter 09: Advanced Machine Learning Techniques, you'll explore:
- Decision trees and ensemble methods (Random Forests, Gradient Boosting)
- Support Vector Machines for complex classification boundaries
- Dimensionality reduction with PCA for high-dimensional data
- Anomaly detection for fraud detection and outlier identification
- Advanced clustering techniques for complex patterns
- Feature selection to identify most important variables
- Hyperparameter tuning strategies for optimal performance
You'll use Rubix ML's advanced algorithms to tackle complex real-world problems like fraud detection, anomaly identification, and multi-class classification with hundreds of features.
Production Checklist
Before deploying an ML system to production, verify:
- [ ] Model trained on representative, recent data
- [ ] Performance metrics meet business requirements
- [ ] Model persisted and loads successfully
- [ ] Feature extraction documented and consistent
- [ ] Error handling covers edge cases (missing data, invalid input)
- [ ] Monitoring in place (prediction distribution, latency, errors)
- [ ] Retraining schedule defined (weekly/monthly for model drift)
- [ ] API endpoints have rate limiting and authentication
- [ ] Predictions logged for auditing and improvement
- [ ] Rollback plan if model performs poorly
Key Takeaways
The Three Rules of Production ML:
Train Offline, Deploy Online: Never retrain on user requests. Train once, save model, load at startup, serve thousands.
Libraries > Custom Code (Usually): 90% time savings, better robustness, active maintenance. Use custom only when learning or implementing novel research.
Measure Everything: Always compare algorithms, monitor performance, log predictions, track model drift.
Remember: A 95% accurate model that takes 10 seconds per prediction is worse than an 85% accurate model that predicts in 10ms. Production requirements always trump theoretical perfection.
Further Reading
Deepen your understanding of PHP ML libraries and best practices:
Library Documentation
- PHP-ML Documentation — Complete reference for PHP-ML algorithms, examples, and API
- Rubix ML Documentation — Comprehensive guides, API reference, and tutorials for Rubix ML
- Rubix ML GitHub — Source code, issues, and community discussions
Algorithm Deep Dives
- k-Nearest Neighbors Explained — Mathematics and applications of k-NN
- Naive Bayes Classification — Probabilistic classifiers and Bayes' theorem
- Decision Trees and Random Forests — Original Random Forest paper by Breiman
- Neural Networks Basics — Stanford CS231n guide to neural networks
Production ML
- Machine Learning Systems Design — Designing ML systems for production
- ML Model Serving Patterns — Martin Fowler on CD4ML (Continuous Delivery for ML)
- Monitoring Machine Learning Models — Detecting model drift and performance degradation
PHP and Machine Learning
- PHP-ML Examples Repository — Community-contributed examples
- Rubix ML Blog — Articles on ML techniques and Rubix ML features
- PHP Machine Learning on Reddit — Community discussions and questions
Knowledge Check
Test your understanding of PHP ML libraries: