Skip to content

Leveraging PHP Machine Learning Libraries

Chapter 08: Leveraging PHP Machine Learning Libraries

Overview

In previous chapters, you built machine learning models from scratch—calculating distances by hand, implementing train/test splits, and writing evaluation metrics. This taught you how ML algorithms work under the hood. But in production, you don't want to reinvent the wheel.

Professional PHP ML libraries like PHP-ML and Rubix ML provide battle-tested implementations of dozens of algorithms, complete with optimizations, edge case handling, and convenient APIs. What took you 200 lines of custom code can often be accomplished in 20 lines with a library—and it'll be faster, more robust, and easier to maintain.

This chapter shows you how to leverage these libraries effectively. You'll reimplement the spam filter from Chapter 6, but this time using library classes that handle the complexity for you. You'll discover Rubix ML's comprehensive ecosystem with 40+ algorithms covering the entire ML lifecycle. You'll learn how to save and load trained models for reuse. Most importantly, you'll understand when to use libraries versus custom implementations, and which library to choose for different scenarios.

By the end of this chapter, you'll be able to drop powerful ML capabilities into any PHP application using just a few lines of code, while understanding exactly what's happening behind those convenient APIs.

Prerequisites

Before starting this chapter, you should have:

  • Completed Chapter 06 (spam filter implementation) or equivalent classification experience
  • Completed Chapter 07 (evaluation metrics) or understand accuracy, precision, and recall
  • Understanding of train/test splits and cross-validation from Chapter 03
  • PHP 8.4+ installed with Composer available
  • Comfortable installing packages via Composer
  • Basic understanding of object-oriented PHP (classes, interfaces, type declarations)

Estimated Time: ~85-100 minutes (installation, examples, and exercises)

What You'll Build

By the end of this chapter, you will have created:

  • A spam classifier using PHP-ML that matches your Chapter 6 custom implementation in 1/10th the code
  • A sentiment analyzer using Rubix ML with advanced preprocessing and multiple algorithms
  • A customer segmentation system using k-means clustering from PHP-ML
  • An iris classifier using Rubix ML's pipeline with feature normalization and cross-validation
  • A model persistence system that saves trained models and loads them for instant predictions
  • A regression model for price prediction using Rubix ML's regressors
  • An algorithm comparison tool testing 6+ different classifiers on the same dataset
  • A real data loading system that loads from CSV files, databases, and JSON APIs
  • A transformer pipeline applying 6+ preprocessing steps (type conversion, imputation, encoding, normalization)
  • A production REST API with singleton model loading, input validation, error handling, logging, and health checks
  • A library comparison matrix documenting when to use PHP-ML vs. Rubix ML vs. custom code

All implementations include proper error handling, performance benchmarks, and production-ready patterns.

Code Examples

Complete, runnable examples for this chapter will be available in code/chapter-08/. The code snippets shown in each step are fully functional and can be run directly. To use them:

bash
# Create chapter directory
mkdir -p code/chapter-08
cd code/chapter-08

# Initialize Composer
composer init --no-interaction
composer require php-ai/php-ml:^0.10
composer require rubix/ml:^3.0

Example files referenced:

  • 01-phpml-spam-filter.php — Spam classification with PHP-ML
  • 02-rubix-iris-classifier.php — Iris classification with Rubix ML
  • 03-phpml-clustering.php — Customer segmentation with k-means
  • 05-model-persistence.php — Saving and loading models
  • 06-algorithm-comparison.php — Comparing 6+ algorithms
  • 07-load-real-data.php — Loading data from CSV/databases with transformers
  • 08-production-api.php — Production REST API for serving predictions
  • 09-regression-feature-importance.php — House price regression with feature analysis

Copy the code from each step into these files to run the examples.

Quick Start

Want to see the power of ML libraries immediately? Here's a 5-minute example comparing custom code vs. PHP-ML:

php
# filename: quick-start-comparison.php
<?php

declare(strict_types=1);

require_once __DIR__ . '/vendor/autoload.php';

use Phpml\Classification\KNearestNeighbors;

// ============================================================
// BEFORE: Custom Implementation (from Chapter 6)
// ============================================================

// You wrote 50+ lines to:
// - Calculate Euclidean distance
// - Find k nearest neighbors
// - Take majority vote
// - Handle edge cases

// ============================================================
// AFTER: Using PHP-ML (this chapter)
// ============================================================

// Training data
$samples = [
    [1, 1, 0, 3],  // spam features
    [0, 0, 0, 0],  // ham features
    [1, 1, 1, 4],  // spam
    [0, 0, 0, 1],  // ham
];
$labels = ['spam', 'ham', 'spam', 'ham'];

// Train classifier - that's it!
$classifier = new KNearestNeighbors(k: 3);
$classifier->train($samples, $labels);

// Make predictions
$predictions = $classifier->predict([
    [1, 1, 1, 3],  // likely spam
    [0, 0, 0, 0],  // likely ham
]);

foreach ($predictions as $i => $pred) {
    $icon = $pred === 'spam' ? '🚫' : '✅';
    echo "{$icon} Email " . ($i + 1) . ": {$pred}\n";
}

// The library handled:
// ✓ Distance calculations with optimizations
// ✓ Edge cases (ties, equal distances)
// ✓ Batch predictions
// ✓ Type safety and validation
// ✓ Performance optimizations

Run it:

bash
cd docs/series/ai-ml-php-developers/code/chapter-08
composer install
php quick-start-comparison.php

Expected output:

🚫 Email 1: spam
✅ Email 2: ham

Time saved: 90% less code, 100% more reliable!

Now let's explore what these libraries can really do...

Objectives

By the end of this chapter, you will be able to:

  • Install and configure PHP-ML and Rubix ML in any PHP project with proper dependency management
  • Choose the right library for specific tasks based on algorithm availability, performance, and ecosystem maturity
  • Reimplement custom algorithms using library classes with 10x less code while improving robustness
  • Use classification algorithms from both libraries including k-NN, Naive Bayes, SVM, Decision Trees, and ensemble methods
  • Perform clustering with k-means, DBSCAN, and hierarchical clustering for customer segmentation
  • Build regression models for price prediction, demand forecasting, and continuous value estimation
  • Save and load trained models using proper serialization for production deployment
  • Create ML pipelines combining preprocessing, training, and evaluation in Rubix ML
  • Compare multiple algorithms systematically to choose the best for your dataset
  • Load real data from CSV files, databases, and JSON APIs using Rubix ML's data loading utilities
  • Apply preprocessing transformers (type conversion, missing data imputation, one-hot encoding, normalization) in correct order
  • Build production REST APIs with singleton model loading, input validation, error handling, logging, and health checks
  • Handle real-world data with library utilities for missing values, normalization, and encoding

Step 1: Installing PHP ML Libraries (~8 min)

Goal

Set up both PHP-ML and Rubix ML in a development environment and verify installations with working test scripts.

Actions

Understanding PHP ML Libraries

PHP has two mature machine learning libraries, each with different strengths:

PHP-ML (php-ai/php-ml)

  • Focus: Simple, educational implementations of core algorithms
  • Strengths: Easy to learn, minimal dependencies, good for beginners
  • Algorithms: ~15 classifiers, regressors, and clustering algorithms
  • Best for: Learning ML concepts, small projects, prototyping

Rubix ML (rubix/ml)

  • Focus: Production-ready, comprehensive ML ecosystem
  • Strengths: 40+ algorithms, advanced pipelines, active development
  • Algorithms: Full ML lifecycle coverage with preprocessing, validation, and visualization
  • Best for: Production applications, complex pipelines, serious ML work

Quick Comparison Table

FeaturePHP-MLRubix ML
Algorithms~15 core algorithms40+ algorithms
Learning CurveGentleSteeper
DocumentationGoodExcellent
Pipeline SupportNoYes
PerformanceGoodOptimized
Active DevelopmentModerateVery Active
Best Use CaseLearning, PrototypingProduction, Complex ML
DependenciesMinimalMore comprehensive

Let's install both so you can compare and choose based on your needs:

Installation Steps

Create a new project directory:

bash
# From the series code directory
mkdir -p chapter-08
cd chapter-08

Initialize Composer and install both libraries:

bash
# Initialize project
composer init --name="aiml/chapter-08" --type=project --no-interaction

# Install PHP-ML
composer require php-ai/php-ml:^0.10

# Install Rubix ML
composer require rubix/ml:^3.0

What just happened?

  • PHP-ML installed with basic algorithm implementations
  • Rubix ML installed with comprehensive ML toolkit
  • Dependencies resolved automatically by Composer
  • Autoloading configured in vendor/autoload.php

Verify PHP-ML Installation

Create a test script:

php
# filename: test-phpml.php
<?php

declare(strict_types=1);

require_once __DIR__ . '/vendor/autoload.php';

use Phpml\Classification\KNearestNeighbors;
use Phpml\Clustering\KMeans;
use Phpml\Regression\LeastSquares;

echo "╔══════════════════════════════════════════════════════════╗\n";
echo "║           PHP-ML Installation Verification              ║\n";
echo "╚══════════════════════════════════════════════════════════╝\n\n";

// Test 1: Classification
echo "Test 1: Classification (k-NN)\n";
echo "------------------------------------------------------------\n";
$classifier = new KNearestNeighbors(k: 3);
$classifier->train([[1, 2], [3, 4]], ['A', 'B']);
$prediction = $classifier->predict([2, 3]);
echo "✓ k-NN classifier working\n";
echo "  Sample [2, 3] → Predicted: {$prediction}\n\n";

// Test 2: Clustering
echo "Test 2: Clustering (k-Means)\n";
echo "------------------------------------------------------------\n";
$kmeans = new KMeans(n: 2);
$clusters = $kmeans->cluster([[1, 1], [1.5, 2], [5, 8], [8, 8]]);
echo "✓ k-Means clustering working\n";
echo "  Found " . count($clusters) . " clusters\n";
echo "  Cluster sizes: ";
foreach ($clusters as $i => $cluster) {
    echo "C" . ($i + 1) . "=" . count($cluster) . " ";
}
echo "\n\n";

// Test 3: Regression
echo "Test 3: Regression (Least Squares)\n";
echo "------------------------------------------------------------\n";
$regression = new LeastSquares();
$regression->train([[1], [2], [3]], [2, 4, 6]);  // y = 2x
$predictedValue = $regression->predict([4]);
echo "✓ Linear regression working\n";
echo "  Input: 4 → Predicted: " . round($predictedValue, 2) . " (expected ~8)\n\n";

echo "╔══════════════════════════════════════════════════════════╗\n";
echo "║  ✓ PHP-ML Successfully Installed and Verified!          ║\n";
echo "╚══════════════════════════════════════════════════════════╝\n";

Verify Rubix ML Installation

php
# filename: test-rubixml.php
<?php

declare(strict_types=1);

require_once __DIR__ . '/vendor/autoload.php';

use Rubix\ML\Datasets\Labeled;
use Rubix\ML\Classifiers\KNearestNeighbors as RubixKNN;
use Rubix\ML\Clusterers\KMeans as RubixKMeans;
use Rubix\ML\Regressors\Ridge;
use Rubix\ML\CrossValidation\Metrics\Accuracy;

echo "╔══════════════════════════════════════════════════════════╗\n";
echo "║          Rubix ML Installation Verification             ║\n";
echo "╚══════════════════════════════════════════════════════════╝\n\n";

// Test 1: Classification with Dataset
echo "Test 1: Classification (k-NN with Labeled Dataset)\n";
echo "------------------------------------------------------------\n";
$dataset = new Labeled(
    [[1, 2], [3, 4], [5, 6]],
    ['A', 'B', 'B']
);
$classifier = new RubixKNN(k: 3);
$classifier->train($dataset);
$prediction = $classifier->predictSample([2, 3]);
echo "✓ Rubix k-NN classifier working\n";
echo "  Sample [2, 3] → Predicted: {$prediction}\n\n";

// Test 2: Accuracy metric
echo "Test 2: Evaluation Metrics\n";
echo "------------------------------------------------------------\n";
$predictions = $classifier->predict($dataset);
$metric = new Accuracy();
$score = $metric->score($predictions, $dataset->labels());
echo "✓ Evaluation metrics working\n";
echo "  Accuracy score: " . number_format($score * 100, 2) . "%\n\n";

// Test 3: Clustering
echo "Test 3: Clustering (k-Means)\n";
echo "------------------------------------------------------------\n";
$clusterDataset = new Labeled(
    [[1, 1], [1.5, 2], [5, 8], [8, 8]],
    [] // No labels for clustering
);
$kmeans = new RubixKMeans(k: 2);
$kmeans->train($clusterDataset);
$clusterPredictions = $kmeans->predict($clusterDataset);
echo "✓ Rubix k-Means clustering working\n";
echo "  Assigned clusters: " . implode(', ', $clusterPredictions) . "\n\n";

// Test 4: Regression
echo "Test 4: Regression (Ridge Regression)\n";
echo "------------------------------------------------------------\n";
$regressionDataset = new Labeled(
    [[1], [2], [3]],
    [2, 4, 6]  // y = 2x
);
$regressor = new Ridge(alpha: 1.0);
$regressor->train($regressionDataset);
$predictedValue = $regressor->predictSample([4]);
echo "✓ Ridge regression working\n";
echo "  Input: 4 → Predicted: " . round($predictedValue, 2) . " (expected ~8)\n\n";

echo "╔══════════════════════════════════════════════════════════╗\n";
echo "║  ✓ Rubix ML Successfully Installed and Verified!        ║\n";
echo "╚══════════════════════════════════════════════════════════╝\n";

Expected Result

Running the PHP-ML test:

bash
php test-phpml.php
╔══════════════════════════════════════════════════════════╗
║           PHP-ML Installation Verification              ║
╚══════════════════════════════════════════════════════════╝

Test 1: Classification (k-NN)
------------------------------------------------------------
✓ k-NN classifier working
  Sample [2, 3] → Predicted: A

Test 2: Clustering (k-Means)
------------------------------------------------------------
✓ k-Means clustering working
  Found 2 clusters
  Cluster sizes: C1=2 C2=2

Test 3: Regression (Least Squares)
------------------------------------------------------------
✓ Linear regression working
  Input: 4 → Predicted: 8.00 (expected ~8)

╔══════════════════════════════════════════════════════════╗
║  ✓ PHP-ML Successfully Installed and Verified!          ║
╚══════════════════════════════════════════════════════════╝

Running the Rubix ML test:

bash
php test-rubixml.php
╔══════════════════════════════════════════════════════════╗
║          Rubix ML Installation Verification             ║
╚══════════════════════════════════════════════════════════╝

Test 1: Classification (k-NN with Labeled Dataset)
------------------------------------------------------------
✓ Rubix k-NN classifier working
  Sample [2, 3] → Predicted: A

Test 2: Evaluation Metrics
------------------------------------------------------------
✓ Evaluation metrics working
  Accuracy score: 100.00%

Test 3: Clustering (k-Means)
------------------------------------------------------------
✓ Rubix k-Means clustering working
  Assigned clusters: 0, 0, 1, 1

Test 4: Regression (Ridge Regression)
------------------------------------------------------------
✓ Ridge regression working
  Input: 4 → Predicted: 7.98 (expected ~8)

╔══════════════════════════════════════════════════════════╗
║  ✓ Rubix ML Successfully Installed and Verified!        ║
╚══════════════════════════════════════════════════════════╝

Why It Works

Composer handles PHP package management by:

  1. Downloading libraries and their dependencies from Packagist
  2. Generating an autoloader that automatically includes classes when needed
  3. Managing version constraints to ensure compatibility
  4. Creating a lock file (composer.lock) for reproducible installations

PHP-ML provides pure-PHP implementations of ML algorithms, meaning no external dependencies or compiled extensions are needed. This makes it portable and easy to deploy.

Rubix ML uses PHP's type system extensively with strict types, interfaces, and modern OOP patterns. It's designed for production use with performance optimizations and comprehensive documentation.

Troubleshooting

  • Error: "Your requirements could not be resolved" — PHP version too old. Verify php --version shows 8.1+. Update PHP or adjust composer.json minimum version.

  • Composer command not found — Composer not installed. Download from getcomposer.org and follow installation instructions.

  • Memory limit exceeded during install — Composer needs more memory. Run: php -d memory_limit=-1 $(which composer) require rubix/ml

  • Class not found errors — Autoloader not included. Ensure require_once __DIR__ . '/vendor/autoload.php'; is at the top of every script.

Quick Reference

You're now ready to use ML libraries! You have both PHP-ML and Rubix ML installed and verified. In the next steps, you'll rebuild the Chapter 6 spam filter with 90% less code, create customer segments with clustering, and deploy production-ready models.

Quick decision guide:

  • Just learning? Start with PHP-ML (simpler API)
  • Building production app? Use Rubix ML (more features, better performance)
  • Not sure? Follow along with both and decide later!

Step 2: Reimplementing the Spam Filter with PHP-ML (~12 min)

Goal

Compare your custom Chapter 6 spam filter implementation against PHP-ML's built-in classifier, demonstrating how libraries reduce code while improving robustness.

Actions

In Chapter 6, you built a spam filter from scratch with manual distance calculations, neighbor finding, and voting logic. Let's rebuild it using PHP-ML and compare.

Custom Implementation Recap (Chapter 6)

Your original implementation required:

php
// ~50 lines of custom code:
// - euclideanDistance() function
// - findKNearestNeighbors() function
// - majorityVote() function
// - Manual data structures for storing distances
// - Edge case handling for ties

PHP-ML Implementation (This Chapter)

php
# filename: 01-phpml-spam-filter.php
<?php

declare(strict_types=1);

require_once __DIR__ . '/vendor/autoload.php';

use Phpml\Classification\KNearestNeighbors;
use Phpml\FeatureExtraction\TokenCountVectorizer;
use Phpml\Tokenization\WordTokenizer;

echo "╔══════════════════════════════════════════════════════════╗\n";
echo "║          Spam Filter: Custom vs. PHP-ML                 ║\n";
echo "╚══════════════════════════════════════════════════════════╝\n\n";

// ============================================================
// STEP 1: Prepare Training Data
// ============================================================

echo "STEP 1: Preparing Training Data\n";
echo "------------------------------------------------------------\n";

$trainingEmails = [
    "Free money now! Click here to claim your prize!",
    "Meeting scheduled for tomorrow at 3pm",
    "URGENT: You won the lottery! Act now!!!",
    "Can you review the project proposal?",
    "Get rich quick! Buy now! Limited offer!!!",
    "Thanks for your email. I'll respond soon.",
    "FREE gift card! Click here immediately!",
    "Let's schedule a call to discuss the requirements",
];

$trainingLabels = ['spam', 'ham', 'spam', 'ham', 'spam', 'ham', 'spam', 'ham'];

echo "Training set: " . count($trainingEmails) . " emails\n";
echo "  - Spam: " . array_count_values($trainingLabels)['spam'] . "\n";
echo "  - Ham: " . array_count_values($trainingLabels)['ham'] . "\n\n";

// ============================================================
// STEP 2: Feature Extraction
// ============================================================

echo "STEP 2: Feature Extraction\n";
echo "------------------------------------------------------------\n";

// Approach 1: Manual feature extraction (like Chapter 6)
function extractManualFeatures(string $email): array
{
    $lower = strtolower($email);

    return [
        str_contains($lower, 'free') ? 1 : 0,
        str_contains($lower, 'money') ? 1 : 0,
        str_contains($lower, 'urgent') ? 1 : 0,
        substr_count($email, '!'),
        preg_match_all('/[A-Z]/', $email),
        str_word_count($email),
    ];
}

$manualFeatures = array_map('extractManualFeatures', $trainingEmails);

echo "Manual feature extraction:\n";
echo "  Features per email: 6 (free, money, urgent, exclamations, capitals, words)\n";
echo "  Example: \"{$trainingEmails[0]}\"\n";
echo "  → Features: [" . implode(', ', $manualFeatures[0]) . "]\n\n";

// ============================================================
// STEP 3: Train Classifier
// ============================================================

echo "STEP 3: Training k-NN Classifier\n";
echo "------------------------------------------------------------\n";

$classifier = new KNearestNeighbors(k: 3);

$startTime = microtime(true);
$classifier->train($manualFeatures, $trainingLabels);
$trainingTime = (microtime(true) - $startTime) * 1000;

echo "✓ Classifier trained in " . number_format($trainingTime, 2) . " ms\n";
echo "  Algorithm: k-Nearest Neighbors (k=3)\n";
echo "  Training samples: " . count($manualFeatures) . "\n\n";

// ============================================================
// STEP 4: Make Predictions
// ============================================================

echo "STEP 4: Predicting New Emails\n";
echo "------------------------------------------------------------\n";

$testEmails = [
    "FREE offer! Act now! Limited time only!!!",
    "Can we meet tomorrow to discuss the budget?",
    "Congratulations! You've been selected for a prize!",
    "Please find attached the quarterly report",
];

echo "Testing " . count($testEmails) . " new emails:\n\n";

foreach ($testEmails as $i => $email) {
    $features = extractManualFeatures($email);
    $prediction = $classifier->predict([$features])[0];

    $icon = $prediction === 'spam' ? '🚫' : '✅';
    $display = strlen($email) > 45 ? substr($email, 0, 42) . '...' : $email;

    echo "{$icon} Email " . ($i + 1) . ": {$display}\n";
    echo "   Prediction: " . strtoupper($prediction) . "\n";
    echo "   Features: [" . implode(', ', $features) . "]\n\n";
}

// ============================================================
// STEP 5: Batch Predictions (Library Advantage)
// ============================================================

echo "STEP 5: Batch Predictions\n";
echo "------------------------------------------------------------\n";

$testFeatures = array_map('extractManualFeatures', $testEmails);

$batchStartTime = microtime(true);
$predictions = $classifier->predict($testFeatures);
$batchTime = (microtime(true) - $batchStartTime) * 1000;

echo "✓ Batch predicted " . count($testEmails) . " emails in " .
     number_format($batchTime, 3) . " ms\n";
echo "  Average per email: " . number_format($batchTime / count($testEmails), 3) . " ms\n\n";

// ============================================================
// COMPARISON: Custom vs. Library
// ============================================================

echo "╔══════════════════════════════════════════════════════════╗\n";
echo "║              Custom vs. PHP-ML Comparison               ║\n";
echo "╚══════════════════════════════════════════════════════════╝\n\n";

$comparison = [
    ['Aspect', 'Custom (Chapter 6)', 'PHP-ML (This Chapter)'],
    ['Lines of Code', '~120 lines', '~40 lines'],
    ['Distance Function', 'Manual implementation', 'Built-in optimized'],
    ['Neighbor Finding', 'Custom sorting logic', 'Library handles it'],
    ['Edge Cases', 'Manual handling needed', 'Automatically handled'],
    ['Batch Predictions', 'Loop required', 'Native support'],
    ['Type Safety', 'Manual validation', 'Type-checked API'],
    ['Performance', 'Unoptimized', 'Optimized algorithms'],
    ['Maintenance', 'You maintain it', 'Library maintainers'],
];

foreach ($comparison as $row) {
    printf("%-20s | %-25s | %-25s\n", ...$row);
    if ($row[0] === 'Aspect') {
        echo str_repeat('-', 20) . ' | ' . str_repeat('-', 25) . ' | ' .
             str_repeat('-', 25) . "\n";
    }
}

echo "\n💡 Key Insight: PHP-ML reduced code by 66% while improving robustness!\n";

Expected Result

╔══════════════════════════════════════════════════════════╗
║          Spam Filter: Custom vs. PHP-ML                 ║
╚══════════════════════════════════════════════════════════╝

STEP 1: Preparing Training Data
------------------------------------------------------------
Training set: 8 emails
  - Spam: 4
  - Ham: 4

STEP 2: Feature Extraction
------------------------------------------------------------
Manual feature extraction:
  Features per email: 6 (free, money, urgent, exclamations, capitals, words)
  Example: "Free money now! Click here to claim your prize!"
  → Features: [1, 1, 0, 1, 2, 9]

STEP 3: Training k-NN Classifier
------------------------------------------------------------
✓ Classifier trained in 0.42 ms
  Algorithm: k-Nearest Neighbors (k=3)
  Training samples: 8

STEP 4: Predicting New Emails
------------------------------------------------------------
Testing 4 new emails:

🚫 Email 1: FREE offer! Act now! Limited time only!!!
   Prediction: SPAM
   Features: [1, 0, 0, 3, 8, 6]

✅ Email 2: Can we meet tomorrow to discuss the budget?
   Prediction: HAM
   Features: [0, 0, 0, 1, 0, 8]

🚫 Email 3: Congratulations! You've been selected for...
   Prediction: SPAM
   Features: [0, 0, 0, 1, 1, 7]

✅ Email 4: Please find attached the quarterly report
   Prediction: HAM
   Features: [0, 0, 0, 0, 0, 6]

STEP 5: Batch Predictions
------------------------------------------------------------
✓ Batch predicted 4 emails in 0.152 ms
  Average per email: 0.038 ms

╔══════════════════════════════════════════════════════════╗
║              Custom vs. PHP-ML Comparison               ║
╚══════════════════════════════════════════════════════════╝

Aspect               | Custom (Chapter 6)        | PHP-ML (This Chapter)
-------------------- | ------------------------- | -------------------------
Lines of Code        | ~120 lines                | ~40 lines
Distance Function    | Manual implementation     | Built-in optimized
Neighbor Finding     | Custom sorting logic      | Library handles it
Edge Cases           | Manual handling needed    | Automatically handled
Batch Predictions    | Loop required             | Native support
Type Safety          | Manual validation         | Type-checked API
Performance          | Unoptimized               | Optimized algorithms
Maintenance          | You maintain it           | Library maintainers

💡 Key Insight: PHP-ML reduced code by 66% while improving robustness!

Why It Works

PHP-ML's k-NN classifier encapsulates all the complexity you manually implemented:

  1. Distance calculation: Uses optimized Euclidean distance with caching
  2. Neighbor finding: Efficiently sorts distances without creating large intermediate arrays
  3. Voting: Handles ties deterministically (alphabetical order for consistency)
  4. Batch operations: Optimizes multiple predictions by reusing distance calculations where possible
  5. Type safety: Method signatures enforce correct data types at compile time

The train() method signature is:

php
public function train(array $samples, array $labels): void

PHP's type system ensures $samples is an array and $labels is an array. The library validates that:

  • Sample count matches label count
  • All samples have the same feature count
  • Labels are valid types (strings or integers)

This prevents runtime errors that custom code might miss.

Troubleshooting

  • Error: "Argument #1 must be of type array, int given" — Wrapping single sample incorrectly. For single predictions, use: $classifier->predict([$features]) (array with one element).

  • Predictions always return same class — k too large for small dataset. With 8 training samples, k=7 would include 7 neighbors (87.5% of data), likely causing majority class to always win. Use k=3 or k=5.

  • "Samples and labels count mismatch" — Arrays have different lengths. Verify: count($samples) === count($labels).

Step 3: Building an Iris Classifier with Rubix ML (~15 min)

Goal

Experience Rubix ML's comprehensive ecosystem by building a complete classification pipeline with preprocessing, cross-validation, and detailed evaluation metrics.

Actions

Rubix ML takes a more structured approach than PHP-ML, using Dataset objects, Estimator interfaces, and built-in Pipeline composition. This leads to more maintainable production code.

php
# filename: 02-rubix-iris-classifier.php
<?php

declare(strict_types=1);

require_once __DIR__ . '/vendor/autoload.php';

use Rubix\ML\Datasets\Labeled;
use Rubix\ML\Datasets\Unlabeled;
use Rubix\ML\Classifiers\KNearestNeighbors;
use Rubix\ML\Classifiers\GaussianNB;
use Rubix\ML\Classifiers\ClassificationTree;
use Rubix\ML\Transformers\ZScaleStandardizer;
use Rubix\ML\Pipeline;
use Rubix\ML\CrossValidation\KFold;
use Rubix\ML\CrossValidation\Metrics\Accuracy;
use Rubix\ML\CrossValidation\Metrics\F1Score;
use Rubix\ML\CrossValidation\Metrics\ConfusionMatrix;

echo "╔══════════════════════════════════════════════════════════╗\n";
echo "║       Iris Classification with Rubix ML Pipeline        ║\n";
echo "╚══════════════════════════════════════════════════════════╝\n\n";

// ============================================================
// STEP 1: Load Iris Dataset
// ============================================================

echo "STEP 1: Loading Iris Dataset\n";
echo "------------------------------------------------------------\n";

// In production, load from CSV. For demo, use embedded data.
$samples = [
    // Setosa (first 10)
    [5.1, 3.5, 1.4, 0.2], [4.9, 3.0, 1.4, 0.2], [4.7, 3.2, 1.3, 0.2],
    [4.6, 3.1, 1.5, 0.2], [5.0, 3.6, 1.4, 0.2], [5.4, 3.9, 1.7, 0.4],
    [4.6, 3.4, 1.4, 0.3], [5.0, 3.4, 1.5, 0.2], [4.4, 2.9, 1.4, 0.2],
    [4.9, 3.1, 1.5, 0.1],

    // Versicolor (next 10)
    [7.0, 3.2, 4.7, 1.4], [6.4, 3.2, 4.5, 1.5], [6.9, 3.1, 4.9, 1.5],
    [5.5, 2.3, 4.0, 1.3], [6.5, 2.8, 4.6, 1.5], [5.7, 2.8, 4.5, 1.3],
    [6.3, 3.3, 4.7, 1.6], [4.9, 2.4, 3.3, 1.0], [6.6, 2.9, 4.6, 1.3],
    [5.2, 2.7, 3.9, 1.4],

    // Virginica (final 10)
    [6.3, 3.3, 6.0, 2.5], [5.8, 2.7, 5.1, 1.9], [7.1, 3.0, 5.9, 2.1],
    [6.3, 2.9, 5.6, 1.8], [6.5, 3.0, 5.8, 2.2], [7.6, 3.0, 6.6, 2.1],
    [4.9, 2.5, 4.5, 1.7], [7.3, 2.9, 6.3, 1.8], [6.7, 2.5, 5.8, 1.8],
    [7.2, 3.6, 6.1, 2.5],
];

$labels = array_merge(
    array_fill(0, 10, 'Iris-setosa'),
    array_fill(0, 10, 'Iris-versicolor'),
    array_fill(0, 10, 'Iris-virginica')
);

echo "✓ Dataset loaded: " . count($samples) . " samples\n";
echo "✓ Features per sample: " . count($samples[0]) . "\n";
echo "✓ Classes: " . implode(', ', array_unique($labels)) . "\n\n";

// Create Rubix ML Dataset object
$dataset = new Labeled($samples, $labels);

echo "Class distribution:\n";
foreach (array_count_values($labels) as $class => $count) {
    $pct = ($count / count($labels)) * 100;
    echo "  - {$class}: {$count} samples (" . number_format($pct, 1) . "%)\n";
}
echo "\n";

// ============================================================
// STEP 2: Build ML Pipeline
// ============================================================

echo "STEP 2: Building ML Pipeline\n";
echo "------------------------------------------------------------\n";

// Rubix ML Pipeline: Transformers + Estimator
$pipeline = new Pipeline([
    new ZScaleStandardizer(),  // Standardize features (mean=0, std=1)
], new KNearestNeighbors(k: 5));

echo "Pipeline created:\n";
echo "  1. ZScaleStandardizer (normalize features)\n";
echo "  2. k-Nearest Neighbors (k=5)\n\n";

echo "What ZScaleStandardizer does:\n";
echo "  For each feature:\n";
echo "    - Calculate mean (μ) and standard deviation (σ)\n";
echo "    - Transform: z = (x - μ) / σ\n";
echo "  Result: Features centered at 0 with unit variance\n\n";

// ============================================================
// STEP 3: Train-Test Split
// ============================================================

echo "STEP 3: Splitting Data (80/20)\n";
echo "------------------------------------------------------------\n";

[$training, $testing] = $dataset->randomize()->split(0.8);

echo "✓ Training set: " . $training->numSamples() . " samples (80%)\n";
echo "✓ Testing set: " . $testing->numSamples() . " samples (20%)\n\n";

// ============================================================
// STEP 4: Train Pipeline
// ============================================================

echo "STEP 4: Training Pipeline\n";
echo "------------------------------------------------------------\n";

$trainStart = microtime(true);
$pipeline->train($training);
$trainTime = (microtime(true) - $trainStart) * 1000;

echo "✓ Pipeline trained in " . number_format($trainTime, 2) . " ms\n";
echo "  - Fitted standardizer to training data\n";
echo "  - Trained k-NN classifier (k=5)\n\n";

// ============================================================
// STEP 5: Make Predictions
// ============================================================

echo "STEP 5: Making Predictions on Test Set\n";
echo "------------------------------------------------------------\n";

$predictStart = microtime(true);
$predictions = $pipeline->predict($testing);
$predictTime = (microtime(true) - $predictStart) * 1000;

echo "✓ Predicted " . count($predictions) . " samples in " .
     number_format($predictTime, 2) . " ms\n";
echo "  Average per sample: " .
     number_format($predictTime / count($predictions), 3) . " ms\n\n";

// Show sample predictions
echo "Sample Predictions:\n";
for ($i = 0; $i < min(5, count($predictions)); $i++) {
    $actual = $testing->label($i);
    $predicted = $predictions[$i];
    $match = $actual === $predicted ? '✓' : '✗';

    echo "  {$match} Sample " . ($i + 1) . ": ";
    echo "Predicted '{$predicted}' | Actual '{$actual}'\n";
}
echo "\n";

// ============================================================
// STEP 6: Evaluate with Multiple Metrics
// ============================================================

echo "STEP 6: Evaluation Metrics\n";
echo "------------------------------------------------------------\n";

$accuracyMetric = new Accuracy();
$f1Metric = new F1Score();
$confusionMatrix = new ConfusionMatrix();

$accuracy = $accuracyMetric->score($predictions, $testing->labels());
$f1 = $f1Metric->score($predictions, $testing->labels());
$matrix = $confusionMatrix->score($predictions, $testing->labels());

echo "Performance Metrics:\n\n";

echo "1. Accuracy: " . number_format($accuracy * 100, 2) . "%\n";
echo "   → Proportion of correct predictions\n";
echo "   → Formula: (TP + TN) / Total\n\n";

echo "2. F1-Score (Micro): " . number_format($f1 * 100, 2) . "%\n";
echo "   → Harmonic mean of precision and recall\n";
echo "   → Balances false positives and false negatives\n\n";

echo "3. Confusion Matrix:\n";
echo "   Shows which classes are confused:\n\n";

// Display confusion matrix
$classes = array_unique($testing->labels());
sort($classes);

// Header
echo "                    │ ";
foreach ($classes as $class) {
    $shortClass = substr($class, 5, 3);  // "Iris-setosa" → "set"
    echo str_pad($shortClass, 6, ' ', STR_PAD_LEFT) . " ";
}
echo "\n";

echo "   " . str_repeat("─", 17) . "┼" . str_repeat("─", 7 * count($classes) + 1) . "\n";

// Matrix rows
foreach ($classes as $actualClass) {
    $shortActual = substr($actualClass, 5, 3);
    echo "   " . str_pad($shortActual, 15, ' ', STR_PAD_LEFT) . " │ ";

    foreach ($classes as $predictedClass) {
        $count = $matrix[$actualClass][$predictedClass] ?? 0;
        echo str_pad((string)$count, 6, ' ', STR_PAD_LEFT) . " ";
    }
    echo "\n";
}
echo "\n";

// ============================================================
// STEP 7: Cross-Validation for Robust Estimates
// ============================================================

echo "STEP 7: 5-Fold Cross-Validation\n";
echo "------------------------------------------------------------\n";

$kfold = new KFold(k: 5);

// Need fresh pipeline for each fold
$cvScores = [];

foreach ($kfold->test($dataset, function ($training, $testing) {
    $pipeline = new Pipeline([
        new ZScaleStandardizer(),
    ], new KNearestNeighbors(k: 5));

    $pipeline->train($training);
    $predictions = $pipeline->predict($testing);

    $metric = new Accuracy();
    return $metric->score($predictions, $testing->labels());
}) as $fold => $score) {
    $cvScores[] = $score;
    echo "  Fold " . ($fold + 1) . ": " . number_format($score * 100, 2) . "%\n";
}

$meanScore = array_sum($cvScores) / count($cvScores);
$variance = 0;
foreach ($cvScores as $score) {
    $variance += pow($score - $meanScore, 2);
}
$stdDev = sqrt($variance / count($cvScores));

echo "\n";
echo "Cross-Validation Results:\n";
echo "  Mean Accuracy: " . number_format($meanScore * 100, 2) . "% (±" .
     number_format($stdDev * 100, 2) . "%)\n";
echo "  Min: " . number_format(min($cvScores) * 100, 2) . "%\n";
echo "  Max: " . number_format(max($cvScores) * 100, 2) . "%\n\n";

// ============================================================
// STEP 8: Predict New Unseen Sample
// ============================================================

echo "STEP 8: Predicting New Iris Flower\n";
echo "------------------------------------------------------------\n";

// New flower with unknown species
$newFlower = new Unlabeled([[5.9, 3.0, 5.1, 1.8]]);

$newPrediction = $pipeline->predict($newFlower);

echo "New flower features:\n";
echo "  Sepal length: 5.9 cm\n";
echo "  Sepal width: 3.0 cm\n";
echo "  Petal length: 5.1 cm\n";
echo "  Petal width: 1.8 cm\n\n";

echo "→ Predicted species: {$newPrediction[0]}\n\n";

echo "╔══════════════════════════════════════════════════════════╗\n";
echo "║         ✓ Rubix ML Pipeline Complete!                  ║\n";
echo "╚══════════════════════════════════════════════════════════╝\n";

Expected Result

╔══════════════════════════════════════════════════════════╗
║       Iris Classification with Rubix ML Pipeline        ║
╚══════════════════════════════════════════════════════════╝

STEP 1: Loading Iris Dataset
------------------------------------------------------------
✓ Dataset loaded: 30 samples
✓ Features per sample: 4
✓ Classes: Iris-setosa, Iris-versicolor, Iris-virginica

Class distribution:
  - Iris-setosa: 10 samples (33.3%)
  - Iris-versicolor: 10 samples (33.3%)
  - Iris-virginica: 10 samples (33.3%)

STEP 2: Building ML Pipeline
------------------------------------------------------------
Pipeline created:
  1. ZScaleStandardizer (normalize features)
  2. k-Nearest Neighbors (k=5)

What ZScaleStandardizer does:
  For each feature:
    - Calculate mean (μ) and standard deviation (σ)
    - Transform: z = (x - μ) / σ
  Result: Features centered at 0 with unit variance

STEP 3: Splitting Data (80/20)
------------------------------------------------------------
✓ Training set: 24 samples (80%)
✓ Testing set: 6 samples (20%)

STEP 4: Training Pipeline
------------------------------------------------------------
✓ Pipeline trained in 1.23 ms
  - Fitted standardizer to training data
  - Trained k-NN classifier (k=5)

STEP 5: Making Predictions on Test Set
------------------------------------------------------------
✓ Predicted 6 samples in 0.45 ms
  Average per sample: 0.075 ms

Sample Predictions:
  ✓ Sample 1: Predicted 'Iris-setosa' | Actual 'Iris-setosa'
  ✓ Sample 2: Predicted 'Iris-virginica' | Actual 'Iris-virginica'
  ✓ Sample 3: Predicted 'Iris-versicolor' | Actual 'Iris-versicolor'
  ✓ Sample 4: Predicted 'Iris-setosa' | Actual 'Iris-setosa'
  ✓ Sample 5: Predicted 'Iris-virginica' | Actual 'Iris-virginica'

STEP 6: Evaluation Metrics
------------------------------------------------------------
Performance Metrics:

1. Accuracy: 100.00%
   → Proportion of correct predictions
   → Formula: (TP + TN) / Total

2. F1-Score (Micro): 100.00%
   → Harmonic mean of precision and recall
   → Balances false positives and false negatives

3. Confusion Matrix:
   Shows which classes are confused:

                    │    set    ver    vir
   ─────────────────┼─────────────────────
            set     │      2      0      0
            ver     │      0      2      0
            vir     │      0      0      2

STEP 7: 5-Fold Cross-Validation
------------------------------------------------------------
  Fold 1: 100.00%
  Fold 2: 83.33%
  Fold 3: 100.00%
  Fold 4: 100.00%
  Fold 5: 83.33%

Cross-Validation Results:
  Mean Accuracy: 93.33% (±8.16%)
  Min: 83.33%
  Max: 100.00%

STEP 8: Predicting New Iris Flower
------------------------------------------------------------
New flower features:
  Sepal length: 5.9 cm
  Sepal width: 3.0 cm
  Petal length: 5.1 cm
  Petal width: 1.8 cm

→ Predicted species: Iris-virginica

╔══════════════════════════════════════════════════════════╗
║         ✓ Rubix ML Pipeline Complete!                  ║
╚══════════════════════════════════════════════════════════╝

Why It Works

Rubix ML's Pipeline combines preprocessing and prediction into a single object:

php
$pipeline = new Pipeline([
    new ZScaleStandardizer(),  // Transformer
    new MinMaxNormalizer(),    // Another transformer (if needed)
], new KNearestNeighbors(k: 5)); // Estimator (must be last)

When you call $pipeline->train($dataset):

  1. Each transformer is fitted to the training data (calculates statistics)
  2. Training data flows through transformers, being transformed
  3. Transformed data trains the final estimator

When you call $pipeline->predict($newData):

  1. New data flows through the same fitted transformers (using training statistics)
  2. Transformed new data goes to the estimator for prediction
  3. No data leakage — test data never influences transformer fitting

Dataset objects (Labeled, Unlabeled) provide structure:

  • Type-safe access to samples and labels
  • Built-in splitting, shuffling, and sampling methods
  • Validation of data consistency
  • Integration with transformers and estimators

This structured approach scales better to production than passing raw arrays.

Troubleshooting

  • Error: "Samples must all have the same dimensionality" — One sample has different feature count. Check: array_map('count', $samples) to find mismatched sample.

  • Poor cross-validation results on tiny dataset — With only 30 samples and 5 folds, each test fold has 6 samples. Results will vary. Use more data or fewer folds (k=3).

  • ZScaleStandardizer fails with "Cannot divide by zero" — Feature has zero standard deviation (all values identical). Remove that feature or use different standardizer.

Step 4: Customer Segmentation with PHP-ML Clustering (~10 min)

Goal

Use unsupervised learning with PHP-ML's k-means clustering to segment customers into groups without predefined labels.

Actions

Clustering is powerful for discovering hidden patterns in data. Let's segment customers based on purchasing behavior.

php
# filename: 03-phpml-clustering.php
<?php

declare(strict_types=1);

require_once __DIR__ . '/vendor/autoload.php';

use Phpml\Clustering\KMeans;
use Phpml\Clustering\DBSCAN;

echo "╔══════════════════════════════════════════════════════════╗\n";
echo "║         Customer Segmentation with PHP-ML               ║\n";
echo "╚══════════════════════════════════════════════════════════╝\n\n";

// ============================================================
// STEP 1: Prepare Customer Data
// ============================================================

echo "STEP 1: Loading Customer Data\n";
echo "------------------------------------------------------------\n";

// Features: [monthly_spending, visit_frequency, avg_purchase_value]
$customers = [
    // Low-value customers (spend little, visit rarely)
    [25, 2, 12],   [30, 3, 10],   [20, 1, 20],   [28, 2, 14],
    [22, 2, 11],   [26, 3, 9],

    // Medium-value customers (moderate spending and visits)
    [150, 8, 19],  [180, 10, 18], [160, 9, 18],  [170, 8, 21],
    [165, 9, 18],  [155, 7, 22],

    // High-value customers (spend a lot, visit frequently)
    [450, 20, 23], [480, 22, 22], [520, 25, 21], [500, 23, 22],
    [490, 21, 23], [510, 24, 21],
];

echo "✓ Loaded " . count($customers) . " customers\n";
echo "✓ Features: monthly_spending ($), visit_frequency (#/month), avg_purchase_value ($)\n\n";

// ============================================================
// STEP 2: k-Means Clustering
// ============================================================

echo "STEP 2: Applying k-Means Clustering\n";
echo "------------------------------------------------------------\n";

$kmeans = new KMeans(n: 3);  // We expect 3 segments
$clusters = $kmeans->cluster($customers);

echo "✓ k-Means clustering complete\n";
echo "  Requested clusters: 3\n";
echo "  Found " . count($clusters) . " clusters\n\n";

// Analyze each cluster
echo "Cluster Analysis:\n\n";

foreach ($clusters as $clusterIndex => $clusterMembers) {
    echo "Cluster " . ($clusterIndex + 1) . ": " . count($clusterMembers) . " customers\n";

    // Calculate cluster statistics
    $spendingValues = array_column($clusterMembers, 0);
    $visitValues = array_column($clusterMembers, 1);
    $avgPurchaseValues = array_column($clusterMembers, 2);

    $avgSpending = array_sum($spendingValues) / count($spendingValues);
    $avgVisits = array_sum($visitValues) / count($visitValues);
    $avgPurchase = array_sum($avgPurchaseValues) / count($avgPurchaseValues);

    echo "  Average monthly spending: $" . number_format($avgSpending, 2) . "\n";
    echo "  Average visit frequency: " . number_format($avgVisits, 1) . " visits/month\n";
    echo "  Average purchase value: $" . number_format($avgPurchase, 2) . "\n";

    // Interpret cluster
    $segment = 'Unknown';
    if ($avgSpending < 50) {
        $segment = 'Occasional Customers (Low Engagement)';
    } elseif ($avgSpending < 250) {
        $segment = 'Regular Customers (Medium Engagement)';
    } else {
        $segment = 'VIP Customers (High Engagement)';
    }

    echo "  → Business Segment: {$segment}\n\n";
}

// ============================================================
// STEP 3: Predicting Cluster for New Customers
// ============================================================

echo "STEP 3: Assigning New Customers to Clusters\n";
echo "------------------------------------------------------------\n";

// New customers to classify
$newCustomers = [
    [35, 4, 9],    // Low spender
    [175, 10, 18], // Medium spender
    [510, 24, 22], // High spender
];

echo "Analyzing " . count($newCustomers) . " new customers:\n\n";

// To assign new customers, we'd normally use cluster centroids
// PHP-ML doesn't expose centroids directly, so let's demonstrate the concept
echo "Note: PHP-ML k-means doesn't expose centroids for new predictions.\n";
echo "For production, use Rubix ML which provides predict() for new samples.\n\n";

echo "However, we can show cluster membership from original clustering:\n";
foreach ($clusters as $clusterIndex => $members) {
    $indices = [];
    foreach ($members as $memberIndex => $member) {
        // Find original index
        $originalIndex = array_search($member, $customers);
        if ($originalIndex !== false) {
            $indices[] = $originalIndex;
        }
    }
    echo "  Cluster " . ($clusterIndex + 1) . " contains customers: " .
         implode(', ', array_slice($indices, 0, 5)) . (count($indices) > 5 ? '...' : '') . "\n";
}

echo "\n";

// ============================================================
// STEP 4: DBSCAN Clustering (Density-Based)
// ============================================================

echo "STEP 4: Comparing with DBSCAN Clustering\n";
echo "------------------------------------------------------------\n";

echo "DBSCAN (Density-Based Spatial Clustering):\n";
echo "  - Finds clusters of any shape\n";
echo "  - Can identify outliers/noise\n";
echo "  - Doesn't require specifying number of clusters\n\n";

$dbscan = new DBSCAN(epsilon: 100, minSamples: 2);
$dbscanClusters = $dbscan->cluster($customers);

echo "✓ DBSCAN clustering complete\n";
echo "  Found " . count($dbscanClusters) . " clusters\n\n";

foreach ($dbscanClusters as $clusterIndex => $clusterMembers) {
    echo "  DBSCAN Cluster " . ($clusterIndex + 1) . ": " .
         count($clusterMembers) . " customers\n";
}

echo "\n";

// ============================================================
// COMPARISON: k-Means vs. DBSCAN
// ============================================================

echo "╔══════════════════════════════════════════════════════════╗\n";
echo "║            k-Means vs. DBSCAN Comparison                ║\n";
echo "╚══════════════════════════════════════════════════════════╝\n\n";

$comparison = [
    ['Aspect', 'k-Means', 'DBSCAN'],
    ['Cluster Shape', 'Spherical clusters', 'Arbitrary shapes'],
    ['Clusters Required', 'Must specify k', 'Auto-determines'],
    ['Outlier Handling', 'Forces all into clusters', 'Can label as noise'],
    ['Speed', 'Fast', 'Slower on large data'],
    ['Best For', 'Well-separated groups', 'Complex shapes, noise'],
];

foreach ($comparison as $row) {
    printf("%-20s | %-25s | %-25s\n", ...$row);
    if ($row[0] === 'Aspect') {
        echo str_repeat('-', 20) . ' | ' . str_repeat('-', 25) . ' | ' .
             str_repeat('-', 25) . "\n";
    }
}

echo "\n💡 For customer segmentation, k-means works well when you know\n";
echo "   how many segments you want (e.g., Low/Medium/High value).\n";

Expected Result

╔══════════════════════════════════════════════════════════╗
║         Customer Segmentation with PHP-ML               ║
╚══════════════════════════════════════════════════════════╝

STEP 1: Loading Customer Data
------------------------------------------------------------
✓ Loaded 18 customers
✓ Features: monthly_spending ($), visit_frequency (#/month), avg_purchase_value ($)

STEP 2: Applying k-Means Clustering
------------------------------------------------------------
✓ k-Means clustering complete
  Requested clusters: 3
  Found 3 clusters

Cluster Analysis:

Cluster 1: 6 customers
  Average monthly spending: $25.17
  Average visit frequency: 2.2 visits/month
  Average purchase value: $12.67
  → Business Segment: Occasional Customers (Low Engagement)

Cluster 2: 6 customers
  Average monthly spending: $163.33
  Average visit frequency: 8.5 visits/month
  Average purchase value: $19.33
  → Business Segment: Regular Customers (Medium Engagement)

Cluster 3: 6 customers
  Average monthly spending: $491.67
  Average visit frequency: 22.5 visits/month
  Average purchase value: $22.00
  → Business Segment: VIP Customers (High Engagement)

STEP 3: Assigning New Customers to Clusters
------------------------------------------------------------
Analyzing 3 new customers:

Note: PHP-ML k-means doesn't expose centroids for new predictions.
For production, use Rubix ML which provides predict() for new samples.

However, we can show cluster membership from original clustering:
  Cluster 1 contains customers: 0, 1, 2, 3, 4...
  Cluster 2 contains customers: 6, 7, 8, 9, 10...
  Cluster 3 contains customers: 12, 13, 14, 15, 16...

STEP 4: Comparing with DBSCAN Clustering
------------------------------------------------------------
DBSCAN (Density-Based Spatial Clustering):
  - Finds clusters of any shape
  - Can identify outliers/noise
  - Doesn't require specifying number of clusters

✓ DBSCAN clustering complete
  Found 3 clusters

  DBSCAN Cluster 1: 6 customers
  DBSCAN Cluster 2: 6 customers
  DBSCAN Cluster 3: 6 customers

╔══════════════════════════════════════════════════════════╗
║            k-Means vs. DBSCAN Comparison                ║
╚══════════════════════════════════════════════════════════╝

Aspect               | k-Means                   | DBSCAN
-------------------- | ------------------------- | -------------------------
Cluster Shape        | Spherical clusters        | Arbitrary shapes
Clusters Required    | Must specify k            | Auto-determines
Outlier Handling     | Forces all into clusters  | Can label as noise
Speed                | Fast                      | Slower on large data
Best For             | Well-separated groups     | Complex shapes, noise

💡 For customer segmentation, k-means works well when you know
   how many segments you want (e.g., Low/Medium/High value).

Why It Works

k-Means algorithm:

  1. Initialize k random cluster centroids
  2. Assign each point to nearest centroid (Euclidean distance)
  3. Recalculate centroids as mean of assigned points
  4. Repeat steps 2-3 until convergence

PHP-ML implements this efficiently, but doesn't expose the final centroids for predicting new samples—a limitation compared to Rubix ML.

DBSCAN algorithm:

  1. For each point, count neighbors within epsilon distance
  2. Points with ≥ minSamples neighbors are "core points"
  3. Core points and their neighbors form clusters
  4. Points not in any cluster are "noise"

This finds arbitrarily-shaped clusters and can identify outliers.

Troubleshooting

  • All points in one cluster — epsilon too large (DBSCAN) or k too small (k-means). Adjust parameters based on data scale.

  • k-means results change each run — k-means uses random initialization. For reproducible results (development only), set mt_srand(42) before clustering.

  • Can't predict cluster for new customers — PHP-ML limitation. Use Rubix ML's KMeans clusterer which implements predict() for new samples.

Step 5: Model Persistence - Saving and Loading (~8 min)

Goal

Learn to save trained models to disk and reload them instantly for production use, avoiding retraining on every request.

Actions

In production, you train once and reuse the model thousands of times. Model persistence makes this possible.

php
# filename: 05-model-persistence.php
<?php

declare(strict_types=1);

require_once __DIR__ . '/vendor/autoload.php';

use Phpml\Classification\KNearestNeighbors;
use Phpml\ModelManager;
use Rubix\ML\Classifiers\GaussianNB;
use Rubix\ML\Datasets\Labeled;
use Rubix\ML\Persisters\Filesystem;

echo "╔══════════════════════════════════════════════════════════╗\n";
echo "║              Model Persistence Demo                     ║\n";
echo "╚══════════════════════════════════════════════════════════╝\n\n";

// Create models directory
if (!is_dir(__DIR__ . '/models')) {
    mkdir(__DIR__ . '/models', 0755, true);
    echo "✓ Created models/ directory\n\n";
}

// ============================================================
// PART 1: PHP-ML Model Persistence
// ============================================================

echo "PART 1: PHP-ML Model Persistence\n";
echo "════════════════════════════════════════════════════════════\n\n";

echo "STEP 1: Training a PHP-ML Classifier\n";
echo "------------------------------------------------------------\n";

$samples = [
    [5.1, 3.5, 1.4, 0.2], [4.9, 3.0, 1.4, 0.2],
    [7.0, 3.2, 4.7, 1.4], [6.4, 3.2, 4.5, 1.5],
];
$labels = ['setosa', 'setosa', 'versicolor', 'versicolor'];

$phpmlClassifier = new KNearestNeighbors(k: 3);
$phpmlClassifier->train($samples, $labels);

echo "✓ Trained k-NN classifier\n";
echo "  Training samples: " . count($samples) . "\n";
echo "  Algorithm: k-Nearest Neighbors (k=3)\n\n";

echo "STEP 2: Saving PHP-ML Model to Disk\n";
echo "------------------------------------------------------------\n";

$phpmlModelPath = __DIR__ . '/models/phpml-knn.model';
$modelManager = new ModelManager();

$saveStart = microtime(true);
$modelManager->saveToFile($phpmlClassifier, $phpmlModelPath);
$saveTime = (microtime(true) - $saveStart) * 1000;

$fileSize = filesize($phpmlModelPath);

echo "✓ Model saved successfully\n";
echo "  Path: {$phpmlModelPath}\n";
echo "  File size: " . number_format($fileSize) . " bytes\n";
echo "  Save time: " . number_format($saveTime, 2) . " ms\n\n";

echo "STEP 3: Loading PHP-ML Model from Disk\n";
echo "------------------------------------------------------------\n";

$loadStart = microtime(true);
$loadedClassifier = $modelManager->restoreFromFile($phpmlModelPath);
$loadTime = (microtime(true) - $loadStart) * 1000;

echo "✓ Model loaded successfully\n";
echo "  Load time: " . number_format($loadTime, 2) . " ms\n";
echo "  Model type: " . get_class($loadedClassifier) . "\n\n";

echo "STEP 4: Making Predictions with Loaded Model\n";
echo "------------------------------------------------------------\n";

$testSample = [5.0, 3.5, 1.5, 0.2];
$prediction = $loadedClassifier->predict([$testSample])[0];

echo "Test sample: [" . implode(', ', $testSample) . "]\n";
echo "→ Prediction: {$prediction}\n\n";

// ============================================================
// PART 2: Rubix ML Model Persistence
// ============================================================

echo "PART 2: Rubix ML Model Persistence\n";
echo "════════════════════════════════════════════════════════════\n\n";

echo "STEP 1: Training a Rubix ML Classifier\n";
echo "------------------------------------------------------------\n";

$dataset = new Labeled($samples, $labels);
$rubixClassifier = new GaussianNB();
$rubixClassifier->train($dataset);

echo "✓ Trained Gaussian Naive Bayes classifier\n";
echo "  Training samples: " . $dataset->numSamples() . "\n\n";

echo "STEP 2: Saving Rubix ML Model to Disk\n";
echo "------------------------------------------------------------\n";

$rubixModelPath = __DIR__ . '/models/rubix-nb.rbx';
$persister = new Filesystem($rubixModelPath);

$saveStart = microtime(true);
$persister->save($rubixClassifier);
$saveTime = (microtime(true) - $saveStart) * 1000;

$fileSize = filesize($rubixModelPath);

echo "✓ Model saved successfully\n";
echo "  Path: {$rubixModelPath}\n";
echo "  File size: " . number_format($fileSize) . " bytes\n";
echo "  Save time: " . number_format($saveTime, 2) . " ms\n\n";

echo "STEP 3: Loading Rubix ML Model from Disk\n";
echo "------------------------------------------------------------\n";

$loadStart = microtime(true);
$loadedRubixClassifier = $persister->load();
$loadTime = (microtime(true) - $loadStart) * 1000;

echo "✓ Model loaded successfully\n";
echo "  Load time: " . number_format($loadTime, 2) . " ms\n";
echo "  Model type: " . get_class($loadedRubixClassifier) . "\n\n";

echo "STEP 4: Making Predictions with Loaded Model\n";
echo "------------------------------------------------------------\n";

$prediction = $loadedRubixClassifier->predictSample($testSample);

echo "Test sample: [" . implode(', ', $testSample) . "]\n";
echo "→ Prediction: {$prediction}\n\n";

// ============================================================
// PRODUCTION USAGE PATTERN
// ============================================================

echo "╔══════════════════════════════════════════════════════════╗\n";
echo "║              Production Usage Pattern                   ║\n";
echo "╚══════════════════════════════════════════════════════════╝\n\n";

echo "TRAINING SCRIPT (run offline, once per day/week):\n";
echo "------------------------------------------------------------\n";
echo "```php\n";
echo "// train-model.php\n";
echo "\\$classifier = new KNearestNeighbors(k: 5);\n";
echo "\\$classifier->train(\\$trainingData, \\$labels);\n";
echo "\\$modelManager->saveToFile(\\$classifier, 'models/classifier.model');\n";
echo "echo 'Model trained and saved!';\n";
echo "```\n\n";

echo "PRODUCTION API (load once, use many times):\n";
echo "------------------------------------------------------------\n";
echo "```php\n";
echo "// api.php\n";
echo "// Load model once when application starts\n";
echo "\\$classifier = \\$modelManager->restoreFromFile('models/classifier.model');\n";
echo "\n";
echo "// Handle requests - no retraining!\n";
echo "\\$features = extractFeatures(\\$_POST['data']);\n";
echo "\\$prediction = \\$classifier->predict([\\$features])[0];\n";
echo "echo json_encode(['prediction' => \\$prediction]);\n";
echo "```\n\n";

echo "PERFORMANCE COMPARISON:\n";
echo "------------------------------------------------------------\n";
echo "  Without persistence:\n";
echo "    - Train on every request: ~50ms\n";
echo "    - Predict: ~1ms\n";
echo "    - Total per request: ~51ms\n";
echo "    - Can handle: ~20 req/sec\n\n";
echo "  With persistence:\n";
echo "    - Load model once: ~2ms (startup)\n";
echo "    - Predict per request: ~1ms\n";
echo "    - Total per request: ~1ms\n";
echo "    - Can handle: ~1000 req/sec\n\n";
echo "  Speedup: 50x faster! 🚀\n";

Expected Result

╔══════════════════════════════════════════════════════════╗
║              Model Persistence Demo                     ║
╚══════════════════════════════════════════════════════════╝

✓ Created models/ directory

PART 1: PHP-ML Model Persistence
════════════════════════════════════════════════════════════

STEP 1: Training a PHP-ML Classifier
------------------------------------------------------------
✓ Trained k-NN classifier
  Training samples: 4
  Algorithm: k-Nearest Neighbors (k=3)

STEP 2: Saving PHP-ML Model to Disk
------------------------------------------------------------
✓ Model saved successfully
  Path: /path/to/models/phpml-knn.model
  File size: 1,245 bytes
  Save time: 0.85 ms

STEP 3: Loading PHP-ML Model from Disk
------------------------------------------------------------
✓ Model loaded successfully
  Load time: 0.42 ms
  Model type: Phpml\Classification\KNearestNeighbors

STEP 4: Making Predictions with Loaded Model
------------------------------------------------------------
Test sample: [5.0, 3.5, 1.5, 0.2]
→ Prediction: setosa

PART 2: Rubix ML Model Persistence
════════════════════════════════════════════════════════════

STEP 1: Training a Rubix ML Classifier
------------------------------------------------------------
✓ Trained Gaussian Naive Bayes classifier
  Training samples: 4

STEP 2: Saving Rubix ML Model to Disk
------------------------------------------------------------
✓ Model saved successfully
  Path: /path/to/models/rubix-nb.rbx
  File size: 2,134 bytes
  Save time: 1.23 ms

STEP 3: Loading Rubix ML Model from Disk
------------------------------------------------------------
✓ Model loaded successfully
  Load time: 0.67 ms
  Model type: Rubix\ML\Classifiers\GaussianNB

STEP 4: Making Predictions with Loaded Model
------------------------------------------------------------
Test sample: [5.0, 3.5, 1.5, 0.2]
→ Prediction: setosa

╔══════════════════════════════════════════════════════════╗
║              Production Usage Pattern                   ║
╚══════════════════════════════════════════════════════════╝

TRAINING SCRIPT (run offline, once per day/week):
------------------------------------------------------------
```php
// train-model.php
$classifier = new KNearestNeighbors(k: 5);
$classifier->train($trainingData, $labels);
$modelManager->saveToFile($classifier, 'models/classifier.model');
echo 'Model trained and saved!';

PRODUCTION API (load once, use many times):

php
// api.php
// Load model once when application starts
$classifier = $modelManager->restoreFromFile('models/classifier.model');

// Handle requests - no retraining!
$features = extractFeatures($_POST['data']);
$prediction = $classifier->predict([$features])[0];
echo json_encode(['prediction' => $prediction]);

PERFORMANCE COMPARISON:

Without persistence: - Train on every request: ~50ms - Predict: ~1ms - Total per request: ~51ms - Can handle: ~20 req/sec

With persistence: - Load model once: ~2ms (startup) - Predict per request: ~1ms - Total per request: ~1ms - Can handle: ~1000 req/sec

Speedup: 50x faster! 🚀


### Why It Works

Both libraries use PHP's serialization to save models:

**PHP-ML** uses standard PHP serialization (`serialize()`/`unserialize()`):
- Pros: Simple, works with any PHP object
- Cons: Can break if class definitions change, security concerns with untrusted data

**Rubix ML** uses enhanced serialization with compression:
- Pros: Optimized for ML models, better forward compatibility
- Cons: Slightly larger file sizes due to metadata

In production:
1. Train model offline (cron job, manual trigger)
2. Save to disk in accessible location
3. Load model once when application starts (use object caching if possible)
4. Reuse loaded model for all predictions
5. Retrain periodically to adapt to new data (model drift mitigation)

### Troubleshooting

- **Error: "Unserialization failed"** — Library version changed between save and load. Retrain and resave model with current library version.

- **Model file corrupted** — Check disk space and permissions. Save to temporary file first, then atomic rename: `rename($tmpPath, $finalPath)`.

- **Loaded model makes wrong predictions** — Verify feature extraction is identical during training and inference. Feature order matters!

- **Permission denied when saving** — Models directory not writable: `chmod 755 models/` or `chown www-data:www-data models/`.

## Step 6: Comparing Multiple Algorithms (~10 min)

### Goal

Systematically compare 6+ algorithms on the same dataset to identify the best performer for your specific use case.

### Actions

There's no universal "best" algorithm. Performance depends on your data, so always compare multiple options.

```php
# filename: 06-algorithm-comparison.php
<?php

declare(strict_types=1);

require_once __DIR__ . '/vendor/autoload.php';

use Rubix\ML\Datasets\Labeled;
use Rubix\ML\Classifiers\KNearestNeighbors;
use Rubix\ML\Classifiers\GaussianNB;
use Rubix\ML\Classifiers\ClassificationTree;
use Rubix\ML\Classifiers\RandomForest;
use Rubix\ML\Classifiers\AdaBoost;
use Rubix\ML\Classifiers\MultilayerPerceptron;
use Rubix\ML\NeuralNet\Layers\Dense;
use Rubix\ML\NeuralNet\Layers\Activation;
use Rubix\ML\NeuralNet\ActivationFunctions\ReLU;
use Rubix\ML\CrossValidation\Metrics\Accuracy;
use Rubix\ML\CrossValidation\Metrics\F1Score;

echo "╔══════════════════════════════════════════════════════════╗\n";
echo "║          Algorithm Comparison on Iris Dataset           ║\n";
echo "╚══════════════════════════════════════════════════════════╝\n\n";

// Load iris dataset (using embedded data for demo)
// ... (iris data loading code) ...

$samples = [
    // Setosa samples
    [5.1, 3.5, 1.4, 0.2], [4.9, 3.0, 1.4, 0.2], [4.7, 3.2, 1.3, 0.2],
    [4.6, 3.1, 1.5, 0.2], [5.0, 3.6, 1.4, 0.2], [5.4, 3.9, 1.7, 0.4],
    [4.6, 3.4, 1.4, 0.3], [5.0, 3.4, 1.5, 0.2], [4.4, 2.9, 1.4, 0.2],
    [4.9, 3.1, 1.5, 0.1], [5.4, 3.7, 1.5, 0.2], [4.8, 3.4, 1.6, 0.2],

    // Versicolor samples
    [7.0, 3.2, 4.7, 1.4], [6.4, 3.2, 4.5, 1.5], [6.9, 3.1, 4.9, 1.5],
    [5.5, 2.3, 4.0, 1.3], [6.5, 2.8, 4.6, 1.5], [5.7, 2.8, 4.5, 1.3],
    [6.3, 3.3, 4.7, 1.6], [4.9, 2.4, 3.3, 1.0], [6.6, 2.9, 4.6, 1.3],
    [5.2, 2.7, 3.9, 1.4], [5.0, 2.0, 3.5, 1.0], [5.9, 3.0, 4.2, 1.5],

    // Virginica samples
    [6.3, 3.3, 6.0, 2.5], [5.8, 2.7, 5.1, 1.9], [7.1, 3.0, 5.9, 2.1],
    [6.3, 2.9, 5.6, 1.8], [6.5, 3.0, 5.8, 2.2], [7.6, 3.0, 6.6, 2.1],
    [4.9, 2.5, 4.5, 1.7], [7.3, 2.9, 6.3, 1.8], [6.7, 2.5, 5.8, 1.8],
    [7.2, 3.6, 6.1, 2.5], [6.5, 3.2, 5.1, 2.0], [6.4, 2.7, 5.3, 1.9],
];

$labels = array_merge(
    array_fill(0, 12, 'Iris-setosa'),
    array_fill(0, 12, 'Iris-versicolor'),
    array_fill(0, 12, 'Iris-virginica')
);

$dataset = new Labeled($samples, $labels);
[$training, $testing] = $dataset->randomize()->split(0.75);

echo "Dataset: " . $dataset->numSamples() . " iris flowers\n";
echo "Training: " . $training->numSamples() . " samples (75%)\n";
echo "Testing: " . $testing->numSamples() . " samples (25%)\n\n";

// Define algorithms to compare
$algorithms = [
    [
        'name' => 'k-Nearest Neighbors (k=5)',
        'estimator' => new KNearestNeighbors(k: 5),
        'description' => 'Classifies by majority vote of k nearest training examples',
    ],
    [
        'name' => 'Gaussian Naive Bayes',
        'estimator' => new GaussianNB(),
        'description' => 'Probabilistic classifier assuming feature independence',
    ],
    [
        'name' => 'Decision Tree (depth=5)',
        'estimator' => new ClassificationTree(maxDepth: 5),
        'description' => 'Tree of if-then rules for classification',
    ],
    [
        'name' => 'Random Forest (100 trees)',
        'estimator' => new RandomForest(estimators: 100),
        'description' => 'Ensemble of decision trees for robust predictions',
    ],
    [
        'name' => 'AdaBoost (50 estimators)',
        'estimator' => new AdaBoost(estimators: 50),
        'description' => 'Boosting ensemble focusing on difficult examples',
    ],
    [
        'name' => 'Neural Network (MLP)',
        'estimator' => new MultilayerPerceptron([
            new Dense(10),
            new Activation(new ReLU()),
            new Dense(10),
            new Activation(new ReLU()),
        ]),
        'description' => 'Multi-layer neural network with backpropagation',
    ],
];

$results = [];

echo "Training and evaluating " . count($algorithms) . " algorithms...\n";
echo str_repeat('=', 60) . "\n\n";

foreach ($algorithms as $algo) {
    echo "Algorithm: {$algo['name']}\n";
    echo str_repeat('-', 60) . "\n";
    echo "Description: {$algo['description']}\n\n";

    $estimator = $algo['estimator'];

    // Train
    $trainStart = microtime(true);
    try {
        $estimator->train($training);
        $trainTime = (microtime(true) - $trainStart) * 1000;

        // Predict
        $predictStart = microtime(true);
        $predictions = $estimator->predict($testing);
        $predictTime = (microtime(true) - $predictStart) * 1000;

        // Evaluate
        $accuracyMetric = new Accuracy();
        $f1Metric = new F1Score();

        $accuracy = $accuracyMetric->score($predictions, $testing->labels());
        $f1 = $f1Metric->score($predictions, $testing->labels());

        $results[] = [
            'name' => $algo['name'],
            'accuracy' => $accuracy,
            'f1' => $f1,
            'train_time' => $trainTime,
            'predict_time' => $predictTime,
            'success' => true,
        ];

        echo "✓ Success!\n";
        echo "  Training time: " . number_format($trainTime, 2) . " ms\n";
        echo "  Prediction time: " . number_format($predictTime, 2) . " ms\n";
        echo "  Accuracy: " . number_format($accuracy * 100, 2) . "%\n";
        echo "  F1-Score: " . number_format($f1 * 100, 2) . "%\n";

    } catch (\Exception $e) {
        $results[] = [
            'name' => $algo['name'],
            'success' => false,
            'error' => $e->getMessage(),
        ];
        echo "✗ Failed: " . $e->getMessage() . "\n";
    }

    echo "\n";
}

// Summary
echo str_repeat('=', 60) . "\n";
echo "COMPARISON SUMMARY\n";
echo str_repeat('=', 60) . "\n\n";

// Rank by accuracy
$successfulResults = array_filter($results, fn($r) => $r['success']);
usort($successfulResults, fn($a, $b) => $b['accuracy'] <=> $a['accuracy']);

echo "Ranked by Accuracy:\n";
echo str_repeat('-', 60) . "\n";
foreach ($successfulResults as $i => $result) {
    $medal = ['🥇', '🥈', '🥉'][$i] ?? '  ';
    echo "{$medal} " . ($i + 1) . ". {$result['name']}: " .
         number_format($result['accuracy'] * 100, 2) . "%\n";
}

echo "\n\nRanked by Speed (Inference Time):\n";
echo str_repeat('-', 60) . "\n";
usort($successfulResults, fn($a, $b) => $a['predict_time'] <=> $b['predict_time']);
foreach ($successfulResults as $i => $result) {
    echo "  " . ($i + 1) . ". {$result['name']}: " .
         number_format($result['predict_time'], 2) . " ms\n";
}

echo "\n\n╔══════════════════════════════════════════════════════════╗\n";
echo "║                  Recommendation                          ║\n";
echo "╚══════════════════════════════════════════════════════════╝\n\n";

$best = $successfulResults[0];
usort($successfulResults, fn($a, $b) => $b['accuracy'] <=> $a['accuracy']);
$mostAccurate = $successfulResults[0];
usort($successfulResults, fn($a, $b) => $a['predict_time'] <=> $b['predict_time']);
$fastest = $successfulResults[0];

echo "Most Accurate: {$mostAccurate['name']} (" .
     number_format($mostAccurate['accuracy'] * 100, 2) . "%)\n";
echo "  → Use when: Accuracy is critical, speed is less important\n\n";

echo "Fastest: {$fastest['name']} (" .
     number_format($fastest['predict_time'], 2) . " ms)\n";
echo "  → Use when: Real-time predictions needed, moderate accuracy OK\n\n";

echo "Balanced: Look for high accuracy with reasonable speed\n";
echo "  → Trade-offs matter for production systems\n";

Expected Result

╔══════════════════════════════════════════════════════════╗
║          Algorithm Comparison on Iris Dataset           ║
╚══════════════════════════════════════════════════════════╝

Dataset: 36 iris flowers
Training: 27 samples (75%)
Testing: 9 samples (25%)

Training and evaluating 6 algorithms...
============================================================

Algorithm: k-Nearest Neighbors (k=5)
------------------------------------------------------------
Description: Classifies by majority vote of k nearest training examples

✓ Success!
  Training time: 0.52 ms
  Prediction time: 1.23 ms
  Accuracy: 100.00%
  F1-Score: 100.00%

Algorithm: Gaussian Naive Bayes
------------------------------------------------------------
Description: Probabilistic classifier assuming feature independence

✓ Success!
  Training time: 0.89 ms
  Prediction time: 0.34 ms
  Accuracy: 88.89%
  F1-Score: 88.52%

Algorithm: Decision Tree (depth=5)
------------------------------------------------------------
Description: Tree of if-then rules for classification

✓ Success!
  Training time: 1.45 ms
  Prediction time: 0.12 ms
  Accuracy: 88.89%
  F1-Score: 88.52%

Algorithm: Random Forest (100 trees)
------------------------------------------------------------
Description: Ensemble of decision trees for robust predictions

✓ Success!
  Training time: 145.23 ms
  Prediction time: 5.67 ms
  Accuracy: 100.00%
  F1-Score: 100.00%

Algorithm: AdaBoost (50 estimators)
------------------------------------------------------------
Description: Boosting ensemble focusing on difficult examples

✓ Success!
  Training time: 89.34 ms
  Prediction time: 3.45 ms
  Accuracy: 100.00%
  F1-Score: 100.00%

Algorithm: Neural Network (MLP)
------------------------------------------------------------
Description: Multi-layer neural network with backpropagation

✓ Success!
  Training time: 234.56 ms
  Prediction time: 0.89 ms
  Accuracy: 100.00%
  F1-Score: 100.00%

============================================================
COMPARISON SUMMARY
============================================================

Ranked by Accuracy:
------------------------------------------------------------
🥇 1. k-Nearest Neighbors (k=5): 100.00%
🥈 2. Random Forest (100 trees): 100.00%
🥉 3. AdaBoost (50 estimators): 100.00%
   4. Neural Network (MLP): 100.00%
   5. Gaussian Naive Bayes: 88.89%
   6. Decision Tree (depth=5): 88.89%


Ranked by Speed (Inference Time):
------------------------------------------------------------
  1. Decision Tree (depth=5): 0.12 ms
  2. Gaussian Naive Bayes: 0.34 ms
  3. Neural Network (MLP): 0.89 ms
  4. k-Nearest Neighbors (k=5): 1.23 ms
  5. AdaBoost (50 estimators): 3.45 ms
  6. Random Forest (100 trees): 5.67 ms

╔══════════════════════════════════════════════════════════╗
║                  Recommendation                          ║
╚══════════════════════════════════════════════════════════╝

Most Accurate: k-Nearest Neighbors (k=5) (100.00%)
  → Use when: Accuracy is critical, speed is less important

Fastest: Decision Tree (depth=5) (0.12 ms)
  → Use when: Real-time predictions needed, moderate accuracy OK

Balanced: Look for high accuracy with reasonable speed
  → Trade-offs matter for production systems

Why It Works

Different algorithms excel in different scenarios:

k-NN: Simple but effective for small datasets with well-separated classes. Slow inference (must check all training points).

Naive Bayes: Fast training and prediction. Works well when feature independence assumption holds (rarely perfect but often "good enough").

Decision Tree: Fast, interpretable. Prone to overfitting on training data.

Random Forest: Ensemble of trees reduces overfitting. More robust but slower inference.

AdaBoost: Focuses training on difficult examples. Often achieves high accuracy but takes longer to train and predict.

Neural Network: Can learn complex patterns but needs more data and training time than shown here. Overkill for iris dataset.

Troubleshooting

  • All algorithms show same accuracy — Dataset may be too easy (like iris). Try more challenging data to see differences.

  • Neural network training very slow — Expected for small datasets (many epochs per sample). For production neural networks, use GPU acceleration or cloud ML services.

  • AdaBoost or RandomForest fail — Insufficient training data. Ensemble methods need adequate samples relative to estimators count.

Step 7: Loading Real Data with Transformers (~12 min)

Goal

Master loading datasets from CSV files, databases, and JSON sources, then apply multiple preprocessing transformers in Rubix ML pipelines for production-ready data handling.

Actions

So far, we've used hardcoded PHP arrays for simplicity. In production, you'll load data from CSV files, databases, or APIs. Rubix ML provides powerful utilities for loading data and chaining multiple transformers to prepare it for machine learning.

Loading Data from CSV

Rubix ML makes CSV loading trivial with built-in methods:

php
# filename: 07-load-real-data.php
<?php

declare(strict_types=1);

require_once __DIR__ . '/vendor/autoload.php';

use Rubix\ML\Datasets\Labeled;
use Rubix\ML\Datasets\Unlabeled;

echo "╔══════════════════════════════════════════════════════════╗\n";
echo "║         Loading Real Data with Rubix ML                 ║\n";
echo "╚══════════════════════════════════════════════════════════╝\n\n";

// ============================================================
// STEP 1: Load Data from CSV
// ============================================================

echo "STEP 1: Loading Data from CSV File\n";
echo "------------------------------------------------------------\n";

// Create sample CSV for demonstration
$csvData = <<<CSV
sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor
6.4,3.2,4.5,1.5,Iris-versicolor
6.3,3.3,6.0,2.5,Iris-virginica
5.8,2.7,5.1,1.9,Iris-virginica
CSV;

if (!is_dir(__DIR__ . '/data')) {
    mkdir(__DIR__ . '/data', 0755, true);
}
file_put_contents(__DIR__ . '/data/iris_sample.csv', $csvData);

// Load CSV with Rubix ML
$dataset = Labeled::fromCSV(__DIR__ . '/data/iris_sample.csv', hasHeader: true);

echo "✓ CSV loaded successfully\n";
echo "  Samples: " . $dataset->numSamples() . "\n";
echo "  Features: " . $dataset->numFeatures() . "\n";
echo "  Labels: " . implode(', ', array_unique($dataset->labels())) . "\n\n";

// Display first few samples
echo "First 3 samples:\n";
for ($i = 0; $i < min(3, $dataset->numSamples()); $i++) {
    $sample = $dataset->sample($i);
    $label = $dataset->label($i);
    echo "  Sample " . ($i + 1) . ": [" . implode(', ', $sample) . "] → {$label}\n";
}
echo "\n";

What fromCSV() does:

  • Automatically parses CSV format
  • Detects delimiter (comma, semicolon, tab)
  • Handles quoted fields
  • Converts string numbers to floats
  • Last column becomes labels (for Labeled datasets)

Comprehensive Transformer Showcase

Rubix ML provides numerous transformers for data preprocessing. Let's showcase the most important ones:

php
# filename: 07-load-real-data.php (continued)

use Rubix\ML\Transformers\NumericStringConverter;
use Rubix\ML\Transformers\MissingDataImputer;
use Rubix\ML\Transformers\OneHotEncoder;
use Rubix\ML\Transformers\MinMaxNormalizer;
use Rubix\ML\Transformers\ZScaleStandardizer;
use Rubix\ML\Transformers\VarianceThresholdFilter;
use Rubix\ML\Pipeline;
use Rubix\ML\Classifiers\RandomForest;

// ============================================================
// STEP 2: Demonstrate Individual Transformers
// ============================================================

echo "STEP 2: Transformer Showcase\n";
echo "------------------------------------------------------------\n\n";

// Prepare sample data with issues to demonstrate transformers
$messyData = new Labeled(
    [
        ['5.1', '3.5', 1.4, 0.2, 'red'],      // Mix of strings and numbers
        [4.9, null, 1.4, 0.2, 'blue'],        // Missing value
        ['7.0', '3.2', 4.7, 1.4, 'red'],
        [6.4, 3.2, null, 1.5, 'green'],       // Missing value
    ],
    ['setosa', 'setosa', 'versicolor', 'versicolor']
);

echo "Original messy data:\n";
echo "  Sample types: " . gettype($messyData->sample(0)[0]) . " (should be float)\n";
echo "  Has nulls: Yes (sample 2 and 4)\n";
echo "  Has categorical: Yes (color column)\n\n";

// Transformer 1: NumericStringConverter
echo "1. NumericStringConverter\n";
echo "   Purpose: Convert string numbers to actual floats\n";
$converter = new NumericStringConverter();
$converter->fit($messyData);
$messyData->apply($converter);
echo "   ✓ Strings converted: '5.1' → 5.1\n\n";

// Transformer 2: MissingDataImputer
echo "2. MissingDataImputer\n";
echo "   Purpose: Fill missing values with mean/median/mode\n";
$imputer = new MissingDataImputer('mean');  // Options: 'mean', 'median', 'mode'
$imputer->fit($messyData);
$messyData->apply($imputer);
echo "   ✓ Null values filled with column means\n\n";

// Transformer 3: OneHotEncoder
echo "3. OneHotEncoder\n";
echo "   Purpose: Convert categorical values to binary columns\n";
echo "   Before: ['red', 'blue', 'green']\n";
echo "   After: [[1,0,0], [0,1,0], [0,0,1]]\n";
$encoder = new OneHotEncoder();
$encoder->fit($messyData);
$messyData->apply($encoder);
echo "   ✓ Categorical column expanded to " . ($messyData->numFeatures() - 4) . " binary columns\n\n";

// Transformer 4: MinMaxNormalizer
echo "4. MinMaxNormalizer\n";
echo "   Purpose: Scale features to [0, 1] range\n";
echo "   Formula: (x - min) / (max - min)\n";
$minmax = new MinMaxNormalizer();
$minmax->fit($messyData);
$messyData->apply($minmax);
echo "   ✓ All features now in [0.0, 1.0] range\n\n";

// Transformer 5: ZScaleStandardizer (alternative to MinMax)
echo "5. ZScaleStandardizer (Alternative Normalization)\n";
echo "   Purpose: Standardize features (mean=0, std=1)\n";
echo "   Formula: (x - μ) / σ\n";
echo "   Use when: Features have different scales, outliers present\n";
echo "   ✓ Would center features around 0 with unit variance\n\n";

// Transformer 6: VarianceThresholdFilter
echo "6. VarianceThresholdFilter\n";
echo "   Purpose: Remove low-variance (nearly constant) features\n";
echo "   Use when: Dataset has useless features that don't vary\n";
echo "   ✓ Automatically removes features with variance < threshold\n\n";

echo "Transformed dataset:\n";
echo "  Samples: " . $messyData->numSamples() . "\n";
echo "  Features: " . $messyData->numFeatures() . " (increased due to one-hot encoding)\n";
echo "  All numeric: Yes\n";
echo "  No missing values: Yes\n";
echo "  Normalized: Yes\n\n";

Complete Pipeline with Multiple Transformers

Now let's chain transformers in a production pipeline:

php
# filename: 07-load-real-data.php (continued)

// ============================================================
// STEP 3: Production Pipeline with Chained Transformers
// ============================================================

echo "STEP 3: Complete Preprocessing Pipeline\n";
echo "------------------------------------------------------------\n";

// Load fresh data
$rawDataset = Labeled::fromCSV(__DIR__ . '/data/iris_sample.csv', hasHeader: true);

// Create pipeline with multiple transformers
$pipeline = new Pipeline([
    new NumericStringConverter(),      // 1. Fix type issues
    new MissingDataImputer('mean'),    // 2. Handle missing values
    new MinMaxNormalizer(),            // 3. Normalize to [0,1]
], new RandomForest(estimators: 10));

echo "Pipeline created with 3 transformers + Random Forest classifier\n\n";

echo "Transformer order (CRITICAL):\n";
echo "  1. NumericStringConverter    - Must be first (convert before math)\n";
echo "  2. MissingDataImputer        - Fill nulls before normalization\n";
echo "  3. MinMaxNormalizer          - Last preprocessing step\n";
echo "  4. RandomForest              - Final estimator\n\n";

echo "Why order matters:\n";
echo "  ✗ Normalize → Impute: NaN values break normalization\n";
echo "  ✗ Impute → Convert: Can't calculate mean of strings\n";
echo "  ✓ Convert → Impute → Normalize: Correct order\n\n";

// Train pipeline (all transformers auto-fit on training data)
[$training, $testing] = $rawDataset->randomize()->split(0.8);

$trainStart = microtime(true);
$pipeline->train($training);
$trainTime = (microtime(true) - $trainStart) * 1000;

echo "✓ Pipeline trained in " . number_format($trainTime, 2) . " ms\n";
echo "  - Transformers fitted to training data only\n";
echo "  - Test data will use same transformation parameters\n\n";

// Make predictions
$predictions = $pipeline->predict($testing);

echo "✓ Predictions made on test set\n";
echo "  Test samples: " . $testing->numSamples() . "\n";

// Calculate accuracy
$correct = 0;
foreach ($predictions as $i => $pred) {
    if ($pred === $testing->label($i)) {
        $correct++;
    }
}
$accuracy = ($correct / count($predictions)) * 100;

echo "  Accuracy: " . number_format($accuracy, 1) . "%\n\n";

Loading from Database

For production systems, you'll often load data from databases:

php
# filename: 07-load-real-data.php (continued)

// ============================================================
// STEP 4: Loading from Database
// ============================================================

echo "STEP 4: Loading Data from Database\n";
echo "------------------------------------------------------------\n";

echo "Example: MySQL/PostgreSQL Loading\n\n";

echo "```php\n";
echo "// Connect to database\n";
echo "\$pdo = new PDO(\n";
echo "    'mysql:host=localhost;dbname=ml_data',\n";
echo "    'username',\n";
echo "    'password',\n";
echo "    [PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION]\n";
echo ");\n\n";

echo "// Query training data\n";
echo "\$stmt = \$pdo->query('\n";
echo "    SELECT feature1, feature2, feature3, label\n";
echo "    FROM training_data\n";
echo "    WHERE created_at > DATE_SUB(NOW(), INTERVAL 30 DAY)\n";
echo "');\n\n";

echo "// Build dataset\n";
echo "\$samples = [];\n";
echo "\$labels = [];\n";
echo "while (\$row = \$stmt->fetch(PDO::FETCH_NUM)) {\n";
echo "    \$samples[] = array_slice(\$row, 0, -1);  // All but last column\n";
echo "    \$labels[] = \$row[count(\$row) - 1];     // Last column\n";
echo "}\n\n";

echo "\$dataset = new Labeled(\$samples, \$labels);\n";
echo "echo \"Loaded \" . count(\$samples) . \" samples from database\\n\";\n";
echo "```\n\n";

echo "Benefits:\n";
echo "  ✓ Load only recent data (avoid stale training data)\n";
echo "  ✓ Filter by conditions (WHERE clauses)\n";
echo "  ✓ Join multiple tables for features\n";
echo "  ✓ Stream large datasets (fetchAll vs fetch loop)\n\n";

Loading from JSON API

php
# filename: 07-load-real-data.php (continued)

// ============================================================
// STEP 5: Loading from JSON API
// ============================================================

echo "STEP 5: Loading Data from JSON API\n";
echo "------------------------------------------------------------\n";

echo "Example: Loading from REST API\n\n";

echo "```php\n";
echo "// Fetch data from API\n";
echo "\$response = file_get_contents('https://api.example.com/training-data');\n";
echo "\$data = json_decode(\$response, true);\n\n";

echo "// Extract features and labels\n";
echo "\$samples = [];\n";
echo "\$labels = [];\n";
echo "foreach (\$data['records'] as \$record) {\n";
echo "    \$samples[] = [\n";
echo "        \$record['age'],\n";
echo "        \$record['income'],\n";
echo "        \$record['score'],\n";
echo "    ];\n";
echo "    \$labels[] = \$record['category'];\n";
echo "}\n\n";

echo "\$dataset = new Labeled(\$samples, \$labels);\n";
echo "```\n\n";

echo "Use cases:\n";
echo "  - Load training data from microservices\n";
echo "  - Fetch labeled data from annotation services\n";
echo "  - Import datasets from cloud ML platforms\n\n";

echo "╔══════════════════════════════════════════════════════════╗\n";
echo "║         ✓ Real Data Loading Complete!                  ║\n";
echo "╚══════════════════════════════════════════════════════════╝\n";

Expected Result

╔══════════════════════════════════════════════════════════╗
║         Loading Real Data with Rubix ML                 ║
╚══════════════════════════════════════════════════════════╝

STEP 1: Loading Data from CSV File
------------------------------------------------------------
✓ CSV loaded successfully
  Samples: 6
  Features: 4
  Labels: Iris-setosa, Iris-versicolor, Iris-virginica

First 3 samples:
  Sample 1: [5.1, 3.5, 1.4, 0.2] → Iris-setosa
  Sample 2: [4.9, 3.0, 1.4, 0.2] → Iris-setosa
  Sample 3: [7.0, 3.2, 4.7, 1.4] → Iris-versicolor

STEP 2: Transformer Showcase
------------------------------------------------------------

Original messy data:
  Sample types: string (should be float)
  Has nulls: Yes (sample 2 and 4)
  Has categorical: Yes (color column)

1. NumericStringConverter
   Purpose: Convert string numbers to actual floats
   ✓ Strings converted: '5.1' → 5.1

2. MissingDataImputer
   Purpose: Fill missing values with mean/median/mode
   ✓ Null values filled with column means

3. OneHotEncoder
   Purpose: Convert categorical values to binary columns
   Before: ['red', 'blue', 'green']
   After: [[1,0,0], [0,1,0], [0,0,1]]
   ✓ Categorical column expanded to 3 binary columns

4. MinMaxNormalizer
   Purpose: Scale features to [0, 1] range
   Formula: (x - min) / (max - min)
   ✓ All features now in [0.0, 1.0] range

5. ZScaleStandardizer (Alternative Normalization)
   Purpose: Standardize features (mean=0, std=1)
   Formula: (x - μ) / σ
   Use when: Features have different scales, outliers present
   ✓ Would center features around 0 with unit variance

6. VarianceThresholdFilter
   Purpose: Remove low-variance (nearly constant) features
   Use when: Dataset has useless features that don't vary
   ✓ Automatically removes features with variance < threshold

Transformed dataset:
  Samples: 4
  Features: 7 (increased due to one-hot encoding)
  All numeric: Yes
  No missing values: Yes
  Normalized: Yes

STEP 3: Complete Preprocessing Pipeline
------------------------------------------------------------
Pipeline created with 3 transformers + Random Forest classifier

Transformer order (CRITICAL):
  1. NumericStringConverter    - Must be first (convert before math)
  2. MissingDataImputer        - Fill nulls before normalization
  3. MinMaxNormalizer          - Last preprocessing step
  4. RandomForest              - Final estimator

Why order matters:
  ✗ Normalize → Impute: NaN values break normalization
  ✗ Impute → Convert: Can't calculate mean of strings
  ✓ Convert → Impute → Normalize: Correct order

✓ Pipeline trained in 45.23 ms
  - Transformers fitted to training data only
  - Test data will use same transformation parameters

✓ Predictions made on test set
  Test samples: 1
  Accuracy: 100.0%

Why It Works

Rubix ML's Data Loading:

  • Labeled::fromCSV() automatically handles CSV parsing, type conversion, and validation
  • Detects data types and converts appropriately
  • Separates features from labels based on column position

Transformer Chaining:

  • Each transformer is fitted on training data (calculates statistics like mean, min, max)
  • Fitted transformers are applied to both training and test data using training statistics
  • This prevents data leakage—test data never influences transformation parameters

Order Matters:

  1. Type conversion first: Can't do math on strings
  2. Imputation second: Need numeric values to calculate means
  3. Normalization last: All values must be clean and numeric

Pipeline Benefits:

  • Transformers auto-fit during train()
  • Same transformations apply during predict()
  • No manual tracking of transformation parameters
  • Guaranteed consistency between training and inference

Troubleshooting

  • Error: "CSV file not found" — Check file path is relative to script location. Use __DIR__ . '/data/file.csv' for absolute paths.

  • Warning: "Non-numeric value encountered" — CSV contains non-numeric data in feature columns. Use NumericStringConverter first or OneHotEncoder for categorical columns.

  • Error: "Sample dimensionality mismatch" — CSV rows have different column counts. Check for:

    php
    // Validate CSV before loading
    $lines = file(__DIR__ . '/data/file.csv');
    $columnCounts = array_map(fn($line) => count(str_getcsv($line)), $lines);
    if (count(array_unique($columnCounts)) > 1) {
        echo "Inconsistent column counts found!\n";
    }
  • Memory exhausted on large CSV — Load data in chunks:

    php
    $handle = fopen('large_file.csv', 'r');
    $batchSize = 1000;
    $samples = [];
    $labels = [];
    
    while (($row = fgetcsv($handle)) !== false) {
        $samples[] = array_slice($row, 0, -1);
        $labels[] = $row[count($row) - 1];
    
        if (count($samples) >= $batchSize) {
            // Process batch
            $dataset = new Labeled($samples, $labels);
            // ... train incrementally or save to disk
            $samples = [];
            $labels = [];
        }
    }
  • Database connection timeout — Use connection pooling and set appropriate timeouts:

    php
    $pdo = new PDO($dsn, $user, $pass, [
        PDO::ATTR_TIMEOUT => 30,
        PDO::ATTR_PERSISTENT => true,  // Connection pooling
    ]);
  • Transformer fails on test data — Ensure transformer fitted on training data first:

    php
    // WRONG
    $normalizer->apply($testData);  // Not fitted!
    
    // RIGHT
    $normalizer->fit($trainingData);  // Fit on training
    $normalizer->apply($testData);    // Apply training stats to test

Step 8: Production REST API Implementation (~15 min)

Goal

Build a production-ready REST API that serves ML predictions with model loading optimization, input validation, error handling, logging, and health checks.

Actions

Let's create a complete production API that demonstrates all best practices for serving ML models in PHP.

php
# filename: 08-production-api.php
<?php

declare(strict_types=1);

require_once __DIR__ . '/vendor/autoload.php';

use Rubix\ML\Persisters\Filesystem;
use Rubix\ML\Estimator;
use Rubix\ML\Datasets\Unlabeled;

/**
 * Production ML Model Server
 *
 * Singleton pattern ensures model loads once and serves many requests.
 * Includes validation, error handling, logging, and monitoring.
 */
class ModelServer
{
    private static ?Estimator $model = null;
    private static array $config = [];
    private static int $requestCount = 0;
    private static float $totalInferenceTime = 0;

    /**
     * Initialize server and load model once
     */
    public static function initialize(): void
    {
        // Load configuration
        self::$config = [
            'model_path' => __DIR__ . '/models/production.rbx',
            'expected_features' => 4,  // Iris dataset has 4 features
            'log_path' => __DIR__ . '/logs/predictions.log',
            'max_feature_value' => 10.0,  // Sanity check
            'min_feature_value' => 0.0,
        ];

        // Create logs directory if needed
        if (!is_dir(__DIR__ . '/logs')) {
            mkdir(__DIR__ . '/logs', 0755, true);
        }

        // Load model once at startup
        try {
            $persister = new Filesystem(self::$config['model_path']);
            self::$model = $persister->load();

            error_log(sprintf(
                "[ModelServer] Model loaded successfully (%s)",
                get_class(self::$model)
            ));
        } catch (Exception $e) {
            error_log("[ModelServer] FATAL: Failed to load model: " . $e->getMessage());
            throw $e;
        }
    }

    /**
     * Validate input features
     *
     * @throws InvalidArgumentException if validation fails
     */
    private static function validateFeatures(array $features): void
    {
        // Check feature count
        if (count($features) !== self::$config['expected_features']) {
            throw new InvalidArgumentException(sprintf(
                "Expected %d features, got %d",
                self::$config['expected_features'],
                count($features)
            ));
        }

        // Check all values are numeric
        foreach ($features as $i => $value) {
            if (!is_numeric($value)) {
                throw new InvalidArgumentException(
                    "Feature at index {$i} is not numeric: " . var_export($value, true)
                );
            }

            // Sanity check: reasonable range
            if ($value < self::$config['min_feature_value'] ||
                $value > self::$config['max_feature_value']) {
                throw new InvalidArgumentException(
                    "Feature at index {$i} out of range [{self::$config['min_feature_value']}, {self::$config['max_feature_value']}]: {$value}"
                );
            }
        }
    }

    /**
     * Make prediction with timing and logging
     *
     * @param array $features Input feature vector
     * @return array Prediction result with metadata
     */
    public static function predict(array $features): array
    {
        // Validate input
        self::validateFeatures($features);

        // Make prediction with timing
        $startTime = microtime(true);
        $prediction = self::$model->predictSample($features);
        $duration = (microtime(true) - $startTime) * 1000;

        // Get probability/confidence if available
        $confidence = null;
        if (method_exists(self::$model, 'proba')) {
            try {
                $proba = self::$model->proba(new Unlabeled([$features]))[0];
                $confidence = max($proba);
            } catch (Exception $e) {
                // Some models don't support proba - that's OK
                $confidence = null;
            }
        }

        // Update statistics
        self::$requestCount++;
        self::$totalInferenceTime += $duration;

        // Log prediction
        self::logPrediction($features, $prediction, $confidence, $duration);

        return [
            'prediction' => $prediction,
            'confidence' => $confidence,
            'processing_time_ms' => round($duration, 2),
            'request_id' => uniqid('pred_', true),
        ];
    }

    /**
     * Log prediction for monitoring and retraining
     */
    private static function logPrediction(
        array $features,
        $prediction,
        ?float $confidence,
        float $duration
    ): void {
        $logEntry = json_encode([
            'timestamp' => date('c'),
            'features' => $features,
            'prediction' => $prediction,
            'confidence' => $confidence,
            'duration_ms' => round($duration, 2),
            'model_class' => get_class(self::$model),
        ]) . "\n";

        file_put_contents(self::$config['log_path'], $logEntry, FILE_APPEND | LOCK_EX);
    }

    /**
     * Get server statistics
     */
    public static function getStats(): array
    {
        return [
            'total_requests' => self::$requestCount,
            'avg_inference_time_ms' => self::$requestCount > 0
                ? round(self::$totalInferenceTime / self::$requestCount, 2)
                : 0,
            'model_class' => self::$model ? get_class(self::$model) : null,
            'uptime_seconds' => time() - $_SERVER['REQUEST_TIME'],
        ];
    }

    /**
     * Health check
     */
    public static function isHealthy(): bool
    {
        return self::$model !== null;
    }
}

// ============================================================
// Initialize model server on startup (runs once)
// ============================================================

try {
    ModelServer::initialize();
} catch (Exception $e) {
    http_response_code(503);
    header('Content-Type: application/json');
    echo json_encode([
        'error' => 'Service Unavailable',
        'message' => 'Failed to initialize model server',
    ]);
    exit(1);
}

// ============================================================
// Handle HTTP Requests
// ============================================================

header('Content-Type: application/json');

$method = $_SERVER['REQUEST_METHOD'];
$uri = parse_url($_SERVER['REQUEST_URI'], PHP_URL_PATH);

// Route: POST /predict - Make prediction
if ($method === 'POST' && $uri === '/predict') {
    try {
        // Parse JSON input
        $input = json_decode(file_get_contents('php://input'), true);

        if ($input === null) {
            http_response_code(400);
            echo json_encode(['error' => 'Invalid JSON']);
            exit;
        }

        if (!isset($input['features'])) {
            http_response_code(400);
            echo json_encode(['error' => 'Missing "features" field']);
            exit;
        }

        // Make prediction
        $result = ModelServer::predict($input['features']);

        http_response_code(200);
        echo json_encode($result);

    } catch (InvalidArgumentException $e) {
        // Client error - bad input
        http_response_code(400);
        echo json_encode([
            'error' => 'Bad Request',
            'message' => $e->getMessage(),
        ]);

    } catch (Exception $e) {
        // Server error - unexpected
        http_response_code(500);
        error_log("[ModelServer] Prediction error: " . $e->getMessage());
        echo json_encode([
            'error' => 'Internal Server Error',
            'message' => 'An unexpected error occurred',
        ]);
    }

// Route: GET /health - Health check
} elseif ($method === 'GET' && $uri === '/health') {
    $healthy = ModelServer::isHealthy();

    http_response_code($healthy ? 200 : 503);
    echo json_encode([
        'status' => $healthy ? 'healthy' : 'unhealthy',
        'model_loaded' => $healthy,
        'timestamp' => date('c'),
    ]);

// Route: GET /stats - Server statistics
} elseif ($method === 'GET' && $uri === '/stats') {
    http_response_code(200);
    echo json_encode(ModelServer::getStats());

// Route not found
} else {
    http_response_code(404);
    echo json_encode([
        'error' => 'Not Found',
        'message' => "Route {$method} {$uri} not found",
        'available_routes' => [
            'POST /predict' => 'Make prediction',
            'GET /health' => 'Health check',
            'GET /stats' => 'Server statistics',
        ],
    ]);
}

Testing the API

Create a test script to interact with the API:

php
# filename: test-api.php
<?php

declare(strict_types=1);

echo "╔══════════════════════════════════════════════════════════╗\n";
echo "║          Testing Production ML API                      ║\n";
echo "╚══════════════════════════════════════════════════════════╝\n\n";

// Start server in background
echo "Starting API server on http://localhost:8000...\n";
$serverProcess = proc_open(
    'php -S localhost:8000 08-production-api.php',
    [
        1 => ['pipe', 'w'],  // stdout
        2 => ['pipe', 'w'],  // stderr
    ],
    $pipes
);

// Give server time to start
sleep(2);

// Test 1: Health Check
echo "\nTest 1: Health Check\n";
echo "------------------------------------------------------------\n";
$healthResponse = file_get_contents('http://localhost:8000/health');
$health = json_decode($healthResponse, true);
echo "✓ Health: " . $health['status'] . "\n";
echo "  Model loaded: " . ($health['model_loaded'] ? 'Yes' : 'No') . "\n\n";

// Test 2: Valid Prediction
echo "Test 2: Valid Prediction Request\n";
echo "------------------------------------------------------------\n";
$validRequest = json_encode([
    'features' => [5.1, 3.5, 1.4, 0.2]  // Iris setosa
]);

$context = stream_context_create([
    'http' => [
        'method' => 'POST',
        'header' => 'Content-Type: application/json',
        'content' => $validRequest,
    ]
]);

$predictionResponse = file_get_contents('http://localhost:8000/predict', false, $context);
$prediction = json_decode($predictionResponse, true);

echo "✓ Prediction: " . $prediction['prediction'] . "\n";
echo "  Confidence: " . number_format($prediction['confidence'] * 100, 1) . "%\n";
echo "  Processing time: " . $prediction['processing_time_ms'] . " ms\n\n";

// Test 3: Invalid Input (wrong feature count)
echo "Test 3: Invalid Input (Wrong Feature Count)\n";
echo "------------------------------------------------------------\n";
$invalidRequest = json_encode([
    'features' => [5.1, 3.5]  // Only 2 features instead of 4
]);

$context = stream_context_create([
    'http' => [
        'method' => 'POST',
        'header' => 'Content-Type: application/json',
        'content' => $invalidRequest,
        'ignore_errors' => true,
    ]
]);

$errorResponse = file_get_contents('http://localhost:8000/predict', false, $context);
$error = json_decode($errorResponse, true);

echo "✓ Error caught: " . $error['message'] . "\n";
echo "  HTTP Status: 400 (Bad Request)\n\n";

// Test 4: Server Statistics
echo "Test 4: Server Statistics\n";
echo "------------------------------------------------------------\n";
$statsResponse = file_get_contents('http://localhost:8000/stats');
$stats = json_decode($statsResponse, true);

echo "✓ Total requests: " . $stats['total_requests'] . "\n";
echo "  Avg inference time: " . $stats['avg_inference_time_ms'] . " ms\n";
echo "  Model class: " . $stats['model_class'] . "\n\n";

// Cleanup: Stop server
proc_terminate($serverProcess);
proc_close($serverProcess);

echo "╔══════════════════════════════════════════════════════════╗\n";
echo "║         ✓ API Testing Complete!                        ║\n";
echo "╚══════════════════════════════════════════════════════════╝\n";

Expected Result

Starting the API server:

bash
php -S localhost:8000 08-production-api.php
[Sun Oct 26 12:34:56 2025] PHP 8.4.0 Development Server (http://localhost:8000) started
[ModelServer] Model loaded successfully (Rubix\ML\Classifiers\KNearestNeighbors)

Making requests with curl:

bash
# Make prediction
curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"features": [5.1, 3.5, 1.4, 0.2]}'
json
{
  "prediction": "Iris-setosa",
  "confidence": 0.95,
  "processing_time_ms": 1.23,
  "request_id": "pred_67234abc12def"
}

Health check:

bash
curl http://localhost:8000/health
json
{
  "status": "healthy",
  "model_loaded": true,
  "timestamp": "2025-10-26T12:35:42+00:00"
}

Server statistics:

bash
curl http://localhost:8000/stats
json
{
  "total_requests": 127,
  "avg_inference_time_ms": 1.18,
  "model_class": "Rubix\\ML\\Classifiers\\KNearestNeighbors",
  "uptime_seconds": 3842
}

Error handling (wrong feature count):

bash
curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"features": [5.1, 3.5]}'
json
{
  "error": "Bad Request",
  "message": "Expected 4 features, got 2"
}

Why It Works

Singleton Pattern:

  • Model loaded once at server startup via ModelServer::initialize()
  • Shared static $model variable serves all requests
  • 50x faster than loading per request (2ms vs 50ms)

Request Flow:

  1. Client sends POST to /predict with JSON features
  2. API validates JSON structure and feature count
  3. ModelServer::predict() validates feature values
  4. Model makes prediction (~1ms)
  5. Result logged and returned to client

Production Benefits:

  • Input Validation: Prevents crashes from bad data
  • Error Handling: Proper HTTP status codes (400 for client errors, 500 for server errors)
  • Logging: All predictions logged for monitoring and retraining
  • Health Checks: Load balancers can verify service is healthy
  • Statistics: Monitor performance over time
  • Request IDs: Track individual predictions for debugging

Scalability:

  • PHP-FPM handles concurrent requests via process pool
  • Each worker process loads model once (not per request)
  • Horizontal scaling: add more PHP-FPM workers or servers
  • Stateless design: any server can handle any request

Production Deployment

Development Server:

bash
# Built-in PHP server (development only)
php -S localhost:8000 08-production-api.php

Production with Nginx + PHP-FPM:

nginx
# /etc/nginx/sites-available/ml-api
server {
    listen 80;
    server_name ml-api.example.com;

    root /var/www/ml-api;
    index 08-production-api.php;

    # API endpoint
    location /predict {
        try_files $uri $uri/ /08-production-api.php?$query_string;
        fastcgi_pass unix:/var/run/php/php8.4-fpm.sock;
        fastcgi_param SCRIPT_FILENAME $document_root/08-production-api.php;
        include fastcgi_params;
    }

    # Health check
    location /health {
        try_files $uri $uri/ /08-production-api.php?$query_string;
        fastcgi_pass unix:/var/run/php/php8.4-fpm.sock;
        fastcgi_param SCRIPT_FILENAME $document_root/08-production-api.php;
        include fastcgi_params;
    }

    # Stats (restrict to internal IPs)
    location /stats {
        allow 10.0.0.0/8;
        deny all;
        try_files $uri $uri/ /08-production-api.php?$query_string;
        fastcgi_pass unix:/var/run/php/php8.4-fpm.sock;
        fastcgi_param SCRIPT_FILENAME $document_root/08-production-api.php;
        include fastcgi_params;
    }
}

PHP-FPM Configuration:

ini
; /etc/php/8.4/fpm/pool.d/ml-api.conf
[ml-api]
user = www-data
group = www-data
listen = /var/run/php/php8.4-fpm.sock

; Process management
pm = dynamic
pm.max_children = 50      ; Adjust based on RAM (each loads model)
pm.start_servers = 10
pm.min_spare_servers = 5
pm.max_spare_servers = 15

; Each child serves 1000 requests before respawning (prevent memory leaks)
pm.max_requests = 1000

; PHP ini settings
php_value[memory_limit] = 256M
php_value[opcache.enable] = 1

Monitoring with Logs:

bash
# Watch predictions in real-time
tail -f logs/predictions.log | jq .

# Analyze performance
cat logs/predictions.log | jq '.duration_ms' | \
  awk '{sum+=$1; count++} END {print "Avg:", sum/count, "ms"}'

# Count predictions by class
cat logs/predictions.log | jq -r '.prediction' | sort | uniq -c

Troubleshooting

  • Error: "Model file not found" — Model not trained/saved yet. Run training script first:

    bash
    # Train and save model (from Step 5)
    php 05-model-persistence.php
  • Slow first request after startup — Model loading on first request. Use opcache preloading (PHP 7.4+):

    ini
    ; php.ini
    opcache.preload=/var/www/ml-api/preload.php
    opcache.preload_user=www-data
  • Memory leaks over time — PHP-FPM child processes accumulate memory. Set pm.max_requests to recycle workers:

    ini
    pm.max_requests = 500  ; Restart worker after 500 requests
  • 504 Gateway Timeout on predictions — Increase PHP and Nginx timeouts:

    nginx
    # nginx.conf
    fastcgi_read_timeout 60s;
    ini
    ; php-fpm.conf
    request_terminate_timeout = 60
  • Predictions return different results — Model uses random state (e.g., neural network). Set seed or use deterministic algorithm (Decision Tree, k-NN).

  • High CPU usage — k-NN has O(n) inference complexity. For large training sets (>10k samples), use faster algorithms (Random Forest, Naive Bayes) or GPU-accelerated libraries.

Step 9: Regression and Feature Importance (~10 min)

Goal

Build a house price prediction model using regression and analyze which features contribute most to predictions through feature importance analysis.

Actions

So far, we've focused on classification (predicting categories: spam/ham, iris species). Now let's tackle regression—predicting continuous numerical values like prices, temperatures, or sales figures.

Understanding Regression

Regression vs. Classification:

  • Classification: Predicts discrete categories ("spam" or "ham", "setosa" or "virginica")
  • Regression: Predicts continuous numbers (house price: $285,400, temperature: 72.5°F)
  • Same workflow: Prepare data → Train model → Evaluate → Predict

Building a House Price Predictor

Let's predict house prices based on features like square footage, number of bedrooms, age, and location quality:

php
# filename: 09-regression-feature-importance.php
<?php

declare(strict_types=1);

require_once __DIR__ . '/vendor/autoload.php';

use Rubix\ML\Datasets\Labeled;
use Rubix\ML\Datasets\Unlabeled;
use Rubix\ML\Regressors\RandomForestRegressor;
use Rubix\ML\CrossValidation\Metrics\RSquared;
use Rubix\ML\CrossValidation\Metrics\MeanAbsoluteError;
use Rubix\ML\CrossValidation\Metrics\MeanSquaredError;

echo "╔══════════════════════════════════════════════════════════╗\n";
echo "║       House Price Prediction with Regression            ║\n";
echo "╚══════════════════════════════════════════════════════════╝\n\n";

// ============================================================
// STEP 1: Prepare House Price Dataset
// ============================================================

echo "STEP 1: Preparing House Price Dataset\n";
echo "------------------------------------------------------------\n";

// Features: [sqft, bedrooms, age_years, location_score (1-10)]
$samples = [
    [1200, 3, 10, 7],
    [1800, 4, 5, 8],
    [900, 2, 20, 5],
    [2500, 5, 2, 9],
    [1100, 2, 15, 6],
    [2200, 4, 8, 8],
    [1500, 3, 12, 7],
    [1900, 4, 6, 7],
    [850, 2, 25, 4],
    [2800, 5, 1, 10],
    [1400, 3, 10, 6],
    [2000, 4, 7, 8],
    [1300, 3, 14, 6],
    [2400, 4, 3, 9],
    [1600, 3, 9, 7],
    [1750, 4, 11, 7],
    [2100, 4, 5, 8],
    [950, 2, 18, 5],
    [2300, 5, 4, 9],
    [1450, 3, 13, 6],
];

// Target: House prices in dollars
$prices = [
    250000, 380000, 180000, 520000, 220000,
    440000, 285000, 360000, 165000, 580000,
    265000, 395000, 255000, 480000, 305000,
    340000, 410000, 195000, 490000, 275000,
];

echo "Dataset prepared:\n";
echo "  Samples: " . count($samples) . " houses\n";
echo "  Features: 4 (sqft, bedrooms, age, location_score)\n";
echo "  Target: Price (\$)\n\n";

echo "Sample data:\n";
for ($i = 0; $i < 3; $i++) {
    echo sprintf(
        "  House %d: %d sqft, %d bed, %d yrs, loc:%d → \$%s\n",
        $i + 1,
        $samples[$i][0],
        $samples[$i][1],
        $samples[$i][2],
        $samples[$i][3],
        number_format($prices[$i], 0)
    );
}
echo "\n";

// ============================================================
// STEP 2: Train Random Forest Regressor
// ============================================================

echo "STEP 2: Training Random Forest Regressor\n";
echo "------------------------------------------------------------\n";

// Create dataset
$dataset = new Labeled($samples, $prices);

// Split into training and testing sets (80/20)
[$training, $testing] = $dataset->randomize()->split(0.8);

echo "Data split:\n";
echo "  Training: " . $training->numSamples() . " samples\n";
echo "  Testing: " . $testing->numSamples() . " samples\n\n";

// Create and train regressor
$regressor = new RandomForestRegressor(
    estimators: 100,        // 100 decision trees
    minLeafSize: 2,
);

echo "Training Random Forest Regressor (100 trees)...\n";
$trainStart = microtime(true);
$regressor->train($training);
$trainTime = (microtime(true) - $trainStart) * 1000;

echo "✓ Model trained in " . number_format($trainTime, 2) . " ms\n\n";

// ============================================================
// STEP 3: Evaluate Model Performance
// ============================================================

echo "STEP 3: Evaluating Model Performance\n";
echo "------------------------------------------------------------\n";

// Make predictions on test set
$predictions = $regressor->predict($testing);
$actuals = $testing->labels();

// Calculate evaluation metrics
$r2Metric = new RSquared();
$maeMetric = new MeanAbsoluteError();
$mseMetric = new MeanSquaredError();

$r2 = $r2Metric->score($predictions, $actuals);
$mae = $maeMetric->score($predictions, $actuals);
$rmse = sqrt($mseMetric->score($predictions, $actuals));

echo "Test Set Performance:\n";
echo "  R² Score: " . number_format($r2, 3) . " (explains " . number_format($r2 * 100, 1) . "% of variance)\n";
echo "  RMSE: \$" . number_format($rmse, 0) . " (avg prediction error)\n";
echo "  MAE: \$" . number_format($mae, 0) . " (avg absolute error)\n\n";

// Show individual predictions
echo "Sample Predictions vs Actuals:\n";
for ($i = 0; $i < min(5, count($predictions)); $i++) {
    $error = abs($predictions[$i] - $actuals[$i]);
    $errorPct = ($error / $actuals[$i]) * 100;

    echo sprintf(
        "  Predicted: \$%s | Actual: \$%s | Error: \$%s (%.1f%%)\n",
        number_format($predictions[$i], 0),
        number_format($actuals[$i], 0),
        number_format($error, 0),
        $errorPct
    );
}
echo "\n";

// ============================================================
// STEP 4: Analyze Feature Importance
// ============================================================

echo "STEP 4: Feature Importance Analysis\n";
echo "------------------------------------------------------------\n";

// Get feature importances from Random Forest
$importances = $regressor->featureImportances();

$featureNames = [
    'Square Feet',
    'Bedrooms',
    'Age (years)',
    'Location Score',
];

// Sort features by importance
$featureData = array_map(
    fn($name, $importance) => ['name' => $name, 'importance' => $importance],
    $featureNames,
    $importances
);

usort($featureData, fn($a, $b) => $b['importance'] <=> $a['importance']);

echo "Feature Importance (how much each feature drives predictions):\n\n";

foreach ($featureData as $i => $feature) {
    $percentage = $feature['importance'] * 100;
    $barLength = (int)($percentage / 2.5); // Scale to reasonable bar length
    $bar = str_repeat('█', $barLength);

    echo sprintf(
        "  %d. %-15s: %5.1f%%  %s\n",
        $i + 1,
        $feature['name'],
        $percentage,
        $bar
    );
}

echo "\n";

echo "Key Insights:\n";
echo "  • " . $featureData[0]['name'] . " is the strongest predictor\n";
echo "  • " . $featureData[1]['name'] . " also significantly impacts price\n";
echo "  • " . $featureData[3]['name'] . " has minimal influence\n\n";

// ============================================================
// STEP 5: Make Predictions on New Houses
// ============================================================

echo "STEP 5: Predicting Prices for New Houses\n";
echo "------------------------------------------------------------\n";

$newHouses = [
    [1500, 3, 8, 7],     // Mid-size, 3 bed, 8 years old, good location
    [2200, 4, 3, 9],     // Large, 4 bed, newer, excellent location
    [1000, 2, 15, 5],    // Small, 2 bed, older, average location
];

$newHouseDescriptions = [
    '1500 sqft, 3 bed, 8 years, location: 7/10',
    '2200 sqft, 4 bed, 3 years, location: 9/10',
    '1000 sqft, 2 bed, 15 years, location: 5/10',
];

$newDataset = new Unlabeled($newHouses);
$newPredictions = $regressor->predict($newDataset);

echo "Predicted prices for new houses:\n\n";

foreach ($newPredictions as $i => $price) {
    echo sprintf(
        "  House %d (%s)\n",
        $i + 1,
        $newHouseDescriptions[$i]
    );
    echo sprintf("  Predicted Price: \$%s\n\n", number_format($price, 0));
}

// ============================================================
// STEP 6: Compare Multiple Regression Algorithms (Optional)
// ============================================================

echo "STEP 6: Comparing Regression Algorithms\n";
echo "------------------------------------------------------------\n";

use Rubix\ML\Regressors\KNNRegressor;
use Rubix\ML\Regressors\Ridge;

$algorithms = [
    'Random Forest' => new RandomForestRegressor(estimators: 50),
    'k-NN Regressor' => new KNNRegressor(k: 3),
    'Ridge Regression' => new Ridge(alpha: 1.0),
];

echo "Training and evaluating 3 regression algorithms...\n\n";

$results = [];

foreach ($algorithms as $name => $algo) {
    $startTime = microtime(true);
    $algo->train($training);
    $trainDuration = (microtime(true) - $startTime) * 1000;

    $preds = $algo->predict($testing);
    $r2Score = $r2Metric->score($preds, $actuals);
    $maeScore = $maeMetric->score($preds, $actuals);

    $results[] = [
        'name' => $name,
        'r2' => $r2Score,
        'mae' => $maeScore,
        'train_time' => $trainDuration,
    ];
}

// Sort by R² (descending)
usort($results, fn($a, $b) => $b['r2'] <=> $a['r2']);

echo "Algorithm Comparison (sorted by R² score):\n";
echo "┌─────────────────────┬──────────┬────────────┬─────────────┐\n";
echo "│ Algorithm           │ R² Score │ MAE (\$)    │ Train (ms)  │\n";
echo "├─────────────────────┼──────────┼────────────┼─────────────┤\n";

foreach ($results as $result) {
    echo sprintf(
        "│ %-19s │ %8.3f │ %10s │ %11s │\n",
        $result['name'],
        $result['r2'],
        number_format($result['mae'], 0),
        number_format($result['train_time'], 2)
    );
}

echo "└─────────────────────┴──────────┴────────────┴─────────────┘\n\n";

echo "Best Algorithm: " . $results[0]['name'] . " (R² = " . number_format($results[0]['r2'], 3) . ")\n\n";

echo "╔══════════════════════════════════════════════════════════╗\n";
echo "║      ✓ Regression & Feature Importance Complete!       ║\n";
echo "╚══════════════════════════════════════════════════════════╝\n";

Expected Result

╔══════════════════════════════════════════════════════════╗
║       House Price Prediction with Regression            ║
╚══════════════════════════════════════════════════════════╝

STEP 1: Preparing House Price Dataset
------------------------------------------------------------
Dataset prepared:
  Samples: 20 houses
  Features: 4 (sqft, bedrooms, age, location_score)
  Target: Price ($)

Sample data:
  House 1: 1200 sqft, 3 bed, 10 yrs, loc:7 → $250,000
  House 2: 1800 sqft, 4 bed, 5 yrs, loc:8 → $380,000
  House 3: 900 sqft, 2 bed, 20 yrs, loc:5 → $180,000

STEP 2: Training Random Forest Regressor
------------------------------------------------------------
Data split:
  Training: 16 samples
  Testing: 4 samples

Training Random Forest Regressor (100 trees)...
✓ Model trained in 42.31 ms

STEP 3: Evaluating Model Performance
------------------------------------------------------------
Test Set Performance:
  R² Score: 0.912 (explains 91.2% of variance)
  RMSE: $18,450 (avg prediction error)
  MAE: $14,200 (avg absolute error)

Sample Predictions vs Actuals:
  Predicted: $267,800 | Actual: $265,000 | Error: $2,800 (1.1%)
  Predicted: $492,300 | Actual: $490,000 | Error: $2,300 (0.5%)
  Predicted: $413,200 | Actual: $410,000 | Error: $3,200 (0.8%)
  Predicted: $286,500 | Actual: $285,000 | Error: $1,500 (0.5%)

STEP 4: Feature Importance Analysis
------------------------------------------------------------
Feature Importance (how much each feature drives predictions):

  1. Square Feet    :  52.3%  ████████████████████
  2. Location Score :  28.1%  ███████████
  3. Bedrooms       :  12.4%  █████
  4. Age (years)    :   7.2%  ███

Key Insights:
  • Square Feet is the strongest predictor
  • Location Score also significantly impacts price
  • Age (years) has minimal influence

STEP 5: Predicting Prices for New Houses
------------------------------------------------------------
Predicted prices for new houses:

  House 1 (1500 sqft, 3 bed, 8 years, location: 7/10)
  Predicted Price: $285,400

  House 2 (2200 sqft, 4 bed, 3 years, location: 9/10)
  Predicted Price: $445,800

  House 3 (1000 sqft, 2 bed, 15 years, location: 5/10)
  Predicted Price: $203,100

STEP 6: Comparing Regression Algorithms
------------------------------------------------------------
Training and evaluating 3 regression algorithms...

Algorithm Comparison (sorted by R² score):
┌─────────────────────┬──────────┬────────────┬─────────────┐
│ Algorithm           │ R² Score │ MAE ($)    │ Train (ms)  │
├─────────────────────┼──────────┼────────────┼─────────────┤
│ Random Forest       │    0.912 │     14,200 │       42.31 │
│ k-NN Regressor      │    0.854 │     18,700 │        2.14 │
│ Ridge Regression    │    0.831 │     21,300 │        3.89 │
└─────────────────────┴──────────┴────────────┴─────────────┘

Best Algorithm: Random Forest (R² = 0.912)

╔══════════════════════════════════════════════════════════╗
║      ✓ Regression & Feature Importance Complete!       ║
╚══════════════════════════════════════════════════════════╝

Why It Works

Regression Fundamentals:

  • Regression predicts continuous values by learning relationships between features and target
  • Random Forest builds 100 decision trees, each making a prediction, then averages them
  • More trees = more robust predictions but slower training
  • Works well for non-linear relationships (unlike linear regression)

Evaluation Metrics:

  • R² (R-squared): Proportion of variance explained (0-1, higher is better). 0.91 = model explains 91% of price variation
  • RMSE (Root Mean Squared Error): Average prediction error in original units (dollars). Penalizes large errors more
  • MAE (Mean Absolute Error): Average absolute difference. More interpretable than RMSE

Feature Importance:

  • Tree-based models track how often each feature is used for splits and how much it reduces error
  • Higher importance = feature has stronger relationship with target
  • Helps prioritize data collection and feature engineering
  • Can reveal surprising relationships (e.g., square footage matters 2x more than bedrooms)

Why Random Forest Wins:

  • Handles non-linear relationships (price doesn't scale linearly with sqft)
  • Resistant to overfitting through ensemble averaging
  • No feature scaling required (unlike Ridge or k-NN)
  • Built-in feature importance (Ridge and k-NN don't provide this directly)

Troubleshooting

  • R² is negative or very low (<0.5) — Features aren't predictive of target. Check if:

    • Features are correlated with target (plot sqft vs price to verify relationship)
    • Sufficient training data (need at least 10x samples as features)
    • Target variable has variation (if all prices similar, nothing to predict)
    php
    // Check correlation
    $correlation = calculateCorrelation($samples[0], $prices);
    echo "Feature 1 correlation: " . $correlation . "\n";
    // Should be > 0.3 for meaningful relationship
  • RMSE very high (>50% of average price) — Model not learning. Solutions:

    • Normalize features if they're on different scales:
    php
    use Rubix\ML\Transformers\ZScaleStandardizer;
    use Rubix\ML\Pipeline;
    
    $pipeline = new Pipeline([
        new ZScaleStandardizer(),
    ], new RandomForestRegressor(estimators: 100));
    
    $pipeline->train($training);
    • Increase number of trees: estimators: 200
    • Try different algorithm (Ridge often works better for small datasets)
  • Feature importance all equal (~25% each) — Model not finding patterns. Causes:

    • Features genuinely don't predict target (collect better features)
    • Too few trees: increase estimators: 200
    • Features highly correlated (all provide same information)

    Try feature selection:

    php
    // Remove least important features and retrain
    $importantFeatures = [0, 3]; // Keep only sqft and location
    $reducedSamples = array_map(
        fn($sample) => [$sample[0], $sample[3]],
        $samples
    );
  • Predictions way off for new houses — Model overfitting or new data differs from training. Check:

    • New data in same range as training (don't predict for 5000 sqft if trained on 900-2800)
    • Sufficient test set size (at least 20% of data)
    • Random state set for reproducibility:
    php
    srand(42); // Set random seed
    [$training, $testing] = $dataset->randomize()->split(0.8);
  • RandomForest training very slow — Large dataset or too many trees. Solutions:

    • Reduce trees: estimators: 50 (faster, slight accuracy loss)
    • Increase minLeafSize: 5 (faster, less precise)
    • Try faster algorithm like Ridge for quick experiments
  • "Call to undefined method featureImportances()" — Algorithm doesn't support feature importance. Only tree-based models (RandomForest, GradientBoosting, DecisionTree) have this. For Ridge or k-NN, use permutation importance or analyze coefficients manually.

Exercises

Apply what you've learned by building your own projects:

Exercise Solutions

Sample solutions are available in code/chapter-08/solutions/. Try implementing them yourself first!

Exercise 1: Spam Filter with Evaluation

Goal: Build a complete spam filter with PHP-ML and comprehensive evaluation

Create exercise1-spam-filter.php that:

  1. Loads email dataset (provided in code/chapter-08/data/emails.csv)
  2. Extracts at least 6 features per email (your choice)
  3. Trains a k-NN classifier (try different k values)
  4. Performs 5-fold cross-validation
  5. Reports accuracy, precision, recall, and F1-score
  6. Saves the best model to disk

Validation: Your implementation should achieve:

  • Accuracy > 85%
  • Clear feature extraction function
  • Cross-validation with at least 3 folds
  • Model successfully saved and reloaded

Exercise 2: Customer Segmentation Dashboard

Goal: Create a customer clustering analysis tool with Rubix ML

Build a script that:

  1. Loads customer data (spending, frequency, recency)
  2. Applies both k-means and DBSCAN clustering
  3. Compares results from both algorithms
  4. For each cluster, calculates:
    • Average spending
    • Average visit frequency
    • Customer lifetime value estimate
  5. Assigns business segment labels (VIP, Regular, Occasional, At-Risk)
  6. Exports results to CSV for business team

Validation:

  • Finds reasonable customer segments (not all in one cluster)
  • Provides actionable business insights
  • Handles edge cases (single-customer clusters, outliers)

Exercise 3: Algorithm Benchmark Suite

Goal: Build a reusable tool for comparing algorithms on any dataset

Create a class AlgorithmBenchmark that:

php
$benchmark = new AlgorithmBenchmark($dataset);
$benchmark->addAlgorithm('k-NN', new KNearestNeighbors(k: 5));
$benchmark->addAlgorithm('Naive Bayes', new GaussianNB());
// ... add more

$results = $benchmark->runAll();
$benchmark->exportReport('benchmark-results.html');

Must include:

  • Train/test split configurable
  • Multiple metrics (accuracy, F1, precision, recall)
  • Timing measurements
  • Statistical significance tests (compare if differences are meaningful)
  • HTML report generation with charts (text-based charts OK)

Validation:

  • Works with any Rubix ML classifier
  • Proper error handling for failed algorithms
  • Comprehensive report with recommendations

Exercise 4: Sentiment Analysis with Pipeline

Goal: Build a text sentiment analyzer using Rubix ML pipeline

Implement a sentiment analyzer that:

  1. Loads movie review dataset (positive/negative)
  2. Creates a pipeline with:
    • Text preprocessing (lowercasing, stop word removal)
    • TF-IDF vectorization (use Rubix ML transformers)
    • Classifier (try multiple)
  3. Uses 3-way split (train/validation/test)
  4. Tunes hyperparameters on validation set
  5. Reports final performance on test set
  6. Saves production-ready model

Validation:

  • Clear pipeline with at least 2 preprocessing steps
  • Proper 3-way split (no data leakage)
  • Test accuracy > 75%
  • Can classify new reviews instantly

Exercise 5: Regression with Feature Engineering

Goal: Extend Step 9 with advanced feature engineering techniques

Starting from the 09-regression-feature-importance.php house price example, enhance it with:

  1. Engineer 3+ new features from the existing 4 features:

    • Price per square foot ratio (derived feature)
    • Age category: new (0-5 yrs), mid (6-15 yrs), old (16+ yrs) — one-hot encoded
    • Bedroom to square foot ratio
    • Location-sqft interaction (location_score * sqft / 1000)
  2. Compare 4+ regression algorithms:

    • RandomForestRegressor (from Step 9 baseline)
    • SVR (Support Vector Regressor)
    • KNNRegressor
    • Ridge Regression
    • Extra Trees Regressor (bonus)
  3. Use 5-fold cross-validation for robust evaluation instead of single train/test split

  4. Analyze feature importance for top 3 tree-based algorithms:

    • Compare which features are consistently important
    • Visualize with text-based bar charts
  5. Predict with error estimates:

    • Use ensemble variance for confidence intervals
    • Show "Price: $285k ± $12k (95% confidence)"

Validation:

  • R² improves by at least 0.05 compared to Step 9 baseline (with engineered features)
  • Cross-validation scores are consistent (std dev < 0.1)
  • Feature importance consistent across tree-based models
  • Predictions reasonable for edge cases (very small/large houses)
  • Error estimates reflect actual prediction quality

Example Output:

Feature Engineering Results:
  Original features: 4
  Engineered features: 7
  Total features: 11

Cross-Validation Results (5-fold):
  Random Forest: R² = 0.94 ± 0.03
  SVR: R² = 0.87 ± 0.05
  Ridge: R² = 0.83 ± 0.04

Best Model: Random Forest (R² = 0.94)

Feature Importance (Top 5):
  1. Square Feet: 38.2%
  2. Location-sqft interaction: 24.1%
  3. Location Score: 18.5%
  4. Price per sqft ratio: 11.3%
  5. Bedrooms: 4.7%

New House Prediction:
  Features: [1500 sqft, 3 bed, 8 yrs, loc:7]
  Predicted Price: $285,400 ± $14,200 (95% confidence)

Exercise 6: Enhance Production API

Goal: Extend the Step 8 API with advanced production features

Starting from the 08-production-api.php implementation in Step 8, add the following advanced capabilities:

  1. Rate Limiting: Limit requests to 100 per minute per IP address

    • Track requests in memory or Redis
    • Return 429 (Too Many Requests) when limit exceeded
    • Include X-RateLimit-Remaining header in responses
  2. API Key Authentication: Require valid API key for predictions

    • Accept key via X-API-Key header
    • Return 401 (Unauthorized) for missing/invalid keys
    • Support multiple keys with different rate limits
  3. Batch Predictions: Accept multiple feature vectors in one request

    json
    POST /predict/batch
    {
      "features": [
        [5.1, 3.5, 1.4, 0.2],
        [6.3, 3.3, 6.0, 2.5]
      ]
    }
  4. Response Caching: Cache predictions for identical inputs

    • Use simple array cache or APCu
    • Return X-Cache: HIT or X-Cache: MISS header
    • Track cache hit rate in stats endpoint
  5. Prometheus Metrics Endpoint: Add /metrics endpoint

    • Total requests counter
    • Request duration histogram
    • Prediction class distribution
    • Cache hit rate

Example enhanced request:

bash
curl -X POST http://localhost:8000/predict \
  -H "X-API-Key: secret123" \
  -H "Content-Type: application/json" \
  -d '{"features": [5.1, 3.5, 1.4, 0.2]}'

Response:

json
{
  "prediction": "Iris-setosa",
  "confidence": 0.95,
  "processing_time_ms": 1.23,
  "request_id": "pred_67234abc",
  "cached": false
}

Validation:

  • Rate limiting enforced (test by making >100 requests)
  • Invalid API keys rejected with 401
  • Batch predictions process correctly and return array
  • Cache hit rate > 20% with repeated requests
  • Metrics endpoint returns Prometheus-compatible format
  • All Step 8 features still work (validation, logging, health checks)

Troubleshooting

Common issues when working with PHP ML libraries:

Installation Issues

Composer requires PHP 8.0+ but you have 7.4

Update PHP or adjust composer.json:

json
{
  "require": {
    "php": "^7.4|^8.0"
  }
}

Memory exhausted during composer install

Increase PHP memory limit:

bash
php -d memory_limit=-1 $(which composer) install

PHP-ML Specific Issues

Predictions inconsistent between runs

k-means uses random initialization. For reproducible results (dev only):

php
mt_srand(42);  // Set before clustering
$kmeans = new KMeans(n: 3);

Cannot predict on new samples after clustering

PHP-ML limitation. Workaround:

php
// Calculate centroids manually
$centroids = [];
foreach ($clusters as $cluster) {
    $centroid = [];
    for ($i = 0; $i < $numFeatures; $i++) {
        $centroid[] = array_sum(array_column($cluster, $i)) / count($cluster);
    }
    $centroids[] = $centroid;
}

// Assign new sample to nearest centroid
function assignToCluster($sample, $centroids) {
    $minDist = PHP_FLOAT_MAX;
    $assignedCluster = 0;

    foreach ($centroids as $i => $centroid) {
        $dist = euclideanDistance($sample, $centroid);
        if ($dist < $minDist) {
            $minDist = $dist;
            $assignedCluster = $i;
        }
    }

    return $assignedCluster;
}

Rubix ML Specific Issues

Error: "Samples and labels must have the same number of elements"

Mismatch between samples and labels count:

php
echo "Samples: " . count($samples) . "\n";
echo "Labels: " . count($labels) . "\n";
// Debug which one is wrong

Pipeline transformation fails on test data

Ensure transformers fitted on training data only:

php
// WRONG
$transformer->fit($allData);  // Leaks test data!

// RIGHT
$pipeline->train($trainingDataset);  // Fits transformers to training only
$predictions = $pipeline->predict($testingDataset);  // Applies same transformations

Neural network won't converge

  • Try more training epochs
  • Reduce learning rate
  • Normalize/standardize features
  • Check for NaN values in data
  • Use simpler architecture (fewer layers/neurons)

Model Persistence Issues

Unserialization fails after library update

Models serialized with v2.x can't be loaded in v3.x. Retrain:

bash
composer show rubix/ml  # Check version
# If version changed, retrain model
php train-model.php

Loaded model makes different predictions

Feature extraction must be identical:

php
// TRAINING TIME
$features = extractFeatures($email);  // [5, 2, 1, 0, 3, 8]

// PRODUCTION TIME
$features = extractFeatures($email);  // Must produce same array!
// Verify order: [word_count, has_free, ...]

Performance Issues

Training very slow with large dataset

PHP is single-threaded. Solutions:

php
// 1. Sample data for development
$sampledData = array_slice($allData, 0, 1000);

// 2. Use faster algorithms (Naive Bayes vs Neural Net)

// 3. Precompute expensive features offline

// 4. Consider Python for training, PHP for inference

Prediction latency too high for real-time

Optimize inference:

php
// 1. Load model once, not per request
static $model = null;
if ($model === null) {
    $model = $persister->load();
}

// 2. Batch predictions when possible
$predictions = $model->predict($multipleSamples);

// 3. Cache predictions for common inputs
$cacheKey = md5(json_encode($features));
if ($cached = $cache->get($cacheKey)) {
    return $cached;
}

// 4. Choose faster algorithm (tree vs k-NN)

Data Quality Issues

All predictions return the same class

Check class balance and features:

php
// 1. Check labels
print_r(array_count_values($labels));
// Should be roughly balanced

// 2. Check if features are informative
$correlation = calculateFeatureCorrelation($features, $labels);
// Features should correlate with labels

// 3. Try different k (for k-NN) or algorithm

Accuracy seems too good (> 99%)

Possible data leakage:

php
// WRONG: Normalize before split
$normalized = normalize($allData);
[$train, $test] = split($normalized);  // Leaked test stats!

// RIGHT: Split first
[$train, $test] = split($allData);
$normalized_train = normalize($train);
$normalized_test = normalizeUsing($test, $trainStats);

Wrap-up

🎉 Congratulations! You've mastered PHP machine learning libraries and can now build production-ready ML applications in a fraction of the time it takes to code from scratch.

What You've Accomplished

┌─────────────────────────────────────────────────────────────┐
│  From Custom Code          →    To Library-Powered          │
├─────────────────────────────────────────────────────────────┤
│  120 lines                 →    12 lines (90% reduction)    │
│  Reinvent algorithms       →    Battle-tested implementations│
│  Manual edge cases         →    Handled automatically       │
│  Train every request       →    Load once, use 1000x        │
│  Single algorithm          →    Compare 6+ with ease        │
│  Limited features          →    40+ algorithms available    │
└─────────────────────────────────────────────────────────────┘

Specific Skills Mastered:

  • Installed and configured PHP-ML and Rubix ML with proper dependency management and verification
  • Reimplemented the Chapter 6 spam filter using PHP-ML, reducing code by 66% while improving robustness
  • Built a complete iris classifier with Rubix ML's pipeline, combining preprocessing and prediction seamlessly
  • Created customer segmentation system using k-means and DBSCAN clustering for unsupervised learning
  • Mastered model persistence saving and loading models for 50x faster production inference
  • Compared 6+ algorithms systematically understanding the trade-offs between accuracy and speed
  • Loaded real data from CSV, databases, and APIs using Rubix ML's data loading utilities
  • Applied 6+ transformers in production pipelines (NumericStringConverter, MissingDataImputer, OneHotEncoder, MinMaxNormalizer, and more)
  • Built production-ready REST API with singleton pattern, input validation, error handling, logging, and health checks
  • Mastered regression models for house price prediction with RandomForestRegressor achieving R² > 0.9
  • Analyzed feature importance to identify which variables drive predictions most (square footage, location, etc.)
  • Understood when to use each library — PHP-ML for learning and prototypes, Rubix ML for production

Key Insights Gained

Code Efficiency: Libraries reduce boilerplate by 90%, letting you focus on business logic instead of algorithm implementation. What took 120 lines custom now takes 12 lines with a library.

Production Readiness: Rubix ML provides battle-tested implementations with edge case handling, type safety, and performance optimizations you'd spend months building yourself.

Library Selection:

  • PHP-ML: Great for learning ML concepts, prototyping, small projects
  • Rubix ML: Production applications, complex pipelines, comprehensive ecosystem
  • Custom code: When you need full control or implement cutting-edge research

Performance Patterns:

  • Train offline, save model, load once in production
  • Batch predictions when possible
  • Cache expensive feature extraction
  • Choose algorithms based on speed vs. accuracy requirements

Production Patterns Mastered:

  • Singleton Model Loading: Load models once at startup (singleton pattern), not per request — 50x faster
  • Input Validation: Validate all inputs before prediction to prevent crashes from bad data
  • Error Handling: Use proper HTTP status codes (400 for client errors, 500 for server errors)
  • Prediction Logging: Log all predictions for monitoring, debugging, and retraining data
  • Health Checks: Provide health check endpoints for load balancers and monitoring systems
  • Performance Monitoring: Track inference time, request counts, and success rates
  • Data Loading: Load from CSV, databases, and APIs using library utilities (not hardcoded arrays)
  • Transformer Pipelines: Chain preprocessing steps ensuring test data uses training statistics (prevents data leakage)

When to Use Libraries vs. Custom Code

Use Libraries When:

  • Building production applications (reliability matters)
  • Need standard algorithms (k-NN, Random Forest, etc.)
  • Want comprehensive ecosystem (metrics, pipelines, transformers)
  • Time-to-market is critical
  • Team needs maintainable code

Use Custom Code When:

  • Learning ML algorithms deeply (educational purposes)
  • Implementing novel/research algorithms not in libraries
  • Need extreme performance optimization for specific use case
  • Library doesn't support your exact requirements

Hybrid Approach (Best of Both):

  • Use libraries for standard ML tasks
  • Custom code for domain-specific preprocessing
  • Library algorithms with custom feature engineering
  • Extend library classes when needed

What's Next

In Chapter 09: Advanced Machine Learning Techniques, you'll explore:

  • Decision trees and ensemble methods (Random Forests, Gradient Boosting)
  • Support Vector Machines for complex classification boundaries
  • Dimensionality reduction with PCA for high-dimensional data
  • Anomaly detection for fraud detection and outlier identification
  • Advanced clustering techniques for complex patterns
  • Feature selection to identify most important variables
  • Hyperparameter tuning strategies for optimal performance

You'll use Rubix ML's advanced algorithms to tackle complex real-world problems like fraud detection, anomaly identification, and multi-class classification with hundreds of features.

Production Checklist

Before deploying an ML system to production, verify:

  • [ ] Model trained on representative, recent data
  • [ ] Performance metrics meet business requirements
  • [ ] Model persisted and loads successfully
  • [ ] Feature extraction documented and consistent
  • [ ] Error handling covers edge cases (missing data, invalid input)
  • [ ] Monitoring in place (prediction distribution, latency, errors)
  • [ ] Retraining schedule defined (weekly/monthly for model drift)
  • [ ] API endpoints have rate limiting and authentication
  • [ ] Predictions logged for auditing and improvement
  • [ ] Rollback plan if model performs poorly

Key Takeaways

The Three Rules of Production ML:

  1. Train Offline, Deploy Online: Never retrain on user requests. Train once, save model, load at startup, serve thousands.

  2. Libraries > Custom Code (Usually): 90% time savings, better robustness, active maintenance. Use custom only when learning or implementing novel research.

  3. Measure Everything: Always compare algorithms, monitor performance, log predictions, track model drift.

Remember: A 95% accurate model that takes 10 seconds per prediction is worse than an 85% accurate model that predicts in 10ms. Production requirements always trump theoretical perfection.

Further Reading

Deepen your understanding of PHP ML libraries and best practices:

Library Documentation

Algorithm Deep Dives

Production ML

PHP and Machine Learning

Knowledge Check

Test your understanding of PHP ML libraries: