14: Document Processing and PDF Analysis

Home›Series›Claude for PHP Developers›Chapter 14

Chapter 14: Document Processing and PDF Analysis

Overview

Claude excels at understanding and extracting information from complex documents - contracts, invoices, resumes, reports, legal documents, and more. Combined with PHP's document processing capabilities, you can build powerful automation systems that understand document structure, extract key data, and generate insights.

In this chapter, you'll learn to process PDFs, extract structured data, analyze contracts, automate invoice processing, and build intelligent document workflows that save hours of manual work.

What You'll Build: An automated document processing system that handles invoices, contracts, and resumes with intelligent extraction and validation.

Prerequisites

Before starting, ensure you have:

✓ Completed Chapter 13: Vision - Working with Images (Vision capabilities)
✓ PDF processing knowledge (basic understanding of PDF structure)
✓ Document structure understanding (headers, tables, sections)
✓ Composer installed for PDF libraries
✓ Imagick PHP extension installed (for PDF to image conversion)

Estimated Time: 60-75 minutes

Objectives

By the end of this chapter, you will be able to:

Convert PDF documents to images for Claude vision processing
Extract structured data from invoices with automatic validation
Analyze legal contracts and identify risks and key clauses
Process resumes and match them against job descriptions
Build an automated document processing pipeline
Optimize document processing with caching strategies
Compare approaches: vision API, Files API, and Batch API
Scale document processing for large volumes cost-effectively

Important: Choose Your Approach

This chapter demonstrates the vision API approach (converting PDFs to images). However, you should be aware of these alternative approaches for different scenarios:

Three Ways to Process Documents

Approach	Best For	Cost	Speed	Setup
Vision API (Chapter 14)	Single/few documents, formatting matters	Medium	Fast	Simple
Files API (Beta)	Persistent storage, reusable documents	Low	Medium	Moderate
Batch API	High volume, cost optimization	Very Low (50% off)	Slow	Complex

Recommended Path:

Start with Vision API for small-scale processing (this chapter)
Add Files API for persistent document storage
Use Batch API for 1000+ document jobs
Combine with RAG (Chapter 31) for knowledge bases

Required Libraries

Install necessary PHP libraries for PDF processing and the Claude SDK:

bash

composer require claude-php/sdk vlucas/phpdotenv
composer require smalot/pdfparser
composer require setasign/fpdf
composer require tecnickcom/tcpdf

Step 1: PDF to Image Conversion (~10 min)

Goal

Convert PDF documents to images so Claude can analyze them using vision capabilities. This is the foundation for all document processing workflows.

Actions

Install required libraries using Composer
Create PDFProcessor class with image conversion methods
Handle multi-page PDFs by converting each page separately
Extract text as fallback for documents that don't need vision

Expected Result

A working PDFProcessor class that can convert any PDF to PNG images and extract text when needed.

Why It Works

Claude's vision API works with images, not PDFs directly. By converting PDFs to images, we preserve formatting, tables, and visual elements that text extraction might miss. The Imagick extension provides high-quality conversion with configurable DPI settings.

Troubleshooting

Error: "Imagick extension required" — Install Imagick: sudo apt-get install php-imagick (Linux) or brew install imagemagick && pecl install imagick (macOS)
PDF conversion fails — Ensure PDF is not password-protected or corrupted
Low quality images — Increase DPI in setResolution() (150-300 recommended)
Memory errors — Process large PDFs page-by-page instead of loading entire document

Claude works best with PDFs converted to images:

php

<?php
# filename: src/Document/PDFProcessor.php
declare(strict_types=1);

namespace App\Document;

use Smalot\PdfParser\Parser as PdfParser;

class PDFProcessor
{
    /**
     * Convert PDF to images using Imagick
     */
    public static function convertToImages(string $pdfPath, string $outputDir = null): array
    {
        if (!extension_loaded('imagick')) {
            throw new \RuntimeException('Imagick extension required for PDF conversion');
        }

        $outputDir = $outputDir ?? sys_get_temp_dir();
        $baseFilename = pathinfo($pdfPath, PATHINFO_FILENAME);

        $imagick = new \Imagick();
        $imagick->setResolution(150, 150); // DPI for quality
        $imagick->readImage($pdfPath);

        $imagick->setImageFormat('png');
        $imagick->setImageCompressionQuality(90);

        $images = [];
        $pageCount = $imagick->getNumberImages();

        foreach ($imagick as $pageIndex => $page) {
            $pageNumber = $pageIndex + 1;
            $imagePath = "{$outputDir}/{$baseFilename}_page_{$pageNumber}.png";

            $page->setImageFormat('png');
            $page->writeImage($imagePath);

            $images[] = [
                'page' => $pageNumber,
                'path' => $imagePath,
                'size' => filesize($imagePath)
            ];
        }

        $imagick->clear();
        $imagick->destroy();

        return $images;
    }

    /**
     * Extract text from PDF (fallback method)
     */
    public static function extractText(string $pdfPath): array
    {
        $parser = new PdfParser();
        $pdf = $parser->parseFile($pdfPath);

        $pages = [];
        foreach ($pdf->getPages() as $pageNumber => $page) {
            $pages[$pageNumber + 1] = $page->getText();
        }

        return [
            'total_pages' => count($pages),
            'pages' => $pages,
            'full_text' => implode("\n\n", $pages),
            'metadata' => $pdf->getDetails()
        ];
    }

    /**
     * Get PDF metadata
     */
    public static function getMetadata(string $pdfPath): array
    {
        $parser = new PdfParser();
        $pdf = $parser->parseFile($pdfPath);
        $details = $pdf->getDetails();

        return [
            'title' => $details['Title'] ?? null,
            'author' => $details['Author'] ?? null,
            'subject' => $details['Subject'] ?? null,
            'keywords' => $details['Keywords'] ?? null,
            'creator' => $details['Creator'] ?? null,
            'producer' => $details['Producer'] ?? null,
            'creation_date' => $details['CreationDate'] ?? null,
            'modification_date' => $details['ModDate'] ?? null,
            'page_count' => count($pdf->getPages())
        ];
    }
}

Step 2: Invoice Processing (~15 min)

Goal

Build an automated invoice processor that extracts structured financial data, validates accuracy, and handles multi-page invoices.

Actions

Create InvoiceProcessor class that uses vision to analyze invoice images
Design JSON schema for invoice data extraction
Implement validation logic to verify extracted data accuracy
Handle multi-page invoices by combining data across pages

Expected Result

A complete invoice processing system that extracts vendor info, line items, totals, and payment terms with automatic validation.

Why It Works

Claude's vision capabilities excel at reading structured documents like invoices. By providing a clear JSON schema in the prompt, we guide Claude to extract data consistently. Validation ensures the extracted data matches the invoice totals and required fields.

Troubleshooting

Missing line items — Check if invoice spans multiple pages; ensure all pages are processed
Invalid JSON response — Add JSON extraction logic to handle code blocks or extra text
Date format errors — Normalize date formats in validation (accept multiple formats)
Currency detection fails — Explicitly request currency symbol extraction in prompt

Extract structured data from invoices:

php

<?php
# filename: src/Document/InvoiceProcessor.php
declare(strict_types=1);

namespace App\Document;

use ClaudePhp\ClaudePhp;
use App\Vision\ImageHelper;

class InvoiceProcessor
{
    public function __construct(
        private ClaudePhp $client
    ) {}

    public function processInvoice(string $pdfPath): array
    {
        // Convert PDF to images
        $images = PDFProcessor::convertToImages($pdfPath);

        // Process each page
        $allData = [];
        foreach ($images as $imageInfo) {
            $pageData = $this->processInvoicePage($imageInfo['path']);
            $allData[] = $pageData;
        }

        // Combine multi-page data
        return $this->combineInvoiceData($allData);
    }

    private function processInvoicePage(string $imagePath): array
    {
        $imageContent = ImageHelper::prepareImage($imagePath);

        $response = $this->client->messages()->create([
            'model' => 'claude-sonnet-4-5',
            'max_tokens' => 4096,
            'messages' => [
                [
                    'role' => 'user',
                    'content' => [
                        $imageContent,
                        [
                            'type' => 'text',
                            'text' => <<<PROMPT
Extract all information from this invoice and return as JSON:

{
  "invoice_number": "string",
  "invoice_date": "YYYY-MM-DD",
  "due_date": "YYYY-MM-DD",
  "vendor": {
    "name": "string",
    "address": "string",
    "tax_id": "string",
    "contact": "string"
  },
  "customer": {
    "name": "string",
    "address": "string",
    "tax_id": "string"
  },
  "line_items": [
    {
      "description": "string",
      "quantity": number,
      "unit_price": number,
      "amount": number,
      "tax_rate": number
    }
  ],
  "subtotal": number,
  "tax": number,
  "shipping": number,
  "total": number,
  "currency": "string",
  "payment_terms": "string",
  "notes": "string"
}

Extract all visible data. Use null for missing fields. Ensure numbers are numeric, not strings.
Return only valid JSON, no explanation.
PROMPT
                        ]
                    ]
                ]
            ]
        ]);

        $jsonText = $response->content[0]->text;

        // Extract JSON from response
        if (preg_match('/```json\s*(\{.*?\})\s*```/s', $jsonText, $matches)) {
            $jsonText = $matches[1];
        }

        $data = json_decode($jsonText, true);

        if (!$data) {
            throw new \RuntimeException('Failed to parse invoice JSON');
        }

        return $data;
    }

    private function combineInvoiceData(array $pageData): array
    {
        // If single page, return as-is
        if (count($pageData) === 1) {
            return $pageData[0];
        }

        // Multi-page: merge line items
        $combined = $pageData[0];
        for ($i = 1; $i < count($pageData); $i++) {
            if (isset($pageData[$i]['line_items'])) {
                $combined['line_items'] = array_merge(
                    $combined['line_items'] ?? [],
                    $pageData[$i]['line_items']
                );
            }
        }

        return $combined;
    }

    public function validateInvoice(array $invoiceData): array
    {
        $errors = [];

        // Required field validation
        $required = ['invoice_number', 'invoice_date', 'vendor', 'total'];
        foreach ($required as $field) {
            if (empty($invoiceData[$field])) {
                $errors[] = "Missing required field: {$field}";
            }
        }

        // Numeric validation
        if (isset($invoiceData['total']) && !is_numeric($invoiceData['total'])) {
            $errors[] = "Total must be numeric";
        }

        // Date validation
        if (isset($invoiceData['invoice_date'])) {
            $date = \DateTime::createFromFormat('Y-m-d', $invoiceData['invoice_date']);
            if (!$date) {
                $errors[] = "Invalid invoice date format";
            }
        }

        // Line items total check
        if (isset($invoiceData['line_items']) && isset($invoiceData['subtotal'])) {
            $calculatedSubtotal = array_sum(array_column($invoiceData['line_items'], 'amount'));
            $difference = abs($calculatedSubtotal - $invoiceData['subtotal']);

            if ($difference > 0.01) { // Allow 1 cent rounding difference
                $errors[] = "Line items total ({$calculatedSubtotal}) doesn't match subtotal ({$invoiceData['subtotal']})";
            }
        }

        return [
            'valid' => empty($errors),
            'errors' => $errors,
            'invoice_number' => $invoiceData['invoice_number'] ?? 'UNKNOWN'
        ];
    }
}

Step 3: Contract Analysis (~20 min)

Goal

Create a contract analyzer that extracts key clauses, assesses risks, and provides recommendations for legal document review.

Actions

Build ContractAnalyzer class with multi-pass analysis
Extract contract overview from first page (parties, dates, type)
Identify specific clauses (termination, payment, liability, IP)
Perform risk assessment with red flags and recommendations
Compare contracts side-by-side for differences

Expected Result

A comprehensive contract analysis system that provides structured insights, risk levels, and actionable recommendations.

Why It Works

Legal contracts require multiple analysis passes: first to understand structure, then to extract specific clauses, and finally to assess risks. By processing all pages together, Claude maintains context across the entire document. The structured JSON output enables programmatic risk assessment and comparison.

Troubleshooting

Missing clauses detected — Some contracts may not have standard clauses; handle null values gracefully
Risk assessment too generic — Provide more specific examples in the prompt about what constitutes high risk
Multi-page context lost — Ensure all pages are sent in a single request for full document context
Comparison results unclear — Request structured comparison format (differences, similarities, recommendations)

Analyze legal contracts and extract key terms:

php

<?php
# filename: src/Document/ContractAnalyzer.php
declare(strict_types=1);

namespace App\Document;

use ClaudePhp\ClaudePhp;
use App\Vision\ImageHelper;

class ContractAnalyzer
{
    public function __construct(
        private ClaudePhp $client
    ) {}

    public function analyzeContract(string $pdfPath): array
    {
        // Convert PDF pages to images
        $images = PDFProcessor::convertToImages($pdfPath);

        // First pass: Get overview
        $overview = $this->getContractOverview($images[0]['path']);

        // Second pass: Extract specific clauses
        $clauses = $this->extractClauses($images);

        // Third pass: Risk assessment
        $risks = $this->assessRisks($images);

        return [
            'overview' => $overview,
            'clauses' => $clauses,
            'risk_assessment' => $risks,
            'page_count' => count($images),
            'processed_at' => date('Y-m-d H:i:s')
        ];
    }

    private function getContractOverview(string $firstPageImage): array
    {
        $imageContent = ImageHelper::prepareImage($firstPageImage);

        $response = $this->client->messages()->create([
            'model' => 'claude-sonnet-4-5',
            'max_tokens' => 2048,
            'messages' => [
                [
                    'role' => 'user',
                    'content' => [
                        $imageContent,
                        [
                            'type' => 'text',
                            'text' => <<<PROMPT
Analyze this contract (first page) and provide:

{
  "contract_type": "string (e.g., Employment, NDA, Service Agreement)",
  "parties": [
    {"name": "string", "role": "string"}
  ],
  "effective_date": "YYYY-MM-DD or null",
  "expiration_date": "YYYY-MM-DD or null",
  "governing_law": "string (jurisdiction)",
  "contract_value": "string (if mentioned)",
  "key_obligations": ["array of main obligations"]
}

Return only JSON.
PROMPT
                        ]
                    ]
                ]
            ]
        ]);

        return json_decode($this->extractJSON($response->content[0]->text), true) ?? [];
    }

    private function extractClauses(array $images): array
    {
        // Prepare all pages
        $content = [];
        foreach ($images as $imageInfo) {
            $content[] = ImageHelper::prepareImage($imageInfo['path']);
        }

        $content[] = [
            'type' => 'text',
            'text' => <<<PROMPT
Extract these key clauses from the contract:

{
  "termination_clause": {
    "notice_period": "string",
    "conditions": ["array"],
    "penalties": "string"
  },
  "payment_terms": {
    "amount": "string",
    "frequency": "string",
    "payment_method": "string",
    "late_fees": "string"
  },
  "confidentiality": {
    "scope": "string",
    "duration": "string",
    "exceptions": ["array"]
  },
  "liability": {
    "limitations": "string",
    "indemnification": "string",
    "insurance_required": "boolean"
  },
  "intellectual_property": {
    "ownership": "string",
    "license_grants": ["array"]
  },
  "dispute_resolution": {
    "method": "string (litigation/arbitration/mediation)",
    "venue": "string"
  }
}

Return only JSON. Use null for clauses not found.
PROMPT
        ];

        $response = $this->client->messages()->create([
            'model' => 'claude-sonnet-4-5',
            'max_tokens' => 4096,
            'messages' => [
                ['role' => 'user', 'content' => $content]
            ]
        ]);

        return json_decode($this->extractJSON($response->content[0]->text), true) ?? [];
    }

    private function assessRisks(array $images): array
    {
        $content = [];
        foreach ($images as $imageInfo) {
            $content[] = ImageHelper::prepareImage($imageInfo['path']);
        }

        $content[] = [
            'type' => 'text',
            'text' => <<<PROMPT
Assess the risks in this contract and provide:

{
  "risk_level": "low/medium/high",
  "red_flags": [
    {
      "category": "string",
      "issue": "string",
      "severity": "low/medium/high",
      "recommendation": "string"
    }
  ],
  "missing_clauses": ["array of important missing protections"],
  "unfavorable_terms": ["array of potentially unfavorable terms"],
  "recommendations": ["array of suggested changes"],
  "overall_assessment": "string (2-3 sentences)"
}

Return only JSON.
PROMPT
        ];

        $response = $this->client->messages()->create([
            'model' => 'claude-sonnet-4-5',
            'max_tokens' => 4096,
            'messages' => [
                ['role' => 'user', 'content' => $content]
            ]
        ]);

        return json_decode($this->extractJSON($response->content[0]->text), true) ?? [];
    }

    private function extractJSON(string $text): string
    {
        if (preg_match('/```json\s*(\{.*?\})\s*```/s', $text, $matches)) {
            return $matches[1];
        }
        if (preg_match('/(\{.*?\})/s', $text, $matches)) {
            return $matches[1];
        }
        return $text;
    }

    public function compareContracts(string $pdfPath1, string $pdfPath2): string
    {
        $images1 = PDFProcessor::convertToImages($pdfPath1);
        $images2 = PDFProcessor::convertToImages($pdfPath2);

        $content = [
            ['type' => 'text', 'text' => "Contract 1:\n"],
            ImageHelper::prepareImage($images1[0]['path']),
            ['type' => 'text', 'text' => "\nContract 2:\n"],
            ImageHelper::prepareImage($images2[0]['path']),
            [
                'type' => 'text',
                'text' => 'Compare these contracts. Identify key differences in terms, obligations, and risk. Which is more favorable and why?'
            ]
        ];

        $response = $this->client->messages()->create([
            'model' => 'claude-sonnet-4-5',
            'max_tokens' => 4096,
            'messages' => [
                ['role' => 'user', 'content' => $content]
            ]
        ]);

        return $response->content[0]->text;
    }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219

Step 4: Resume/CV Processing (~15 min)

Goal

Build a resume processor that extracts candidate information, matches resumes to job descriptions, and generates interview questions.

Actions

Create ResumeProcessor class for structured resume extraction
Extract comprehensive candidate data (experience, education, skills)
Implement job matching algorithm with scoring
Generate interview questions based on resume content

Expected Result

A complete resume processing system that can parse resumes, match candidates to jobs, and assist with interview preparation.

Why It Works

Resumes have varied formats but consistent information types. Claude's vision can understand different layouts and extract structured data. By providing a comprehensive JSON schema, we ensure all relevant information is captured. The matching algorithm uses Claude's understanding of job requirements to score candidates.

Troubleshooting

Skills not extracted — Resumes may list skills differently; use flexible extraction (keywords, sections, bullets)
Date parsing errors — Accept multiple date formats (MM/YYYY, Month YYYY, etc.)
Match score inconsistent — Provide clear scoring criteria in the prompt (skills weight, experience weight)
Missing work experience — Some resumes use non-standard formats; request extraction of all employment history

Extract structured data from resumes:

php

<?php
# filename: src/Document/ResumeProcessor.php
declare(strict_types=1);

namespace App\Document;

use ClaudePhp\ClaudePhp;
use App\Vision\ImageHelper;

class ResumeProcessor
{
    public function __construct(
        private ClaudePhp $client
    ) {}

    public function processResume(string $pdfPath): array
    {
        $images = PDFProcessor::convertToImages($pdfPath);

        // Process all pages together for context
        $content = [];
        foreach ($images as $imageInfo) {
            $content[] = ImageHelper::prepareImage($imageInfo['path']);
        }

        $content[] = [
            'type' => 'text',
            'text' => <<<PROMPT
Extract comprehensive information from this resume/CV:

{
  "personal_info": {
    "name": "string",
    "email": "string",
    "phone": "string",
    "location": "string",
    "linkedin": "string",
    "website": "string"
  },
  "summary": "string (professional summary/objective)",
  "work_experience": [
    {
      "company": "string",
      "title": "string",
      "start_date": "YYYY-MM or string",
      "end_date": "YYYY-MM or 'Present'",
      "duration": "string",
      "responsibilities": ["array of key responsibilities"],
      "achievements": ["array of quantifiable achievements"]
    }
  ],
  "education": [
    {
      "institution": "string",
      "degree": "string",
      "field": "string",
      "graduation_date": "string",
      "gpa": "string or null",
      "honors": "string or null"
    }
  ],
  "skills": {
    "technical": ["array"],
    "languages": ["array"],
    "soft_skills": ["array"]
  },
  "certifications": [
    {
      "name": "string",
      "issuer": "string",
      "date": "string"
    }
  ],
  "projects": [
    {
      "name": "string",
      "description": "string",
      "technologies": ["array"],
      "url": "string or null"
    }
  ],
  "years_of_experience": number,
  "seniority_level": "entry/mid/senior/lead/executive"
}

Return only valid JSON.
PROMPT
        ];

        $response = $this->client->messages()->create([
            'model' => 'claude-sonnet-4-5',
            'max_tokens' => 4096,
            'messages' => [
                ['role' => 'user', 'content' => $content]
            ]
        ]);

        $jsonText = $this->extractJSON($response->content[0]->text);
        return json_decode($jsonText, true) ?? [];
    }

    public function matchJobDescription(array $resumeData, string $jobDescription): array
    {
        $resumeJson = json_encode($resumeData, JSON_PRETTY_PRINT);

        $response = $this->client->messages()->create([
            'model' => 'claude-sonnet-4-5',
            'max_tokens' => 2048,
            'messages' => [
                [
                    'role' => 'user',
                    'content' => <<<PROMPT
Resume Data:
{$resumeJson}

Job Description:
{$jobDescription}

Analyze the match between this resume and job description:

{
  "match_score": number (0-100),
  "matching_skills": ["array of skills that match"],
  "missing_skills": ["array of required skills not in resume"],
  "relevant_experience": ["array of relevant work experiences"],
  "strengths": ["why this candidate is a good fit"],
  "gaps": ["potential concerns or missing qualifications"],
  "recommendation": "strong_match/good_match/partial_match/poor_match",
  "summary": "string (2-3 sentence assessment)"
}

Return only JSON.
PROMPT
                ]
            ]
        ]);

        return json_decode($this->extractJSON($response->content[0]->text), true) ?? [];
    }

    public function generateInterviewQuestions(array $resumeData): array
    {
        $resumeJson = json_encode($resumeData, JSON_PRETTY_PRINT);

        $response = $this->client->messages()->create([
            'model' => 'claude-sonnet-4-5',
            'max_tokens' => 2048,
            'messages' => [
                [
                    'role' => 'user',
                    'content' => <<<PROMPT
Based on this resume, generate targeted interview questions:

{$resumeJson}

Provide:
{
  "technical_questions": ["5-7 technical questions based on their skills"],
  "experience_questions": ["5-7 behavioral questions about their work history"],
  "project_deep_dives": ["3-5 questions about specific projects"],
  "cultural_fit": ["3-5 questions to assess team fit"],
  "areas_to_probe": ["topics that need clarification"]
}

Return only JSON.
PROMPT
                ]
            ]
        ]);

        return json_decode($this->extractJSON($response->content[0]->text), true) ?? [];
    }

    private function extractJSON(string $text): string
    {
        if (preg_match('/```json\s*(\{.*?\})\s*```/s', $text, $matches)) {
            return $matches[1];
        }
        if (preg_match('/(\{.*?\})/s', $text, $matches)) {
            return $matches[1];
        }
        return $text;
    }
}

Step 5: Complete Document Processing Pipeline (~10 min)

Goal

Create an intelligent document pipeline that automatically detects document types and routes them to appropriate processors.

Actions

Build DocumentPipeline class with type detection
Implement automatic routing based on document type
Handle multiple document types (invoice, contract, resume, generic)
Process batch documents efficiently

Expected Result

A unified pipeline that can process any document type automatically without manual classification.

Why It Works

By using Claude's vision to detect document type first, we can route documents to specialized processors. This approach is more efficient than trying to process all document types with a single generic processor. The pipeline pattern makes it easy to add new document types in the future.

Troubleshooting

Wrong document type detected — Improve detection prompt with examples of each document type
Processing fails for unknown types — Always have a generic fallback processor
Batch processing slow — Implement parallel processing or queue system for large batches
Memory issues with large batches — Process documents sequentially and clear resources between files

php

<?php
# filename: examples/01-document-pipeline.php
declare(strict_types=1);

require __DIR__ . '/../vendor/autoload.php';

use ClaudePhp\ClaudePhp;
use App\Document\InvoiceProcessor;
use App\Document\ContractAnalyzer;
use App\Document\ResumeProcessor;
use App\Document\PDFProcessor;

$client = new ClaudePhp(apiKey: $_ENV['ANTHROPIC_API_KEY']);

class DocumentPipeline
{
    public function __construct(
        private ClaudePhp $client
    ) {}

    public function processDocument(string $pdfPath): array
    {
        // Detect document type
        $documentType = $this->detectDocumentType($pdfPath);

        echo "Detected document type: {$documentType}\n\n";

        // Process based on type
        return match($documentType) {
            'invoice' => $this->processAsInvoice($pdfPath),
            'contract' => $this->processAsContract($pdfPath),
            'resume' => $this->processAsResume($pdfPath),
            default => $this->processAsGeneric($pdfPath)
        };
    }

    private function detectDocumentType(string $pdfPath): string
    {
        $images = PDFProcessor::convertToImages($pdfPath);
        $firstPage = $images[0]['path'];

        $imageContent = \App\Vision\ImageHelper::prepareImage($firstPage);

        $response = $this->client->messages()->create([
            'model' => 'claude-sonnet-4-5',
            'max_tokens' => 200,
            'messages' => [
                [
                    'role' => 'user',
                    'content' => [
                        $imageContent,
                        [
                            'type' => 'text',
                            'text' => 'What type of document is this? Reply with one word: invoice, contract, resume, report, letter, or other.'
                        ]
                    ]
                ]
            ]
        ]);

        return strtolower(trim($response->content[0]->text));
    }

    private function processAsInvoice(string $pdfPath): array
    {
        $processor = new InvoiceProcessor($this->client);
        $data = $processor->processInvoice($pdfPath);
        $validation = $processor->validateInvoice($data);

        return [
            'type' => 'invoice',
            'data' => $data,
            'validation' => $validation
        ];
    }

    private function processAsContract(string $pdfPath): array
    {
        $analyzer = new ContractAnalyzer($this->client);
        $analysis = $analyzer->analyzeContract($pdfPath);

        return [
            'type' => 'contract',
            'analysis' => $analysis
        ];
    }

    private function processAsResume(string $pdfPath): array
    {
        $processor = new ResumeProcessor($this->client);
        $data = $processor->processResume($pdfPath);

        return [
            'type' => 'resume',
            'data' => $data
        ];
    }

    private function processAsGeneric(string $pdfPath): array
    {
        $textData = PDFProcessor::extractText($pdfPath);

        $response = $this->client->messages()->create([
            'model' => 'claude-sonnet-4-5',
            'max_tokens' => 2048,
            'messages' => [
                [
                    'role' => 'user',
                    'content' => "Summarize this document and extract key information:\n\n" . $textData['full_text']
                ]
            ]
        ]);

        return [
            'type' => 'generic',
            'summary' => $response->content[0]->text,
            'metadata' => PDFProcessor::getMetadata($pdfPath)
        ];
    }
}

// Example usage
$pipeline = new DocumentPipeline($client);

$documents = [
    __DIR__ . '/documents/invoice-001.pdf',
    __DIR__ . '/documents/employment-contract.pdf',
    __DIR__ . '/documents/resume-john-doe.pdf'
];

foreach ($documents as $docPath) {
    if (!file_exists($docPath)) {
        echo "Skipping missing file: {$docPath}\n";
        continue;
    }

    echo "Processing: " . basename($docPath) . "\n";
    echo str_repeat('=', 50) . "\n";

    $result = $pipeline->processDocument($docPath);

    echo json_encode($result, JSON_PRETTY_PRINT) . "\n\n";
}

Step 6: Performance Optimization (~5 min)

Goal

Implement caching to avoid redundant API calls and improve processing speed for previously analyzed documents.

Actions

Create DocumentCache class for storing processed results
Implement cache invalidation based on file modification time
Use file-based caching for simplicity and portability

Expected Result

A caching system that reduces API costs and speeds up document processing for repeated documents.

Why It Works

Document processing is expensive (multiple API calls per document). By caching results keyed to file content hash and modification time, we can skip reprocessing unchanged documents. File-based caching is simple and doesn't require additional infrastructure.

Troubleshooting

Cache not invalidating — Ensure modification time comparison accounts for timezone differences
Cache directory permissions — Set proper permissions (0755) and ensure writable
Cache growing too large — Implement cache size limits or TTL-based expiration
Stale cache data — Always check file modification time before using cached data

php

<?php
# filename: src/Document/DocumentCache.php
declare(strict_types=1);

namespace App\Document;

class DocumentCache
{
    private string $cacheDir;

    public function __construct(string $cacheDir = null)
    {
        $this->cacheDir = $cacheDir ?? sys_get_temp_dir() . '/document_cache';
        if (!is_dir($this->cacheDir)) {
            mkdir($this->cacheDir, 0755, true);
        }
    }

    public function get(string $pdfPath): ?array
    {
        $cacheKey = $this->getCacheKey($pdfPath);
        $cachePath = "{$this->cacheDir}/{$cacheKey}.json";

        if (!file_exists($cachePath)) {
            return null;
        }

        // Check if PDF has been modified since cache
        if (filemtime($pdfPath) > filemtime($cachePath)) {
            unlink($cachePath);
            return null;
        }

        $data = file_get_contents($cachePath);
        return json_decode($data, true);
    }

    public function set(string $pdfPath, array $data): void
    {
        $cacheKey = $this->getCacheKey($pdfPath);
        $cachePath = "{$this->cacheDir}/{$cacheKey}.json";

        file_put_contents($cachePath, json_encode($data, JSON_PRETTY_PRINT));
    }

    private function getCacheKey(string $pdfPath): string
    {
        return md5($pdfPath . filesize($pdfPath) . filemtime($pdfPath));
    }

    public function clear(): void
    {
        $files = glob("{$this->cacheDir}/*.json");
        foreach ($files as $file) {
            unlink($file);
        }
    }
}

Step 7: Files API Alternative (~5 min)

Goal

Learn when and how to use the Files API (Beta) as an alternative to image conversion for persistent document storage and reuse.

When to Use Files API

Use the Files API instead of vision when:

You process the same documents multiple times
You need persistent storage across sessions
You want to reduce bandwidth for large files
File size exceeds practical base64 limits

Cost Comparison

php

<?php
# Vision API approach (current chapter)
$base64_overhead = 1.33; // base64 encoding
$vision_cost_per_page = 0.00075; // Page (low-res)
$total_images = 1000 * 10; // 1000 docs, 10 pages each
$vision_total = ($total_images * $vision_cost_per_page) * $base64_overhead;
echo "Vision API: \${$vision_total}"; // ~$10

# Files API approach
$files_cost_per_upload = 0.02; // File upload cost (one-time)
$files_cost_per_use = 0.00075; // Image cost when used
$files_total = (1000 * $files_cost_per_upload) + ($total_images * $files_cost_per_use);
echo "Files API: \${$files_total}"; // ~$27 (but reusable)

# Batch API approach (with Files API)
$batch_discount = 0.5; // 50% off
$batch_cost_per_page = $vision_cost_per_page * $batch_discount;
$batch_total = ($total_images * $batch_cost_per_page) * $batch_discount;
echo "Batch API: \${$batch_total}"; // ~$5

Files API Example

php

<?php
# filename: examples/06-files-api-approach.php
declare(strict_types=1);

require __DIR__ . '/../vendor/autoload.php';

use ClaudePhp\ClaudePhp;

$client = new ClaudePhp(apiKey: $_ENV['ANTHROPIC_API_KEY']);

// Upload PDF file once
$pdfPath = __DIR__ . '/documents/invoice.pdf';
$fileHandle = fopen($pdfPath, 'r');

// Upload file (Beta feature - check docs for availability)
// $fileResponse = $client->beta()->files()->upload([
//     'file' => $fileHandle,
//     'mime_type' => 'application/pdf'
// ]);
// $fileId = $fileResponse->id;

// Reuse file multiple times without re-uploading
// $response = $client->messages()->create([
//     'model' => 'claude-sonnet-4-5',
//     'max_tokens' => 4096,
//     'messages' => [
//         [
//             'role' => 'user',
//             'content' => [
//                 [
//                     'type' => 'document',
//                     'source' => [
//                         'type' => 'file',
//                         'file_id' => $fileId
//                     ]
//                 ],
//                 [
//                     'type' => 'text',
//                     'text' => 'Extract invoice data...'
//                 ]
//             ]
//         ]
//     ]
// ]);

echo "Note: Files API is in Beta. See latest Claude API docs for current availability.\n";
echo "For production use, check: https://docs.claude.com/en/docs/capabilities/files-api\n";

Why Choose Each Approach

Feature	Vision API	Files API	Batch API
Real-time processing	✅ Yes	✅ Yes	❌ Async only
Persistent storage	❌ No	✅ Yes	✅ Yes
One-time setup	✅ Yes	⚠️ Moderate	❌ Complex
Cost per document	Medium	Lower (reuse)	Lowest (50% off)
Processing speed	Fast	Fast	Slow (async)

Step 8: Batch Processing for Scale (~10 min)

Goal

Learn to use the Batch API for cost-effective processing of large document volumes (1000+).

When Batch Processing Makes Sense

Use Batch API when:

Processing 1000+ documents
Cost savings (50% discount) matter more than speed
Processing can happen asynchronously
You can wait 1+ hours for results

Batch Processing Example

php

<?php
# filename: examples/07-batch-processing.php
declare(strict_types=1);

require __DIR__ . '/../vendor/autoload.php';

use ClaudePhp\ClaudePhp;

$client = new ClaudePhp(apiKey: $_ENV['ANTHROPIC_API_KEY']);

class BatchDocumentProcessor
{
    public function __construct(
        private ClaudePhp $client
    ) {}

    /**
     * Create batch requests for multiple documents
     */
    public function createBatch(array $documents): array
    {
        $requests = [];

        foreach ($documents as $index => $docPath) {
            $imageData = base64_encode(file_get_contents($docPath));

            $requests[] = [
                'custom_id' => "doc-{$index}",
                'params' => [
                    'model' => 'claude-sonnet-4-5',
                    'max_tokens' => 4096,
                    'messages' => [
                        [
                            'role' => 'user',
                            'content' => [
                                [
                                    'type' => 'image',
                                    'source' => [
                                        'type' => 'base64',
                                        'media_type' => 'image/png',
                                        'data' => $imageData
                                    ]
                                ],
                                [
                                    'type' => 'text',
                                    'text' => 'Extract invoice data as JSON...'
                                ]
                            ]
                        ]
                    ]
                ]
            ];
        }

        return $requests;
    }

    /**
     * Submit batch for processing
     * Note: Actual Batch API implementation depends on SDK version
     */
    public function submitBatch(array $requests): string
    {
        // Batch API integration pattern (check latest SDK docs)
        // $response = $this->client->batch()->create([
        //     'requests' => $requests
        // ]);
        // return $response->id;

        echo "Batch API requires Anthropic SDK v0.7+\n";
        echo "Savings: 50% off API costs for batch processing\n";
        echo "Processing time: 1+ hours\n";
        return "batch_example_id";
    }

    /**
     * Poll for batch completion
     */
    public function waitForCompletion(string $batchId): array
    {
        // Poll status until complete
        // $batch = $this->client->batch()->retrieve($batchId);
        // while ($batch->processing_status !== 'completed') {
        //     sleep(30);
        //     $batch = $this->client->batch()->retrieve($batchId);
        // }
        // return $this->processBatchResults($batch->request_counts);

        return [
            'processed' => 1000,
            'succeeded' => 995,
            'failed' => 5,
            'cost_savings' => '50%'
        ];
    }
}

// Usage
$processor = new BatchDocumentProcessor($client);

$documents = [
    'invoice-001.png',
    'invoice-002.png',
    // ... 998 more documents
];

$requests = $processor->createBatch($documents);
echo "Created " . count($requests) . " batch requests\n";
echo "Each request costs 50% less than standard API\n";
echo "Check Anthropic documentation for Batch API integration\n";

Cost Savings Calculation

php

<?php
// Standard API: 1000 invoices
$standard_cost = 1000 * 0.00075; // ~$0.75 per extraction
echo "Standard API (1000 docs): \$" . ($standard_cost * 1000) . "\n";

// Batch API: Same 1000 invoices
$batch_cost = ($standard_cost * 1000) * 0.5; // 50% discount
echo "Batch API (1000 docs): \$" . $batch_cost . "\n";

// Savings
$savings = ($standard_cost * 1000) - $batch_cost;
echo "You save: \$" . $savings . "\n";

Best Practices

1. Image Quality Optimization

php

// Use appropriate DPI for document type
$imagick->setResolution(150, 150); // Standard documents
$imagick->setResolution(300, 300); // High-quality scans or small text

2. Error Handling

php

try {
    $images = PDFProcessor::convertToImages($pdfPath);
} catch (\RuntimeException $e) {
    error_log("PDF conversion failed: " . $e->getMessage());
    // Fallback to text extraction
    $textData = PDFProcessor::extractText($pdfPath);
}

3. Caching Strategy

php

$cache = new DocumentCache();
if ($cached = $cache->get($pdfPath)) {
    return $cached;
}

$result = $processor->processInvoice($pdfPath);
$cache->set($pdfPath, $result);
return $result;

4. Multi-Page Document Handling

php

// Process all pages together for context
$allPages = [];
foreach ($images as $imageInfo) {
    $allPages[] = ImageHelper::prepareImage($imageInfo['path']);
}
// Send all pages in single request for better context

5. Validation and Error Recovery

php

$validation = $processor->validateInvoice($data);
if (!$validation['valid']) {
    // Log errors and attempt correction
    foreach ($validation['errors'] as $error) {
        error_log("Invoice validation error: {$error}");
    }
    // Optionally request Claude to fix errors
}

Troubleshooting

PDF Conversion Issues

Problem: Imagick extension not found

bash

# Ubuntu/Debian
sudo apt-get install php-imagick

# macOS
brew install imagemagick
pecl install imagick

# Verify installation
php -m | grep imagick

Problem: PDF conversion produces blank images

Check if PDF is password-protected
Verify PDF is not corrupted: file document.pdf
Try increasing DPI: setResolution(300, 300)

Data Extraction Issues

Problem: Claude returns invalid JSON

php

// Add robust JSON extraction
private function extractJSON(string $text): string
{
    // Try code block extraction first
    if (preg_match('/```json\s*(\{.*?\})\s*```/s', $text, $matches)) {
        return $matches[1];
    }
    // Try plain JSON object
    if (preg_match('/(\{.*?\})/s', $text, $matches)) {
        return $matches[1];
    }
    // Last resort: return as-is and let json_decode handle it
    return $text;
}

Problem: Missing data in extracted results

Increase max_tokens for complex documents
Break complex extractions into multiple passes
Provide more specific examples in prompts

Performance Issues

Problem: Processing is too slow

Implement caching for repeated documents
Process pages in parallel where possible
Use lower DPI for faster conversion (150 vs 300)
Batch similar documents together

Problem: High API costs

Cache all processed documents
Use text extraction for simple documents (no vision needed)
Combine multiple analyses into single requests
Implement rate limiting and queuing

Memory Issues

Problem: Out of memory errors with large PDFs

php

// Process pages individually instead of loading all
foreach ($imagick as $pageIndex => $page) {
    $page->writeImage($imagePath);
    // Process immediately, then clear
    $page->clear();
}

Key Takeaways

Core Techniques

✓ Convert PDFs to images for best results with Claude
✓ Invoice processing extracts structured financial data automatically
✓ Contract analysis identifies risks, terms, and missing clauses
✓ Resume processing enables automated candidate screening
✓ Multi-page documents require combining data across pages
✓ Always validate extracted data for accuracy
✓ Cache processed results to avoid redundant API calls
✓ Combine text extraction with vision for comprehensive analysis
✓ Use specific prompts and schemas for reliable structured output
✓ Document classification enables smart routing to specialized processors

Scaling and Optimization

✓ Vision API (this chapter) — Best for real-time processing with formatting preservation
✓ Files API — Use for persistent storage when processing same documents multiple times
✓ Batch API — Use for 1000+ documents to save 50% on costs (async processing)
✓ Choose your approach based on volume, speed requirements, and budget
✓ Combine with Chapter 31 (RAG) for building document knowledge bases
✓ See Chapter 39 for cost optimization strategies

Production Ready

✓ Implement robust error handling for PDF conversion and extraction
✓ Use queue systems (Chapter 19) for asynchronous processing
✓ Monitor and log all document processing operations
✓ Implement rate limiting for API calls
✓ Secure sensitive document data (Chapter 36)
✓ Plan for scaling as document volume grows (Chapter 38)

💻 Code Samples

All code examples from this chapter are available in the GitHub repository:

View Chapter 14 Code Samples

Clone and run locally:

bash

git clone https://github.com/dalehurley/codewithphp.git
cd codewithphp/code/claude-php/chapter-14
composer install
export ANTHROPIC_API_KEY="sk-ant-your-key-here"
php examples/01-document-pipeline.php

Chapter 14: Document Processing and PDF Analysis ​

Overview ​

Prerequisites ​

Objectives ​

Important: Choose Your Approach ​

Three Ways to Process Documents ​

Required Libraries ​

Step 1: PDF to Image Conversion (~10 min) ​

Goal ​

Actions ​

Expected Result ​

Why It Works ​

Troubleshooting ​

Step 2: Invoice Processing (~15 min) ​

Goal ​

Actions ​

Expected Result ​

Why It Works ​

Troubleshooting ​

Step 3: Contract Analysis (~20 min) ​

Goal ​

Actions ​

Expected Result ​

Why It Works ​

Troubleshooting ​

Step 4: Resume/CV Processing (~15 min) ​

Goal ​

Actions ​

Expected Result ​

Why It Works ​

Troubleshooting ​

Step 5: Complete Document Processing Pipeline (~10 min) ​

Goal ​

Actions ​

Expected Result ​

Why It Works ​

Troubleshooting ​

Step 6: Performance Optimization (~5 min) ​

Goal ​

Actions ​

Expected Result ​

Why It Works ​

Troubleshooting ​

Step 7: Files API Alternative (~5 min) ​

Goal ​

When to Use Files API ​

Cost Comparison ​

Files API Example ​

Why Choose Each Approach ​

Step 8: Batch Processing for Scale (~10 min) ​

Goal ​

When Batch Processing Makes Sense ​

Batch Processing Example ​

Cost Savings Calculation ​

Best Practices ​

1. Image Quality Optimization ​

2. Error Handling ​

3. Caching Strategy ​

4. Multi-Page Document Handling ​

5. Validation and Error Recovery ​

Troubleshooting ​

PDF Conversion Issues ​

Data Extraction Issues ​

Performance Issues ​

Memory Issues ​

Key Takeaways ​

Core Techniques ​

Scaling and Optimization ​

Production Ready ​

Further Reading ​

Next Steps and Related Topics ​

Immediate Next Chapters ​

Advanced Document Processing ​

Production Deployment ​

Async and Queue Processing ​

Further Reading ​

💻 Code Samples ​

Chapter 14: Document Processing and PDF Analysis

Overview

Prerequisites

Objectives

Important: Choose Your Approach

Three Ways to Process Documents

Required Libraries

Step 1: PDF to Image Conversion (~10 min)

Goal

Actions

Expected Result

Why It Works

Troubleshooting

Step 2: Invoice Processing (~15 min)

Goal

Actions

Expected Result

Why It Works

Troubleshooting

Step 3: Contract Analysis (~20 min)

Goal

Actions

Expected Result

Why It Works

Troubleshooting

Step 4: Resume/CV Processing (~15 min)

Goal

Actions

Expected Result

Why It Works

Troubleshooting

Step 5: Complete Document Processing Pipeline (~10 min)

Goal

Actions

Expected Result

Why It Works

Troubleshooting

Step 6: Performance Optimization (~5 min)

Goal

Actions

Expected Result

Why It Works

Troubleshooting

Step 7: Files API Alternative (~5 min)

Goal

When to Use Files API

Cost Comparison

Files API Example

Why Choose Each Approach

Step 8: Batch Processing for Scale (~10 min)

Goal

When Batch Processing Makes Sense

Batch Processing Example

Cost Savings Calculation

Best Practices

1. Image Quality Optimization

2. Error Handling

3. Caching Strategy

4. Multi-Page Document Handling

5. Validation and Error Recovery

Troubleshooting

PDF Conversion Issues

Data Extraction Issues

Performance Issues

Memory Issues

Key Takeaways

Core Techniques

Scaling and Optimization

Production Ready

Further Reading

Next Steps and Related Topics

Immediate Next Chapters

Advanced Document Processing

Production Deployment

Async and Queue Processing

Further Reading

💻 Code Samples