Mass Document Processing for Daimler
An AI-Driven Solution for Vendor Management
Project Introduction
Daimler, one of the world's leading automotive manufacturers, operates multiple assembly lines for car and truck manufacturing across their facilities. The construction and maintenance of these assembly lines require coordination with numerous vendors who supply essential materials and equipment including aluminum, iron rods, steel components, paint, specialized machinery, and various other equipment necessary for vehicle production.
The procurement process involved working with a diverse network of vendors, many of them small to medium-sized businesses. These vendors would finalize contracts that included pricing structures for their materials and services. However, market conditions frequently fluctuated, leading to regular price adjustments that needed to be documented and tracked. The department responsible for vendor management was creating PDF documents to record these contracts, price changes, and vendor details.
Over time, as the volume of documentation grew exponentially, the department lost track of critical information: which vendor supplied what materials, at what prices, how frequently prices changed, and the historical pricing trends. This lack of visibility created significant challenges for management in making informed procurement decisions and negotiating favorable terms with vendors.
To address this challenge, Daimler decided to leverage artificial intelligence to automatically read all scanned PDF documents, extract relevant information, build a comprehensive database, and create an analytical dashboard on the frontend. The vision was to create a system where entering a company number would instantly display a beautiful, interactive dashboard with graphs, statistics, and other relevant information.
The Challenge
This project, undertaken during a period when AI and computer vision technologies were still in their relative infancy compared to today's standards (around 2017-2018), presented several significant challenges:
Volume and Format Diversity
Hundreds to thousands of scanned documents in various image formats (JPEG, PNG, TIFF). Each vendor used their own invoice and contract format with no standardization. Some documents were professionally formatted, while others were handwritten or used unconventional layouts.
Technology Limitations
AI and OCR technologies were not as advanced as today. Deep learning models for document understanding were in early stages, and pre-trained models for German language document processing were limited.
Document Quality Issues
Poor scanning resolution, skewed or rotated pages, background noise and artifacts, faded text, varying lighting conditions, mixed languages (German primary with English technical terms), and complex table structures.
Complex Information Extraction
The system needed to extract vendor names, identification numbers, item descriptions, unit specifications, prices, tax information, dates, application information, and tabular data while preserving structure.
Accuracy Requirements
Given the financial nature of the data and its importance for management decision-making, the system required high accuracy to be reliable for business operations. Target: 99% OCR accuracy and 90% information extraction accuracy.
Table Digitization Challenge
Beyond simple text extraction, the system needed to digitize tables from scanned documents while maintaining their structure—crucial for preserving relationships between items, quantities, unit prices, and totals.
The Solution: How I Solved the Challenge
Phase 1: Document Preprocessing Pipeline
Optimizing scanned images for maximum OCR accuracy
▸Noise Reduction
Implemented advanced image filtering techniques to remove scanning artifacts, dust, and background noise while preserving text clarity.
▸Image Straightening
Developed algorithms to detect page orientation and automatically rotate skewed documents to ensure text was horizontally aligned.
▸Image Enhancement
Applied contrast adjustment, brightness normalization, and sharpening filters to improve text visibility, especially for faded or low-quality scans.
▸Resolution Optimization
Standardized image resolution to optimal levels for OCR processing, upscaling low-resolution images where necessary.
Phase 2: Custom YOLO Model for Layout Detection
Training a custom object detection model for document structure understanding
At this time, YOLO (You Only Look Once) models were just emerging as powerful object detection tools. I leveraged an early version of YOLO with a critical innovation: the model was trained not just to detect regions, but also to extract text along with its precise position within the document.
Custom Training Dataset Creation
Manually labeled hundreds of existing invoices and contracts to create a comprehensive training dataset. Each label identified key regions of interest:
Text Position Detection
The model was specifically trained to recognize not only what text was present but also the exact coordinates (x, y position) of each text element. This positional information was crucial for:
- ✓Maintaining table structure during digitization
- ✓Understanding relationships between related fields (item → quantity → price → total)
- ✓Reconstructing the logical flow of information from the original document
- ✓Preserving hierarchical relationships in nested data
ML Training Pipeline on Local GPU Infrastructure
Hardware & Framework:
- • NVIDIA GPU workstations (Tesla/RTX series) for accelerated model training
- • TensorFlow/PyTorch with CUDA support
- • Jenkins pipeline automated training process
Model Outputs:
- • Bounding boxes around each relevant section
- • Text content extracted from within those boxes
- • X, Y coordinates of each text element
- • Confidence scores for each detection
- • Relationship mapping between text elements
Phase 3: Image Enhancement for Extracted Regions
Post-extraction optimization for maximum OCR accuracy
Background Removal
- •Adaptive thresholding to separate text from background
- •Morphological operations to clean up noise
- •Background color normalization
Text Clarity Enhancement
- •Contrast enhancement using CLAHE
- •Sharpening filters for distinct text edges
- •Binarization for pure black/white conversion
Resolution Upscaling
- •Super-resolution techniques for low-quality extractions
- •Text quality improvement before OCR
- •Intelligent image reconstruction
Phase 4: OCR with Tesseract Customization
Fine-tuned OCR for German industrial documents
German Language Optimization
Tesseract came with pre-installed models, but these required significant customization for the specific vocabulary and formatting used in German industrial contracts and invoices.
Custom Dictionary
Built a specialized dictionary containing technical terms specific to automotive manufacturing, vendor names and abbreviations, material specifications, and German compound words.
Character Recognition Training
Fine-tuned Tesseract for common fonts used in vendor documents, varying text sizes and styles, handwritten annotations, and special characters and currency symbols.
Position-Aware OCR
Leveraged positional information from YOLO to process text in correct reading order, maintain spatial relationships, and preserve table structure by assigning text to correct cells.
Phase 5: Table Digitization and Structure Preservation
One of the most critical innovations in this project
Table Structure Recognition
Using the positional data from the YOLO model and OCR output:
- ▸Identified table boundaries
- ▸Detected row and column separators (even when lines were faint or missing)
- ▸Recognized merged cells and complex table layouts
- ▸Determined header rows and data rows
Structured Data Generation
Converted visual table structure into structured data format:
{
"table_id": "invoice_123_table_1",
"headers": ["Item Description", "Quantity", "Unit", "Unit Price (€)", "Total (€)"],
"rows": [
{
"item_description": "Aluminum Sheets 2mm",
"quantity": "500",
"unit": "kg",
"unit_price": "2.50",
"total": "1250.00"
},
...
]
}This structured format allowed for easy database insertion and querying, making the data immediately useful for analytics and reporting.
Phase 6: Information Extraction and Validation
Ensuring data quality and accuracy
Structured Data Extraction
- • Regular expressions for price patterns (€, EUR formats)
- • Date parsing with multiple format support
- • Number extraction and normalization
- • Text classification for item descriptions
- • Vendor identification using pattern matching
Validation Rules
- • Price reasonableness checks (flagging anomalies)
- • Date sequence validation
- • Required field completeness
- • Cross-reference validation between fields
- • Tax calculation verification
Confidence Scoring: Each extracted piece of information was assigned a confidence score. Documents with low overall confidence scores were flagged for manual review, ensuring high data quality.
Phase 7: Database Architecture and Frontend Dashboard
Making the data accessible and actionable
PostgreSQL Database Design
Created a normalized relational database schema to store extracted information efficiently with proper indexing for fast queries on vendor IDs, dates, and pricing data.
Frontend Dashboard (Brief Overview)
The frontend dashboard was designed to provide instant access to vendor analytics and historical data:
- •Interactive Search: Enter a company number to instantly retrieve all vendor information and historical data
- •Visual Analytics: Beautiful graphs showing price trends over time, contract modifications, and spending patterns
- •Statistics Dashboard: Key metrics including total spending, number of contracts, average prices, and price change frequency
- •Document Viewer: Access to original scanned documents with extracted data overlay for verification
Complete Technology Stack
Computer Vision
OCR
Infrastructure
Database
Languages
ML Tools
DevOps
Frontend
Project Impact
Transformed Daimler's vendor management operations by automatically digitizing thousands of scanned documents, achieving 99% OCR accuracy and 90% information extraction accuracy for critical pricing and vendor data, and providing management with an intuitive dashboard for data-driven procurement decisions—enabling better vendor negotiations and cost optimization across the entire supply chain.