📄

Mass Document Processing for Daimler

An AI-Driven Solution for Vendor Management

Project Introduction

Daimler, one of the world's leading automotive manufacturers, operates multiple assembly lines for car and truck manufacturing across their facilities. The construction and maintenance of these assembly lines require coordination with numerous vendors who supply essential materials and equipment including aluminum, iron rods, steel components, paint, specialized machinery, and various other equipment necessary for vehicle production.

The procurement process involved working with a diverse network of vendors, many of them small to medium-sized businesses. These vendors would finalize contracts that included pricing structures for their materials and services. However, market conditions frequently fluctuated, leading to regular price adjustments that needed to be documented and tracked. The department responsible for vendor management was creating PDF documents to record these contracts, price changes, and vendor details.

Over time, as the volume of documentation grew exponentially, the department lost track of critical information: which vendor supplied what materials, at what prices, how frequently prices changed, and the historical pricing trends. This lack of visibility created significant challenges for management in making informed procurement decisions and negotiating favorable terms with vendors.

To address this challenge, Daimler decided to leverage artificial intelligence to automatically read all scanned PDF documents, extract relevant information, build a comprehensive database, and create an analytical dashboard on the frontend. The vision was to create a system where entering a company number would instantly display a beautiful, interactive dashboard with graphs, statistics, and other relevant information.

1000s
Documents Processed
90%
Info Extraction Accuracy
18B
Tokens Processed
99%
OCR Accuracy

The Challenge

This project, undertaken during a period when AI and computer vision technologies were still in their relative infancy compared to today's standards (around 2017-2018), presented several significant challenges:

📚

Volume and Format Diversity

Hundreds to thousands of scanned documents in various image formats (JPEG, PNG, TIFF). Each vendor used their own invoice and contract format with no standardization. Some documents were professionally formatted, while others were handwritten or used unconventional layouts.

⚙️

Technology Limitations

AI and OCR technologies were not as advanced as today. Deep learning models for document understanding were in early stages, and pre-trained models for German language document processing were limited.

📄

Document Quality Issues

Poor scanning resolution, skewed or rotated pages, background noise and artifacts, faded text, varying lighting conditions, mixed languages (German primary with English technical terms), and complex table structures.

🔍

Complex Information Extraction

The system needed to extract vendor names, identification numbers, item descriptions, unit specifications, prices, tax information, dates, application information, and tabular data while preserving structure.

🎯

Accuracy Requirements

Given the financial nature of the data and its importance for management decision-making, the system required high accuracy to be reliable for business operations. Target: 99% OCR accuracy and 90% information extraction accuracy.

📊

Table Digitization Challenge

Beyond simple text extraction, the system needed to digitize tables from scanned documents while maintaining their structure—crucial for preserving relationships between items, quantities, unit prices, and totals.

The Solution: How I Solved the Challenge

Phase 1: Document Preprocessing Pipeline

Optimizing scanned images for maximum OCR accuracy

Noise Reduction

Implemented advanced image filtering techniques to remove scanning artifacts, dust, and background noise while preserving text clarity.

Image Straightening

Developed algorithms to detect page orientation and automatically rotate skewed documents to ensure text was horizontally aligned.

Image Enhancement

Applied contrast adjustment, brightness normalization, and sharpening filters to improve text visibility, especially for faded or low-quality scans.

Resolution Optimization

Standardized image resolution to optimal levels for OCR processing, upscaling low-resolution images where necessary.

Phase 2: Custom YOLO Model for Layout Detection

Training a custom object detection model for document structure understanding

At this time, YOLO (You Only Look Once) models were just emerging as powerful object detection tools. I leveraged an early version of YOLO with a critical innovation: the model was trained not just to detect regions, but also to extract text along with its precise position within the document.

Custom Training Dataset Creation

Manually labeled hundreds of existing invoices and contracts to create a comprehensive training dataset. Each label identified key regions of interest:

Vendor information section
Item description area
Pricing table structure
Tax information
Total amount
Date fields
Usage notes
Table cells
Text Position Detection

The model was specifically trained to recognize not only what text was present but also the exact coordinates (x, y position) of each text element. This positional information was crucial for:

  • Maintaining table structure during digitization
  • Understanding relationships between related fields (item → quantity → price → total)
  • Reconstructing the logical flow of information from the original document
  • Preserving hierarchical relationships in nested data
ML Training Pipeline on Local GPU Infrastructure

Hardware & Framework:

  • • NVIDIA GPU workstations (Tesla/RTX series) for accelerated model training
  • • TensorFlow/PyTorch with CUDA support
  • • Jenkins pipeline automated training process

Model Outputs:

  • • Bounding boxes around each relevant section
  • • Text content extracted from within those boxes
  • • X, Y coordinates of each text element
  • • Confidence scores for each detection
  • • Relationship mapping between text elements

Phase 3: Image Enhancement for Extracted Regions

Post-extraction optimization for maximum OCR accuracy

Background Removal
  • Adaptive thresholding to separate text from background
  • Morphological operations to clean up noise
  • Background color normalization
Text Clarity Enhancement
  • Contrast enhancement using CLAHE
  • Sharpening filters for distinct text edges
  • Binarization for pure black/white conversion
Resolution Upscaling
  • Super-resolution techniques for low-quality extractions
  • Text quality improvement before OCR
  • Intelligent image reconstruction

Phase 4: OCR with Tesseract Customization

Fine-tuned OCR for German industrial documents

German Language Optimization

Tesseract came with pre-installed models, but these required significant customization for the specific vocabulary and formatting used in German industrial contracts and invoices.

Custom Dictionary

Built a specialized dictionary containing technical terms specific to automotive manufacturing, vendor names and abbreviations, material specifications, and German compound words.

Character Recognition Training

Fine-tuned Tesseract for common fonts used in vendor documents, varying text sizes and styles, handwritten annotations, and special characters and currency symbols.

Position-Aware OCR

Leveraged positional information from YOLO to process text in correct reading order, maintain spatial relationships, and preserve table structure by assigning text to correct cells.

Phase 5: Table Digitization and Structure Preservation

One of the most critical innovations in this project

Table Structure Recognition

Using the positional data from the YOLO model and OCR output:

  • Identified table boundaries
  • Detected row and column separators (even when lines were faint or missing)
  • Recognized merged cells and complex table layouts
  • Determined header rows and data rows
Structured Data Generation

Converted visual table structure into structured data format:

{
  "table_id": "invoice_123_table_1",
  "headers": ["Item Description", "Quantity", "Unit", "Unit Price (€)", "Total (€)"],
  "rows": [
    {
      "item_description": "Aluminum Sheets 2mm",
      "quantity": "500",
      "unit": "kg",
      "unit_price": "2.50",
      "total": "1250.00"
    },
    ...
  ]
}

This structured format allowed for easy database insertion and querying, making the data immediately useful for analytics and reporting.

Phase 6: Information Extraction and Validation

Ensuring data quality and accuracy

Structured Data Extraction
  • • Regular expressions for price patterns (€, EUR formats)
  • • Date parsing with multiple format support
  • • Number extraction and normalization
  • • Text classification for item descriptions
  • • Vendor identification using pattern matching
Validation Rules
  • • Price reasonableness checks (flagging anomalies)
  • • Date sequence validation
  • • Required field completeness
  • • Cross-reference validation between fields
  • • Tax calculation verification

Confidence Scoring: Each extracted piece of information was assigned a confidence score. Documents with low overall confidence scores were flagged for manual review, ensuring high data quality.

Phase 7: Database Architecture and Frontend Dashboard

Making the data accessible and actionable

PostgreSQL Database Design

Created a normalized relational database schema to store extracted information efficiently with proper indexing for fast queries on vendor IDs, dates, and pricing data.

Frontend Dashboard (Brief Overview)

The frontend dashboard was designed to provide instant access to vendor analytics and historical data:

  • Interactive Search: Enter a company number to instantly retrieve all vendor information and historical data
  • Visual Analytics: Beautiful graphs showing price trends over time, contract modifications, and spending patterns
  • Statistics Dashboard: Key metrics including total spending, number of contracts, average prices, and price change frequency
  • Document Viewer: Access to original scanned documents with extracted data overlay for verification

Complete Technology Stack

Computer Vision

YOLOTensorFlowPyTorchOpenCV

OCR

TesseractCustom ModelsCLAHEImage Processing

Infrastructure

KubernetesDockerJenkinsEKS

Database

PostgreSQLData ModelingSQLIndexing

Languages

PythonJavaScriptSQLBash

ML Tools

CUDAGPU TrainingNVIDIA TeslaModel Optimization

DevOps

CI/CDAutomationVersion ControlTesting

Frontend

ReactCharts.jsDashboardAnalytics

Project Impact

90%
Info Extraction
99% OCR accuracy achieved
1.5 Years
Project Duration
From concept to production
1000s
Documents
Successfully processed

Transformed Daimler's vendor management operations by automatically digitizing thousands of scanned documents, achieving 99% OCR accuracy and 90% information extraction accuracy for critical pricing and vendor data, and providing management with an intuitive dashboard for data-driven procurement decisions—enabling better vendor negotiations and cost optimization across the entire supply chain.