Document Intelligence — Algorithm Deep Dive

Detail

🎯 Why This Algorithm

📋 Problem Statement

RFPs arrive as scanned PDFs, native DOCX, tables, and email threads — a pure OCR model misses structure, and a pure text parser misses scans. A table header on page 14 and its continuation on page 15 get processed as two separate, disconnected entities. Critical scope items get lost or misclassified.

✅ Solution

LayoutLMv3 fuses text + layout + image patches in a single multimodal transformer. It understands that text aligned in columns with borders is a table, that bold text at the top is a heading, and that indented bullet points form a list. Tables, headers, and list items land in the right spot.

Detail

🧩 What It Comprises

🤖 Core Model

LayoutLMv3 — multimodal transformer, 133M parameters. Fine-tuned on 3,200 SAP RFP pages for section/table/form-field classification.

☁️ OCR Pre-Reader

Azure Document Intelligence — pre-reads for OCR confidence and initial text extraction for scanned pages.

📄 Deterministic Fallback

unstructured.io + PyMuPDF — deterministic fallback for native digital files (DOCX, text-based PDFs).

🔀 Triage Router

Lightweight classifier that routes: Scanned Image → OCR Path; Digital Text → Fast Parse; Complex Table → Table Transformer.

Detail

📥 Inputs & 📤 Outputs

📥 Inputs

•PDF, DOCX, EML, XLSX, PNG/JPG
•Max 500-page documents
•Multi-language (EN/DE/FR/ES/JP)

📤 Outputs

•Structured blocks (section, heading, paragraph, table, list)
•Per-block bounding box (x, y, w, h)
•Per-block OCR confidence
•Clean JSON with provenance (source page, offset)

Detail

🔄 How It Runs — Step by Step

•1. Triage: Route scanned pages to OCR, native pages to text parse
•2. Layout: LayoutLMv3 tags each block as heading / paragraph / table / figure
•3. Normalize: Merge fragments, reconstruct tables into JSON
•4. Emit: Clean blocks with provenance (source page, offset, confidence)

📊 Pipeline Flowchart

┌─────────────────────────────────────────────────────────────────────────────────────────┐
│                           DOCUMENT INTELLIGENCE PIPELINE                                   │
├─────────────────────────────────────────────────────────────────────────────────────────┤
│                                                                                           │
│   ┌──────────────┐                                                                        │
│   │   INPUT:     │  PDF, DOCX, EML, XLSX, PNG/JPG (Max 500 pages)                        │
│   │   Document   │                                                                        │
│   └──────┬───────┘                                                                        │
│          │                                                                                 │
│          ▼                                                                                 │
│   ┌──────────────────────────────────────────────────────────────────┐                    │
│   │                    STEP 1: TRIAGE ROUTER                           │                    │
│   │  ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐ │                    │
│   │  │ Scanned Image?  │───▶│ Digital Text?   │───▶│ Complex Table?  │ │                    │
│   │  │ → OCR Path      │    │ → Fast Parse    │    │ → Table Transf. │ │                    │
│   │  │ (Azure Doc Intel)│   │ (PyMuPDF)       │    │ (Specialized)   │ │                    │
│   │  └─────────────────┘    └─────────────────┘    └─────────────────┘ │                    │
│   └──────────────────────────────────────────────────────────────────┘                    │
│          │                                                                                 │
│          ▼                                                                                 │
│   ┌──────────────────────────────────────────────────────────────────┐                    │
│   │                    STEP 2: LAYOUT ANALYSIS (LayoutLMv3)            │                    │
│   │                                                                     │                    │
│   │   ┌─────────────────────────────────────────────────────────────┐ │                    │
│   │   │              Multimodal Transformer (133M params)            │ │                    │
│   │   │  ┌──────────┐    ┌──────────┐    ┌──────────┐               │ │                    │
│   │   │  │   Text   │ +  │  Layout  │ +  │  Visual  │               │ │                    │
│   │   │  │Embeddings│    │Embeddings│    │Embeddings│               │ │                    │
│   │   │  └────┬─────┘    └────┬─────┘    └────┬─────┘               │ │                    │
│   │   │       └───────────────┼───────────────┘                      │ │                    │
│   │   │                       ▼                                      │ │                    │
│   │   │            Unified Multi-Modal Attention                      │ │                    │
│   │   │                       ▼                                      │ │                    │
│   │   │   Tags: [HEADING] [PARAGRAPH] [TABLE] [LIST] [FIGURE]       │ │                    │
│   │   └─────────────────────────────────────────────────────────────┘ │                    │
│   └──────────────────────────────────────────────────────────────────┘                    │
│          │                                                                                 │
│          ▼                                                                                 │
│   ┌──────────────────────────────────────────────────────────────────┐                    │
│   │                    STEP 3: NORMALIZE                               │                    │
│   │   • Merge fragmented text blocks                                   │                    │
│   │   • Reconstruct tables into JSON structure (rows, cells)           │                    │
│   │   • Preserve reading order                                         │                    │
│   └──────────────────────────────────────────────────────────────────┘                    │
│          │                                                                                 │
│          ▼                                                                                 │
│   ┌──────────────────────────────────────────────────────────────────┐                    │
│   │                    STEP 4: EMIT                                    │                    │
│   │   {                                                               │                    │
│   │     "page": 14,                                                   │                    │
│   │     "blocks": [                                                   │                    │
│   │       {"type": "heading", "text": "4.2 Scope of Work", ...},     │                    │
│   │       {"type": "table", "rows": [...], "confidence": 0.96, ...}  │                    │
│   │     ]                                                             │                    │
│   │   }                                                               │                    │
│   └──────────────────────────────────────────────────────────────────┘                    │
│                                                                                           │
└─────────────────────────────────────────────────────────────────────────────────────────┘

Detail

🏗️ Architecture & Integration

Where Document Intelligence Sits in A²AI

📤 RFP Upload
PDF, DOCX, Images

→

🔍 TOOL 01
Document Intelligence
LayoutLMv3 + OCR

→

📋 Structured JSON
Sections, Tables, Text

↓ Feeds Into ↓

TOOL 02
Requirements Extraction

TOOL 03
Module Router

TOOL 06
RFP Chat (Index)

TOOL 11
Compliance Matcher

Tool 01 is the ENTRY POINT of the entire A²AI pipeline. No downstream tool can function without clean, structured input from Document Intelligence.

Detail

📐 Mathematical Explanation

LayoutLMv3 Unified Attention Mechanism:

Attention(Q, K, V) = softmax( QKᵀ / √dₖ ) V

Where Q, K, V are derived from three distinct embeddings:

1. Text Embedding (E_text): Word token embeddings from BPE tokenizer
2. Layout Embedding (E_layout): 2D positional embeddings (x₀, y₀, x₁, y₁) of each bounding box
3. Visual Embedding (E_visual): CNN feature map of the page image patch (ResNet backbone)

Pre-training Objective (Masked Visual-Language Modeling):

L_MVLM = -𝔼_{(W,I,M)} [ Σ_{i∈M} log P(w_i | W_{\M}, I) ]

Where:
• W = text sequence
• I = image patches
• M = set of masked token indices
• W_{\M} = text with masked tokens removed

Fine-tuning for SAP RFP Classification:

P(class | x) = softmax( W_class · h_CLS + b )

Where h_CLS is the final hidden state of the [CLS] token after 12 transformer layers.

Detail

📊 Measured Performance

Metric	Value	Benchmark Dataset
Macro-F1 (Overall)	94.2%	1,200 held-out SAP RFP pages
Table Cell Extraction (Exact Match)	91.6%	Internal RFP benchmark
Section Header Classification	96.8%	Multi-level heading hierarchy test
List Item Detection	93.2%	Bulleted and numbered lists

Detail

📚 Training & Calibration Set

•Size: Fine-tuned on 3,200 annotated pages
•Source: Sampled across pharma, CPG, manufacturing, and financial-services RFPs (2022–2025)
•Annotation: Double-annotated by senior document analysts; inter-annotator agreement 0.91 Cohen's κ
•Augmentation: Synthetic noise, rotation, scaling applied during training
•Validation Split: 80% train / 10% validation / 10% test (stratified by industry)

Detail

🎬 End-to-End Example

Scenario: 450-Page Pharma RFP with Complex Tables

•T=0s: Client uploads RFP_Pharma_S4_Migration.pdf (450 pages, mixed scanned/digital)
•Triage: Router identifies 312 pages as digital text, 138 as scanned images
•Layout Analysis: LayoutLMv3 processes scanned pages; PyMuPDF handles digital pages
•Table Reconstruction: Table on pages 87-89 (GxP Validation Requirements) reconstructed as structured JSON
•Output: 847 structured blocks emitted with confidence scores and provenance
•Downstream: JSON feeds directly into Tool 02 (Requirements Extraction) and Tool 06 (RFP Chat index)

Result: 40+ hours of manual document triage reduced to under 2 minutes.