Analogous Project Retrieval — Algorithm Deep Dive

Detail

🎯 Why This Algorithm

📋 Problem Statement

Two shapes of recall matter: keyword precision ('Ariba migration 2024') and semantic understanding ('something like ours but in CPG'). Pure keyword search misses synonyms; pure semantic search misses exact matches. Consultants need both to find relevant analogues for estimation.

✅ Solution

Hybrid Search: BM25 for keyword precision + Dense embeddings (bge-large) for semantic recall. Reciprocal Rank Fusion combines rankings. Metadata filters (industry, size, region, year) ensure relevance. Every retrieved project includes actual outcomes for benchmarking.

Detail

🧩 What It Comprises

🔤 BM25

Okapi BM25 over project briefs, scope documents, and lessons learned. Handles exact keyword matches.

🧬 Dense Embeddings

bge-large-en-v1.5 embeddings (1,024-dim) stored in pgvector. Semantic similarity for conceptual matches.

🔀 Reciprocal Rank Fusion

Combines BM25 and dense rankings: RRF_score = 1/(k+rank₁) + 1/(k+rank₂).

🏷️ Metadata Filters

Industry, company size, region, year, SAP modules, project duration, budget range.

Detail

📥 Inputs & 📤 Outputs

📥 Inputs

•Project description / query (natural language)
•Optional filters (industry, size, region, year)
•Current project metadata

📤 Outputs

•Top-N most similar past projects
•Similarity score (0-100)
•Outcome data (actual vs estimated cost/timeline)
•Lessons learned and risk patterns

📋 Example: Top 3 Analogues

Pfizer S/4HANA Finance Migration 94% Match

Industry: Pharma | Size: $8.2M | Duration: 14 months

Outcome: Delivered on budget, 2-week delay due to data quality

Key Risk: Legacy data unmapped → +15% contingency used

Merck Global FI-CO Rollout 87% Match

Industry: Pharma | Size: $6.5M | Duration: 12 months

Outcome: Under budget by 8%, on time

Key Success: Strong data governance from Day 1

J&J Supply Chain S/4 Migration 78% Match

Industry: Pharma/CPG | Size: $12.1M | Duration: 18 months

Outcome: 10% over budget, 3-month delay (scope creep)

Lesson: Freeze scope earlier; add change order buffer

Detail

🔄 How It Runs — Step by Step

┌─────────────────────────────────────────────────────────────────────────────────────────┐
│                     HISTORICAL PROJECT RETRIEVER PIPELINE                                  │
├─────────────────────────────────────────────────────────────────────────────────────────┤
│                                                                                           │
│   ┌──────────────┐                                                                        │
│   │   INPUT:     │  "S/4HANA migration for pharma company with complex supply chain"      │
│   │    Query     │  Filters: Industry=Pharma, Size>$5M                                     │
│   └──────┬───────┘                                                                        │
│          │                                                                                 │
│          ▼                                                                                 │
│   ┌──────────────────────────────────────────────────────────────────┐                    │
│   │                    STEP 1: PARALLEL RETRIEVAL                       │                    │
│   │                                                                     │                    │
│   │   ┌─────────────────────────┐    ┌─────────────────────────────┐   │                    │
│   │   │   DENSE RETRIEVAL       │    │   BM25 KEYWORD RETRIEVAL     │   │                    │
│   │   │   (bge-large)           │    │   (Okapi BM25)               │   │                    │
│   │   │                         │    │                              │   │                    │
│   │   │  Query → Embedding      │    │  Query → Tokenize            │   │                    │
│   │   │  Cosine sim to projects │    │  IDF-weighted term matching  │   │                    │
│   │   │                         │    │                              │   │                    │
│   │   │  Top-50 candidates      │    │  Top-50 candidates           │   │                    │
│   │   │  (semantic similarity)  │    │  (keyword precision)         │   │                    │
│   │   └───────────┬─────────────┘    └───────────────┬─────────────┘   │                    │
│   │               │                                   │                  │                    │
│   │               └─────────────┬─────────────────────┘                  │                    │
│   │                             ▼                                        │                    │
│   │              Reciprocal Rank Fusion (RRF)                             │                    │
│   │              Score = Σ 1/(60 + rank_i)                               │                    │
│   │                             │                                        │                    │
│   │                             ▼                                        │                    │
│   │                        Combined Top-50                                │                    │
│   └──────────────────────────────────────────────────────────────────┘                    │
│          │                                                                                 │
│          ▼                                                                                 │
│   ┌──────────────────────────────────────────────────────────────────┐                    │
│   │                    STEP 2: METADATA FILTERING                       │                    │
│   │                                                                     │                    │
│   │   Apply strict filters:                                             │                    │
│   │   • Industry = Pharma                                               │                    │
│   │   • Budget > $5M                                                    │                    │
│   │   • Year ≥ 2022                                                     │                    │
│   │                                                                     │                    │
│   │   Soft filters (penalize, don't exclude):                           │                    │
│   │   • Region = North America (preferred)                               │                    │
│   │   • Modules = FI, CO, MM (required)                                 │                    │
│   └──────────────────────────────────────────────────────────────────┘                    │
│          │                                                                                 │
│          ▼                                                                                 │
│   ┌──────────────────────────────────────────────────────────────────┐                    │
│   │                    STEP 3: OUTCOME ENRICHMENT                       │                    │
│   │                                                                     │                    │
│   │   For each retrieved project, attach:                               │                    │
│   │   • Actual vs. estimated cost                                       │                    │
│   │   • Actual vs. estimated timeline                                   │                    │
│   │   • Risk materialization flags                                      │                    │
│   │   • Lessons learned summary                                         │                    │
│   │   • Key success factors                                             │                    │
│   └──────────────────────────────────────────────────────────────────┘                    │
│          │                                                                                 │
│          ▼                                                                                 │
│   ┌──────────────────────────────────────────────────────────────────┐                    │
│   │   OUTPUT:                                                          │                    │
│   │   {                                                                │                    │
│   │     "query": "S/4HANA migration pharma supply chain",              │                    │
│   │     "analogues": [                                                 │                    │
│   │       {"project": "Pfizer S/4HANA Finance", "similarity": 0.94,    │                    │
│   │        "outcome": {"cost_variance": "+2%", "time_variance": "+2w"}}│                    │
│   │     ],                                                             │                    │
│   │     "benchmark_summary": "Avg cost variance: +3.2%"                │                    │
│   │   }                                                                │                    │
│   └──────────────────────────────────────────────────────────────────┘                    │
│                                                                                           │
└─────────────────────────────────────────────────────────────────────────────────────────┘

Detail

🏗️ Architecture & Integration

Where Historical Retriever Sits in A²AI

📚 Project DB
42 Projects

📊 SolMan
Actuals

↓ (Indexing)

🕰️ TOOL 12
Historical Project Retriever
Hybrid Search

↓

TOOL 04
Cost Forecaster (Prior)

TOOL 05
Risk Estimator

Benchmarking
Reports

Tool 12 provides the "institutional memory" for evidence-based estimation.

Detail

📐 Mathematical Explanation

BM25 Scoring:

BM25(q, d) = Σ IDF(q_i) · [f(q_i, d) · (k₁+1)] / [f(q_i, d) + k₁·(1-b+b·|d|/avgdl)]

Where k₁=1.5, b=0.75.

Dense Cosine Similarity:

sim(q, d) = (E_q · E_d) / (‖E_q‖ · ‖E_d‖)

Reciprocal Rank Fusion (RRF):

RRF_score(d) = Σ_{r∈{dense, bm25}} 1 / (k + rank_r(d))

Where k=60 (smoothing constant).

Precision@K:

P@K = |{relevant documents in top K}| / K

Detail

📊 Measured Performance

Metric	Value	Benchmark
Precision@10	0.82	Consultant-rated relevance
Mean Reciprocal Rank (MRR)	0.78	First relevant result position
Recall@10	0.71	% of all relevant projects found
Query Latency	45ms	Hybrid search + fusion
Index Size	42 projects	Growing with each delivery

Detail

📚 Training & Calibration Set

•Corpus: 42 delivered SAP projects (2022–2025)
•Document Types: Executive summaries, SOWs, RICEFW inventories, lessons learned, actuals
•Embedding Model: bge-large-en-v1.5 (pre-trained)
•BM25 Parameters: k₁=1.5, b=0.75 (standard)
•RRF k: 60 (empirically tuned on 200 queries)
•Update Schedule: New projects indexed within 48 hours of completion

Detail

🎬 End-to-End Example

Scenario: Benchmarking for New Pharma RFP

•Input: "S/4HANA greenfield for mid-size pharma with validated GxP systems"
•Hybrid Search: BM25 matches "GxP" and "pharma"; Dense matches "validated systems" → "CSV" concepts
•Metadata Filter: Industry=Pharma, Size=$5M-$15M
•Output: 3 highly relevant analogues with actual outcomes
•Downstream: Tool 04 uses analogues as Bayesian prior for cost estimation

Result: P50 estimate calibrated to $8.2M based on similar projects; actual delivered at $8.5M.