Tool 04 · Algorithm Deep Dive
XGBoost + SHAP + Monte Carlo Simulation
Single-number estimates lie. The Flaw of Averages: if you estimate 10 tasks at 10 days each, the project is NOT 100 days. Due to statistical dependency (one delay cascades), actual outcome is usually 140-160 days. Clients need P50 (median) and P90 (conservative) ranges to price in contingency correctly.
Quantile LightGBM learns the full distribution of costs, not just the mean. Monte Carlo simulation (10,000 runs) propagates uncertainty through dependent workstreams. Bayesian updating pulls priors from similar historical projects, refining estimates with actual delivery data.
4 LightGBM models trained for τ ∈ {0.25, 0.50, 0.75, 0.90}. Each predicts a specific percentile of the cost distribution.
10,000-run simulation sampling from joint distribution of workstream costs. Gaussian copula models dependency structure.
Prior from historical analogues (Tool 12) → Updated with project-specific features via Bayes' theorem.
Cumulative distribution function showing probability of completion at or below any cost level.
┌─────────────────────────────────────────────────────────────────────────────────────────┐
│ COST & TIMELINE FORECASTER PIPELINE │
├─────────────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────────────────┐ │
│ │ STEP 1: FEATURE ENGINEERING │ │
│ │ │ │
│ │ Inputs from Tools 02, 03, 12: │ │
│ │ • Scope vector: [FI:23, CO:18, MM:31, SD:15, PP:8] │ │
│ │ • Team: [Sr:4, Mid:6, Jr:3] │ │
│ │ • Integrations: 7 interfaces │ │
│ │ • Industry: Pharma, Region: EU │ │
│ │ • Historical analogues: 5 similar projects │ │
│ └─────────────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────────────┐ │
│ │ STEP 2: QUANTILE LIGHTGBM PREDICTION │ │
│ │ │ │
│ │ For each workstream (FI, CO, MM, SD, PP, Basis, Data): │ │
│ │ │ │
│ │ ┌───────────────────────────────────────────────────────────────────────────┐ │ │
│ │ │ LightGBM Quantile Regressors (4 models per workstream) │ │ │
│ │ │ │ │ │
│ │ │ Input → [800 trees, max-depth=6] → Quantile Predictions │ │ │
│ │ │ │ │ │
│ │ │ τ=0.25 τ=0.50 τ=0.75 τ=0.90 │ │ │
│ │ │ ↓ ↓ ↓ ↓ │ │ │
│ │ │ $120k $145k $175k $210k (FI workstream example) │ │ │
│ │ │ │ │ │
│ │ │ Pinball Loss: L_τ(y,ŷ) = { τ(y-ŷ) if y≥ŷ; (1-τ)(ŷ-y) if y<ŷ } │ │ │
│ │ └───────────────────────────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────────────┐ │
│ │ STEP 3: BAYESIAN PRIOR UPDATING │ │
│ │ │ │
│ │ Prior Distribution from Historical Analogues (Tool 12): │ │
│ │ P(θ) ~ Weighted mixture of 5 most similar projects │ │
│ │ │ │
│ │ Likelihood from LightGBM: │ │
│ │ P(Data | θ) ~ Quantile predictions transformed to distribution │ │
│ │ │ │
│ │ Posterior (Bayes' Theorem): │ │
│ │ P(θ | Data) ∝ P(Data | θ) × P(θ) │ │
│ │ │ │
│ │ → Shrinks estimates toward historical outcomes when data is sparse │ │
│ └─────────────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────────────┐ │
│ │ STEP 4: MONTE CARLO SIMULATION (10,000 RUNS) │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────────────────────────────┐ │ │
│ │ │ COPULA STRUCTURE │ │ │
│ │ │ │ │ │
│ │ │ FI ──┐ │ │ │
│ │ │ │ │ │ │
│ │ │ CO ──┼── Gaussian Copula (ρ = 0.72 correlation) │ │ │
│ │ │ │ │ │ │
│ │ │ MM ──┤ → Joint distribution modeling │ │ │
│ │ │ │ dependency between workstreams │ │ │
│ │ │ SD ──┤ │ │ │
│ │ │ │ │ │ │
│ │ │ Data ─┘ │ │ │
│ │ └─────────────────────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ For run i = 1 to 10,000: │ │
│ │ 1. Sample correlated uniforms from copula │ │
│ │ 2. Transform to workstream distributions (inverse CDF) │ │
│ │ 3. Sum costs = Total_i │ │
│ │ 4. Record Total_i │ │
│ │ │ │
│ │ Results: 10,000 possible total project costs │ │
│ └─────────────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────────────┐ │
│ │ STEP 5: S-CURVE & OUTPUT │ │
│ │ │ │
│ │ Sort 10,000 totals → Cumulative Distribution Function │ │
│ │ │ │
│ │ 100% ┤ ╱────────── │ │
│ │ │ ╱── │ │
│ │ 90% ┤ ╱─── │ │
│ │ │ ╱── │ │
│ │ 75% ┤ ╱─── │ │
│ │ │ ╱── │ │
│ │ 50% ┤ ╱─── ← P50: $4.2M │ │
│ │ │ ╱── │ │
│ │ 25% ┤ ╱── │ │
│ │ ├──┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴─── │ │
│ │ $3.5M $4.0M $4.5M $5.0M $5.5M $6.0M │ │
│ │ Project Cost │ │
│ │ │ │
│ │ Output: P50 = $4.2M, P90 = $5.8M, P75 = $4.9M │ │
│ └─────────────────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────────────────┘
Probability of completing at or below given cost
Tool 04 is the STAR TOOL — directly feeds executive decision-making and commercial pricing.
| Metric | Value | Benchmark |
|---|---|---|
| MAPE (P50 Cost) | 12.3% | 42 delivered projects |
| P90 Coverage | 88.7% | Actual cost ≤ P90 estimate |
| P75 Coverage | 76.2% | Actual cost ≤ P75 estimate |
| P50 Coverage | 52.4% | Actual cost ≤ P50 estimate (ideal: 50%) |
| Timeline MAPE | 14.8% | Months duration |
| Simulation Time | 2.3s | 10,000 runs on CPU |
Result: Project delivered at $5.1M — within P90 band. Margin protected.