The Data Fragmentation Crisis

During the onset of COVID-19, critical data (case counts, regulations, logistics) was scattered across thousands of disparate government websites, formats (PDFs, Dashboards), and languages. Organizations like Microsoft and WHO needed a unified "Source of Truth" fast.

I architected a Cognitive Scraping Pipeline on Azure to normalize this chaos into structured datasets.

Architecture: The Intelligent Crawler

1. The Cloud Orchestrator

Infrastructure: Deployed a fleet of ephemeral Node.js scrapers on Azure, capable of scaling horizontally to hit thousands of endpoints simultaneously.
Resilience: Built retry mechanisms and proxy rotation to handle the instability of government servers under high load.

2. Cognitive Extraction

Challenge: Data wasn't just in HTML tables; it was locked in images and PDFs.
Solution: Integrated Azure Cognitive Services (OCR) to "read" screenshots and PDF bulletins, converting unstructured pixels into structured JSON data for analytics downstream.

3. Impact

Velocity: Automated 90% of the manual repetitive tasks previously required to aggregate this data.
Scale: Enabled real-time dashboards for decision-makers by reducing data lag from days to minutes.

Global Pandemic Data Pipeline

Tech Stack

The Data Fragmentation Crisis

Architecture: The Intelligent Crawler

1. The Cloud Orchestrator

2. Cognitive Extraction

3. Impact