System Design: How We Built a Real-Time Analytics Platform at IBM

At IBM Labs, I led the engineering of a customer analytics platform that processes millions of events per day across three cloud providers. This is the system design behind it.

Requirements

Ingest 10M+ user interaction events daily
Real-time dashboards with < 5 second latency
ML predictions — churn probability, next-best-action, anomaly detection
Multi-cloud — must run on AWS, Azure, and GCP simultaneously
99.9% uptime — financial services clients don’t tolerate downtime

High-Level Architecture

Clients → API Gateway → Kafka → Stream Processor → TimescaleDB
                                      ↓
                              ML Inference Service → Prediction Store
                                      ↓
                              Dashboard API → React Frontend

Event Ingestion

We chose Apache Kafka as the backbone. Every user interaction (page view, click, form submit, error) becomes an event:

{
  "event_id": "uuid-v4",
  "type": "page_view",
  "user_id": "u-123",
  "timestamp": "2025-08-10T14:30:00Z",
  "properties": {
    "page": "/dashboard",
    "duration_ms": 4500,
    "referrer": "/login"
  }
}

Kafka handles 50K events/second at peak with 3-day retention. We partition by user_id for ordering guarantees within a user’s session.

Stream Processing

A custom stream processor (built with FastAPI + asyncio) consumes from Kafka and:

Enriches events with user metadata from CosmosDB
Aggregates into 1-minute, 5-minute, and 1-hour windows
Detects anomalies using statistical models (Z-score on rolling averages)
Writes to TimescaleDB for time-series queries

ML Pipeline

Three models run in production:

Churn prediction — gradient boosting on 90-day behavioral features
Anomaly detection — isolation forests on traffic patterns
Next-best-action — collaborative filtering for feature recommendations

Models retrain weekly on the latest data. We use Hugging Face for embeddings and TensorFlow Serving for inference at < 50ms P99.

Observability Stack

You can’t operate what you can’t observe:

Prometheus + Grafana for infrastructure metrics
OpenTelemetry for distributed tracing across services
ELK Stack for centralized logging
PagerDuty integration for alerting

We track four golden signals: latency, traffic, errors, and saturation.

Lessons Learned

Kafka is not a database — don’t use it for long-term storage
Multi-cloud is hard — abstract cloud-specific APIs behind interfaces
Observability > testing — in distributed systems, you need both but observability catches what tests miss
Start with batch, graduate to streaming — prove value with batch processing before investing in real-time

This platform now serves 200+ enterprise clients and processes over 300M events monthly.