Product Case Study · Video Infrastructure

Live at Scale: Delivering to
1.8M Concurrent Viewers

How I led product for a high-stakes international live sports stream — balancing delivery reliability, real-time CDN orchestration, and cross-functional coordination under pressure.

Live Streaming CDN Strategy Infrastructure PM Enterprise Reliability
1.8M Concurrent users at peak
3+ CDN providers orchestrated in real time
0 Unplanned outages during the event

The Challenge

An enterprise customer needed to stream a high-profile international soccer match — live, at a scale the platform had not previously served from a single event origin. The business risk was significant: a failure or degraded viewing experience at this scale would directly threaten the customer relationship, damage platform reputation, and expose gaps in the infrastructure product.

Why it matters: Live streaming failures are not recoverable in the moment. Unlike VOD, a viewer who hits buffering during a live event cannot "retry" — they leave and don't come back. At 1.8M CCU, even a 1% impact means 18,000 viewers experiencing failure simultaneously.

Research & Pre-Event Analysis

1

CDN Capacity Forecasting

Worked with engineering and CDN account teams to model expected traffic curves. Analyzed historical CCU data from prior high-traffic events to stress-test our multi-CDN commit tiers and identify potential saturation points per provider.

2

SLA & Baseline Benchmarking

Pulled TTFB, cache hit rates, and error rate baselines per CDN provider (Akamai, Edgio, Fastly, CloudFront) across geographies relevant to the event's audience. Established thresholds that would trigger routing adjustments.

3

Cross-Functional Alignment

Coordinated with engineering, sales, and the customer's technical team pre-event to define escalation paths, communication cadences, and fallback playbook triggers. Ensured every team knew their role before the stream started.

4

Real-Time Monitoring Setup

Defined which signals to watch during the event: throughput per CDN, error rate spikes, TTFB degradation, and segment delivery latency. Configured dashboards so the team could make routing decisions without hunting for data.

Possible Failure Modes

CDN Saturation at Peak Traffic

Any single CDN could hit throughput limits during kickoff or high-tension moments — the moments when viewership spikes fastest.

Geographic Coverage Gaps

South Korea-focused traffic might expose PoP coverage differences between providers, leading to uneven latency across the audience.

Critical Focus

Single CDN Dependency

Relying primarily on one provider with no real-time routing fallback would create a single point of failure with no recovery path during the live event.

Playback Errors from Segment Delivery Failures

If delivery failed for even a subset of segments, viewers would experience freezing or rebuffering — with no transparent retry to an alternate source.

Dynamic Multi-CDN Routing with Real-Time Scoring

Rather than treating CDN selection as a static configuration, I worked with engineering to define a multi-signal scoring framework that evaluated CDN health in real time and dynamically shifted traffic to the optimal provider. This meant no single provider failure could take down the event — the platform would be able route around degradation.

01

Multi-Signal Health Score

Each CDN was continuously scored on real-time throughput, TTFB, cache efficiency, and SLA error rate. Scores updated dynamically to reflect current delivery health.

02

Traffic Rebalancing Logic

When a CDN's score dropped below threshold, traffic routing shifted toward higher-scoring providers — without viewer interruption or manual intervention.

03

Live Event Dashboard

A real-time view of per-CDN performance during the event, enabling the team to monitor, annotate, and escalate based on observed data rather than guesswork.

How We Measured Success

Success Metrics
Peak CCU sustained without SLA breach across all CDN providers
TTFB within baseline thresholds at peak traffic moments
Error rate below agreed SLA across geographies
Zero unplanned escalations during the event window
Post-event CDN performance summary delivered to customer
Negative Impact to Watch
Routing instability — frequent switching between CDNs could introduce micro-buffering for some viewers
Over-routing to one CDN if scoring weights aren't balanced, recreating a single-provider dependency
Latency in score updates causing routing decisions to lag behind real delivery degradation

Rollout & Event Strategy

Before the Event

Pre-Event Preparation

  • Traffic forecast and CDN capacity review with each provider's account team
  • Routing thresholds defined and validated in staging
  • Escalation playbook agreed with customer's technical team
  • Monitoring dashboards tested and ready
During the Event

Live Monitoring

  • Real-time CDN health tracking throughout the stream
  • Dynamic traffic rebalancing as CDN scores shifted
  • Continuous cross-team communication cadence
  • Incident triggers on standby with defined ownership
After the Event

Post-Event Review

  • CDN performance summary: TTFB, cache efficiency, error rates per provider
  • SLA adherence evaluation for each CDN
  • Delivery optimization recommendations for future events
  • Internal retrospective to improve the scoring framework

What This Built

The event completed without outages at 1.8M CCU — but the more durable outcome was the delivery framework we built around it. Dynamic CDN routing became a repeatable capability for future high-scale events. The post-event analysis model became a standard artifact in enterprise customer reviews, strengthening trusted-advisor relationships by demonstrating platform transparency.

The key learning: live event reliability is a product problem as much as an infrastructure problem. Pre-defining how the system should behave under degradation — not just how to respond after — is what makes scale achievable without heroics.