Product Case Study · Video Infrastructure
How I led product for a high-stakes international live sports stream — balancing delivery reliability, real-time CDN orchestration, and cross-functional coordination under pressure.
01 · Problem Statement
An enterprise customer needed to stream a high-profile international soccer match — live, at a scale the platform had not previously served from a single event origin. The business risk was significant: a failure or degraded viewing experience at this scale would directly threaten the customer relationship, damage platform reputation, and expose gaps in the infrastructure product.
Why it matters: Live streaming failures are not recoverable in the moment. Unlike VOD, a viewer who hits buffering during a live event cannot "retry" — they leave and don't come back. At 1.8M CCU, even a 1% impact means 18,000 viewers experiencing failure simultaneously.
02 · Approach
Worked with engineering and CDN account teams to model expected traffic curves. Analyzed historical CCU data from prior high-traffic events to stress-test our multi-CDN commit tiers and identify potential saturation points per provider.
Pulled TTFB, cache hit rates, and error rate baselines per CDN provider (Akamai, Edgio, Fastly, CloudFront) across geographies relevant to the event's audience. Established thresholds that would trigger routing adjustments.
Coordinated with engineering, sales, and the customer's technical team pre-event to define escalation paths, communication cadences, and fallback playbook triggers. Ensured every team knew their role before the stream started.
Defined which signals to watch during the event: throughput per CDN, error rate spikes, TTFB degradation, and segment delivery latency. Configured dashboards so the team could make routing decisions without hunting for data.
03 · Risk Identification
Any single CDN could hit throughput limits during kickoff or high-tension moments — the moments when viewership spikes fastest.
South Korea-focused traffic might expose PoP coverage differences between providers, leading to uneven latency across the audience.
Relying primarily on one provider with no real-time routing fallback would create a single point of failure with no recovery path during the live event.
If delivery failed for even a subset of segments, viewers would experience freezing or rebuffering — with no transparent retry to an alternate source.
04 · Solution
Rather than treating CDN selection as a static configuration, I worked with engineering to define a multi-signal scoring framework that evaluated CDN health in real time and dynamically shifted traffic to the optimal provider. This meant no single provider failure could take down the event — the platform would be able route around degradation.
Each CDN was continuously scored on real-time throughput, TTFB, cache efficiency, and SLA error rate. Scores updated dynamically to reflect current delivery health.
When a CDN's score dropped below threshold, traffic routing shifted toward higher-scoring providers — without viewer interruption or manual intervention.
A real-time view of per-CDN performance during the event, enabling the team to monitor, annotate, and escalate based on observed data rather than guesswork.
05 · Metrics
06 · Execution
07 · Outcome & Learnings
The event completed without outages at 1.8M CCU — but the more durable outcome was the delivery framework we built around it. Dynamic CDN routing became a repeatable capability for future high-scale events. The post-event analysis model became a standard artifact in enterprise customer reviews, strengthening trusted-advisor relationships by demonstrating platform transparency.
The key learning: live event reliability is a product problem as much as an infrastructure problem. Pre-defining how the system should behave under degradation — not just how to respond after — is what makes scale achievable without heroics.