Live at Scale — Viviana Palacios

01 · Problem Statement

The challenge

An enterprise customer needed to stream a high-profile international soccer match — live, at a scale the platform had not previously served from a single event origin. The business risk was significant: a failure or degraded viewing experience at this scale would directly threaten the customer relationship, damage platform reputation, and expose gaps in the infrastructure product.

Why it matters: Live streaming failures are not recoverable in the moment. Unlike VOD, a viewer who hits buffering during a live event cannot "retry" — they leave and don't come back. At 1.8M CCU, even a 1% impact means 18,000 viewers experiencing failure simultaneously.

02 · Approach

Research & pre-event analysis

1

CDN capacity forecasting

Worked with engineering and CDN account teams to model expected traffic curves. Analyzed historical CCU data from prior high-traffic events to stress-test our multi-CDN commit tiers and identify potential saturation points per provider.

2

SLA & baseline benchmarking

Pulled TTFB, cache hit rates, and error rate baselines per CDN provider (Akamai, Edgio, Fastly, CloudFront) across geographies relevant to the event's audience. Established thresholds that would trigger routing adjustments.

3

Cross-functional alignment

Coordinated with engineering, sales, and the customer's technical team pre-event to define escalation paths, communication cadences, and fallback playbook triggers. Ensured every team knew their role before the stream started.

4

Real-time monitoring setup

Defined which signals to watch during the event: throughput per CDN, error rate spikes, TTFB degradation, and segment delivery latency. Configured dashboards so the team could make routing decisions without hunting for data.

03 · Risk Identification

Possible failure modes

CDN saturation at peak traffic

Any single CDN could hit throughput limits during kickoff or high-tension moments — the moments when viewership spikes fastest.

Geographic coverage gaps

South Korea-focused traffic might expose PoP coverage differences between providers, leading to uneven latency across the audience.

Critical Focus

Single CDN Dependency

Relying primarily on one provider with no real-time routing fallback would create a single point of failure with no recovery path during the live event.

Playback errors from segment delivery failures

If delivery failed for even a subset of segments, viewers would experience freezing or rebuffering — with no transparent retry to an alternate source.

04 · Solution

Dynamic multi-CDN routing with real-time scoring

Rather than treating CDN selection as a static configuration, I worked with engineering to define a multi-signal scoring framework that evaluated CDN health in real time and dynamically shifted traffic to the optimal provider. This meant no single provider failure could take down the event — the platform would be able route around degradation.

01

Multi-signal health score

Each CDN was continuously scored on real-time throughput, TTFB, cache efficiency, and SLA error rate. Scores updated dynamically to reflect current delivery health.

02

Traffic rebalancing logic

When a CDN's score dropped below threshold, traffic routing shifted toward higher-scoring providers — without viewer interruption or manual intervention.

03

Live event dashboard

A real-time view of per-CDN performance during the event, enabling the team to monitor, annotate, and escalate based on observed data rather than guesswork.

05 · Metrics

How we measured success

Success Metrics

Peak CCU sustained without SLA breach across all CDN providers

TTFB within baseline thresholds at peak traffic moments

Error rate below agreed SLA across geographies

Zero unplanned escalations during the event window

Post-event CDN performance summary delivered to customer

Negative Impact to Watch

Routing instability — frequent switching between CDNs could introduce micro-buffering for some viewers

Over-routing to one CDN if scoring weights aren't balanced, recreating a single-provider dependency

Latency in score updates causing routing decisions to lag behind real delivery degradation

06 · Execution

Rollout & event strategy

Before the Event

Pre-event preparation

Traffic forecast and CDN capacity review with each provider's account team
Routing thresholds defined and validated in staging
Escalation playbook agreed with customer's technical team
Monitoring dashboards tested and ready

During the Event

Live monitoring

Real-time CDN health tracking throughout the stream
Dynamic traffic rebalancing as CDN scores shifted
Continuous cross-team communication cadence
Incident triggers on standby with defined ownership

After the Event

Post-event review

CDN performance summary: TTFB, cache efficiency, error rates per provider
SLA adherence evaluation for each CDN
Delivery optimization recommendations for future events
Internal retrospective to improve the scoring framework

07 · Outcome & Learnings

What this built

The event completed without outages at 1.8M CCU — but the more durable outcome was the delivery framework we built around it. Dynamic CDN routing became a repeatable capability for future high-scale events. The post-event analysis model became a standard artifact in enterprise customer reviews, strengthening trusted-advisor relationships by demonstrating platform transparency.

The key learning: live event reliability is a product problem as much as an infrastructure problem. Pre-defining how the system should behave under degradation — not just how to respond after — is what makes scale achievable without heroics.

Live at Scale: Delivering to1.8M Concurrent Viewers