Observability in Live Broadcasts and Quality of Experience
In our recent webinar on Multi-CDN lessons learned, Dan Rayburn and I spent a fair amount of time discussing the need for qualifying what “performance” means for your streaming service and for the need to accurately monitor, measure and observe a viewer’s quality of experience (QoE).
Over the past decade, we’ve come to rely on client-based technologies like Conviva and other vendors to identify areas of QoE where we might improve. While client-based analytics provide valuable insights into viewership and critical last-mile performance data, they don’t provide a complete picture.
To truly see what’s going on and ensure a great QoE, you need observability across your streaming stack, integrated into one view — a single pane of glass from which you can monitor, identify issues, and get more information in order to react faster during a live broadcast.
Some Common “Performance” Issues in QoE include:
Rebuffering occurs when some or all of a video segment needs to be reloaded due to slow encode, transcode or delivery. According to Mux’s research, rebuffering reduces viewer time by as much as 40%.
Jitter occurs when parts of the stream data is lost in encode, transcode or transmission. Often leads to rebuffering.
Screen tearing occurs as a result of missing video data in a stream, leading to a fragmented frame being displayed to the viewer. This can be a result of jitter.
Slow time to start is traditionally defined as exceeding the two-second start time benchmark, often resulting in viewers leaving a broadcast.
Disconnects occur involuntarily from a stream when some part of the stack fails to deliver expected signals to the client. This could be the manifest, a segment or a network connection failing.
Network congestion can result in reduced throughput for delivery. This can occur between any of the networked systems in a live stream delivery stack (e.g. between source and origin, or between the edge delivery and the client). It can result in “lag,” rebuffering and jitter.
“Lag” is a viewer complaint that can mean any number of issues, but most commonly refers to rebuffering or slow start times.
Time-behind-live is typically the difference in latency in milliseconds between the actual live event in the venue and the video displaying in the client device or player, also called “Glass-to-Glass” (G2G) latency.
Lifting the veil on observability
One of the most significant challenges in live broadcasting is that a typical streaming stack has a variety of different technologies. Traditionally, very few of these technologies have exposed logging or tracing, and only very limited monitoring. In essence, these stacks were closed black box systems. If the encoding frame rate was falling behind the live signal, or there was congestion on the signal transport from the origin to your edges, it could be expressed as stuttering or rebuffering in the viewing experience. However, on the other side of the glass, a broadcaster’s ability to identify where the problem actually occurred was challenging. It would require looking at each system separately, and correlating issues between the disconnected metrics and logs.
From traditional content delivery networks, logging data was often delayed several hours and potentially longer with large scale events. This made troubleshooting issues in a live broadcast setting nearly impossible.
In its simplest form, the problem can be broken down into space breadth vs. depth. When you’re operating a live event broadcast stream, you need to be able to see the health of the overall broadcast (breadth), but in order to actually identify, troubleshoot and fix an issue, you need detailed tracing and logging (depth).
Letting performance inform your single, observable framework
One of the most important concepts to keep in mind is understanding how your viewers’ QoE will shape what performance criteria are important to your business. If you’re broadcasting a live esports event with heavy social interaction, time-behind-live latency might be your most important performance metric. A premium live event broadcast which has a subscription fee might require you to focus on rebuffering and video quality. These business considerations will drive your observability strategy.
Cindy Sridharan published a great book where she breaks down how monitoring, logging, and tracing work together to provide a complete observability picture.
What we mean by monitoring, logging and tracing:
Monitoring and/or metrics provide snapshots based on predefined criterion. You might capture CPU usage and outbound data transfer rates for your encoders at fixed intervals. At some threshold, you might alert to indicate there’s a problem, which might cause tearing or image quality degradation.
Logging provides a record of time-stamped discrete events. You might log errors you see at the edge (i.e. HTTP 5xx errors), video player or client latency and the associated downstream network path (ASN) to determine potential sources of jitter or rebuffering. With Fastly Real-Time Logs, you can completely customize the variables you capture and the format you send them in, whether it's Syslog format, CSV or JSON.
Tracing provides a view on the end-to-end flow of a request through your entire application stack. In a live event broadcast, you might trace a request for the HLS manifest > HLS segment > origin > transcode/encode > source signal.
Individually, each of these components provide a point-in-time view of any issue in your stack. Collectively, they allow you to observe a problem as it’s happening, right where it’s happening. Fastly’s technical co-founder, Simon Wistow, points out “monitoring and logging are no longer enough.” In a live broadcast, you must go beyond monitoring or logging and tie these systems together into an observable framework that allows you to solve problems as they happen.
As you think about what performance means for your viewer’s experience, Sridharan’s framework is a very useful foundation for a strong observability strategy.
For many broadcasters and content publishers, a typical streaming stack looks something like this:
Wowza, Fastly and observability at IBC 2019
To demonstrate these ideas, Wowza will be showcasing their Stream Health dashboard at the 2019 International Broadcasting Convention (IBC) in Amsterdam. This dashboard leverages real-time logging from Fastly’s edge cloud platform, and illustrates the power of taking a breadth and depth approach to your observability strategy.
The demo highlights the tremendous observability boost a streaming stack experiences by having customized real-time logs from your edge. Wowza’s products expose a fantastic amount of metrics in their existing dashboards, allowing their customers to see things like source latency in their Clearcaster product, or origin performance in their Wowza Cloud products. Fastly customers, like Vimeo, FourSquare and Taboola, have leveraged our real-time log streams for a variety of use cases. In a single second, Fastly sends over 250,000 log lines to Taboola (as of June 18, 2019), each one customized with the observability data they need to make business decisions for their service.
Wowza & Fastly Demos @ IBC 2019
Wowza Stand, Hall 14, booth E08
Microsoft Pod, Hall 1, booth C27
Google Stand, Hall 14, booth E01
We’ll be continuing work with Wowza and others in the community to expose endpoints where customers can poll critical metrics. An open data access ecosystem will allow publishers and broadcasters to clearly observe how quality of experience is impacted by performance. And, it will allow us all to leverage data warehouses like Google BigQuery and Azure Data Explorer to build a single pane of glass dashboard across multi-vendor technology stacks.