How to build a fast and reliable OTT ecosystem
This story is about the overall landscape of what is beyond and around video delivery for an OTT service and how the CDN plays a fundamental role in this ecosystem. Most of the samples about CDN are done using Akamai, but similar services and capabilities are present by other vendors in this space.
Often, when it comes to services related to internet video, the focus on the CDN is limited to the actual video delivery because the high volumes of traffic are there, but CDN services are required for more than just delivery of video.
As first item, let’s try to grab an overview of the ecosystem that is composing an OTT solution.
To deliver high-quality video is not enough, because users must be allowed to access that high-quality video. You can have an incredibly high-quality video-content streamed over the internet but if people are not aware that the video actually exist is as if you are not transmitting it. People access videos via applications, being web, mobile or connected tv applications, and applications rely on APIs to identify the user, check the right for the users and provide the list of content and the URLs to get the content. If all this is not working or not performant enough, the quality of the video distribution will not be perceived because users will sit in front of their device without being able to access the video.
This schema represents a simplified high-level view of the components representing an OTT solution. The frontend layer is delivered to multiple devices, each having its own specific capabilities. Multiple services are composing the solution, often managed by different providers and vendors. The business rules and the application logic are therefore distributed and need to be coordinated to assure a consistent experience.
The user journey to access the content is involving multiple services and layers, going from the user data (to assure the user is correctly identified) to subscription data (to get the information about which content packages that specific user is allowed to access in the specific access condition, being those potentially considering geographical location, device type, day of week and time of the day); collecting video data (to know which categorisation or classification applies to the specific content the user is trying to access) to match them with subscription data and finally involving the video content to actually be delivered to the end user (or not, depending on the matching executed on the previous data collection). It is easy to see how extensive the surface of attacks is, due to the number of services involved and the relationship in terms of data input\output of each of them.
The system is as strong as the weaker point of the chain. Often the effort on video-based solutions is on the delivery capacity and security for the video content, but if the access to APIs is not managed properly and it is unsecure or low performance, the overall security and performance of the solution will not be acceptable.
We’ll try now to provide some ideas on each of three aspects of Web content and API delivery: Performance, Resiliency and Security; and to see how CDN plays a fundamental role on each of them.
We start with performance.
Web delivery performance is often associated with caching and this is right for a large part. To serve cached content from the Edge means to manage the HTTP request-response transaction from a point very close to the client, but the closest point to the client is the client itself, so it is key to properly manage the rules to define client side caching, to directly avoid the client to initiate a HTTP request-response transaction whenever this is not required. These rules can be set on the backend application serving as origin to the CDN but it is also possible to define client caching rules at the Edge, in order to centralise where the content cache TTL at the Edge and on client side is managed and to assign to one specific person or team the responsibility to manage both.
There is a common misconception about the fact that if you are publishing APIs you do not need a CDN because you are fully dynamic. Beside the fact that even dynamic and uncacheable content take advantages from a CDN due to TCP optimisation, content protection, request validation, filtering, routing and distribution; you have to consider that not all API requests are uncacheable and cache hit can even be very high. If the response for a user token or profile are intended to be unique per each request, the video list is varying during the time of the day but it is the same for each cluster of users (if not for all).
Akamai Web delivery services offer the ability to customise how the cache key is defined. The cache key is the unique identifier of an object in cache and rules can be defined in order to translate a specific HTTP request into a corresponding cache key. The more you can intercept similar HTTP request to correspond to the same response the higher the cache hit ratio can be and the more performant the content browsing sessions can be. Case sensitiveness, query string ordering and parameters, specific requests headers or cookies can be granularly tunes to be consider or not as part of the cache key, so for example you can provide different responses for different user agents or you can provide the same response no matter which query parameters and values are in the request. It is also possible to cache POST requests inserting hash of the request body or use variables defined at the edge, for example based on geographical detection, to discriminate between two requests.
Akamai Web delivery services allow you also to construct responses for specific requests directly at the Edge, without the need to forward the request to your origin. This could be used for example to manage the Options method and provide CORS headers in the response, or to set specific cookies or headers to be served to the client as part of the response or added to the request forwarded to the origin (for example to send details about geographical detection or connectivity services of the user).
The previous paragraph aimed to enforce the concept that Web content delivery services do not provide only static caching or TCP optimisation, they must be considered as part of the overall solution architecture because specific business rules applies to delivery and those rules can be implemented at the Edge to increment the perceived (and measured) performance.
Akamai services allow to operate requests validation and to implement not only caching rules but also routing rules, to simplify the management of multiple origins (as in a microservices scenario). To simplify rule evaluation at the origin means to allow a more performant backend, for example different origin for the same edge domain can be served to the end user on the base of the preferred language, avoiding the origin to compute which language to select per request and having multiple specialised origins where language is statically assigned.
The second area to consider is resiliency.
Resiliency is based on the ability to deal with failures. A resilient solution starts from the design phase. During every phase of the project and in any element of the solution the assumption that something will eventually fail must be taken explicitly.
Resiliency is about how the software solution is designed, implemented and deployed but it is a lot about mindset and procedures of the organisation and people involved in the service.
In software development and architecture the point of attention is often to define root causes and fix them, this is definitively good practice in problem management, but a solution must have also a strong incident management procedure, where the focus, more linked with operations than not with coding, is to assure a user experience the smoother possible even in case of failures on components.
When a solution goes live and enters into operation it is key to have clearly defined responsibilities for business continuity decision, a streamlined communication path and an agreed map of possible countermeasures for each kind of failure.
All this is very difficult when working in silos, this is the reason why the cultural movement called DevOps is becoming more and more a common pattern. DevOps is not about tools or procedures, it is about the building of a unified team from design to implementation to operation, to assure that a common business context is at the bases of each decision and that this business context and the values that the solution wants to implement as well as the architectural and technological choices made from coding to deployment are perfectly known by each member of the team, no matter if having an analyst, development, QA, sysadmin or operation skillset.
The main prerequisite to be able to deal with incidents is to be aware of them.
Akamai platform has an extensive, and often not too known and used, alerting solution. Alerts can be configured for raising threshold of specific HTTP response code, or for the increase in terms of overall volumes of requests hitting the origin, or for variation (increase or decrease) of Edge traffic, for specific errors (as DNS, SSL transaction or origin availability). Adaptive alerting is also available, to alert in cased of usage pattern on a specific service is not matching the predictive model that Akamai created on the base of historical data for that configuration or property.
Edge services offer a lot to deal with resiliency. Multiple origins can be configured, in conjunction or not with Site Failover functionalities, to route traffic from a failing origin to a working one. The secondary location can be just an exact copy of the main one just in a different hosting provider, to cope with hosting failures or specific connectivity related issues. But it can also be a stub to provide static responses to API calls to unblock a process (for example assuring subscription check is passed, if the business continuity rule defined an “Open Gate” policy, not to deny access to content in case of impossibility to properly check the rights). Or it can be an origin hosting a different version of the application to correct a specific bug (or to rollback from a buggy deployment).
It is also possible to create rules at the Edge to deal with errors, to serve, in case of origin unavailability, a stale version of the content or to construct a specific response in order to drive the user journey in case of origin failures.
The third area is security, an area where the level of awareness is constantly growing in the last years.
We mentioned in the introduction that the level of security of a solution is equal to the level of security of its weakest module. We also mentioned that access to video is done thanks to access to your web application and APIs, therefore security of your video is based on the security of your web application and APIs.
Akamai services are providing various capabilities to manage security and not all of them are strictly linked with the Kona suite of products, being some security mechanisms already available in the delivery products.
With the web delivery configurations it is possible to deal in details on all elements of the HTTP request-response transaction. This allows you to validate and filter for each path or filename or file extension which HTTP Verbs you want to allow, to analyse and interact with both request and response headers and cookies. You can validate the format and values of query strings and requests body. All these features, available in standard delivery configurations, can reduce massively the surface of attack and remove the disclosure of precious information to attackers (for example by removing the Edge debug headers or headers inserted by your OS or Application Server that would allow to identify the technical stack you are using).
Other standard way of protecting access to content are the usage of Geo Blocking rules and Tokenization (URL Signing).
More advanced services are available as the Web Application Firewall, the API Gateway, Bot Manager in order to operate fine grained access and content protection.
We can talk about may tools and services available to automatically identify and blocks attacks, but I would never stress enough about the important role of visibility for a team following operation of your solution.
One of the main benefits we experienced enabling a WAF solution was the ability to access to near real-time data about requests and requests analysis. You can also create custom rules to filter specific request patterns to have them highlighted in the reporting in order to make educated decision.
What is important is to consider all sources of information and the overall context. Traffic data (consumed bandwidth) is per-se a non critical piece of information, but if you can combine your current consumption with the information from analytics tools (as Akamai Media Analytics QoS or Conviva for videos) and you know how many real concurrent users you have and what is the average bitrate consumed, you can easily verify if your traffic is including illegal consumption of your video. The same can be applied to requests to APIs, where you can combine analytics information about distribution of users in terms of devices and locations and verify if this is matching the distribution of requests you received. You can also be willing to compare your CDN reports related to Edge and Origin request with the traffic and requests logs you collect at the origin to identify an illegal sourcing happening directly from your origin.
Now that we have reviewed considerations for the three areas, performance, resiliency and security, we would like to quickly move our attention to the relationship between Cloud and CDN.
Cloud Services are offering the ability to create elastic and geographically distributed deployments, still they do represent deployments into specific locations and datacentres. Edge services are instead by nature distributed into multiple locations and datacentres, being the locations in use not selected by the deployment team but directly linked with the activation caused by proximity with interacting users.
While a complex multi-tier application is not suitable for an Edge deployment, there are specific set of Business Rules that can be implemented at the Edge, reducing the capacity required at the origin (being the origin a Cloud service) and optimising the roundtrip between the user request and the response delivery.
Beside the requests routing, distribution, filtering and validation that we have mentioned before, it is possible to operate more complex activities to process a request directly at the Edge, as constructing a computed (therefore variable) response to specific requests. We have been using this, in conjunction with Akamai Professional Services, to implement a simple entitlement service to validate a user request for a video and returning, only in case of validation of the request, as part of the response the video url with a one time token computed for that specific request. In one other case we have used geographical information to populate variables and constructed a specific logic to vary the response on the base of the values of those variables.
The Edge is the natural extension of the Cloud and the two works perfectly together.
CDN Services can integrate with Cloud providers, optimising the network consumption and implementing not only distribution but also security services as a “wrapper” around the Cloud.
Azure Media Services and Akamai AMD integration is a perfect example on how the two kind of deployments are glued together to provide the optimal end-to-end solution. While the Azure Media Services can provide a complete set of functionalities to ingest, transform, store and publish video content, Akamai AMD offers optimal content distribution and protection, also presenting a shield to the Cloud origin, without the need to strictly limit the number of accessing nodes to operate a static IP filtering. Filtering on access is in fact operated via request signing between the Edge and the Origin, via a dedicated custom request header added by Akamai nodes and validated by Azure.
Quote from Grady Booch, one of the father of UML. Reported here to remember that the overall solution has to be part for the design and not only the software modules, infrastructure services as Cloud Hosting and CDN are offering capabilities that must be part of the design phase as well as the operating model of the solution has to be part of the project since its initial stages.
Quote from George Box, british statistician. Reported to remember that reality is more complex than any design but a good model is required to drive the design. The solution will be subjected to continuous adaptation once live to respond to real life variation.
Quote from Renzo Piano, architect. Reported to remember that the context is constantly evolving so we need to always be ready to re-invent ourselves and our solutions, never stop thinking to have the right solution just to be implemented again and again as is.
The content of this article was part of a presentation done at the Akamai Media Tech Day in Milan on 15th March 2019. The slides are available in the Newesis YouTube channel https://www.youtube.com/watch?v=7eHYZgPqvL0 while the video of the session is available at the URL https://www.youtube.com/watch?v=ErkeHC-HISQ (audio in Italian)