If you are a Cricket fan, then you must remember the IND vs. NZ 2019 World Cup, where Hotstar made a world record of 25M+ live streamers watching that match with a smooth watching experience and almost zero downtime. Many sports like Football and Cricket are streamed by millions of users and during peak times like IPL, football world Cup, or India vs. Pakistan match, these numbers will be much higher than expected and OTT platforms have to handle this much traffic without any downtime for a seamless user experience.
Have you ever wondered how OTT platforms like Jio and Hotstar manage their technologies like scalability, Database and load balancing, and live streaming without any downtime having millions of users at a particular instance of time? Let’s understand this in the article.
Summary of Live Users in INDIA vs. New Zealand
Let’s understand the graph which represents the number of users watching the match during the entire match and break the graph into some points for better clarity.
Day 1
On first day New Zealand won the toss and started batting and in between the rain started and match stopped and continued on Day 2.
1. During Toss
The very first small hike happened at point 1 during the toss.
2. NZ Choose to Bat and Starts Batting
Between Point 1 to Point 2 the userbase increases to 10M+ live streamers from nearly 1.5M users suddenly when New Zealand starts batting.
3. During NZ Batting
Between Point 2 and Point 3 the number of live streams is around 10 Million, but you can also see some dips and rise which is completely normal and happened due to drink breaks and strategic timeouts and in those times users leave the platform and when match starts again they will return to the platform.
4. Rain Started
Due to rain the platform sees the sudden drop from 13.9M live users to nearly 3M users between Point 3 and Point 4.
Day 2
On next day when match starts again there were some overs left for the NZ to bat on and which lead to the rise of users from point 4 to Point 5.
1. NZ Batting was Over
Between Point 5 and Point 6 the users start leaving the app because the New Zealand batting was over and again after point 6 the rise start gaining because India’s batting starts from there.
2. During INDIAN Batting
At Point 7, there were around 16Million users at a time because india was losing its wickets regularly and suddenly Dhoni came to bat which raise the platform to 25M+ live users at a time and this rate is 1M+ users per minute
3. When Dhoni Got Out
At Point 8, 2 things happened Dhoni got out and Hotstar made a world record of highest number of live users watching any live stream.
But this point is very dangerous for the platform as you can see there is sudden drop from 25M+ user to nearly 1M users. This is the point where most of the platform don’t survive and represent the stability of Hotstar.
If you know how these system are designed, you can refer to this article:
Let’s Understand It More Clearly
When people watch a video at Hotstar they request the video playback files, but in the case of a live cricket match, there were just video API calls. But the problem arises when people suddenly stop watching the live match and hit the back button. Now there can be 2 cases, either they came out from the app or press the back button which lends them to the home page by requesting the homepage request, which also contains your personalized recommendation. If 25 million users hit the back button at the same time then It becomes very dangerous for the servers to handle the load of these requests and there are chances that the application may crash. Your application should be capable of handling all these requests and issues because the traffic can suddenly rise or drop at any time, so you should be prepared accordingly.
Facing the Mock Game Before Actual Game Starts
See, Cricket is India’s favorite sport and people celebrate it like any festival. So, it is a cricket match especially Indian Premier League, then it is obvious that the traffic will definitely hit millions but we cannot expect the exact number earlier. So to check how much load an application can handle, large-scale testing is conducted to check the capabilities of the platform.
In the case of IND vs. NZ, the Hotstar team created an in-house project “Project HULK” to test their platform. The infrastructure used is massive in number as you can see in the image above.
Look at the load generation infrastructure in the above image which will generate and handle the loads. The c59 xlarge machines were distributed in 8 different regions which go to the internet and then to CDN to lead the balancer to autoscale and then to the application. The machines are distributed in 8 different regions because they are hosted on the cloud which may affect other customers of the same cloud. Every load balancer has a certain peak value upto which it can handle the load but in this case, every single application of Hotstar use around 3-4 load balancer to distribute the load which leads to scalability.
Scaling Numbers
If you see from the very first graph in 10 minutes almost 10M+ users have been added to the platform which is 1M per minute, to handle this traffic, the team has to scale up in advance because by the time ec2 will is provisioned, then boots up and then application become healthy under a load balancer and this will waste the 5-7 minutes and in a live match we can not take the risk because in that time traffic will increase to 7 million. Hotstar uses FULLY BAKED AMIs because they cannot support any kind of delay and want the application to run smoothly to give users a good experience.
Autoscaling
AWS provides the autoscaling feature but instead of using this autoscaling feature, Hotstar have created their own auto-scaling feature. But let’s first understand why AWS autoscaling does not work for Hotstar in this case.out AWS
To understand AWS autoscaling, refer to this article: Amazon Web Services – Auto Scaling Amazon EC2
1. Why Hotstar Does Not Use Existing Autoscaling?
Because scaling at this scale from 15 million to 25 million in just 10 minutes require a lot of servers from AWS and what if you did not get the desirable servers to fulfill the capacity. Also if got the servers, there is another issue which is STEP SIZE autosclaing. It means when you request to increase your desired servers capacity, it adds servers i a step size of 10 or 20 and it will make the process very slow and it is not acceptable when you are running a live match.
You can also ask the cloud provider to increase your service limit and allow to scale at a step sizer off 100 or 200 but it will cause more damage to the system, so it is also not possible at this much scale. Since Hotstar have their own autosclaing system, now let’s understand How it works?
2. Self Auto-Scaling Mechanism of Hotstar
The general mechanism of auto-sclaing is based on metrics like CPU and network usage but Hotstar have created heri own auto-scaling tool which scales up the system on request rate and concurrency.
They don’t use CPU metrics perhaps they have marked the benchmark for each servers and containers usage. Their decision is depended on the rated RPM that each container serves. As, their matrics of scaling is different so when the request count per application is high, the system will scale using the request count as a metric but it concurrency is high it will scale that way accordingly.
CHAOS Engineering
Chaos Engineering is a term used by developers which means that you find out the breaking point in a system which means that at a which particular point system is going to fail and you know how to overcome that failure without impacting the user.
Now in a live match where 25 millions of users are streaming it there are multiple ways in which the platform can break into a failure, let’s understand some of them
1. Increased Latency
If any of the API increase its latency then it has a cascading effect on the other services of entire application. Let’s say the content platform API’s have increase their latency then there are other service that consume this API as well to show the content on homepage. So your personalized engine, a recommendation engine, which shows your content consumption history will work slowly which load the homepage slowly and ultimately slowdown the entire application.
2. Network Failure (CDN)
Here the platform is dependent on the CDN providers because if the edge location which is near to the viewer home goes down then all the request come to the mid layer or the origin endpoint and if the platform is not capable of handling these much requests then it may bring down the application, which ultimately lead to a bad user experience.
3. Delayed ScaleUp
It is the reason that we discussed above that Hotstar uses it own auto-scaling tool, because if the infrastructure is not up to the mark to handle the load, then user may get a bad experience because the AWS autosclaing consume tiem to scle which is not acceptable in a live game.
4. Bandwidth Constraints
The problem arrises when large userbase is coming to watch the live streaming, then application will consume a lot of video bandwidth, and in this match more than 10TB per second of bandwidth is consumed which is almost running at 70% of India’s capacity. Hence, there would be limited room to operate when users are increasing at this pace.
Conclusion
This is how Hotstar & JioCinema handles their millions of traffic so smoothly at the peak time of their application to maintain the highest record of handling this many live users at one time with zero downtime and its other services are also working fine. The major role is of load balancing and the autoscaling which they handle by themself by creating their own tools as we discussed above. Also, they have tested the system before the actual stream and with the help of advance scaling they maintain a better user experience.