When AI architects think about ML Serving, they focus primarily on speeding up the inference function in the Serving layer. Worried about performance, they optimize towards overcapacity, leading to an expense end-to-end solution. When the solution is deployed, the cost of serving alarms those responsible for budgets, leading to abandoning of solutions. Keeping costs down is an important goal for a practical architecture and a successful AI solution in Asynchronous Architectures
The default architecture that architects come up with is a synchronous one. A simplified version of this architecture is provided here:
An ML Service API, typical a REST API sits in front of the serving layer. It takes care of standard API functions like authentication and load balancing. A cluster of ML Serving nodes is set up behind this service API. Each node provides an inference function, that takes input feature variables and returns the required prediction. The Service API layer does load balancing of requests between the Serving nodes.
What are some of the problems of this synchronous architecture? Typically, there is a fluctuation in serving load across time intervals. If this solution needs to handle the maximum load, it must be provisioned to maximum capacity. When the system is not utilized at its peak, the provisioned resources are idling. ML Serving may use expensive resources like GPUs and keeping them idle is a waste of money. Alternatively, if we provide the solution to handle only average or above-average loads, the Service Layer needs to reject new requests beyond provisioned capacity to prevent back-pressure on the serving nodes. This would be a denial of service. Both scenarios are not desired. While new developments in elastic scaling in the cloud helps alleviate some of the concerns, they introduce other issues.
A good alternative to synchronous serving is asynchronous serving, which helps optimize ML Serving node resources and enables additional capabilities. Here is how a typical asynchronous serving architecture would work:
When the ML Service API receives the request, instead of directly sending it to the ML Serving node, it places the request in a Request Queue, which is a publish-subscribe Queue. Apache Kafka is a great technology for this option. ML Serving nodes are subscribers to this queue. They listen to the queue and pull in new requests when they have free capacity. The results are then pushed to a Response Queue, which is another publish-subscribe queue. The Service API subscribes to the response queue. As responses are received, the ML Service API then returns them to the clients. Clients can either wait for the response in the same request connection or can provide a callback endpoint to receive the responses.
There are multiple advantages to the Asynchronous Architectures approach
- The ML Serving node can be provisioned for average loads. When there is a sudden spike in load, the queues handle the back pressure temporarily. Care should be taken to understand load patterns and provision enough resources, so that catch up happens within acceptable thresholds.
- Any pre-processing needed for requests can be handled using streaming jobs on the request queue. Again, this ensures even distribution and scaling. Same with post-processing on the response queues.
- Message Queue technologies provide capabilities like persistence and fault tolerance, so these don’t have to be built out separately.
- Message queues can also provide the same input to other subscribers, like reporting and analytics.
- This architecture provides loose coupling between services, that allows them to evolve independently and enables Agile development and deployment.
An immediate concern raised by this approach is the latency of responses. Hypothetically, there is additional latency introduced due to the asynchronous nature, but in most practical cases, this additional latency is still within acceptable limits for the response times required for the solution. Setup correctly, an asynchronous solution can provide the same response time averages as its synchronous equivalent.
While architecting the solution, the suitability of the asynchronous architecture should be evaluated against business needs. If end-user experience is blocked until a prediction is made, and the prediction is critical, then synchronous may be the way to go, despite higher costs. An example here would be operating critical machinery based on predictions. On the other hand, if the user experience can proceed and a delay in predictions is not critical, asynchronous is the way to go. An example here would be providing recommendations to a user on an e-commerce website, while the user is browsing a catalog.
Asynchronous pipelines are a great tool in the architect’s toolset. It is still the architect’s call on whether it’s appropriate to the use case in question. Do catch up on my talk about experiences in building an ML platform at ODSC West (https://odsc.com/speakers/building-a-ml-serving-platform-at-scale-for-natural-language-processing/), where I will be discussing similar interesting asynchronous architecture options.
Kumaran Ponnambalam is an AI and Big Data leader with 15+ years of experience. He is currently the Senior AI Architect for Webex Contact Center at Cisco. He focuses on creating robust, scalable AI platforms and models to drive effective customer engagements. In his current and previous roles, he has built data pipelines, ML models, analytics and integrations around customer engagement. He has also authored several courses on the LinkedIn Learning Platform in Machine Learning and Big Data areas. He holds a MS in Information Technology and advanced certificates in Deep Learning and Data Science.