The State-of-the-Art of Platforms for Machine-Learning-Inference

multiple different rockets racing each other in space, generated with DALL-E

From On-Premise to Cloud to Serverless (11.08.2022)

Förderjahr 2021 / Stipendien Call #16 / ProjektID: 5884 / Projekt: Efficient and Transparent Model Selection for Serverless Machine Learning Platforms

How are machine learning models currently deployed and delivered on existing platforms? Let's find out. (This time in English, as my diploma thesis will be written in English)

As I mentioned already in my first blog post, serverless computing is a good candidate for delivering machine learning models. But other paradigms are also suitable in some cases and are also currently offered in the ML industry. In this post, I show and compare existing machine learning platforms from industry and academia, with deployment offerings including serverless computing and others.

Amazon SageMaker

SageMaker is a prominent example in the space of cloud machine-learning platforms - not only due to the fact that it is part of the biggest public-cloud vendor (AWS). SageMaker's offering is broad and extensive and addresses a diversified range of users with unique specializations. The platform has both graphical interfaces for business analysts, as well as technical interfaces for data scientists to develop and train new models. The most interesting aspect for my work, however, is in the offering for ML engineers, which focuses on MLOps - the deployment and management of models.

The graphic shows the optimized SageMaker MLOps pipeline that enables developers to quickly and efficiently get a working, deployed model available for inference. From this pipeline many parts will be relevant to my work as well. More concretely I will go more in to detail about the Model registry, the Model deployment and the Monitoring steps of the pipeline.

Model registry

The model registry is how SageMaker stores models for further development and deployment. For each ML problem (i.e. application), the registry creates a "model group" with each new training of a model being a versioned "model package" inside that group. The versions start from 1 and increase by 1 per training of the model. The registry exists as an alternative to the AWS marketplace where pre-trained models can be selected instead of custom trained models. The registry features an approval mechanism where models can be tested before they are deployed for inference.

Model deployment

When deploying models there are multiple options which can be tweaked with the main distinction being the inference type:

Real-time inference is perhaps the most "basic" type, as it involves hosting an endpoint continuously and scaling it up per need. Amazon recommends using this type when requiring low latency.
Serverless inference is the most interesting type in relation to my thesis. Amazon describes the benefits of it being less management time spent on choosing instances and scaling options. They recommend using this type for use cases with long idle periods that are less affected by cold-starts of containers.
Asynchronous inference is perhaps the antithesis of serverless inference as it is useful for requests with large bodies (at Gigabyte scale) with long processing times.

Monitoring

To maintain reliability of the deployed ML services, AWS provides tools for monitoring the deployed models. The offering is based on a range of common AWS services like CloudWatch for tracking metrics and creating dashboards, CloudWatch Logs for storing the logs of model runs, Cloudtrail for capturing API calls and subsequent events, and CloudWatch Events for keeping track of status changes in training jobs.

MArk

MArk is a proposed academic solution [1] which—in a similar fashion to my work—explores the operationalization of serving ML models and applications. The authors found that the central challenge for MLOps is the compliance to "response time Service Level Objectives (SLOs)" since ML applications are used in time-critical fields and selecting the correct cloud provider services for the task is hard. Their research found that a combination of Infrastructure-as-a-Service and Functions-as-a-Service (Serverless) yields low latency, high throughput and low cost if changing the strategy at the correct time. In this case IaaS is used for general serving, whereas FaaS is used when scaling up the infrastructure according to increased demand. MArk decides on the scaling based on load predictions, which can be quite costly. Therefore I looked into BATCH, which is presented next.

BATCH

BATCH is another project resulting from an academic paper [1] and like MArk and the goal of my thesis explores the potential of Serverless in ML applications, especially in "bursty" workloads. They show that current industry-leading platforms, i.e. AWS Lambda and AWS SageMaker, have difficulties dealing with ML workloads which have short bursts of many workloads. Lambda has the general problem of not being able to deal with those data intensive workloads properly (as discussed in my previous blog post). SageMaker at the time of publishing of the paper only offered standard elastic scaling of resources, which lags behind when facing sudden explosions of requests. Presumably, the serverless endpoints in SageMaker can help with this, which is something I will look further into in my future prototype comparisons.

Nevertheless, the approach to fixing the shortcomings of Lambda in this paper have merit regardless of any serverless optimizations in SageMaker. The authors recognized (similarly to MArk) the four main challenges currently faced by serverless platforms when serving ML workloads:

The stateless design of serverless platforms does not allow for batching of requests as there is no way to keep their state together. They tackle this challenge by introducing an extra buffer before the serverless execution itself, which gathers the requests before dispatching them to the execution environment.
Due to the relative simplicity of the scheduling of serverless function (one request starts one function execution), there is no notion of latency requirements which can be optimized towards. In many ML workloads latency is a very import issue, therefore there needs to be some kind of optimization towards reducing it. Adapting the batching windows of the aforementioned buffer offers an additional parameters for optimization.
Existing parameters are not workload-aware and don't scale according to the latency requirements. The introduced buffer parameters help solve this challenge as they can intelligently adapt to workloads.
Solutions as used by MArk are too work-intensive to run in a real world environment as the benefits of the optimization are outweighed by the cost of it. The authors choose to use simpler analytical methods (based on request arrival distributions) as opposed to simulations or ML approaches for the optimization of parameters.

Through their approach, Ali et. al. could reduce the overall costs of operating ML services while still keeping a desired latency. Using this kind of batching of requests with elastic batch sizes will be something I will explore in my prototypical implementation.

[1] BATCH: Machine Learning Inference Serving on Serverless Platforms with Adaptive Batching. Ahsan Ali, Riccardo Pinciroli, Feng Yan and Evgenia Smirni

[2] MArk: Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving. Chengliang Zhang, Minchen Yu, and Wei Wang