Amazon SageMaker the Ultimate Guide – Hello, Viktor Tönköl here

AWS is a leader with regards to services provided in the cloud. Learning all of them is a time-consuming and tedious exercise. I put together a brief overview on Amazon SageMaker.

I had excluded all marketing content and concentrated on providing only a short description. In the end I’m not the one selling the service :). My goal is to help machine learning consultants to get a well-rounded overview Amazon SageMaker’s functionality. You will find reference link to every feature. Depending on your project at hand you can dive deep into a specific one.

AWS Machine Learning Specialty

This post is also part of a series about the AWS Machine Learning Speciality (MLS-C01) exam. I have structured the content into seven separate posts. These posts can be consumed as a stand-alone material. If you are not preparing for the MLS-C01 exam you may still find the topics interesting:

Part 1: The AWS Certified Machine Learning Consultant
Part 2: The AWS Data Engineering Consultant
Part 3: Top 7 Explorative Data Analysis Methods
Part 4: Machine Learning Algorithms Overview
Part 5: AWS AI Services Overview
Part 6: Amazon SageMaker the Ultimate Guide

This post covers one of the domains in the MLS-C01 exam. Domain 4. Machine Learning Implementation and Operations:

4.1 Build machine learning solutions for performance, availability, scalability, resiliency, and fault tolerance.
4.2 Recommend and implement the appropriate machine learning services and features for a given problem.
4.3 Apply basic AWS security practices to machine learning solutions.
4.4 Deploy and operationalize machine learning solutions.

AWS SageMaker Introduction

If you’d rather watch a video here is the official SageMaker overview from 2017:

AWS SageMaker provides an end-to-end Machine Learning framework to help:

explore and load the data,
run preprocessing jobs,
choose the optimal model from a pool or build-in models or build your own,
train a model on massive datasets,
tune the model’s hyper parameters,
deploy the final model for inference
monitor the model and evaluate different versions in production

All aspects will be covered in this post to give you a complete understanding of AWS SageMaker and to prepare you for the exam.

Data exploration

Amazon SageMaker provides a fully managed Jupyter Notebook environment. Two types of notebooks are available:

SageMaker notebooks: older solution, single notebook instance running in the cloud managed by Amazon.
SageMaker Studio notebooks: new integrated development environment. It extends basic notebook functionality by pre-installed Amazon SageMaker Python SDK, integration with many SageMaker features, AWS SSO and much more.

Tip: SageMaker jupyter instances are running under AWS service account, hence they are not visible under your EC2 page.

AWS Ground Truth

In real life what often happens is that the data is available, but the labels or classes are missing. Think about classifying social media posts and images as offensive vs neutral. The input data is there, however a human has to score it against a set of rules to decide on the class, before there is enough sample to develop an automatic model.

Ground Truth helps setting up a web based human labeling workflow, where workforce could be one of:

Amazon Mechanical Turk: 500k independent contractors
Private workforce from your organization
Vendor company from AWS Marketplace

Optionally, automated labeling can be used where AWS could automatically build a model to label the data and improve the model using the labels created by humans.

Networking

Notebooks by default have direct internet access via service VPC, which might make the system less secure.

Optionally, notebooks can be launched with custom VPCs attached and without direct internet access. This provides more security, however notebooks have to have internet access to train models. This means that the custom VPC must have a NAT gateway.

S3

Amazon SageMaker notebooks by default have access to S3 buckets containing the string: “sagemaker”. Raw or processed data is typically stored in S3 buckets. As opposed to S3, notebook instances have fixed size storage attached to them. Preparing a small data sample is the solution to overcome this issue for data exploration.

S3 could be used furthermore to store results model inferences and the models themselves in serialized form as well.

Data Preprocessing

Amazon SageMaker provides a framework to assist with feature engineering, data validation, model evaluation and model interpretation tasks. The job processing functionality is based on Docker images as computation nodes. Coordinated by SageMaker API calls the Docker image reads and writes data to S3. Two execution modes exists: script outside of container or script within container.

Script Outside of Container

A generic docker image should be created with only the execution environment. The script is provided using SageMaker API. This enables flexibility, which is specially useful during development.

Docker file:

FROM python:3.7-slim-buster
RUN pip3 install pandas==0.25.3 scikit-learn==0.21.3
ENV PYTHONUNBUFFERED=TRUE
ENTRYPOINT ["python3"]

SageMaker API calls:

from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput

script_processor = ScriptProcessor(command=['python3'], image_uri='', instance_count=1, instance_type='')

script_processor.run(code='preprocessing.py',
  inputs=[ProcessingInput(source='s3://path/to/my/input-data.csv', destination='/opt/ml/processing/input')],
  outputs=[ProcessingOutput(source='/opt/ml/processing/output/train'),
           ProcessingOutput(source='/opt/ml/processing/output/validation'),
           ProcessingOutput(source='/opt/ml/processing/output/test')]

See the complete example here.

Script inside Docker

The components are similar to the previous example. The difference is that the script is deployed to the container before docker build. Stable processes would benefit from this process by making the execution less error prone.

Docker file:

FROM python:3.7-slim-buster

# Install scikit-learn and pandas
RUN pip3 install pandas==0.25.3 scikit-learn==0.21.3

# Add a Python script and configure Docker to run it
ADD processing_script.py /
ENTRYPOINT ["python3", "/processing_script.py"]

SageMaker API calls:

from sagemaker.processing import Processor, ProcessingInput, ProcessingOutput

processor = Processor(image_uri='',
                     role=role,
                     instance_count=1,
                     instance_type="ml.m5.xlarge")
                     
processor.run(inputs=[ProcessingInput(
                        source='',
                        destination='/opt/ml/processing/input_data')],
                    outputs=[ProcessingOutput(
                        source='/opt/ml/processing/processed_data',
                        destination='')],
                    ))

See the complete example here.

Model Building

SageMaker provides two ways of build models:

use built-in models
build your own model

Using built-in models are subject to a separate blog post. Here I’ll present how to bring your own model into SageMaker.

Custom model

SageMaker uses Docker containers to run all it’s models both for training and inference.

Docker images provide the portability, isolation and security which is required for a modern machine learning solution.

There are three ways to use custom models:

use pre-built container image: Amazon provides fully configured images for TensorFlow, MXNet, Chainer and PyTorch,
extend pre-built container,
build fully custom image.

Independently of your preferred method, during the AWS Machine Learning Speciality exam you will be tested on Docker image usage and structure.

Directory structure

Docker container’s directory structure for SageMaker training and inference:

Training: 
/opt/ml
├── input
│   ├── config
│   │   ├── hyperparameters.json
│   │   └── resourceConfig.json
│   └── data
│       └── <channel_name>
│           └── <input data>
├── model
├── code
├── output
└── failure

Inference:
/opt/ml/model
└── <model files>

Image creation steps

A simple Docker file is enough to create a new image containing any custom model. This example extends a tensorflow Docker image. Here is a full list of available images.

The python code implementing the model will be saved as train.py. The code should use the directory structure from above. SageMaker provides many environment variables, you should use them whenever possible.

Tip: handful of environment variables are important for the exam: SAGEMAKER_PROGRAM, SM_MODEL_DIR, SM_CHANNELS, SM_CHANNEL_{channel_name}, SM_HPS, SM_HP_{hyperparameter_name}, SM_CURRENT_HOST, SM_HOSTS, SM_NUM_GPUS

Docker file:
FROM tensorflow/tensorflow:2.0.0a0
RUN pip install sagemaker-containers
# Copies the training code inside the container
COPY train.py /opt/ml/code/train.py
# Defines train.py as script entry point
ENV SAGEMAKER_PROGRAM train.py


Build Docker image: docker build -t tf-2.0 .

Tip: “RUN pip install sagemaker-containers” is a key line here to have SageMaker capabilities and to score points on the exam

Model Training

SageMaker provides an API to train and deploy Docker image based models. The most common use case is to have a control notebook and call SageMaker via the python API.

from sagemaker.estimator import Estimator
estimator = Estimator(image_name='tf-2.0',
                      role='SageMakerRole',
                      train_instance_count=1,
                      train_instance_type='local') # or e.g., ml.m4.xlarge
# training
estimator.fit('s3://my_bucket/my_training_data/')

Using the fit method SageMaker will execute the training on the specified data. Using the instance type parameter the training can be started locally or in the cloud.

SageMaker API

SageMaker has two python APIs:

Amazon SageMaker Python SDK
AWS SDK for Python (Boto 3)

Use Python SDK from notebooks and only for advanced use cases the Boto 3. The complete language independent API Docs can be found here.

Important actions to recognise in the exam:

CreateNotebookInstance to create jupyter notebook
CreateTrainingJob to train a model and save it in S3
CreateModel to create a model instance from results of a training job
CreateEndpoint to deploy a model

Apache Spark integration

Spark is often used for developing data processing pipelines at scale. Spark itself has extensive machine learning capabilities. SageMaker provides a Spark integration library (PySpark and Scala) to enable Spark applications to host and monitor models using SageMaker integrated to the Spark application.

SageMakerEstimator: triggers a SageMaker training job from Spark and returns a SageMakerModel. It has multiple inherited classes to support k-means, pca, XBoost alogrithms.

SageMakerModel: is a SparkModel, which calls out to SageMaker upon construction (CreateModel, CreateEndpointConfig, CreateEndpoint)

Reinforcement Learning

Reinforcement learning (RL) is a machine learning technique that attempts to learn a strategy, called a policy, that optimizes an objective for an agent acting in an environment. RL is based on Markov Decision Process (MDP), which is a time step based model with the following elements: environment, state, action, reward, observation. Simple code example can be found here. It solves the HelloWord problem of RL: balancing a pole on a small cart that can move in one dimension.

SageMaker Autopilot

It’s a higher level service to automate a basic machine learning workflow (data exploration, model selection and training). Currently it supports regression and classification (binary and multiclass) with CSV input.

Hyperparameter tuning

Hyperparameter tuning is the art of finding the best possible parameters for the given model and data set while keeping the generalization power of the model. It is a nice marriage of AWS computation scaling and the trial-and-error process of tuning.

SageMaker provides the tools to run jobs with many different parameter permutations. Overfitting the train data set is a real danger. In order to prevent this a validation data set has to be provided to the job as well. See a complete python code example here and a web console example here.

The tuning will be executed smarter than a complete random search across all possible parameter value permutations. SageMaker will use the results of previous runs to exclude underperforming permutations and further optimize better performing ones.

Tip: use logarithmic scale for parameters with a wide range of values.

Tip: run only one job at a time, because SageMaker will perform a Bayesian search (strategy parameter) to reduce the number of required job runs and with that the cost as well.

Model Inference

Production deployment is the next step after the model training is completed. SageMaker provides two ways to get predictions:

Docker image containing the tuned model is the requirement for the inference. It is recommended that the docker file would run the model directly:

ENTRYPOINT ["python", "k_means_inference.py"]

SageMaker inference containers need to implement a web server that responds to /invocations and /ping on port 8080. Docker file should contain an entry point to start serving the model:

Tip: try to remember these numbers for the exam:
– containers must accept socket connection requests within 250 ms,
– containers must respond to requests within 60 seconds.

Tip: for GPU enabled inference make sure that your containers are nvidia-docker compatible. Don’t bundle NVIDIA drivers with the image.

Elastic Inference

Elastic inference accelerates EC2 CPU instances for deep learning workloads. Usage:

predictor = estimator.deploy(instance_type='ml.m4.xlarge', 
initial_instance_count=1,
accelerator_type=’'ml.eia1.xlarge’) # Elastic Inference
# prediction
response = predictor.predict(data)
# clean-up
predictor.delete_endpoint()

SageMaker Neo

SageMaker Neo is a powerful compiler which transforms SageMaker models to a selected hardware’s (Intel, ARM, NVidia-processor) binary representation. It effectively allows running models on edge IoT devices. This enables local predictions without network latency.

Monitoring

Amazon SageMaker integrates with CloudWatch. Invocation and computing resource logs are sent to CloudWatch and they are available under aws/sagemaker/ namespace, e.g., /aws/sagemaker/ProcessingJobs, /aws/sagemaker/TrainingJobs, /aws/sagemaker/TransformJobs and /aws/sagemaker/Endpoints

Metrics are available at a 1-minute frequency.

SageMaker produces events which can be integrated with CloudWatch events to execute any operation supported by CloudWatch. See a full list of SageMaker events here.

All SageMaker API invocations are sent to CloudTrail for audit logging. Inference API calls are not included, since those are considered to be operational (CloudWatch) API invocations.

Conclusion

This post was a summary of Amazon SageMaker service. For a final overview please check out the two notebooks: one contains a data preprocessing example and the other a model creation and deployment example.

In the previous posts I have introduced the AWS Machine Learning Speciality certificate for AWS Certified Machine Learning Consultants, AWS Data Engineering services for The AWS Data Engineering Consultants, the Top 7 Explorative Data Analysis Methods, Machine Learning Algorithms Overview and AWS AI Services Overview. In the next post I will present the [LINK coming soon]. See you there!

Commenting is disabled on this site. You will find my contact details here: About me.

Photo by mahdis mousavi on Unsplash