Amazon SageMaker Built-in Methods Reference Guide

sagemaker builtin algorithms

AWS is a leader with regards to services provided in the cloud. Learning all of them is a time-consuming and tedious exercise. Amazon provides a machine learning platform called SageMaker. After a general introduction in a previous post I will cover the built-in methods of SageMaker here.

I had excluded all marketing content and concentrated on providing only a short description. In the end I’m not the one selling the service :). My goal is to help machine learning consultants to get a well-rounded overview of the available methods. You will find reference link to every feature. Depending on your project at hand you can dive deep into a specific one.

AWS Machine Learning Specialty

This post is also part of a series about the AWS Machine Learning Speciality (MLS-C01) exam. I have structured the content into seven separate posts. These posts can be consumed as a stand-alone material. If you are not preparing for the MLS-C01 exam you may still find the topics interesting:

This post covers one of the domains in the MLS-C01 exam. Domain 4. Machine Learning Implementation and Operations:

  • 4.1 Build machine learning solutions for performance, availability, scalability, resiliency, and fault tolerance.
  • 4.2 Recommend and implement the appropriate machine learning services and features for a given problem.
  • 4.3 Apply basic AWS security practices to machine learning solutions.
  • 4.4 Deploy and operationalize machine learning solutions.

Amazon SageMaker

Amazon SageMaker is a fully managed service from Amazon to support all aspects of a Machine Learning model development. It has an integrated Jupyter notebook (R, Python) environment for data preparation and exploration. Using pre-built images for training and deploying models it’s extremely fast to create a scalable solution.

AWS SageMaker built-in methods

The structure of this post is somewhat unusual. Instead of writing long text for every method I will structure the content into a table per method. First, I will provide some general introduction about the rows in the table, so I can save some space by eliminating repetition.

The basis of the content was this overview table from Amazon. I have extended this information with a few other aspects.

Model Aspects

Hyperparameters: will list the most important parameters, for the complete lists please see the links. Neural Net based models will always have batch_size, learning_rate, epochs as parameters, which will be shortened to: BS,LR,E.

Channels: named input data feed, typically data from S3 for train, validation and test.

File types

  • CSV – text/csv; label_size=n, data has no header column, labels are assumed to be the first column(s). By default n=1, which means only the first column is the label. n=0 means no label, for unsupervised learning.
  • Text file – text/plain – only used for algorithms where comma cannot be used as a separator. Like BlazingText where comma has a meaning.
  • Protobuf RecordIO – application/x-recordio-protobuf: preferred whenever possible. https://mxnet.apache.org/api/architecture/note_data_loading#data-format
  • Parquet file – https://parquet.apache.org/
  • Augmented manifest file – application/jsonlines: one json object per line. Typically used with images to connect the image reference, label and any other input data. This format can be used for training and batch inference as well.
  • Json request – application/json: typically only used for single inference.

Training input mode: file or pipe (stream). Pipe mode usually only works with protobuf recordIO or augmented manifest files.

Tip: you would use Pipe mode whenever possible to reduce required disk space and to reduce training start-up time.

Parallelizable: it means that the method can be deployed on multiple compute instances for distributed training.

Instance types: typical instance types used for SageMaker models: CPU (eg.: C4), Memory increased (eg.: M4) or GPU (eg.: P2). GPU enabled instances are typically used for neural net based algorithms. If parallelizable = no, then only a single instance can be used.

Inference is typically done on a “smaller” instance, like C-types. Elastic Inference is a resource which can be attached to CPU only instances to boost neural network based inferences. 

Amazon SageMaker Build-in Methods Reference Guide

Algorithm nameBlazingText – Word2vec – Unsupervised
DescriptionIt creates similar vectors (embeddings) for similar words, typically done before NLP, unsupervised.
Most important hyperparametersReported metric: train:mean_rho

buckets – ngram features are hashed into buckets, tunes memory consumption
BS,LR,E
mode –  batch_skipgram, skipgram, or cbow
vector_dim – output 
window_size – number of words before/after to consider
ChannelsTrain
Training inputFile or Pipe
File typeText file (one sentence per line with space-separated tokens):linux ready for prime time , intel says
Instance classSingle CPU instance: cbow, skip-gram, batch skip-gram
Single GPU instance (with 1 or more GPUs): cbow, skip-gram
Multi CPU instances: batch skip-gram
ParallelizableNo (in general, see above)
Algorithm nameBlazingText – Text Classification – Supervised
DescriptionWord embeddings based text classification method. It extends fastText algorithm from Fabebook.
Most important hyperparametersReported metric: validation:accuracy

buckets – ngram features are hashed into buckets, tunes memory consumption
BS,LR,E
mode –  supervised
vector_dim – output 
window_size – number of words before/after to consider
ChannelsTrain
Training inputFile or Pipe
File typeText file (one sentence per line with space-separated tokens):linux ready for prime time , intel says
Instance classSingle CPU instance or single GPU instance
ParallelizableNo (in general, see above)
Algorithm nameDeepAR Forecasting
DescriptionIt forecasts one-dimensional time-series using RNN.
Supports:
– Categorical features (cat): helps to distinguish different time-series, eg.: products. It’s binary coded represented by an array: [0, 1, 0].
– Dynamic features (dynamic_feat): meta time-series of float
or int values to represent additional information (eg.: promotion, rainfall, sunny hours)
Most important hyperparameterscontext_length – time-points to consider for a prediction
prediction_length – time-points to predict
time_freq – e.g: 5min or H for hourly
num_layers – number of hidden layers
num_cells – number of cells in the RNN hidden layer
ChannelsTrain and (optionally) test
Training inputFile
File typeJSON Lines or Parquet file with fields:
– start (timestamp),
– target (array(len x)),
– cat (array(len y)),
– dynamic_feat (array(len z) of arrays(len x))
Instance classCPU or GPU
ParallelizableYes
Algorithm nameFactorization Machines
DescriptionRegression or classification in sparse datasets, e.g., user-click data prediction or user-item recommendation
Most important hyperparametersfeature_dim – usually a high number due to the sparse data use case
num_factors – “internal precision / complexity”
predictor_type – binary_classifier or regressor
bias_init_method – uniform, normal, or constant
factors_init_method – uniform, normal, or constant
linear_init_method – uniform, normal, or constant
hannelsTrain and (optionally) test
Training inputFile or Pipe
File typeRecordIO-protobuf
Instance classCPU (GPU for dense data)
ParallelizableYes
Algorithm nameImage Classification
DescriptionCNN (ResNet) based supervised multi-label image classification.
Supports full training or transfer learning (pretrained on ImageNet).
Most important hyperparametersnum_classes – number of output classes
num_training_samples – for learning rate optimization
augmentation_type – crop, crop_color, crop_color_transform
use_pretrained_model – 1 only top FC layer is trained, 0 – full retrain
BS,LR,E
Channelstrain and validation, (optionally) train_lst, validation_lst, and model
Training inputFile or Pipe
File typerecordIO or image files (.jpg or .png) using augmented manifest image format
Instance classGPU
ParallelizableYes
Algorithm nameIP Insights
DescriptionUnsupervised method to detect potential threat in real-time based
on historical data of (entity, IPv4) pairs. It can be used to trigger multi factor authentication or other measures.
Most important hyperparametersnum_entity_vectors – eg.: number of users
vector_dim – if too large overfits the IPs of the user
BS,LR,E
ChannelsTrain and (optionally) validation
Training inputFile
File typeCSV
Instance classCPU or GPU
ParallelizableYes
Algorithm nameK-means
DescriptionUnsupervised clustering algorithm to group similar observations.
Inference outputs are: closest_cluster and distance_to_cluster.
Most important
hyperparameters
feature_dim – features of the input
k – number of clusters to create
init_method – random or kmeans++(~equally distributed initial centroids)
ChannelsTrain and (optionally) test
Training inputFile or Pipe
File typeRecordIO-protobuf or CSV
Instance classCPU or GPUCommon (single GPU device on one or more instances)
ParallelizableNo
Algorithm namek-nearest-neighbor (k-NN)
DescriptionSupervised method, which builds an index during training,
that is used to look up nearest neighbors during inference.
For classification the most common label of the neighbors is returned.
For regression the average feature value of the neighbors is returned.
Most important hyperparametersfeature_dim – features of the input
k – number of nearest neighbors
predictor_type – classifier or regressor
sample_size – observations to use to build the index (model is designed for large scale data)
dimension_reduction_target – 0 < x < feature_dim
ChannelsTrain and (optionally) test
Training inputFile or Pipe
File typeRecordIO-protobuf or CSV
Instance classCPU or GPU (single GPU device on one or more instances)
ParallelizableYes
Algorithm nameLDA
DescriptionUnsupervised method to discover specified number of topics among documents.
Goal is the same as for Neural Topic Model, but results may differ.
Most important hyper parametersnum_topics
feature_dim – size of the vocabulary
mini_batch_size – total number of documents
ChannelsTrain and (optionally) test
Training inputFile or Pipe
File typeRecordIO-protobuf or CSV
Instance classCPU (single instance only)
ParallelizableNo
Algorithm nameLinear Learner
DescriptionSupervised method for solving regression and classification problems.
Most important hyperparametersfeature_dim – features of the input
num_classes – classes are integers
predictor_type – binary_classifier, multiclass_classifier, or regressor
binary_classifier_model_selection_criteria – accuracy, f_beta (F1), etc
loss – defaults:
– regressor – squared_loss;
– binary_classifier – logistic;
– multiclass_classifier – softmax_loss
num_models – how many models to build in parallel
ChannelsTrain and (optionally) validation, test, or both
Training inputFile or Pipe
File typeRecordIO-protobuf or CSV
Instance classCPU or GPU
ParallelizableYes
Algorithm nameNeural Topic Model – NTM
DescriptionUnsupervised method to discover specified number of topics among documents. Goal is the same as for LDA, but results may differ.
Most important hyperparametersnum_topics
feature_dim – size of the vocabulary
ChannelsTrain and (optionally) validation, test, or both
Training inputFile or Pipe
File typeRecordIO-protobuf or CSV
Instance classGPU or CPU
ParallelizableYes
Algorithm nameObject2Vec
DescriptionSupervised learning method to learn low-dimensional embeddings
from high-dimensional objects. E.g., label move summaries,
find similar documents, classify documents (offensive/non-offensive), recommend products to customers.
More examples here.
Most important hyperparametersNothing particular to highlight many encoder / neural net parameters
ChannelsTrain and (optionally) validation, test, or both
Training inputFile
File typeJSON Lines
Instance classGPU or CPU (single instance only)
ParallelizableNo

Tip: BlazingText works on the word / sentence level. Object2Vec is more general and takes the whole document into account. Source.

Algorithm nameObject Detection
DescriptionSupervised method to classify objects (represented by rectangles) within images.
Every image can have multiple objects, but every object can only have one predicted class.Uses VGG and ResNet networks. Supports full training or transfer learning (pretrained on ImageNet)
Most important hyperparametersnum_classes – number of output classes
num_training_samples – for learning rate optimization
base_network – vgg-16 (def) or resnet-50
use_pretrained_model – 1 only top FC layer is trained, 0 – full retrain
BS,LR,E
ChannelsTrain and validation, (optionally) train_annotation, validation_annotation, and model
Training inputFile or Pipe
File typeRecordIO or image files (.jpg or .png) using augmented manifest image format
Instance classGPU
ParallelizableYes
Algorithm namePrincipal Component Analysis – PCA
DescriptionUnsupervised dimensionality reduction
Most important
hyperparameters
feature_dim – input dimension
num_components – output dimension / components
algorithm_mode – regular (def) or randomized (for large datasets)
ChannelsTrain and (optionally) test
Training inputFile or Pipe
File typeRecordIO-protobuf or CSV
Instance classGPU or CPU
ParallelizableYes
Algorithm nameRandom Cut Forest – RCF
DescriptionUnsupervised anomaly detection algorithm.
It’s very flexible between one-dimensional time series to arbitrary-dimensional (10k) input. It returns an anomaly score and the caller should set a threshold.
Most important hyperparametersfeature_dim – input dimension
ChannelsTrain and (optionally) test
Training inputFile or Pipe
File typeRecordIO-protobuf or CSV
Instance classCPU
ParallelizableYes
Algorithm nameSemantic Segmentation
DescriptionSupervised pixel level image classification.
It returns the classes per segmentation mask.
This mask applied on the original image only the pixels of the specific class are visible. Given an image of a street it would return: car, tree, people, etc.
Uses ResNet50 and ResNet101 networks. Supports full training or transfer learning (pretrained on ImageNet)
Most important hyperparametersbackbone: resnet-50, resnet-101
use_pretrained_model: True, False (yes it’s not 0 and 1 like before…)
ChannelsTrain and validation, train_annotation, validation_annotation, and (optionally) label_map and model.
_annotation channels contain the segmentation mask for training and validation.
Training input File or Pipe
File typeImage files
Instance classGPU (single instance only)
ParallelizableNo
Algorithm nameSeq2Seq Modeling
DescriptionSupervised method to convert input set of tokens to an output set.
E.g., translation, speech-to-text, text summary generationUses encoder-decoder architectures with RNN and CNN.
Most important hyperparametersNothing particular to highlight many encoder / neural net parameters.
ChannelsTrain, validation, and vocab
Training inputFile
File typeRecordIO-protobuf
Instance classGPU (single instance only)
ParallelizableNo
Algorithm nameeXtreme Gradient Boosting – XGBoost
DescriptionSupervised method with an open-source implementation of the gradient boosted trees.
It’s able to perform regression, classification and ranking.
Most important hyper parametersAmazon docs, xboost docs.
num_round – training rounds
num_class – number of classes
objective – reg:squarederror (def), reg:logistic, multi:softmax, etc.
ChannelsTrain and (optionally) validation
Training inputFile
File typeCSV or LibSVM
Instance classCPU
ParallelizableYes

Tip: built-in Docker images stored in ECR will have :1 version tag for stable and :latest for non-stable versions. Source.

Conclusion

This post was a structured reference summary of Amazon SageMaker built-in methods. Machine learning consultants should understand and memorize the content of this post for the AWS Machine Learning Speciality exam.

In the previous posts I have introduced the AWS Machine Learning Speciality certificate for AWS Certified Machine Learning Consultants, AWS Data Engineering services for The AWS Data Engineering Consultants, the Top 7 Explorative Data Analysis Methods, Machine Learning Algorithms Overview and AWS AI Services Overview.

The last and this post cover Amazon SageMaker it’s flagship product for machine learning. This means that this post concludes my series on AWS machine learning. I’m only hoping that many of you would find this less verbose, focused content useful. The goal was to give a very prompt overview on what AWS has to offer for machine learning projects.

Commenting is disabled on this site. You will find my contact details here: About me.

Photo by chuttersnap on Unsplash