Amazon SageMaker Built-in Methods Reference Guide

AWS is a leader with regards to services provided in the cloud. Learning all of them is a time-consuming and tedious exercise. Amazon provides a machine learning platform called SageMaker. After a general introduction in a previous post I will cover the built-in methods of SageMaker here.

I had excluded all marketing content and concentrated on providing only a short description. In the end I’m not the one selling the service :). My goal is to help machine learning consultants to get a well-rounded overview of the available methods. You will find reference link to every feature. Depending on your project at hand you can dive deep into a specific one.

AWS Machine Learning Specialty

This post is also part of a series about the AWS Machine Learning Speciality (MLS-C01) exam. I have structured the content into seven separate posts. These posts can be consumed as a stand-alone material. If you are not preparing for the MLS-C01 exam you may still find the topics interesting:

Part 1: The AWS Certified Machine Learning Consultant
Part 2: The AWS Data Engineering Consultant
Part 3: Top 7 Explorative Data Analysis Methods
Part 4: Machine Learning Algorithms Overview
Part 5: AWS AI Services Overview
Part 6: Amazon SageMaker the Ultimate Guide
Part 7: Amazon SageMaker Bulit-in Methods Reference Guide

This post covers one of the domains in the MLS-C01 exam. Domain 4. Machine Learning Implementation and Operations:

4.1 Build machine learning solutions for performance, availability, scalability, resiliency, and fault tolerance.
4.2 Recommend and implement the appropriate machine learning services and features for a given problem.
4.3 Apply basic AWS security practices to machine learning solutions.
4.4 Deploy and operationalize machine learning solutions.

Amazon SageMaker

Amazon SageMaker is a fully managed service from Amazon to support all aspects of a Machine Learning model development. It has an integrated Jupyter notebook (R, Python) environment for data preparation and exploration. Using pre-built images for training and deploying models it’s extremely fast to create a scalable solution.

AWS SageMaker built-in methods

The structure of this post is somewhat unusual. Instead of writing long text for every method I will structure the content into a table per method. First, I will provide some general introduction about the rows in the table, so I can save some space by eliminating repetition.

The basis of the content was this overview table from Amazon. I have extended this information with a few other aspects.

Model Aspects

Hyperparameters: will list the most important parameters, for the complete lists please see the links. Neural Net based models will always have batch_size, learning_rate, epochs as parameters, which will be shortened to: BS,LR,E.

Channels: named input data feed, typically data from S3 for train, validation and test.

File types:

CSV – text/csv; label_size=n, data has no header column, labels are assumed to be the first column(s). By default n=1, which means only the first column is the label. n=0 means no label, for unsupervised learning.
Text file – text/plain – only used for algorithms where comma cannot be used as a separator. Like BlazingText where comma has a meaning.
Protobuf RecordIO – application/x-recordio-protobuf: preferred whenever possible. https://mxnet.apache.org/api/architecture/note_data_loading#data-format
Parquet file – https://parquet.apache.org/
Augmented manifest file – application/jsonlines: one json object per line. Typically used with images to connect the image reference, label and any other input data. This format can be used for training and batch inference as well.
Json request – application/json: typically only used for single inference.

Training input mode: file or pipe (stream). Pipe mode usually only works with protobuf recordIO or augmented manifest files.

Tip: you would use Pipe mode whenever possible to reduce required disk space and to reduce training start-up time.

Parallelizable: it means that the method can be deployed on multiple compute instances for distributed training.

Instance types: typical instance types used for SageMaker models: CPU (eg.: C4), Memory increased (eg.: M4) or GPU (eg.: P2). GPU enabled instances are typically used for neural net based algorithms. If parallelizable = no, then only a single instance can be used.

Inference is typically done on a “smaller” instance, like C-types. Elastic Inference is a resource which can be attached to CPU only instances to boost neural network based inferences.

Amazon SageMaker Build-in Methods Reference Guide

Algorithm name	BlazingText – Word2vec – Unsupervised
Description	It creates similar vectors (embeddings) for similar words, typically done before NLP, unsupervised.
Most important hyperparameters	Reported metric: train:mean_rho buckets – ngram features are hashed into buckets, tunes memory consumption BS,LR,E mode – batch_skipgram, skipgram, or cbow vector_dim – output window_size – number of words before/after to consider
Channels	Train
Training input	File or Pipe
File type	Text file (one sentence per line with space-separated tokens):linux ready for prime time , intel says
Instance class	Single CPU instance: cbow, skip-gram, batch skip-gram Single GPU instance (with 1 or more GPUs): cbow, skip-gram Multi CPU instances: batch skip-gram
Parallelizable	No (in general, see above)

Algorithm name	BlazingText – Text Classification – Supervised
Description	Word embeddings based text classification method. It extends fastText algorithm from Fabebook.
Most important hyperparameters	Reported metric: validation:accuracy buckets – ngram features are hashed into buckets, tunes memory consumption BS,LR,E mode – supervised vector_dim – output window_size – number of words before/after to consider
Channels	Train
Training input	File or Pipe
File type	Text file (one sentence per line with space-separated tokens):linux ready for prime time , intel says
Instance class	Single CPU instance or single GPU instance
Parallelizable	No (in general, see above)

Algorithm name	DeepAR Forecasting
Description	It forecasts one-dimensional time-series using RNN. Supports: – Categorical features (cat): helps to distinguish different time-series, eg.: products. It’s binary coded represented by an array: [0, 1, 0]. – Dynamic features (dynamic_feat): meta time-series of float or int values to represent additional information (eg.: promotion, rainfall, sunny hours)
Most important hyperparameters	context_length – time-points to consider for a prediction prediction_length – time-points to predict time_freq – e.g: 5min or H for hourly num_layers – number of hidden layers num_cells – number of cells in the RNN hidden layer
Channels	Train and (optionally) test
Training input	File
File type	JSON Lines or Parquet file with fields: – start (timestamp), – target (array(len x)), – cat (array(len y)), – dynamic_feat (array(len z) of arrays(len x))
Instance class	CPU or GPU
Parallelizable	Yes

Algorithm name	Factorization Machines
Description	Regression or classification in sparse datasets, e.g., user-click data prediction or user-item recommendation
Most important hyperparameters	feature_dim – usually a high number due to the sparse data use case num_factors – “internal precision / complexity” predictor_type – binary_classifier or regressor bias_init_method – uniform, normal, or constant factors_init_method – uniform, normal, or constant linear_init_method – uniform, normal, or constant
hannels	Train and (optionally) test
Training input	File or Pipe
File type	RecordIO-protobuf
Instance class	CPU (GPU for dense data)
Parallelizable	Yes

Algorithm name	Image Classification
Description	CNN (ResNet) based supervised multi-label image classification. Supports full training or transfer learning (pretrained on ImageNet).
Most important hyperparameters	num_classes – number of output classes num_training_samples – for learning rate optimization augmentation_type – crop, crop_color, crop_color_transform use_pretrained_model – 1 only top FC layer is trained, 0 – full retrain BS,LR,E
Channels	train and validation, (optionally) train_lst, validation_lst, and model
Training input	File or Pipe
File type	recordIO or image files (.jpg or .png) using augmented manifest image format
Instance class	GPU
Parallelizable	Yes

Algorithm name	IP Insights
Description	Unsupervised method to detect potential threat in real-time based on historical data of (entity, IPv4) pairs. It can be used to trigger multi factor authentication or other measures.
Most important hyperparameters	num_entity_vectors – eg.: number of users vector_dim – if too large overfits the IPs of the user BS,LR,E
Channels	Train and (optionally) validation
Training input	File
File type	CSV
Instance class	CPU or GPU
Parallelizable	Yes

Algorithm name	K-means
Description	Unsupervised clustering algorithm to group similar observations. Inference outputs are: closest_cluster and distance_to_cluster.
Most important hyperparameters	feature_dim – features of the input k – number of clusters to create init_method – random or kmeans++(~equally distributed initial centroids)
Channels	Train and (optionally) test
Training input	File or Pipe
File type	RecordIO-protobuf or CSV
Instance class	CPU or GPUCommon (single GPU device on one or more instances)
Parallelizable	No

Algorithm name	k-nearest-neighbor (k-NN)
Description	Supervised method, which builds an index during training, that is used to look up nearest neighbors during inference. For classification the most common label of the neighbors is returned. For regression the average feature value of the neighbors is returned.
Most important hyperparameters	feature_dim – features of the input k – number of nearest neighbors predictor_type – classifier or regressor sample_size – observations to use to build the index (model is designed for large scale data) dimension_reduction_target – 0 < x < feature_dim
Channels	Train and (optionally) test
Training input	File or Pipe
File type	RecordIO-protobuf or CSV
Instance class	CPU or GPU (single GPU device on one or more instances)
Parallelizable	Yes

Algorithm name	LDA
Description	Unsupervised method to discover specified number of topics among documents. Goal is the same as for Neural Topic Model, but results may differ.
Most important hyper parameters	num_topics feature_dim – size of the vocabulary mini_batch_size – total number of documents
Channels	Train and (optionally) test
Training input	File or Pipe
File type	RecordIO-protobuf or CSV
Instance class	CPU (single instance only)
Parallelizable	No

Algorithm name	Linear Learner
Description	Supervised method for solving regression and classification problems.
Most important hyperparameters	feature_dim – features of the input num_classes – classes are integers predictor_type – binary_classifier, multiclass_classifier, or regressor binary_classifier_model_selection_criteria – accuracy, f_beta (F1), etc loss – defaults: – regressor – squared_loss; – binary_classifier – logistic; – multiclass_classifier – softmax_loss num_models – how many models to build in parallel
Channels	Train and (optionally) validation, test, or both
Training input	File or Pipe
File type	RecordIO-protobuf or CSV
Instance class	CPU or GPU
Parallelizable	Yes

Algorithm name	Neural Topic Model – NTM
Description	Unsupervised method to discover specified number of topics among documents. Goal is the same as for LDA, but results may differ.
Most important hyperparameters	num_topics feature_dim – size of the vocabulary
Channels	Train and (optionally) validation, test, or both
Training input	File or Pipe
File type	RecordIO-protobuf or CSV
Instance class	GPU or CPU
Parallelizable	Yes

Algorithm name	Object2Vec
Description	Supervised learning method to learn low-dimensional embeddings from high-dimensional objects. E.g., label move summaries, find similar documents, classify documents (offensive/non-offensive), recommend products to customers. More examples here.
Most important hyperparameters	Nothing particular to highlight many encoder / neural net parameters
Channels	Train and (optionally) validation, test, or both
Training input	File
File type	JSON Lines
Instance class	GPU or CPU (single instance only)
Parallelizable	No

Tip: BlazingText works on the word / sentence level. Object2Vec is more general and takes the whole document into account. Source.

Algorithm name	Object Detection
Description	Supervised method to classify objects (represented by rectangles) within images. Every image can have multiple objects, but every object can only have one predicted class.Uses VGG and ResNet networks. Supports full training or transfer learning (pretrained on ImageNet)
Most important hyperparameters	num_classes – number of output classes num_training_samples – for learning rate optimization base_network – vgg-16 (def) or resnet-50 use_pretrained_model – 1 only top FC layer is trained, 0 – full retrain BS,LR,E
Channels	Train and validation, (optionally) train_annotation, validation_annotation, and model
Training input	File or Pipe
File type	RecordIO or image files (.jpg or .png) using augmented manifest image format
Instance class	GPU
Parallelizable	Yes

Algorithm name	Principal Component Analysis – PCA
Description	Unsupervised dimensionality reduction
Most important hyperparameters	feature_dim – input dimension num_components – output dimension / components algorithm_mode – regular (def) or randomized (for large datasets)
Channels	Train and (optionally) test
Training input	File or Pipe
File type	RecordIO-protobuf or CSV
Instance class	GPU or CPU
Parallelizable	Yes

Algorithm name	Random Cut Forest – RCF
Description	Unsupervised anomaly detection algorithm. It’s very flexible between one-dimensional time series to arbitrary-dimensional (10k) input. It returns an anomaly score and the caller should set a threshold.
Most important hyperparameters	feature_dim – input dimension
Channels	Train and (optionally) test
Training input	File or Pipe
File type	RecordIO-protobuf or CSV
Instance class	CPU
Parallelizable	Yes

Algorithm name	Semantic Segmentation
Description	Supervised pixel level image classification. It returns the classes per segmentation mask. This mask applied on the original image only the pixels of the specific class are visible. Given an image of a street it would return: car, tree, people, etc. Uses ResNet50 and ResNet101 networks. Supports full training or transfer learning (pretrained on ImageNet)
Most important hyperparameters	backbone: resnet-50, resnet-101 use_pretrained_model: True, False (yes it’s not 0 and 1 like before…)
Channels	Train and validation, train_annotation, validation_annotation, and (optionally) label_map and model. _annotation channels contain the segmentation mask for training and validation.
Training input	File or Pipe
File type	Image files
Instance class	GPU (single instance only)
Parallelizable	No

Algorithm name	Seq2Seq Modeling
Description	Supervised method to convert input set of tokens to an output set. E.g., translation, speech-to-text, text summary generationUses encoder-decoder architectures with RNN and CNN.
Most important hyperparameters	Nothing particular to highlight many encoder / neural net parameters.
Channels	Train, validation, and vocab
Training input	File
File type	RecordIO-protobuf
Instance class	GPU (single instance only)
Parallelizable	No

Algorithm name	eXtreme Gradient Boosting – XGBoost
Description	Supervised method with an open-source implementation of the gradient boosted trees. It’s able to perform regression, classification and ranking.
Most important hyper parameters	Amazon docs, xboost docs. num_round – training rounds num_class – number of classes objective – reg:squarederror (def), reg:logistic, multi:softmax, etc.
Channels	Train and (optionally) validation
Training input	File
File type	CSV or LibSVM
Instance class	CPU
Parallelizable	Yes

Tip: built-in Docker images stored in ECR will have :1 version tag for stable and :latest for non-stable versions. Source.

Conclusion

This post was a structured reference summary of Amazon SageMaker built-in methods. Machine learning consultants should understand and memorize the content of this post for the AWS Machine Learning Speciality exam.

In the previous posts I have introduced the AWS Machine Learning Speciality certificate for AWS Certified Machine Learning Consultants, AWS Data Engineering services for The AWS Data Engineering Consultants, the Top 7 Explorative Data Analysis Methods, Machine Learning Algorithms Overview and AWS AI Services Overview.

The last and this post cover Amazon SageMaker it’s flagship product for machine learning. This means that this post concludes my series on AWS machine learning. I’m only hoping that many of you would find this less verbose, focused content useful. The goal was to give a very prompt overview on what AWS has to offer for machine learning projects.

Commenting is disabled on this site. You will find my contact details here: About me.

Photo by chuttersnap on Unsplash