![sagemaker builtin algorithms](http://cpros.de.www458.your-server.de/wp-content/uploads/2020/07/sagemaker_builtin-1-1024x576.jpg)
AWS is a leader with regards to services provided in the cloud. Learning all of them is a time-consuming and tedious exercise. Amazon provides a machine learning platform called SageMaker. After a general introduction in a previous post I will cover the built-in methods of SageMaker here.
I had excluded all marketing content and concentrated on providing only a short description. In the end I’m not the one selling the service :). My goal is to help machine learning consultants to get a well-rounded overview of the available methods. You will find reference link to every feature. Depending on your project at hand you can dive deep into a specific one.
AWS Machine Learning Specialty
This post is also part of a series about the AWS Machine Learning Speciality (MLS-C01) exam. I have structured the content into seven separate posts. These posts can be consumed as a stand-alone material. If you are not preparing for the MLS-C01 exam you may still find the topics interesting:
- Part 1: The AWS Certified Machine Learning Consultant
- Part 2: The AWS Data Engineering Consultant
- Part 3: Top 7 Explorative Data Analysis Methods
- Part 4: Machine Learning Algorithms Overview
- Part 5: AWS AI Services Overview
- Part 6: Amazon SageMaker the Ultimate Guide
- Part 7: Amazon SageMaker Bulit-in Methods Reference Guide
This post covers one of the domains in the MLS-C01 exam. Domain 4. Machine Learning Implementation and Operations:
- 4.1 Build machine learning solutions for performance, availability, scalability, resiliency, and fault tolerance.
- 4.2 Recommend and implement the appropriate machine learning services and features for a given problem.
- 4.3 Apply basic AWS security practices to machine learning solutions.
- 4.4 Deploy and operationalize machine learning solutions.
Amazon SageMaker
Amazon SageMaker is a fully managed service from Amazon to support all aspects of a Machine Learning model development. It has an integrated Jupyter notebook (R, Python) environment for data preparation and exploration. Using pre-built images for training and deploying models it’s extremely fast to create a scalable solution.
AWS SageMaker built-in methods
The structure of this post is somewhat unusual. Instead of writing long text for every method I will structure the content into a table per method. First, I will provide some general introduction about the rows in the table, so I can save some space by eliminating repetition.
The basis of the content was this overview table from Amazon. I have extended this information with a few other aspects.
Model Aspects
Hyperparameters: will list the most important parameters, for the complete lists please see the links. Neural Net based models will always have batch_size, learning_rate, epochs as parameters, which will be shortened to: BS,LR,E.
Channels: named input data feed, typically data from S3 for train, validation and test.
File types:
- CSV – text/csv; label_size=n, data has no header column, labels are assumed to be the first column(s). By default n=1, which means only the first column is the label. n=0 means no label, for unsupervised learning.
- Text file – text/plain – only used for algorithms where comma cannot be used as a separator. Like BlazingText where comma has a meaning.
- Protobuf RecordIO – application/x-recordio-protobuf: preferred whenever possible. https://mxnet.apache.org/api/architecture/note_data_loading#data-format
- Parquet file – https://parquet.apache.org/
- Augmented manifest file – application/jsonlines: one json object per line. Typically used with images to connect the image reference, label and any other input data. This format can be used for training and batch inference as well.
- Json request – application/json: typically only used for single inference.
Training input mode: file or pipe (stream). Pipe mode usually only works with protobuf recordIO or augmented manifest files.
Tip: you would use Pipe mode whenever possible to reduce required disk space and to reduce training start-up time.
Parallelizable: it means that the method can be deployed on multiple compute instances for distributed training.
Instance types: typical instance types used for SageMaker models: CPU (eg.: C4), Memory increased (eg.: M4) or GPU (eg.: P2). GPU enabled instances are typically used for neural net based algorithms. If parallelizable = no, then only a single instance can be used.
Inference is typically done on a “smaller” instance, like C-types. Elastic Inference is a resource which can be attached to CPU only instances to boost neural network based inferences.
Amazon SageMaker Build-in Methods Reference Guide
Algorithm name | BlazingText – Word2vec – Unsupervised |
Description | It creates similar vectors (embeddings) for similar words, typically done before NLP, unsupervised. |
Most important hyperparameters | Reported metric: train:mean_rho buckets – ngram features are hashed into buckets, tunes memory consumption BS,LR,E mode – batch_skipgram, skipgram, or cbow vector_dim – output window_size – number of words before/after to consider |
Channels | Train |
Training input | File or Pipe |
File type | Text file (one sentence per line with space-separated tokens):linux ready for prime time , intel says |
Instance class | Single CPU instance: cbow, skip-gram, batch skip-gram Single GPU instance (with 1 or more GPUs): cbow, skip-gram Multi CPU instances: batch skip-gram |
Parallelizable | No (in general, see above) |
Algorithm name | BlazingText – Text Classification – Supervised |
Description | Word embeddings based text classification method. It extends fastText algorithm from Fabebook. |
Most important hyperparameters | Reported metric: validation:accuracy buckets – ngram features are hashed into buckets, tunes memory consumption BS,LR,E mode – supervised vector_dim – output window_size – number of words before/after to consider |
Channels | Train |
Training input | File or Pipe |
File type | Text file (one sentence per line with space-separated tokens):linux ready for prime time , intel says |
Instance class | Single CPU instance or single GPU instance |
Parallelizable | No (in general, see above) |
Algorithm name | DeepAR Forecasting |
Description | It forecasts one-dimensional time-series using RNN. Supports: – Categorical features (cat): helps to distinguish different time-series, eg.: products. It’s binary coded represented by an array: [0, 1, 0]. – Dynamic features (dynamic_feat): meta time-series of float or int values to represent additional information (eg.: promotion, rainfall, sunny hours) |
Most important hyperparameters | context_length – time-points to consider for a prediction prediction_length – time-points to predict time_freq – e.g: 5min or H for hourly num_layers – number of hidden layers num_cells – number of cells in the RNN hidden layer |
Channels | Train and (optionally) test |
Training input | File |
File type | JSON Lines or Parquet file with fields: – start (timestamp), – target (array(len x)), – cat (array(len y)), – dynamic_feat (array(len z) of arrays(len x)) |
Instance class | CPU or GPU |
Parallelizable | Yes |
Algorithm name | Factorization Machines |
Description | Regression or classification in sparse datasets, e.g., user-click data prediction or user-item recommendation |
Most important hyperparameters | feature_dim – usually a high number due to the sparse data use case num_factors – “internal precision / complexity” predictor_type – binary_classifier or regressor bias_init_method – uniform, normal, or constant factors_init_method – uniform, normal, or constant linear_init_method – uniform, normal, or constant |
hannels | Train and (optionally) test |
Training input | File or Pipe |
File type | RecordIO-protobuf |
Instance class | CPU (GPU for dense data) |
Parallelizable | Yes |
Algorithm name | Image Classification |
Description | CNN (ResNet) based supervised multi-label image classification. Supports full training or transfer learning (pretrained on ImageNet). |
Most important hyperparameters | num_classes – number of output classes num_training_samples – for learning rate optimization augmentation_type – crop, crop_color, crop_color_transform use_pretrained_model – 1 only top FC layer is trained, 0 – full retrain BS,LR,E |
Channels | train and validation, (optionally) train_lst, validation_lst, and model |
Training input | File or Pipe |
File type | recordIO or image files (.jpg or .png) using augmented manifest image format |
Instance class | GPU |
Parallelizable | Yes |
Algorithm name | IP Insights |
Description | Unsupervised method to detect potential threat in real-time based on historical data of (entity, IPv4) pairs. It can be used to trigger multi factor authentication or other measures. |
Most important hyperparameters | num_entity_vectors – eg.: number of users vector_dim – if too large overfits the IPs of the user BS,LR,E |
Channels | Train and (optionally) validation |
Training input | File |
File type | CSV |
Instance class | CPU or GPU |
Parallelizable | Yes |
Algorithm name | K-means |
Description | Unsupervised clustering algorithm to group similar observations. Inference outputs are: closest_cluster and distance_to_cluster. |
Most important hyperparameters | feature_dim – features of the input k – number of clusters to create init_method – random or kmeans++(~equally distributed initial centroids) |
Channels | Train and (optionally) test |
Training input | File or Pipe |
File type | RecordIO-protobuf or CSV |
Instance class | CPU or GPUCommon (single GPU device on one or more instances) |
Parallelizable | No |
Algorithm name | k-nearest-neighbor (k-NN) |
Description | Supervised method, which builds an index during training, that is used to look up nearest neighbors during inference. For classification the most common label of the neighbors is returned. For regression the average feature value of the neighbors is returned. |
Most important hyperparameters | feature_dim – features of the input k – number of nearest neighbors predictor_type – classifier or regressor sample_size – observations to use to build the index (model is designed for large scale data) dimension_reduction_target – 0 < x < feature_dim |
Channels | Train and (optionally) test |
Training input | File or Pipe |
File type | RecordIO-protobuf or CSV |
Instance class | CPU or GPU (single GPU device on one or more instances) |
Parallelizable | Yes |
Algorithm name | LDA |
Description | Unsupervised method to discover specified number of topics among documents. Goal is the same as for Neural Topic Model, but results may differ. |
Most important hyper parameters | num_topics feature_dim – size of the vocabulary mini_batch_size – total number of documents |
Channels | Train and (optionally) test |
Training input | File or Pipe |
File type | RecordIO-protobuf or CSV |
Instance class | CPU (single instance only) |
Parallelizable | No |
Algorithm name | Linear Learner |
Description | Supervised method for solving regression and classification problems. |
Most important hyperparameters | feature_dim – features of the input num_classes – classes are integers predictor_type – binary_classifier, multiclass_classifier, or regressor binary_classifier_model_selection_criteria – accuracy, f_beta (F1), etc loss – defaults: – regressor – squared_loss; – binary_classifier – logistic; – multiclass_classifier – softmax_loss num_models – how many models to build in parallel |
Channels | Train and (optionally) validation, test, or both |
Training input | File or Pipe |
File type | RecordIO-protobuf or CSV |
Instance class | CPU or GPU |
Parallelizable | Yes |
Algorithm name | Neural Topic Model – NTM |
Description | Unsupervised method to discover specified number of topics among documents. Goal is the same as for LDA, but results may differ. |
Most important hyperparameters | num_topics feature_dim – size of the vocabulary |
Channels | Train and (optionally) validation, test, or both |
Training input | File or Pipe |
File type | RecordIO-protobuf or CSV |
Instance class | GPU or CPU |
Parallelizable | Yes |
Algorithm name | Object2Vec |
Description | Supervised learning method to learn low-dimensional embeddings from high-dimensional objects. E.g., label move summaries, find similar documents, classify documents (offensive/non-offensive), recommend products to customers. More examples here. |
Most important hyperparameters | Nothing particular to highlight many encoder / neural net parameters |
Channels | Train and (optionally) validation, test, or both |
Training input | File |
File type | JSON Lines |
Instance class | GPU or CPU (single instance only) |
Parallelizable | No |
Tip: BlazingText works on the word / sentence level. Object2Vec is more general and takes the whole document into account. Source.
Algorithm name | Object Detection |
Description | Supervised method to classify objects (represented by rectangles) within images. Every image can have multiple objects, but every object can only have one predicted class.Uses VGG and ResNet networks. Supports full training or transfer learning (pretrained on ImageNet) |
Most important hyperparameters | num_classes – number of output classes num_training_samples – for learning rate optimization base_network – vgg-16 (def) or resnet-50 use_pretrained_model – 1 only top FC layer is trained, 0 – full retrain BS,LR,E |
Channels | Train and validation, (optionally) train_annotation, validation_annotation, and model |
Training input | File or Pipe |
File type | RecordIO or image files (.jpg or .png) using augmented manifest image format |
Instance class | GPU |
Parallelizable | Yes |
Algorithm name | Principal Component Analysis – PCA |
Description | Unsupervised dimensionality reduction |
Most important hyperparameters | feature_dim – input dimension num_components – output dimension / components algorithm_mode – regular (def) or randomized (for large datasets) |
Channels | Train and (optionally) test |
Training input | File or Pipe |
File type | RecordIO-protobuf or CSV |
Instance class | GPU or CPU |
Parallelizable | Yes |
Algorithm name | Random Cut Forest – RCF |
Description | Unsupervised anomaly detection algorithm. It’s very flexible between one-dimensional time series to arbitrary-dimensional (10k) input. It returns an anomaly score and the caller should set a threshold. |
Most important hyperparameters | feature_dim – input dimension |
Channels | Train and (optionally) test |
Training input | File or Pipe |
File type | RecordIO-protobuf or CSV |
Instance class | CPU |
Parallelizable | Yes |
Algorithm name | Semantic Segmentation |
Description | Supervised pixel level image classification. It returns the classes per segmentation mask. This mask applied on the original image only the pixels of the specific class are visible. Given an image of a street it would return: car, tree, people, etc. Uses ResNet50 and ResNet101 networks. Supports full training or transfer learning (pretrained on ImageNet) |
Most important hyperparameters | backbone: resnet-50, resnet-101 use_pretrained_model: True, False (yes it’s not 0 and 1 like before…) |
Channels | Train and validation, train_annotation, validation_annotation, and (optionally) label_map and model. _annotation channels contain the segmentation mask for training and validation. |
Training input | File or Pipe |
File type | Image files |
Instance class | GPU (single instance only) |
Parallelizable | No |
Algorithm name | Seq2Seq Modeling |
Description | Supervised method to convert input set of tokens to an output set. E.g., translation, speech-to-text, text summary generationUses encoder-decoder architectures with RNN and CNN. |
Most important hyperparameters | Nothing particular to highlight many encoder / neural net parameters. |
Channels | Train, validation, and vocab |
Training input | File |
File type | RecordIO-protobuf |
Instance class | GPU (single instance only) |
Parallelizable | No |
Algorithm name | eXtreme Gradient Boosting – XGBoost |
Description | Supervised method with an open-source implementation of the gradient boosted trees. It’s able to perform regression, classification and ranking. |
Most important hyper parameters | Amazon docs, xboost docs. num_round – training rounds num_class – number of classes objective – reg:squarederror (def), reg:logistic, multi:softmax, etc. |
Channels | Train and (optionally) validation |
Training input | File |
File type | CSV or LibSVM |
Instance class | CPU |
Parallelizable | Yes |
Tip: built-in Docker images stored in ECR will have :1 version tag for stable and :latest for non-stable versions. Source.
Conclusion
This post was a structured reference summary of Amazon SageMaker built-in methods. Machine learning consultants should understand and memorize the content of this post for the AWS Machine Learning Speciality exam.
In the previous posts I have introduced the AWS Machine Learning Speciality certificate for AWS Certified Machine Learning Consultants, AWS Data Engineering services for The AWS Data Engineering Consultants, the Top 7 Explorative Data Analysis Methods, Machine Learning Algorithms Overview and AWS AI Services Overview.
The last and this post cover Amazon SageMaker it’s flagship product for machine learning. This means that this post concludes my series on AWS machine learning. I’m only hoping that many of you would find this less verbose, focused content useful. The goal was to give a very prompt overview on what AWS has to offer for machine learning projects.
Photo by chuttersnap on Unsplash