If you are thinking about hiring or becoming a data engineering consultant understanding the AWS data technologies is the perfect start.
Hiring managers can use this information to screen data engineering consultant.
Consultants can use this information to get a complete picture of the AWS services around data storage and processing.
This post is part of a series about the AWS Machine Learning Speciality (MLS-C01) exam. I have structured the content into eight separate posts. These posts can be consumed as a stand-alone material. If you are not preparing for the MLS-C01 exam you may still find the topics interesting:
- Part 1: The AWS Certified Machine Learning Consultant
- Part 2: The Data Engineering Consultant
- Part 3: Top 7 Explorative Data Analysis Methods
- Part 4: [Link coming soon]
AWS Data Engineering Consultant
In this post I will list the basic AWS services related to data storage and transformation. The list might be incomplete or might contain mistakes, if you have any feedback please feel free to get in touch with me. You will find my contact details here: About me.
My goal with this collection is to provide you:
- a simple self-evaluation tool to decide if you are ready for the exam,
- a quick summary of the relevant services to memorize right before the exam.
This guide is not detailed enough to be your only source of knowledge for the exam. You need deep understanding of the services and hands-on experience with many of them.
Data storage
S3 – https://aws.amazon.com/s3/faqs/
- Simple Storage Service: it’s used to store any type of objects.
- Typical place to store input, output and model code.
- Typical file formats: CSV, parquet, ORC, protobuf.
- Data partitioning is achieved via directory structure: /yyyy/mm/dd/.
- SageMaker by default has access to buckets containing “sagemaker” in their name. https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-ex-bucket.html
EBS – https://aws.amazon.com/ebs/faqs/
- Elastic block storage: basically virtual disks that you can attach to virtual images which are running jupyter notebooks.
- Typically used as temporal data storage during model development with a subset of S3 data.
- It has a fixed size, which is limited compared to S3. Typically 5-100GB.
HDFS over EMR – https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop.html
- Hadoop distributed file system and elastic map reduce.
- Typically data is stored here for petabyte scale processing.
- EMR File System (EMRFS) provides S3 integration: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-fs.html
Other data stores less relevant for machine learning
- Redshift – data warehouse solution.
- RDS – Relational databases – Aurora is Amazon’s own serverless database (compatibel with MySQL or PostgreSQL).
- DynamoDB – serverless key-value and document database (NoSQL).
- ElastiCache in-memory data storage using memcached or redis.
Repositories
Glue Data Catalog – https://docs.aws.amazon.com/glue/latest/dg/components-overview.html
- Crawlers can automatically detect data schema from: files in S3, Redshift, RDS, DynamoDB.
- Fits into serverless architectures.
Tipp: choosing Glue Data Catalog over Hive for data schema storing is usually the correct answer.
ECR – https://aws.amazon.com/ecr/faqs/
- Elastic container registry used for docker images.
- Typically used to store pre-installed (from Amazon or custom made) machine learning models.
Data ingestion
Manual – https://docs.aws.amazon.com/AmazonS3/latest/dev/UploadingObjects.html
- Upload data manually to S3 using WebUI, aws cli or programmatically.
Kinesis Streams – https://aws.amazon.com/kinesis/data-streams/faqs/
- Real time data processing, data is stored 1-7 days.
- Pricing is based on shards: 1 shard is 1MB / sec or 1000 record / sec incoming and 2000 MB / sec outgoing.
Example: 150 records / sec each 10 KB with 5 consumers:
Incoming shards = 150 * 10 KB / sec = 1.5MB/sec / 1MB/sec -> 2 shards (rounding up)
Outgoing shards = 5 * 150 * 10 KB / sec = 7.5MB/sec / 2MB/sec -> 4 shards
Total = max(2, 4) = 4 shards.
Kinesis Firehose – https://aws.amazon.com/kinesis/data-firehose/faqs/
- Near real time data processing and data loading.
- Auto scaling and lambda integration.
- Data loading: S3, Redshift, Elastic search, Splunk.
- Compression: GZIP, ZIP, and SNAPPY.
- File formats: CSV, Json, parquet, ORC.
Tipp: choose Firehose if cost is a concern and there are no real-time delivery requirements.
Transformation and orchestration
Glue ETL – https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming.html
- Serverless data transformation using Spark (python, scala).
- Supports triggers and schedulers.
- Typical file formats: avro, csv, json, orc, parquet, xml.
- Check out use cases here: https://aws.amazon.com/glue/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc
Tipp: Glue ETL is the answer when simple data loading is the requirement
Lambda – https://aws.amazon.com/lambda/faqs/
- Serverless function execution.
- Typically used for simpler file / record manipulations.
- Supported languages: Java, Go, PowerShell, Node.js, C#, Python, and Ruby code.
- Runtime API allows any other language as well.
- Check out use cases here: https://aws.amazon.com/lambda/
Kinesis Data Analytics – https://aws.amazon.com/kinesis/data-analytics/faqs/
- Serverless SQL, Java (Flink) based streaming data transformation.
- Typically used for simpler manipulations and window operations.
- Check out use cases here: https://aws.amazon.com/kinesis/data-analytics/
AWS Batch – https://aws.amazon.com/batch/faqs/
- EC2 based job execution with Docker support.
- Serverless: resources provisioned automatically.
- Check out use cases here: https://aws.amazon.com/batch/
Step Functions – https://aws.amazon.com/step-functions/faqs/
- Generic, serverless, visual workflow design and execution.
- Supports: parallel processes, error handling, timeouts, retries and on-premise integration.
- Check out use cases here: https://aws.amazon.com/step-functions/
Data Pipelines – https://aws.amazon.com/datapipeline/faqs/
- Complex data processing workloads that are fault tolerant, repeatable, and highly available.
- Data nodes: S3, RDS, Redshift, SQL:
- Computing resources: EC2 or EMR.
Data Pipelines and Step Functions provide similar functionality, here is a quick guide to decide:
1. Is your workflow pure data transformation? Yes: go to 2, no: step functions.
2. Are all your data sources supported by Data Pipelines? Yes: data pipelines, no: step functions.
https://stackoverflow.com/questions/55061621/aws-data-pipeline-vs-step-functions
EC2 – https://aws.amazon.com/ec2/faqs/
- Elastic Compute Cloud – virtual machines in the cloud for varying use-cases
- Types: General purpose, compute-, memory-, storage optimized, accelerated computing (GPU)
- https://aws.amazon.com/ec2/instance-types/
- Cost options: free tier, on-demand, spot instances, reserved instances and dedicated hosts
- https://aws.amazon.com/ec2/pricing/
- Can be used stand alone or with ECS or EKS
Fargate – https://aws.amazon.com/fargate/faqs
- Serverless compute for containers (ECS or EKS)
- Alternative to EC2 if instance level control is not required
- Does not support GPU instances
EMR – https://aws.amazon.com/emr/faqs/
- Elastic MapReduce including Hadoop, Spark, Flink, Presto.
- Petabyte scale data processing in the cloud.
- Master node: coordination; core node: data store + task execution; task node only task execution.
- Computing: EC2
Tipp: when cost is a concern task nodes can be Spot instances.
ECS – https://aws.amazon.com/ecs/faqs/
- Elastic Container Service: serverless Docker image execution
- Computing: EC2 or Fargate
EKS – https://aws.amazon.com/eks/faqs/
- Elastic Kubernetes Service: fully managed Kubernetes to run Docker (and other containers supported by Kubernetes)
- Computing: EC2 or Fargate
ECS vs EKS
- ECS is Amazon native solution provides better AWS integration and experience
- ECS could be simpler learn and operate, without prior knowledge.
- EKS (Kubernetes) is open-source provides easier migration to other cloud provider or to on-premise.
- EKS is more expensive: compute resources (like ECS), plus 0.10$ per hour per EKS cluster.
- https://aws.amazon.com/eks/pricing/
AWS Elastic Beanstalk – https://aws.amazon.com/elasticbeanstalk/faqs/
- Automatically handles the deployment details of capacity provisioning, load balancing, auto-scaling, and application health monitoring.
- Supported languages/frameworks: Java, .NET, PHP, Node.js, Python, Ruby, Go, and Docker
- Computing: EC2
Tipp: serverless technologies are the correct answers most of the time. With that being said, you should have a high level understanding of the above AWS offerings.
Data exploration
Athena – https://aws.amazon.com/athena/faqs/
- Interactive, serverless, SQL based query service for S3 data.
- It can use schema information from Glue.
- Execution time and cost can be lowered by using compressed and/or columnar data like parquet.
QuickSight – https://aws.amazon.com/quicksight/resources/faqs/
- Serverless business intelligence service.
- Customizable web dashboard with variety of charts.
- Data Sources: RDS, Aurora, Redshift, Athena and S3, CSV files or on-premise SQL servers.
- SPICE: in-memory calculation engine to support ad-hoc queries.
Redshift Spectrum – https://aws.amazon.com/redshift/faqs/
- Redshift functionality to access data directly from S3.
Tipp: S3, Glue ETL, Glue Data Catalog, Athena and QuickSight work nicely together 😉
CloudWatch – https://aws.amazon.com/cloudwatch/faqs/
- Operational data collecting, monitoring and visualizing: logs, metrics and events.
- Supports alerting and scheduling.
CloudTrail – https://aws.amazon.com/cloudtrail/
- Audit: tracking user activity and API usage.
Build-in statistic and machine learning
Kinesis Analytics – https://docs.aws.amazon.com/kinesisanalytics/latest/sqlref/sql-reference-statistical-variance-deviation-functions.html
- Hotspots: Detects hotspots of frequently occurring data in the data stream.
- Random Cut Forest: Detects anomalies in the data stream.
- Variance and Deviation of the data stream.
Glue ETL – https://docs.aws.amazon.com/glue/latest/dg/machine-learning.html
- FindMatches: identify duplicate or matching records in your dataset, even when the records do not have a common unique identifier and no fields match exactly.
QuickSight – https://docs.aws.amazon.com/quicksight/latest/user/making-data-driven-decisions-with-ml-in-quicksight.html
- Anomalies: Random Cut Forest detects outliers in the data.
- Forecasting: Random Cut Forest algorithm automatically handles complex real-world scenarios such as detecting seasonality and trends, excluding outliers, and imputing missing values.
- Autonarratives: you can build rich dashboards with embedded narratives to tell the story of your data in plain language.
Conclusion
This post introduced the AWS data engineering services. It is a good instrument for managers to filter the crowd of data engineering consultants out there. For data engineering consultants it is a good tool to make sure you have a well-rounded AWS knowledge base.
In the previous post I have introduced the AWS Machine Learning Speciality certificate for AWS Certified Machine Learning Consultants. The next post I will present the most important data analysis methods and tools: Top 7 Explorative Data Analysis Methods. See you there!
Photo by Stephen Dawson on Unsplash