The AWS Data Engineering Consultant

Data Engineering Consultant

If you are thinking about hiring or becoming a data engineering consultant understanding the AWS data technologies is the perfect start.

Hiring managers can use this information to screen data engineering consultant.

Consultants can use this information to get a complete picture of the AWS services around data storage and processing.

This post is part of a series about the AWS Machine Learning Speciality (MLS-C01) exam. I have structured the content into eight separate posts. These posts can be consumed as a stand-alone material. If you are not preparing for the MLS-C01 exam you may still find the topics interesting:

AWS Data Engineering Consultant

In this post I will list the basic AWS services related to data storage and transformation. The list might be incomplete or might contain mistakes, if you have any feedback please feel free to get in touch with me. You will find my contact details here: About me.

My goal with this collection is to provide you:

  • a simple self-evaluation tool to decide if you are ready for the exam,
  • a quick summary of the relevant services to memorize right before the exam.

This guide is not detailed enough to be your only source of knowledge for the exam. You need deep understanding of the services and hands-on experience with many of them.

Data storage

S3 https://aws.amazon.com/s3/faqs/

EBShttps://aws.amazon.com/ebs/faqs/

  • Elastic block storage: basically virtual disks that you can attach to virtual images which are running jupyter notebooks.
  • Typically used as temporal data storage during model development with a subset of S3 data.
  • It has a fixed size, which is limited compared to S3. Typically 5-100GB.

HDFS over EMRhttps://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop.html

Other data stores less relevant for machine learning

  • Redshift – data warehouse solution.
  • RDS – Relational databases – Aurora is Amazon’s own serverless database (compatibel with MySQL or PostgreSQL).
  • DynamoDB – serverless key-value and document database (NoSQL).
  • ElastiCache in-memory data storage using memcached or redis.

Repositories

Glue Data Cataloghttps://docs.aws.amazon.com/glue/latest/dg/components-overview.html

  • Crawlers can automatically detect data schema from: files in S3, Redshift, RDS, DynamoDB.
  • Fits into serverless architectures.

Tipp: choosing Glue Data Catalog over Hive for data schema storing is usually the correct answer.

ECRhttps://aws.amazon.com/ecr/faqs/

  • Elastic container registry used for docker images.
  • Typically used to store pre-installed (from Amazon or custom made) machine learning models.

Data ingestion

Manualhttps://docs.aws.amazon.com/AmazonS3/latest/dev/UploadingObjects.html

  • Upload data manually to S3 using WebUI, aws cli or programmatically.

Kinesis Streamshttps://aws.amazon.com/kinesis/data-streams/faqs/

  • Real time data processing, data is stored 1-7 days.
  • Pricing is based on shards: 1 shard is 1MB / sec or 1000 record / sec incoming and 2000 MB / sec outgoing. 

Example: 150 records / sec each 10 KB with 5 consumers:

Incoming shards = 150 * 10 KB / sec = 1.5MB/sec / 1MB/sec -> 2 shards (rounding up)

Outgoing shards = 5 * 150 * 10 KB / sec = 7.5MB/sec / 2MB/sec -> 4 shards

Total = max(2, 4) = 4 shards.

Kinesis Firehosehttps://aws.amazon.com/kinesis/data-firehose/faqs/

  • Near real time data processing and data loading.
  • Auto scaling and lambda integration.
  • Data loading: S3, Redshift, Elastic search, Splunk.
  • Compression: GZIP, ZIP, and SNAPPY.
  • File formats: CSV, Json, parquet, ORC.

Tipp: choose Firehose if cost is a concern and there are no real-time delivery requirements.

Transformation and orchestration

Glue ETLhttps://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming.html

Tipp: Glue ETL is the answer when simple data loading is the requirement

Lambdahttps://aws.amazon.com/lambda/faqs/

  • Serverless function execution.
  • Typically used for simpler file / record manipulations.
  • Supported languages: Java, Go, PowerShell, Node.js, C#, Python, and Ruby code.
  • Runtime API allows any other language as well.
  • Check out use cases here: https://aws.amazon.com/lambda/

Kinesis Data Analyticshttps://aws.amazon.com/kinesis/data-analytics/faqs/

AWS Batchhttps://aws.amazon.com/batch/faqs/

  • EC2 based job execution with Docker support.
  • Serverless: resources provisioned automatically.
  • Check out use cases here: https://aws.amazon.com/batch/

Step Functionshttps://aws.amazon.com/step-functions/faqs/

  • Generic, serverless, visual workflow design and execution.
  • Supports: parallel processes, error handling, timeouts, retries and on-premise integration.
  • Check out use cases here: https://aws.amazon.com/step-functions/

Data Pipelineshttps://aws.amazon.com/datapipeline/faqs/

  • Complex data processing workloads that are fault tolerant, repeatable, and highly available.
  • Data nodes: S3, RDS, Redshift, SQL:
  • Computing resources: EC2 or EMR.

Data Pipelines and Step Functions provide similar functionality, here is a quick guide to decide:

1. Is your workflow pure data transformation? Yes: go to 2, no: step functions.

2. Are all your data sources supported by Data Pipelines? Yes: data pipelines, no: step functions.

https://stackoverflow.com/questions/55061621/aws-data-pipeline-vs-step-functions

EC2 https://aws.amazon.com/ec2/faqs/

Fargate https://aws.amazon.com/fargate/faqs

  • Serverless compute for containers (ECS or EKS)
  • Alternative to EC2 if instance level control is not required
  • Does not support GPU instances

EMRhttps://aws.amazon.com/emr/faqs/

  • Elastic MapReduce including Hadoop, Spark, Flink, Presto.
  • Petabyte scale data processing in the cloud.
  • Master node: coordination; core node: data store + task execution; task node only task execution.
  • Computing: EC2

Tipp: when cost is a concern task nodes can be Spot instances.

ECS https://aws.amazon.com/ecs/faqs/

  • Elastic Container Service: serverless Docker image execution
  • Computing: EC2 or Fargate

EKS https://aws.amazon.com/eks/faqs/

  • Elastic Kubernetes Service: fully managed Kubernetes to run Docker (and other containers supported by Kubernetes)
  • Computing: EC2 or Fargate

ECS vs EKS

  • ECS is Amazon native solution provides better AWS integration and experience
  • ECS could be simpler learn and operate, without prior knowledge.
  • EKS (Kubernetes) is open-source provides easier migration to other cloud provider or to on-premise.
  • EKS is more expensive: compute resources (like ECS), plus 0.10$ per hour per EKS cluster.
  • https://aws.amazon.com/eks/pricing/

AWS Elastic Beanstalkhttps://aws.amazon.com/elasticbeanstalk/faqs/

  • Automatically handles the deployment details of capacity provisioning, load balancing, auto-scaling, and application health monitoring.
  • Supported languages/frameworks: Java, .NET, PHP, Node.js, Python, Ruby, Go, and Docker
  • Computing: EC2

Tipp: serverless technologies are the correct answers most of the time. With that being said, you should have a high level understanding of the above AWS offerings.

Data exploration

Athenahttps://aws.amazon.com/athena/faqs/

  • Interactive, serverless, SQL based query service for S3 data.
  • It can use schema information from Glue.
  • Execution time and cost can be lowered by using compressed and/or columnar data like parquet.

QuickSight https://aws.amazon.com/quicksight/resources/faqs/

  • Serverless business intelligence service.
  • Customizable web dashboard with variety of charts.
  • Data Sources: RDS, Aurora, Redshift, Athena and S3, CSV files or on-premise SQL servers.
  • SPICE: in-memory calculation engine to support ad-hoc queries.

Redshift Spectrumhttps://aws.amazon.com/redshift/faqs/

  • Redshift functionality to access data directly from S3.

Tipp: S3, Glue ETL, Glue Data Catalog, Athena and QuickSight work nicely together 😉

CloudWatchhttps://aws.amazon.com/cloudwatch/faqs/

  • Operational data collecting, monitoring and visualizing: logs, metrics and events.
  • Supports alerting and scheduling.

CloudTrail https://aws.amazon.com/cloudtrail/

  • Audit: tracking user activity and API usage.

Build-in statistic and machine learning

Kinesis Analyticshttps://docs.aws.amazon.com/kinesisanalytics/latest/sqlref/sql-reference-statistical-variance-deviation-functions.html

  • Hotspots: Detects hotspots of frequently occurring data in the data stream.
  • Random Cut Forest: Detects anomalies in the data stream.
  • Variance and Deviation of the data stream.

Glue ETLhttps://docs.aws.amazon.com/glue/latest/dg/machine-learning.html

  • FindMatches: identify duplicate or matching records in your dataset, even when the records do not have a common unique identifier and no fields match exactly.

QuickSighthttps://docs.aws.amazon.com/quicksight/latest/user/making-data-driven-decisions-with-ml-in-quicksight.html

  • Anomalies: Random Cut Forest detects outliers in the data.
  • Forecasting: Random Cut Forest algorithm automatically handles complex real-world scenarios such as detecting seasonality and trends, excluding outliers, and imputing missing values.
  • Autonarratives: you can build rich dashboards with embedded narratives to tell the story of your data in plain language.

Conclusion

This post introduced the AWS data engineering services. It is a good instrument for managers to filter the crowd of data engineering consultants out there. For data engineering consultants it is a good tool to make sure you have a well-rounded AWS knowledge base.

In the previous post I have introduced the AWS Machine Learning Speciality certificate for AWS Certified Machine Learning Consultants. The next post I will present the most important data analysis methods and tools: Top 7 Explorative Data Analysis Methods. See you there!

Commenting is disabled on this site. You will find my contact details here: About me.

Photo by Stephen Dawson on Unsplash