Getting started with distributed data processing with Step Functions
Distributed Map is a task state of Step Functions that runs the same process for multiple entries in a data set at a maximum concurrency of 10,000. The process is a child workflow that can be any combination of AWS SDK calls, an AWS Lambda function(s), or other AWS Step Functions integrations. With Distributed Map, You can iterate over millions of objects such as logs, images, or .csv files stored in Amazon Simple Storage Service (Amazon S3).
Distributed Map is a Map state. Using Amazon States Language (ASL), you define the mode as DISTRIBUTED to tell Step Functions to run the map state in distributed mode. You can choose to run the child workflow as STANDARD
or EXPRESS
using ExecutionType
.
"Process": {
"Type": "Map",
"ItemProcessor": {
"ProcessorConfig": {
"Mode": "DISTRIBUTED",
"ExecutionType": "STANDARD"
},...
}
After defining the Distributed Map state as above, you configure where to read the data from. You provide a bucket and key if you are processing data from S3.
"ItemReader": {
"Resource": "arn:aws:states:::s3:getObject",
"Parameters": {
"Bucket.$": "$.bucket",
"Key.$": "$.key"
}
}
You provide bucket and/or prefix, if you are processing list of files from S3.
"ItemReader": {
"Resource": "arn:aws:states:::s3:listObjectsV2",
"Parameters": {
"Bucket": "$.bucket"
}
}
Additionally, You can process data from S3 Inventory list
and JSON array
of data from previous step or workflow input.
You control how many child workflows are run concurrently using the MaxConcurrency
attribute.
"MaxConcurrency": 1000
You can configure the size of the data sent to each child workflow using the ItemBatcher
attribute. It can be a count of items, size in KB, or both.
"ItemBatcher": {
"MaxItemsPerBatch": 500
}
Data quality is a big challenge with data processing. With Distributed Map, you can stop the workflow from continuing to process bad data if certain thresholds are reached. This saves cost and time spent processing bad quality data.
"ToleratedFailurePercentage": 10,
"ToleratedFailureCount": 100
Once the processing is done by iterating over all the objects, you can choose to export the results to an S3 bucket. Distributed Map aggregates the results of all successful child workflows in one JSON file and all failed child workflows in another JSON file.
"ResultWriter": {
"Resource": "arn:aws:states:::s3:putObject",
"Parameters": {
"Bucket": "processedOutput",
"Prefix": "test-run"
}
To learn more about the concepts, follow the link here.