Select your cookie preferences

We use cookies and similar tools to enhance your experience, provide our services, deliver relevant advertising, and make improvements. Approved third parties also use these tools to help us deliver advertising and provide certain site features.

Distributed data processing with Step Functions Step Functions - Distributed Map 3 min

Learning Goals

  • Introduction to Distributed Map.

Getting started with distributed data processing with Step Functions

Distributed Map is a task state of Step Functions that runs the same process for multiple entries in a data set at a maximum concurrency of 10,000. The process is a child workflow that can be any combination of AWS SDK calls, an AWS Lambda function(s), or other AWS Step Functions integrations. With Distributed Map, You can iterate over millions of objects such as logs, images, or .csv files stored in Amazon Simple Storage Service (Amazon S3).

Distributed Map is a Map state. Using Amazon States Language (ASL), you define the mode as DISTRIBUTED to tell Step Functions to run the map state in distributed mode. You can choose to run the child workflow as STANDARD or EXPRESS using ExecutionType.

    "Process": {
      "Type": "Map",
      "ItemProcessor": {
        "ProcessorConfig": {
          "Mode": "DISTRIBUTED",
          "ExecutionType": "STANDARD"
          
        },...
      }

After defining the Distributed Map state as above, you configure where to read the data from. You provide a bucket and key if you are processing data from S3.

      "ItemReader": {
        "Resource": "arn:aws:states:::s3:getObject",
        "Parameters": {
          "Bucket.$": "$.bucket",
          "Key.$": "$.key"
        }
      }

You provide bucket and/or prefix, if you are processing list of files from S3.

      "ItemReader": {
        "Resource": "arn:aws:states:::s3:listObjectsV2",
        "Parameters": {
          "Bucket": "$.bucket"
        }
      }

Additionally, You can process data from S3 Inventory list and JSON array of data from previous step or workflow input. You control how many child workflows are run concurrently using the MaxConcurrency attribute.

      "MaxConcurrency": 1000

You can configure the size of the data sent to each child workflow using the ItemBatcher attribute. It can be a count of items, size in KB, or both.

      "ItemBatcher": {
        "MaxItemsPerBatch": 500
      }

Data quality is a big challenge with data processing. With Distributed Map, you can stop the workflow from continuing to process bad data if certain thresholds are reached. This saves cost and time spent processing bad quality data.

      "ToleratedFailurePercentage": 10,
      "ToleratedFailureCount": 100

Once the processing is done by iterating over all the objects, you can choose to export the results to an S3 bucket. Distributed Map aggregates the results of all successful child workflows in one JSON file and all failed child workflows in another JSON file.

    "ResultWriter": {
      "Resource": "arn:aws:states:::s3:putObject",
      "Parameters": {
        "Bucket": "processedOutput",
        "Prefix": "test-run"
      }

To learn more about the concepts, follow the link here.