Distributed Data Stream Aggregator

Aggregate data from multiple third-party locations using distributed processing with 3-tier architecture

This workflow demonstrates large-scale data aggregation from multiple third-party locations using AWS Step Functions' distributed processing capabilities with a 3-tier architecture.
The main workflow orchestrates the entire process by querying DynamoDB for client locations, then uses distributed map to process multiple locations in parallel. Each location is processed by a standard execution child workflow that handles data extraction and pagination.
A second express execution child workflow performs the actual API calls to third-party endpoints with query parameters and pagination support. Data is temporarily stored in S3 as JSON files organized by task ID.
Finally, an AWS Glue job consolidates all partial files into a single output file uploaded to the destination S3 bucket, with status updates tracked in DynamoDB.

< Back to all workflows

GitHub icon View this workflow on GitHub


Clone repo

git clone https://github.com/aws-samples/step-functions-workflows-collection/tree/main/distributed-data-stream-aggregator/cd step-functions-workflows-collection/distributed-data-stream-aggregator

Deploy

Follow the step-by-step deployment instructions in the README.md to create DynamoDB tables, S3 buckets, Glue job, EventBridge connections, and deploy the state machines.


Testing

1. Populate the locations DynamoDB table with test data containing task_id and location information.
2. Execute the state machine using the AWS CLI with task_id and task_sort_key as input.
3. Monitor execution progress in the Step Functions console and verify data consolidation in S3.

Cleanup

1. Delete the state machines using AWS CLI: aws stepfunctions delete-state-machine
2. Delete DynamoDB tables: aws dynamodb delete-table
3. Delete S3 buckets: aws s3 rb --force
4. Delete Glue job and EventBridge connection

Created by:

Aparna Saha

Aparna Saha

With 14 years of experience in software development, I've specialized in backend technologies such as PHP, Node.js, and Java, as well as frontend frameworks like React and Angular. I thrive on building distributed systems, crafting microservices, and leveraging platforms like IBM Cloud and AWS Cloud. Passionate about system design, I've led successful projects by guiding teams with dedication and adaptability. I take pride in mentoring interns and junior developers, helping them grow in their roles. My commitment is to deliver high-quality solutions and consistently contribute to the success of the team.

Follow on LinkedIn