Parallelized Document Vectorization Pipeline

A serverless pipeline that processes documents in parallel and generates vector embeddings for similarity search using AWS Step Functions, Lambda, and Amazon Bedrock

This sample project demonstrates how to build a production-ready document vectorization pipeline using AWS Step Functions to orchestrate parallel processing of documents. The pipeline intelligently handles different document formats and processes them in parallel chunks for optimal performance.

The state machine routes text, PDF, and Word documents to specialized Lambda functions for content extraction, then leverages parallel processing to generate vector embeddings using Amazon Bedrock's Titan models. All vectors are stored in PostgreSQL with pgvector extension for efficient similarity search.

Key features include automatic document type detection, parallel chunk processing, serverless scaling, and comprehensive error handling - making it suitable for high-throughput production workloads.

< Back to all workflows

Type:: Standard Workflow
Framework:: AWS SAM

View this workflow on GitHub

Clone repo

git clone https://github.com/solaws/step-functions-workflows-collection/tree/main/parallelized-embedding-pipelinecd step-functions-workflows-collection/parallelized-embedding-pipeline

Deploy

Deploy the complete pipeline with database initialization: <code>./deploy-with-db-init.sh --region us-east-1</code>Or deploy infrastructure only: <code>sam build && sam deploy --guided</code>Then initialize the database: <code>./deploy-db-init.sh --region us-east-1</code>

Testing

See the GitHub repo for detailed testing instructions.

Cleanup

Delete the stack: aws cloudformation delete-stack --stack-name vectorization-pipeline --region [YOUR-REGION]

Or use SAM: sam delete

Additional resources

Amazon Bedrock

AWS Step Functions

AWS Lambda

Amazon RDS for PostgreSQL

pgvector Extension

Amazon Titan Embedding Models

AWS Serverless Application Model (SAM)

Try the Serverlesspresso workshop

Try the Step Functions workshop.

Created by:

Solomon Ojo

Solomon is a Solutions Architect supporting Federal System Integrators at AWS. He specializes in Generative AI solutions and serverless architectures, and is an active member of the AWS Machine Learning and Artificial Intelligence community. Outside of work, Solomon is passionate about serving his community and helping others leverage cloud technologies to solve complex problems.

Dave Horne

Dave is a Sr. Solutions Architect supporting Federal System Integrators at AWS. He is based in Washington, DC and has 15 years of experience building, modernizing and integrating systems for Public Sector customers. Outside of work, Dave enjoys playing with his kids and hiking.

Follow on LinkedIn