Document Summarization Pipeline

A serverless AWS solution that accepts documents in multiple formats (txt, markdown, PDF), generates AI-powered summaries using Amazon Bedrock's Claude Sonnet 4.5, creates semantic embeddings using Amazon Nova Multimodal Embeddings, and stores the results in AWS S3 Vectors for efficient semantic search capabilities.

Architecture

The pipeline leverages the CDK AppMod Catalog Blueprints library, specifically the AgenticDocumentProcessing construct, to orchestrate an AI agent that processes documents intelligently.

flowchart LR
    subgraph Input
        U[User] -->|Upload| S3[S3 Bucket]
    end

    subgraph Processing
        S3 -->|Trigger| SQS[SQS Queue]
        SQS -->|Invoke| Agent[AI Agent<br/>Claude Sonnet 4.5]
        Agent -->|PDF files| PDF[PDF Extractor<br/>Tool]
        PDF -->|Extracted text| Agent
        Agent -->|Summary JSON| Post[Post Processor<br/>Lambda]
    end

    subgraph Storage
        Post -->|Generate embeddings| Nova[Nova Embeddings<br/>Model]
        Nova -->|3072-dim vector| Post
        Post -->|Store| S3V[S3 Vectors<br/>Index]
    end

    S3 -.->|Processed| S3P[processed/]
    S3 -.->|Failed| S3F[failed/]

Processing Flow

Documents are uploaded to an S3 bucket (supports .txt, .md, .pdf)
The QueuedS3Adapter triggers processing via SQS queue
The AI agent (Claude Sonnet 4.5) analyzes the document:
- For text/markdown files: reads content directly
- For PDF files: uses the custom extract_pdf_text tool to extract text
The agent generates a concise summary in JSON format
The Post Processor Lambda:
- Generates embeddings using Amazon Nova Multimodal Embeddings (3072 dimensions)
- Stores the summary and embeddings in S3 Vectors with original filename as metadata
Processed documents are moved to the processed/ prefix; failures go to failed/

Prerequisites

Node.js 18.x or later
AWS CLI configured with appropriate credentials
AWS CDK CLI (npm install -g aws-cdk)

Project Structure

.
├── bin/                          # CDK app entry point
├── lib/                          # CDK stack definitions
│   └── document-summarization-stack.ts  # Main infrastructure stack
├── resources/                    # Lambda function code and assets
│   ├── post_processor.py         # Post-processing Lambda (embeddings + S3 Vectors storage)
│   ├── requirements.txt          # Python dependencies (pypdf, boto3)
│   ├── system_prompt.txt         # Agent system prompt for summarization
│   └── tools/                    # Custom agent tools
│       └── pdf_extractor.py      # PDF text extraction tool using pypdf
├── specs/                        # Feature specifications
├── test/                         # Test files
├── cdk.json                      # CDK configuration
├── package.json                  # Node.js dependencies
└── tsconfig.json                 # TypeScript configuration

Installation

Install dependencies:

npm install

Build

Compile TypeScript to JavaScript:

npm run build

Deployment

Deploy the stack to your AWS account:

cdk deploy

Usage

Upload documents to the created S3 bucket. Supported formats:

.txt - Plain text files
.md - Markdown files
.pdf - PDF documents

The pipeline will automatically process uploaded documents and store summaries with embeddings in S3 Vectors for semantic search.

Useful Commands

npm run build - Compile TypeScript to JavaScript
npm run watch - Watch for changes and compile
npm run test - Run unit tests
cdk deploy - Deploy this stack to your default AWS account/region
cdk diff - Compare deployed stack with current state
cdk synth - Emit the synthesized CloudFormation template

Technology Stack

Infrastructure as Code: AWS CDK (TypeScript)
AI Models:
- Claude Sonnet 4.5 (cross-region inference to US) for summarization
- Amazon Nova Multimodal Embeddings (amazon.nova-2-multimodal-embeddings-v1:0) for vector generation
Storage:
- AWS S3 for document storage (KMS encrypted)
- AWS S3 Vectors for semantic search (cosine distance, float32)
Compute: AWS Lambda (Python 3.13) for custom processing
Orchestration: CDK AppMod Catalog Blueprints - AgenticDocumentProcessing construct
Observability: CloudWatch Logs with structured JSON logging, custom metrics

Key Components

Post Processor Lambda

Handles embedding generation and vector storage:

Invokes Amazon Nova Multimodal Embeddings via Bedrock
Stores vectors in S3 Vectors index with metadata
Implements retry logic with exponential backoff
Structured JSON logging for observability

PDF Extractor Tool

Custom agent tool for PDF processing:

Uses pypdf library for text extraction
Handles multi-page PDFs
Returns structured results with page count and extracted text

S3 Vectors Configuration

Dimension: 3072 (Nova embeddings)
Distance metric: Cosine
Data type: float32
Metadata: originalFilename, summary (non-filterable)

License

This project is licensed under the MIT License.

Architecture​

Processing Flow​

Prerequisites​

Project Structure​

Installation​

Build​

Deployment​

Usage​

Useful Commands​

Technology Stack​

Key Components​

Post Processor Lambda​

PDF Extractor Tool​

S3 Vectors Configuration​

License​