Skip to main content

PDF Analysis and Chunking Lambda

This Lambda function is the first step in the Step Functions workflow for chunked document processing. It analyzes PDFs to determine if chunking is needed, and if so, splits the PDF into chunks and uploads them to S3.

Overview

This is a single Lambda function that performs both analysis and chunking to avoid downloading the PDF twice. It:

  1. Analyzes the PDF to determine token count and page count
  2. Determines if chunking is required based on strategy and thresholds
  3. If no chunking needed: returns analysis metadata only
  4. If chunking needed: splits PDF and uploads chunks to S3

Files

  • handler.py - Main Lambda handler function
  • token_estimation.py - Token estimation module (word-based heuristic)
  • chunking_strategies.py - Chunking algorithms (fixed-pages, token-based, hybrid)
  • requirements.txt - Python dependencies (PyPDF2, boto3)
  • test_handler.py - Unit tests for handler
  • test_token_estimation.py - Unit tests for token estimation
  • test_chunking_strategies.py - Unit tests for chunking strategies
  • test_integration.py - Integration tests (requires AWS setup)

Configuration

The Lambda supports three chunking strategies:

1. Fixed-Pages Strategy (Legacy)

Simple page-based chunking. Fast but doesn't account for token density.

config = {
'strategy': 'fixed-pages',
'pageThreshold': 100,
'chunkSize': 50,
'overlapPages': 5
}

2. Token-Based Strategy

Token-aware chunking that respects model limits. Ideal for variable density documents.

config = {
'strategy': 'token-based',
'tokenThreshold': 150000,
'maxTokensPerChunk': 100000,
'overlapTokens': 5000
}

Best of both worlds - targets token count but respects page limits.

config = {
'strategy': 'hybrid',
'pageThreshold': 100,
'tokenThreshold': 150000,
'targetTokensPerChunk': 80000,
'maxPagesPerChunk': 99, # Bedrock has a hard limit of 100 pages
'overlapTokens': 5000
}

Event Format

Input Event (from SQS Consumer)

The Lambda receives events from the SQS Consumer in this exact format:

{
"documentId": "invoice-2024-001-1705315800000",
"contentType": "file",
"content": {
"location": "s3",
"bucket": "my-document-bucket",
"key": "raw/invoice-2024-001.pdf",
"filename": "invoice-2024-001.pdf"
},
"eventTime": "2024-01-15T10:30:00.000Z",
"eventName": "ObjectCreated:Put",
"source": "sqs-consumer"
}

Required Fields:

  • documentId - Unique document identifier (generated by SQS consumer)
  • content.bucket - S3 bucket name
  • content.key - S3 object key (must be in raw/ prefix)

Optional Fields:

  • contentType - Must be "file" if provided (default behavior)
  • content.location - Always "s3" (informational)
  • content.filename - Original filename (informational)
  • eventTime - S3 event timestamp (informational)
  • eventName - S3 event name (informational)
  • source - Event source identifier (informational)
  • config - Optional chunking configuration override

Input Event (with Custom Configuration)

You can override chunking configuration per document:

{
"documentId": "doc-123",
"contentType": "file",
"content": {
"bucket": "document-bucket",
"key": "raw/document.pdf",
"filename": "document.pdf"
},
"config": {
"strategy": "hybrid",
"pageThreshold": 100,
"tokenThreshold": 150000,
"targetTokensPerChunk": 80000,
"maxPagesPerChunk": 99
}
}

Output (No Chunking)

{
"documentId": "doc-123",
"requiresChunking": false,
"tokenAnalysis": {
"totalTokens": 45000,
"totalPages": 30,
"avgTokensPerPage": 1500
},
"reason": "Document has 30 pages, below threshold of 100"
}

Output (Chunking)

{
"documentId": "doc-456",
"requiresChunking": true,
"tokenAnalysis": {
"totalTokens": 200000,
"totalPages": 150,
"avgTokensPerPage": 1333,
"tokensPerPage": [...]
},
"strategy": "hybrid",
"chunks": [
{
"chunkId": "doc-456_chunk_0",
"chunkIndex": 0,
"totalChunks": 2,
"startPage": 0,
"endPage": 74,
"pageCount": 75,
"estimatedTokens": 100000,
"bucket": "document-bucket",
"key": "chunks/doc-456_chunk_0.pdf"
}
],
"config": {
"strategy": "hybrid",
"totalPages": 150,
"totalTokens": 200000,
"targetTokensPerChunk": 80000,
"maxPagesPerChunk": 99
}
}

Validation

The Lambda performs several validation checks:

1. Payload Validation

  • documentId - Must be present
  • content.bucket - Must be present
  • content.key - Must be present
  • contentType - Must be "file" if provided (only file-based processing is supported)

2. File Extension Check

  • Logs a warning if file doesn't have .pdf extension
  • Still processes the file (validates using magic bytes)
  • Useful for catching misnamed files

3. PDF Magic Bytes Validation

  • Validates file starts with %PDF- before processing
  • Prevents wasting resources on non-PDF files
  • Rejects HTML, text, images, and other formats

4. PDF Format Validation

  • Uses PyPDF2 to validate PDF structure
  • Detects corrupted or invalid PDFs
  • Rejects encrypted PDFs (not supported)

Error Responses

All validation errors return a standardized error response:

{
"documentId": "doc-123",
"requiresChunking": false,
"error": {
"type": "ValueError",
"message": "Missing required field: documentId"
}
}

The Lambda handles various error scenarios:

  1. Non-PDF files - Validates file starts with PDF magic bytes (%PDF-) before processing
  2. Invalid PDF format - Returns error response if PyPDF2 cannot parse the file
  3. Corrupted PDF files - Returns error response with details
  4. S3 access denied - Returns error with specific message
  5. Corrupted pages - Skips page, logs warning, continues with remaining pages
  6. S3 write failures - Retries with exponential backoff (3 attempts)

PDF Validation

Before attempting to process a file, the Lambda validates it's actually a PDF by checking the magic bytes:

  • Valid PDFs must start with %PDF- (hex: 25 50 44 46 2D)
  • Files without this signature are rejected immediately
  • This prevents wasting resources on non-PDF files (HTML, text, images, etc.)

Testing

Unit Tests

python test_handler.py

Integration Tests

Requires AWS credentials and test bucket:

export RUN_INTEGRATION_TESTS=true
export TEST_BUCKET=your-test-bucket-name
python test_integration.py

Test PDFs should be uploaded to:

  • s3://your-test-bucket/test-data/small-document.pdf
  • s3://your-test-bucket/test-data/large-document.pdf
  • s3://your-test-bucket/test-data/invalid.pdf

Performance

  • Token analysis: 2-5 seconds for 100-page PDF
  • Chunking: ~1 second per chunk
  • Memory: 2048 MB recommended
  • Timeout: 10 minutes recommended

IAM Permissions Required

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject"
],
"Resource": "arn:aws:s3:::bucket-name/raw/*"
},
{
"Effect": "Allow",
"Action": [
"s3:PutObject"
],
"Resource": "arn:aws:s3:::bucket-name/chunks/*"
}
]
}

Environment Variables

  • CHUNKING_STRATEGY - Default strategy (default: 'hybrid')
  • PAGE_THRESHOLD - Page count threshold (default: 100)
  • TOKEN_THRESHOLD - Token count threshold (default: 150000)
  • CHUNK_SIZE - Pages per chunk for fixed-pages (default: 50)
  • OVERLAP_PAGES - Overlap pages for fixed-pages (default: 5)
  • MAX_TOKENS_PER_CHUNK - Max tokens for token-based (default: 100000)
  • OVERLAP_TOKENS - Overlap tokens (default: 5000)
  • TARGET_TOKENS_PER_CHUNK - Target tokens for hybrid (default: 80000)
  • MAX_PAGES_PER_CHUNK - Max pages for hybrid (default: 99, Bedrock limit is 100)
  • LOG_LEVEL - Logging level (default: 'INFO')

Architecture Integration

This Lambda is invoked by Step Functions as the first step in the workflow (before Init Metadata). The SQS Consumer has NO changes - it simply triggers Step Functions as before.

The workflow structure:

SQS Consumer → Step Functions → PDF Analysis & Chunking Lambda → Init Metadata → ...

Token Estimation

Uses word-based heuristic for fast estimation:

  • Count words using regex \b\w+\b
  • Apply 1.3 tokens per word multiplier
  • Accuracy: ~85-90% for English text
  • Speed: ~0.2 seconds per 100 pages

Can be upgraded to tiktoken for production if needed.