PDF Analysis and Chunking Lambda
This Lambda function is the first step in the Step Functions workflow for chunked document processing. It analyzes PDFs to determine if chunking is needed, and if so, splits the PDF into chunks and uploads them to S3.
Overview
This is a single Lambda function that performs both analysis and chunking to avoid downloading the PDF twice. It:
- Analyzes the PDF to determine token count and page count
- Determines if chunking is required based on strategy and thresholds
- If no chunking needed: returns analysis metadata only
- If chunking needed: splits PDF and uploads chunks to S3
Files
handler.py- Main Lambda handler functiontoken_estimation.py- Token estimation module (word-based heuristic)chunking_strategies.py- Chunking algorithms (fixed-pages, token-based, hybrid)requirements.txt- Python dependencies (PyPDF2, boto3)test_handler.py- Unit tests for handlertest_token_estimation.py- Unit tests for token estimationtest_chunking_strategies.py- Unit tests for chunking strategiestest_integration.py- Integration tests (requires AWS setup)
Configuration
The Lambda supports three chunking strategies:
1. Fixed-Pages Strategy (Legacy)
Simple page-based chunking. Fast but doesn't account for token density.
config = {
'strategy': 'fixed-pages',
'pageThreshold': 100,
'chunkSize': 50,
'overlapPages': 5
}
2. Token-Based Strategy
Token-aware chunking that respects model limits. Ideal for variable density documents.
config = {
'strategy': 'token-based',
'tokenThreshold': 150000,
'maxTokensPerChunk': 100000,
'overlapTokens': 5000
}
3. Hybrid Strategy (RECOMMENDED)
Best of both worlds - targets token count but respects page limits.
config = {
'strategy': 'hybrid',
'pageThreshold': 100,
'tokenThreshold': 150000,
'targetTokensPerChunk': 80000,
'maxPagesPerChunk': 99, # Bedrock has a hard limit of 100 pages
'overlapTokens': 5000
}
Event Format
Input Event (from SQS Consumer)
The Lambda receives events from the SQS Consumer in this exact format:
{
"documentId": "invoice-2024-001-1705315800000",
"contentType": "file",
"content": {
"location": "s3",
"bucket": "my-document-bucket",
"key": "raw/invoice-2024-001.pdf",
"filename": "invoice-2024-001.pdf"
},
"eventTime": "2024-01-15T10:30:00.000Z",
"eventName": "ObjectCreated:Put",
"source": "sqs-consumer"
}
Required Fields:
documentId- Unique document identifier (generated by SQS consumer)content.bucket- S3 bucket namecontent.key- S3 object key (must be inraw/prefix)
Optional Fields:
contentType- Must be "file" if provided (default behavior)content.location- Always "s3" (informational)content.filename- Original filename (informational)eventTime- S3 event timestamp (informational)eventName- S3 event name (informational)source- Event source identifier (informational)config- Optional chunking configuration override
Input Event (with Custom Configuration)
You can override chunking configuration per document:
{
"documentId": "doc-123",
"contentType": "file",
"content": {
"bucket": "document-bucket",
"key": "raw/document.pdf",
"filename": "document.pdf"
},
"config": {
"strategy": "hybrid",
"pageThreshold": 100,
"tokenThreshold": 150000,
"targetTokensPerChunk": 80000,
"maxPagesPerChunk": 99
}
}
Output (No Chunking)
{
"documentId": "doc-123",
"requiresChunking": false,
"tokenAnalysis": {
"totalTokens": 45000,
"totalPages": 30,
"avgTokensPerPage": 1500
},
"reason": "Document has 30 pages, below threshold of 100"
}
Output (Chunking)
{
"documentId": "doc-456",
"requiresChunking": true,
"tokenAnalysis": {
"totalTokens": 200000,
"totalPages": 150,
"avgTokensPerPage": 1333,
"tokensPerPage": [...]
},
"strategy": "hybrid",
"chunks": [
{
"chunkId": "doc-456_chunk_0",
"chunkIndex": 0,
"totalChunks": 2,
"startPage": 0,
"endPage": 74,
"pageCount": 75,
"estimatedTokens": 100000,
"bucket": "document-bucket",
"key": "chunks/doc-456_chunk_0.pdf"
}
],
"config": {
"strategy": "hybrid",
"totalPages": 150,
"totalTokens": 200000,
"targetTokensPerChunk": 80000,
"maxPagesPerChunk": 99
}
}
Validation
The Lambda performs several validation checks:
1. Payload Validation
- documentId - Must be present
- content.bucket - Must be present
- content.key - Must be present
- contentType - Must be "file" if provided (only file-based processing is supported)
2. File Extension Check
- Logs a warning if file doesn't have
.pdfextension - Still processes the file (validates using magic bytes)
- Useful for catching misnamed files
3. PDF Magic Bytes Validation
- Validates file starts with
%PDF-before processing - Prevents wasting resources on non-PDF files
- Rejects HTML, text, images, and other formats
4. PDF Format Validation
- Uses PyPDF2 to validate PDF structure
- Detects corrupted or invalid PDFs
- Rejects encrypted PDFs (not supported)
Error Responses
All validation errors return a standardized error response:
{
"documentId": "doc-123",
"requiresChunking": false,
"error": {
"type": "ValueError",
"message": "Missing required field: documentId"
}
}
The Lambda handles various error scenarios:
- Non-PDF files - Validates file starts with PDF magic bytes (%PDF-) before processing
- Invalid PDF format - Returns error response if PyPDF2 cannot parse the file
- Corrupted PDF files - Returns error response with details
- S3 access denied - Returns error with specific message
- Corrupted pages - Skips page, logs warning, continues with remaining pages
- S3 write failures - Retries with exponential backoff (3 attempts)
PDF Validation
Before attempting to process a file, the Lambda validates it's actually a PDF by checking the magic bytes:
- Valid PDFs must start with
%PDF-(hex: 25 50 44 46 2D) - Files without this signature are rejected immediately
- This prevents wasting resources on non-PDF files (HTML, text, images, etc.)
Testing
Unit Tests
python test_handler.py
Integration Tests
Requires AWS credentials and test bucket:
export RUN_INTEGRATION_TESTS=true
export TEST_BUCKET=your-test-bucket-name
python test_integration.py
Test PDFs should be uploaded to:
s3://your-test-bucket/test-data/small-document.pdfs3://your-test-bucket/test-data/large-document.pdfs3://your-test-bucket/test-data/invalid.pdf
Performance
- Token analysis: 2-5 seconds for 100-page PDF
- Chunking: ~1 second per chunk
- Memory: 2048 MB recommended
- Timeout: 10 minutes recommended
IAM Permissions Required
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject"
],
"Resource": "arn:aws:s3:::bucket-name/raw/*"
},
{
"Effect": "Allow",
"Action": [
"s3:PutObject"
],
"Resource": "arn:aws:s3:::bucket-name/chunks/*"
}
]
}
Environment Variables
CHUNKING_STRATEGY- Default strategy (default: 'hybrid')PAGE_THRESHOLD- Page count threshold (default: 100)TOKEN_THRESHOLD- Token count threshold (default: 150000)CHUNK_SIZE- Pages per chunk for fixed-pages (default: 50)OVERLAP_PAGES- Overlap pages for fixed-pages (default: 5)MAX_TOKENS_PER_CHUNK- Max tokens for token-based (default: 100000)OVERLAP_TOKENS- Overlap tokens (default: 5000)TARGET_TOKENS_PER_CHUNK- Target tokens for hybrid (default: 80000)MAX_PAGES_PER_CHUNK- Max pages for hybrid (default: 99, Bedrock limit is 100)LOG_LEVEL- Logging level (default: 'INFO')
Architecture Integration
This Lambda is invoked by Step Functions as the first step in the workflow (before Init Metadata). The SQS Consumer has NO changes - it simply triggers Step Functions as before.
The workflow structure:
SQS Consumer → Step Functions → PDF Analysis & Chunking Lambda → Init Metadata → ...
Token Estimation
Uses word-based heuristic for fast estimation:
- Count words using regex
\b\w+\b - Apply 1.3 tokens per word multiplier
- Accuracy: ~85-90% for English text
- Speed: ~0.2 seconds per 100 pages
Can be upgraded to tiktoken for production if needed.