PDF Analysis and Chunking Lambda

This Lambda function is the first step in the Step Functions workflow for chunked document processing. It analyzes PDFs to determine if chunking is needed, and if so, splits the PDF into chunks and uploads them to S3.

Overview

This is a single Lambda function that performs both analysis and chunking to avoid downloading the PDF twice. It:

Analyzes the PDF to determine token count and page count
Determines if chunking is required based on strategy and thresholds
If no chunking needed: returns analysis metadata only
If chunking needed: splits PDF and uploads chunks to S3

Files

handler.py - Main Lambda handler function
token_estimation.py - Token estimation module (word-based heuristic)
chunking_strategies.py - Chunking algorithms (fixed-pages, token-based, hybrid)
requirements.txt - Python dependencies (PyPDF2, boto3)
test_handler.py - Unit tests for handler
test_token_estimation.py - Unit tests for token estimation
test_chunking_strategies.py - Unit tests for chunking strategies
test_integration.py - Integration tests (requires AWS setup)

Configuration

The Lambda supports three chunking strategies:

1. Fixed-Pages Strategy (Legacy)

Simple page-based chunking. Fast but doesn't account for token density.

config = {
    'strategy': 'fixed-pages',
    'pageThreshold': 100,
    'chunkSize': 50,
    'overlapPages': 5
}

2. Token-Based Strategy

Token-aware chunking that respects model limits. Ideal for variable density documents.

config = {
    'strategy': 'token-based',
    'tokenThreshold': 150000,
    'maxTokensPerChunk': 100000,
    'overlapTokens': 5000
}

3. Hybrid Strategy (RECOMMENDED)

Best of both worlds - targets token count but respects page limits.

config = {
    'strategy': 'hybrid',
    'pageThreshold': 100,
    'tokenThreshold': 150000,
    'targetTokensPerChunk': 80000,
    'maxPagesPerChunk': 99,  # Bedrock has a hard limit of 100 pages
    'overlapTokens': 5000
}

Event Format

Input Event (from SQS Consumer)

The Lambda receives events from the SQS Consumer in this exact format:

{
  "documentId": "invoice-2024-001-1705315800000",
  "contentType": "file",
  "content": {
    "location": "s3",
    "bucket": "my-document-bucket",
    "key": "raw/invoice-2024-001.pdf",
    "filename": "invoice-2024-001.pdf"
  },
  "eventTime": "2024-01-15T10:30:00.000Z",
  "eventName": "ObjectCreated:Put",
  "source": "sqs-consumer"
}

Required Fields:

documentId - Unique document identifier (generated by SQS consumer)
content.bucket - S3 bucket name
content.key - S3 object key (must be in raw/ prefix)

Optional Fields:

contentType - Must be "file" if provided (default behavior)
content.location - Always "s3" (informational)
content.filename - Original filename (informational)
eventTime - S3 event timestamp (informational)
eventName - S3 event name (informational)
source - Event source identifier (informational)
config - Optional chunking configuration override

Input Event (with Custom Configuration)

You can override chunking configuration per document:

{
  "documentId": "doc-123",
  "contentType": "file",
  "content": {
    "bucket": "document-bucket",
    "key": "raw/document.pdf",
    "filename": "document.pdf"
  },
  "config": {
    "strategy": "hybrid",
    "pageThreshold": 100,
    "tokenThreshold": 150000,
    "targetTokensPerChunk": 80000,
    "maxPagesPerChunk": 99
  }
}

Output (No Chunking)

{
  "documentId": "doc-123",
  "requiresChunking": false,
  "tokenAnalysis": {
    "totalTokens": 45000,
    "totalPages": 30,
    "avgTokensPerPage": 1500
  },
  "reason": "Document has 30 pages, below threshold of 100"
}

Output (Chunking)

{
  "documentId": "doc-456",
  "requiresChunking": true,
  "tokenAnalysis": {
    "totalTokens": 200000,
    "totalPages": 150,
    "avgTokensPerPage": 1333,
    "tokensPerPage": [...]
  },
  "strategy": "hybrid",
  "chunks": [
    {
      "chunkId": "doc-456_chunk_0",
      "chunkIndex": 0,
      "totalChunks": 2,
      "startPage": 0,
      "endPage": 74,
      "pageCount": 75,
      "estimatedTokens": 100000,
      "bucket": "document-bucket",
      "key": "chunks/doc-456_chunk_0.pdf"
    }
  ],
  "config": {
    "strategy": "hybrid",
    "totalPages": 150,
    "totalTokens": 200000,
    "targetTokensPerChunk": 80000,
    "maxPagesPerChunk": 99
  }
}

Validation

The Lambda performs several validation checks:

1. Payload Validation

documentId - Must be present
content.bucket - Must be present
content.key - Must be present
contentType - Must be "file" if provided (only file-based processing is supported)

2. File Extension Check

Logs a warning if file doesn't have .pdf extension
Still processes the file (validates using magic bytes)
Useful for catching misnamed files

3. PDF Magic Bytes Validation

Validates file starts with %PDF- before processing
Prevents wasting resources on non-PDF files
Rejects HTML, text, images, and other formats

4. PDF Format Validation

Uses PyPDF2 to validate PDF structure
Detects corrupted or invalid PDFs
Rejects encrypted PDFs (not supported)

Error Responses

All validation errors return a standardized error response:

{
  "documentId": "doc-123",
  "requiresChunking": false,
  "error": {
    "type": "ValueError",
    "message": "Missing required field: documentId"
  }
}

The Lambda handles various error scenarios:

Non-PDF files - Validates file starts with PDF magic bytes (%PDF-) before processing
Invalid PDF format - Returns error response if PyPDF2 cannot parse the file
Corrupted PDF files - Returns error response with details
S3 access denied - Returns error with specific message
Corrupted pages - Skips page, logs warning, continues with remaining pages
S3 write failures - Retries with exponential backoff (3 attempts)

PDF Validation

Before attempting to process a file, the Lambda validates it's actually a PDF by checking the magic bytes:

Valid PDFs must start with %PDF- (hex: 25 50 44 46 2D)
Files without this signature are rejected immediately
This prevents wasting resources on non-PDF files (HTML, text, images, etc.)

Testing

Unit Tests

python test_handler.py

Integration Tests

Requires AWS credentials and test bucket:

export RUN_INTEGRATION_TESTS=true
export TEST_BUCKET=your-test-bucket-name
python test_integration.py

Test PDFs should be uploaded to:

s3://your-test-bucket/test-data/small-document.pdf
s3://your-test-bucket/test-data/large-document.pdf
s3://your-test-bucket/test-data/invalid.pdf

Performance

Token analysis: 2-5 seconds for 100-page PDF
Chunking: ~1 second per chunk
Memory: 2048 MB recommended
Timeout: 10 minutes recommended

IAM Permissions Required

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject"
      ],
      "Resource": "arn:aws:s3:::bucket-name/raw/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::bucket-name/chunks/*"
    }
  ]
}

Environment Variables

CHUNKING_STRATEGY - Default strategy (default: 'hybrid')
PAGE_THRESHOLD - Page count threshold (default: 100)
TOKEN_THRESHOLD - Token count threshold (default: 150000)
CHUNK_SIZE - Pages per chunk for fixed-pages (default: 50)
OVERLAP_PAGES - Overlap pages for fixed-pages (default: 5)
MAX_TOKENS_PER_CHUNK - Max tokens for token-based (default: 100000)
OVERLAP_TOKENS - Overlap tokens (default: 5000)
TARGET_TOKENS_PER_CHUNK - Target tokens for hybrid (default: 80000)
MAX_PAGES_PER_CHUNK - Max pages for hybrid (default: 99, Bedrock limit is 100)
LOG_LEVEL - Logging level (default: 'INFO')

Architecture Integration

This Lambda is invoked by Step Functions as the first step in the workflow (before Init Metadata). The SQS Consumer has NO changes - it simply triggers Step Functions as before.

The workflow structure:

SQS Consumer → Step Functions → PDF Analysis & Chunking Lambda → Init Metadata → ...

Token Estimation

Uses word-based heuristic for fast estimation:

Count words using regex \b\w+\b
Apply 1.3 tokens per word multiplier
Accuracy: ~85-90% for English text
Speed: ~0.2 seconds per 100 pages

Can be upgraded to tiktoken for production if needed.

Overview​

Files​

Configuration​

1. Fixed-Pages Strategy (Legacy)​

2. Token-Based Strategy​

3. Hybrid Strategy (RECOMMENDED)​

Event Format​

Input Event (from SQS Consumer)​

Input Event (with Custom Configuration)​

Output (No Chunking)​

Output (Chunking)​

Validation​

1. Payload Validation​

2. File Extension Check​

3. PDF Magic Bytes Validation​

4. PDF Format Validation​

Error Responses​

PDF Validation​

Testing​

Unit Tests​

Integration Tests​

Performance​

IAM Permissions Required​

Environment Variables​

Architecture Integration​

Token Estimation​