Lending Document Processing with Bedrock Data Automation and Knowledge Base
This walkthrough guides you through deploying and using the BDA Processor sample application, which demonstrates document processing for lending documents using Amazon Bedrock Data Automation with integrated Knowledge Base capabilities.
Overview
The BDA Lending sample showcases how to:
- Process lending documents like mortgage applications, loan agreements, and financial statements
- Extract structured data using Amazon Bedrock Data Automation
- Automatically ingest processed documents into a Knowledge Base for conversational queries
- Query processed documents using natural language through the Knowledge Base
- Track document processing status through a GraphQL API
- View and manage documents through a web interface
Architecture
The sample deploys the following components:
- S3 buckets for input documents and processing results
- Amazon Bedrock Data Automation project for document processing
- Amazon Bedrock Knowledge Base with vector search capabilities
- Automatic document ingestion pipeline for processed documents
- AWS Step Functions workflow for orchestration
- AWS Lambda functions for processing tasks and Knowledge Base ingestion
- Amazon DynamoDB tables for configuration and tracking
- Amazon AppSync GraphQL API for querying processing status and Knowledge Base
- Amazon CloudFront distribution for the web interface
Knowledge Base Integration
The solution includes a sophisticated Knowledge Base integration that:
- Automatically ingests processed documents from the output S3 bucket into a vector database
- Enables natural language queries over processed document content
- Provides conversational search capabilities through the web interface
- Maintains document traceability with source citations in query responses
- Uses Amazon Titan Embeddings for vector representation of document content
- Leverages Amazon Nova Pro for generating contextual responses to user queries
Prerequisites
Before you begin, ensure you have:
- AWS Account: With permissions to create the required resources
- AWS CLI: Configured with appropriate credentials
- Node.js: Version 18 or later (use NVM to install the version specified in
.nvmrc
) - AWS CDK: Version 2.x installed globally
- Docker: For building Lambda functions
- Amazon Bedrock Access: Ensure your account has access to Amazon Bedrock and the required models:
- Amazon Titan Embed Text v2 (for document embeddings)
- Amazon Nova Pro v1 (for conversational queries)
- Models required by Bedrock Data Automation (varies by configuration)
Step 1: Clone the Repository
First, clone the GenAI IDP Accelerator repository and navigate to the sample directory:
git clone https://gitlab.aws.dev/genaiic-reusable-assets/engagement-artifacts/genaiic-idp-accelerator-cdk.git
cd genaiic-idp-accelerator-cdk/samples/sample-bda-lending
Step 2: Install Dependencies
Install the required dependencies:
# Install the required Node.js version using NVM
nvm use
# Install dependencies
yarn install
# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Restore NuGet packages
dotnet restore
# Build the project
dotnet build
Step 3: Bootstrap Your AWS Environment
If you haven't already bootstrapped your AWS environment for CDK, run:
cdk bootstrap aws://ACCOUNT-NUMBER/REGION
Replace ACCOUNT-NUMBER
with your AWS account number and REGION
with your preferred AWS region.
Step 4: Review the Stack Configuration
The main stack is defined in src/bda-lending-stack.ts
(TypeScript). Let's examine the key components:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 |
|
Knowledge Base Ingestion Lambda Function
The solution includes an automatic ingestion Lambda function that triggers whenever new processed documents are added to the output S3 bucket. Here's the implementation:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
|
This Lambda function:
- Triggers automatically when new files are added to the output S3 bucket
- Starts an ingestion job to process the new documents into the Knowledge Base
- Uses the AWS request ID as a client token to ensure idempotency
- Returns the ingestion job details for monitoring and logging
Step 5: Deploy the Stack
Deploy the stack to your AWS account:
# Install dependencies
yarn install
# Build the project
yarn build
# Deploy the stack
cdk deploy
# Install dependencies
pip install -r requirements.txt
# Deploy the stack
cdk deploy
# Build the project
dotnet build
# Deploy the stack
cdk deploy
The deployment will take several minutes. Once complete, the command will output the URL of the web application.
Step 6: Configure Bedrock Model Access
Ensure you have access to the required Bedrock models:
- Go to the Amazon Bedrock console
- Navigate to "Model access" in the left sidebar
- Request access to the models used by the Data Automation project
- Wait for access approval (usually immediate for most models)
Step 7: Configure the Bedrock Data Automation Project
After deployment, you need to configure the Bedrock Data Automation project:
- Go to the Amazon Bedrock console
- Navigate to "Data Automation" in the left sidebar
- Select the project created by the stack (it will have a name like "BdaLendingStack-LendingBda...")
- Configure the document processing workflow:
- Define document types (e.g., mortgage application, loan agreement)
- Define extraction schemas for each document type
- Configure processing options
Step 8: Test the Solution
Now you can test the solution by uploading a sample lending document and querying the Knowledge Base:
Document Processing Test
- Access the web application using the URL from the deployment output
- Sign in with the credentials provided during deployment
- Upload a sample lending document (e.g., a mortgage application)
- Monitor the processing status in the web interface
- Once processing is complete, view the extracted data
Alternatively, you can upload documents directly to the S3 input bucket:
aws s3 cp sample-mortgage-application.pdf s3://YOUR-INPUT-BUCKET-NAME/
Knowledge Base Query Test
After documents have been processed and ingested into the Knowledge Base, you can test the conversational query capabilities:
- Navigate to the Knowledge Base section in the web application
- Ask natural language questions about your processed documents, such as:
- "What is the loan amount in the mortgage application?"
- "Who is the borrower in document XYZ?"
- "What are the key terms mentioned in the loan agreements?"
-
"Summarize the financial information from all processed documents"
-
Review the responses which will include:
- Contextual answers generated by Amazon Nova Pro
- Source citations linking back to the original processed documents
- Confidence scores and relevance indicators
GraphQL API Testing
You can also query the Knowledge Base programmatically using the GraphQL API:
query QueryKnowledgeBase($input: QueryKnowledgeBaseInput!) {
queryKnowledgeBase(input: $input) {
response
citations {
retrievedReferences {
content {
text
}
location {
s3Location {
uri
}
}
}
}
sessionId
}
}
With variables:
{
"input": {
"query": "What is the loan amount mentioned in the documents?",
"sessionId": "optional-session-id-for-conversation-continuity"
}
}
Step 9: Monitor Processing
You can monitor the document processing and Knowledge Base operations in several ways:
Document Processing Monitoring
- Web Interface: View processing status and results
- AWS Step Functions Console: Monitor workflow executions
- CloudWatch Logs: View detailed processing logs
- GraphQL API: Query processing status programmatically
Knowledge Base Monitoring
- Amazon Bedrock Console:
- Monitor Knowledge Base ingestion jobs
- View data source synchronization status
-
Check vector database metrics
-
CloudWatch Metrics:
- Knowledge Base query latency
- Ingestion job success/failure rates
-
Vector search performance metrics
-
Lambda Function Logs:
- Monitor automatic ingestion triggers
- View ingestion job initiation logs
- Debug any ingestion failures
Key Metrics to Monitor
- Document Processing Success Rate: Percentage of documents successfully processed
- Knowledge Base Ingestion Latency: Time from document processing to Knowledge Base availability
- Query Response Time: Time taken to respond to Knowledge Base queries
- Vector Search Accuracy: Relevance of retrieved document chunks
Step 10: Clean Up
When you're done experimenting, you can clean up all resources to avoid incurring charges:
cdk destroy
Next Steps
Now that you have a working BDA Processor solution with Knowledge Base integration, consider:
Document Processing Enhancements
- Customizing the extraction schemas for your specific document types
- Adding custom post-processing logic for the extracted data
- Integrating the solution with your existing systems
- Enhancing the web interface for your specific requirements
Knowledge Base Optimizations
- Fine-tuning chunking strategies for better document segmentation
- Implementing custom metadata for enhanced search filtering
- Adding web crawling capabilities to ingest external reference documentation
- Configuring advanced retrieval settings for improved query accuracy
Advanced Features
- Multi-turn conversations using session IDs for context continuity
- Custom prompt engineering for domain-specific query responses
- Integration with external systems via the GraphQL API
- Automated document classification before processing
Production Considerations
- Implementing proper security controls and access management
- Setting up monitoring and alerting for production workloads
- Configuring backup and disaster recovery for the Knowledge Base
- Optimizing costs through appropriate model selection and usage patterns
For more information, refer to: - Amazon Bedrock Data Automation documentation - Amazon Bedrock Knowledge Bases documentation - GenAI IDP Accelerator documentation