Don't Let Your LLM Wing It: Building a Knowledge Base That Actually Knows Things

A developer built a retrieval-augmented generation (RAG) knowledge base using Amazon Bedrock, Aurora PostgreSQL with pgvector, and Terraform, all synced automatically from a Git repository via GitHub Actions. The system uses hierarchical chunking and Titan Embed Text v2 for embeddings, with separate environments for acc and prod. The infrastructure includes KMS-encrypted S3 buckets, a dedicated Aurora cluster, and IAM roles for secure access.

Every team eventually asks the same question: "can we make the LLM answer questions about our docs instead of hallucinating something plausible?" The answer is retrieval-augmented generation RAG , and the unglamorous truth is that RAG is 10% prompt engineering and 90% plumbing — object storage, a vector index, an embedding model, and a pipeline that keeps all three in sync whenever someone edits a markdown file. This is the plumbing. A Bedrock Knowledge Base backed by Aurora PostgreSQL with pgvector , provisioned entirely in Terraform, synced automatically from a Git repo via GitHub Actions. No notebooks, no manual "let me re-upload the docs" Tuesdays. php flowchart TD A "Git repository — knowledge-base/ .md" -- |"push to main or workflow dispatch"| B "GitHub Actions — sync-knowledge-base.yml" B -- |uploads docs| C "S3 bucket — KMS encrypted" C -- |start ingestion| D{"Bedrock ingestion job — hierarchical chunking"} D -- |"Titan Embed Text v2"| E "Aurora PostgreSQL — pgvector — HNSW + GIN indexes" E -- |"Retrieve / RetrieveAndGenerate"| F "App / LLM proxy — scoped IAM role" Two environments — acc and prod — run this same pipeline side by side, each with its own bucket, KB, and database, gated by branch and dispatch logic in the workflow. Bedrock pulls documents from S3, not the other way around, so the bucket is just a private, versioned, encrypted drop zone: resource "aws s3 bucket" "kb docs" { bucket = "${local.cluster name}-knowledge-base" tags = local.default tags } resource "aws s3 bucket versioning" "kb docs" { bucket = aws s3 bucket.kb docs.id versioning configuration { status = "Enabled" } } resource "aws s3 bucket server side encryption configuration" "kb docs" { bucket = aws s3 bucket.kb docs.id rule { apply server side encryption by default { sse algorithm = "aws:kms" } } } resource "aws s3 bucket public access block" "kb docs" { bucket = aws s3 bucket.kb docs.id block public acls = true block public policy = true ignore public acls = true restrict public buckets = true } This is a separate Aurora cluster from any application database — different lifecycle, different access pattern, and you don't want a runaway ingestion job competing for connections with your app: module "kb db" { source = "terraform-aws-modules/rds-aurora/aws" version = "~ 9.0" name = "bedrock-kb" engine = "aurora-postgresql" engine version = "16.6" instances = { 1 = {} } serverlessv2 scaling configuration = { min capacity = 0.5 max capacity = 4 } database name = "bedrock kb" master username = "root" manage master user password = true enable http endpoint = true tags = local.default tags } Bedrock's Knowledge Base service expects credentials in {"username": ..., "password": ...} JSON shape in Secrets Manager. If your Terraform module stores a bare password string most do , mirror it into a second secret in the right shape rather than fighting the module: resource "aws secretsmanager secret" "kb db bedrock" { name = "bedrock-kb-db-credentials" recovery window in days = 0 tags = local.default tags } resource "aws secretsmanager secret version" "kb db bedrock" { secret id = aws secretsmanager secret.kb db bedrock.id secret string = jsonencode { username = "root" password = module.kb db.cluster master password } } Bedrock's Knowledge Base needs to assume a role that can read from S3, describe and execute statements against Aurora via the Data API, read the credentials secret, and invoke the embedding model: data "aws iam policy document" "bedrock kb assume role" { statement { effect = "Allow" actions = "sts:AssumeRole" principals { type = "Service" identifiers = "bedrock.amazonaws.com" } condition { test = "StringEquals" variable = "aws:SourceAccount" values = local.aws account id } } } data "aws iam policy document" "bedrock kb" { statement { sid = "S3Read" effect = "Allow" actions = "s3:GetObject", "s3:ListBucket" resources = aws s3 bucket.kb docs.arn, "${aws s3 bucket.kb docs.arn}/ " } statement { sid = "RDSDataApi" effect = "Allow" actions = "rds-data:BatchExecuteStatement", "rds-data:ExecuteStatement" resources = module.kb db.cluster arn } statement { sid = "SecretsManagerRead" effect = "Allow" actions = "secretsmanager:GetSecretValue" resources = aws secretsmanager secret.kb db bedrock.arn } statement { sid = "BedrockEmbeddings" effect = "Allow" actions = "bedrock:InvokeModel" resources = "arn:aws:bedrock:eu-west-1::foundation-model/amazon.titan-embed-text-v2:0" } } resource "aws iam role" "bedrock kb" { name = "${local.cluster name}-bedrock-kb" assume role policy = data.aws iam policy document.bedrock kb assume role.json } resource "aws iam role policy" "bedrock kb" { role = aws iam role.bedrock kb.id policy = data.aws iam policy document.bedrock kb.json } resource "aws bedrockagent knowledge base" "main" { name = "${local.cluster name}-knowledge-base" role arn = aws iam role.bedrock kb.arn knowledge base configuration { type = "VECTOR" vector knowledge base configuration { embedding model arn = "arn:aws:bedrock:eu-west-1::foundation-model/amazon.titan-embed-text-v2:0" } } storage configuration { type = "RDS" rds configuration { resource arn = module.kb db.cluster arn credentials secret arn = aws secretsmanager secret.kb db bedrock.arn database name = "bedrock kb" table name = "bedrock integration.bedrock kb" field mapping { primary key field = "id" vector field = "embedding" text field = "chunks" metadata field = "metadata" } } } depends on = module.kb db } resource "aws bedrockagent data source" "s3" { knowledge base id = aws bedrockagent knowledge base.main.id name = "${local.cluster name}-knowledge-base-s3" data source configuration { type = "S3" s3 configuration { bucket arn = aws s3 bucket.kb docs.arn } } vector ingestion configuration { chunking configuration { chunking strategy = "HIERARCHICAL" hierarchical chunking configuration { level configuration { max tokens = 1500 } level configuration { max tokens = 300 } overlap tokens = 60 } } } } Hierarchical chunking is worth calling out: it splits documents into large 1500-token "parent" chunks and smaller 300-token "child" chunks with a 60-token overlap. Retrieval matches on the precise child chunk but can return the broader parent context — better recall on long documents than flat fixed-size chunking, at the cost of slightly more complex ingestion. aws bedrockagent knowledge base expects the target table to already exist with the right columns and indexes — Terraform won't create the pgvector extension or table for you. This is a one-time psql job against the Aurora endpoint: CREATE EXTENSION IF NOT EXISTS vector; CREATE SCHEMA IF NOT EXISTS bedrock integration; CREATE TABLE IF NOT EXISTS bedrock integration.bedrock kb id UUID PRIMARY KEY, embedding vector 1024 , chunks TEXT, metadata JSON ; CREATE INDEX IF NOT EXISTS bedrock kb embedding idx ON bedrock integration.bedrock kb USING hnsw embedding vector cosine ops ; CREATE INDEX IF NOT EXISTS bedrock kb chunks idx ON bedrock integration.bedrock kb USING gin to tsvector 'simple', chunks ; The HNSW index handles approximate nearest-neighbor search on the embedding vector; the GIN index on chunks enables hybrid search keyword + semantic if you ever want it. vector 1024 matches Titan Embed Text v2's output dimensionality — if you swap embedding models later, the column width has to match or ingestion fails outright. resource "aws ssm parameter" "kb id" { name = "/${local.env}/knowledge-base/kb id" type = "String" value = aws bedrockagent knowledge base.main.id } resource "aws ssm parameter" "kb bucket" { name = "/${local.env}/knowledge-base/bucket" type = "String" value = aws s3 bucket.kb docs.bucket } Apps and CI pipelines read these instead of hardcoding ARNs — when you rebuild the KB in a new account or region, nothing downstream needs a code change. Provisioning is the easy part. The actual day-to-day value comes from never having to think about ingestion again. Drop a markdown file in a knowledge-base/ folder, push, and it's searchable within minutes. name: Sync Knowledge Base on: workflow dispatch: inputs: environment: type: choice description: Environment to sync knowledge base options: - acc - prod required: true push: branches: - main paths: - 'knowledge-base/ ' jobs: sync-kb-acc: if: github.event name == 'workflow dispatch' && github.event.inputs.environment == 'acc' || github.event name == 'push' secrets: inherit uses: your-org/your-pipelines-repo/.github/workflows/sync-bedrock-knowledge-base.yml@v2 with: environment: acc knowledge-base-id: ${{ vars.BEDROCK KB ID ACC }} data-source-id: ${{ vars.BEDROCK KB DATA SOURCE ID ACC }} bucket-name: ${{ vars.BEDROCK KB BUCKET ACC }} bucket-prefix: knowledge-base source-dir: knowledge-base sync-kb-prod: if: github.event name == 'workflow dispatch' && github.event.inputs.environment == 'prod' secrets: inherit uses: your-org/your-pipelines-repo/.github/workflows/sync-bedrock-knowledge-base.yml@v2 with: environment: prod knowledge-base-id: ${{ vars.BEDROCK KB ID PROD }} data-source-id: ${{ vars.BEDROCK KB DATA SOURCE ID PROD }} bucket-name: ${{ vars.BEDROCK KB BUCKET PROD }} bucket-prefix: knowledge-base source-dir: knowledge-base The gating logic is the whole trick here: push to main , path-filtered to knowledge-base/ — only fires the acc job. Routine doc edits land in the acc environment automatically, with zero manual steps. workflow dispatch with environment: acc workflow dispatch with environment: prod Both jobs delegate to the same reusable workflow sync-bedrock-knowledge-base.yml@v2 , parameterized per environment. The reusable workflow does the actual work: sync the source-dir to the S3 bucket-name under bucket-prefix , then call StartIngestionJob against data-source-id . Centralizing that logic in one reusable workflow means every team adopting this pattern gets the same sync behavior — and a fix to the sync logic ships everywhere at once instead of needing fifteen copy-pasted workflow files updated individually. enable http endpoint = true on the Aurora cluster. rds-data:ExecuteStatement calls fail with a confusing connectivity error that has nothing to do with security groups — you'll waste an hour checking VPC routing before you find this. vector 1024 doesn't match your embedding model's output size, the table creation succeeds, the Knowledge Base resource creates fine, and ingestion just fails per-document. Check the embedding model's dimensionality before writing the DDL, not after. push -triggered syncs touch prod. acc / prod split needs two of everythingNone of this is exotic — it's a bucket, a Postgres extension, two indexes, and a YAML file with an if: condition. That's the whole trick to "production RAG": treat the knowledge base like any other piece of infrastructure, version it, gate promotion to prod behind a manual step, and let the boring CI pipeline do the boring sync work. Your LLM stops winging it, and you stop being the person who manually re-uploads PDFs every time someone asks why the bot doesn't know about last week's runbook update.