Data Ingestion: RSS Feeds, Knowledge Base, S3 Vectors, and Metadata Filtering

Architecture of the AWS Briefing Agent, a personalized assistant built on Amazon Bedrock that uses Retrieval-Augmented Generation (RAG) to answer natural language queries about AWS releases. The system ingests RSS feed data every six hours via an EventBridge Scheduler-triggered Lambda function, storing documents as vector embeddings in Amazon S3 Vectors for cost-effective retrieval. To avoid redundant processing, the pipeline uses MD5 hashes of blog post URLs as unique S3 filenames to identify and skip previously ingested content.

This is the second in a series of posts documenting the architecture, implementation, and lessons learned from building the AWS Briefing Agent - a personalised AWS assistant deployed on Amazon Bedrock AgentCore Runtime . - Part 1: Building a Full-Stack AI Agent on Bedrock AgentCore https://dev.to/aws-heroes/building-a-full-stack-ai-agent-on-amazon-bedrock-agentcore-2p - Part 2: Data Ingestion: RSS Feeds, Knowledge Base, S3 Vectors, and Metadata Filtering - Part 3: Strands Agents + AgentCore Runtime - a perfect match https://dev.to/aws-heroes/strands-agents-agentcore-runtime-a-perfect-match-3a51 - Part 4: Adding Memory to the Agent - Part 5: Experimenting with API Gateway - Part 6: Observability and Evaluations - Part 7: Third Party Integrations - Identity, Gateway and Slack Notifications When I started building the AWS Briefing Agent, the first version queried the AWS What's New RSS feed on every invocation. This worked in terms of showing the agent could return tailored information back to the client. However, it was costly and wasteful, with the same data fetched repeatedly, which added latency to every invocation. The RSS feed also only covers recent information, and it was likely we would want to start searching for releases that had been launched in the past 6 months or more. The next step therefore, was to separate the retrieval by the agent from the ingestion. Amazon Bedrock Knowledge Base One of the key design goals was to allow the agent to match a natural language query "what's new in Bedrock this week?" against a large corpus of documents to return the most semantically similar results. This is where Amazon Bedrock Knowledge Base comes into its own. It allows the agent to use RAG Retrieval-Augmented Generation . By querying the Knowledge Base, we can retrieve relevant documents at query time, and then inject them into the prompt as context. The LLM then generates a response from this retrieved information which we know to be factual. The python CDK code that creates the Knowledge Base is shown below: knowledge base = bedrock.CfnKnowledgeBase self, "AnnouncementKnowledgeBase", name="aws-briefing-agent-announcements", ... knowledge base configuration=bedrock.CfnKnowledgeBase.KnowledgeBaseConfigurationProperty type="VECTOR", vector knowledge base configuration=bedrock.CfnKnowledgeBase.VectorKnowledgeBaseConfigurationProperty embedding model arn=f"arn:aws:bedrock:{self.region}::foundation-model/amazon.titan-embed-text-v2:0", , , storage configuration=bedrock.CfnKnowledgeBase.StorageConfigurationProperty type="S3 VECTORS", s3 vectors configuration=bedrock.CfnKnowledgeBase.S3VectorsConfigurationProperty index name="announcements", vector bucket arn=f"arn:aws:s3vectors:{self.region}:{self.account}:bucket/briefing-agent-vectors", , , This declares the embeddings model to be used as amazon.titan-embed-text-v2:0 and the vector store as being of type S3 VECTORS . There is no code required to handle aspects such as embeddings. Instead, Bedrock manages all of this for us. Amazon S3 Vectors Amazon Bedrock Knowledge Bases support several vector stores. A vector store is the retrieval engine that makes RAG work. It stores documents as numerical embeddings vectors that are generated by an embeddings model. At query time, the user's question is embedded, and the vector store finds documents whose embeddings are closest in meaning. The prototype uses Amazon S3 Vectors as the underlying vector store. S3 Vectors provides cost-effective, elastic, and durable vector storage at up to 90% lower costs for uploading, storing, and querying vectors than alternatives such as OpenSearch Serverless . There is no infrastructure to manage, and it still provides a sub-second query latency which is acceptable for this use case. Scheduling the Ingestion The ingestion pipeline is run every 6 hours using Amazon EventBridge Scheduler . This service provides capabilities such as built-in retry policies, time zone support, and dead-letter queues. The schedule triggers an AWS Lambda function that carries out the required processing. This includes: - Lists existing document hashes in S3 - Fetches the AWS What’s New RSS feed ~100 announcements - Fetches 13 AWS blog RSS feeds aws, machine-learning, compute, security, database, containers, devops, networking, storage, infrastructure-and-automation, developer, big-data, iot - Fetches the AWS Security Bulletins RSS feed - For each new blog post, fetches the canonical URL and extracts the full article body using a stdlib HTML parser - Parses publication dates into YYYYMMDD integers - Writes .txt and .metadata.json files per new item to S3 - Triggers a Bedrock KB ingestion job Deduplication and Incremental Writes When the ingestion pipeline runs, most of the content in the various RSS feeds is not new. It was important to find a way to prevent re-fetching and re-writing hundreds of announcements every 6 hours. To support this, we created an MD5 hash of the blog posts URL, truncated to 12 hex characters. This hash is used as the S3 filename. The sample code snippet is shown below: python def write to s3 items, existing keys=None : existing = existing keys or set for item in items: url hash = hashlib.md5 item "link" .encode .hexdigest :12 if url hash in existing: continue Already in S3, skip ... write doc + metadata files At startup, get existing keys lists all the .txt files in S3 and extracts the hash from each filename into a set. When processing the blog posts, the Lambda functions computes the URL hash and checks to see if it is already in the set. If it already exists, then it has been ingested in a previous run, and there is no need to re-fetch the page. If the hash does not exist, then the function fetches the page, extracts the content, and writes to S3. The hash gives a stable, deterministic filename derived from the URL. The same URL always produces the same hash. Chunking Strategy The chunking strategy is set on the Data Source resource in the CDK stack as shown below: data source = bedrock.CfnDataSource self, "AnnouncementDataSource", name="aws-announcements-s3", knowledge base id=knowledge base.attr knowledge base id, data source configuration=bedrock.CfnDataSource.DataSourceConfigurationProperty type="S3", s3 configuration=bedrock.CfnDataSource.S3DataSourceConfigurationProperty bucket arn=data bucket.bucket arn, , , vector ingestion configuration=bedrock.CfnDataSource.VectorIngestionConfigurationProperty chunking configuration=bedrock.CfnDataSource.ChunkingConfigurationProperty chunking strategy="SEMANTIC", semantic chunking configuration=bedrock.CfnDataSource.SemanticChunkingConfigurationProperty breakpoint percentile threshold=92, buffer size=1, max tokens=600, , , , We utilise a SEMANTIC chunking strategy. This uses the embedding model itself to decide where to split. The following three parameters control this behaviour: - breakpoint percentile threshold=92 - controls the percentile threshold that will result in a split. A higher threshold requires sentences to be more distinguishable to split the document into different chunks. - max tokens=600 - the maximum number of tokens that should be included in a single chunk, while honoring sentence boundaries. - buffer size=1 - for a given sentence, the buffer size defines the number of surrounding sentences to be added for embeddings creation. A larger buffer size might capture more context but can also introduce noise, while a smaller buffer size might miss important context but ensures more precise chunking. Filtering by Date One of the goals in writing the agent was that a user could ask to constrain information by how recent it is e.g. "what is new in the past 7 days?". To help achieve this, at ingestion time for each document, we create an associated metadata.json sidecar file that attaches structured, filterable attributes to a document so the agent can narrow search results without relying only on semantic similarity. An example companion file is shown below: { "metadataAttributes": { "published date": 20260415, "service": "amazon-bedrock", "category": "artificial-intelligence", "source type": "announcement" } } During the Knowledge Base sync, Bedrock reads this sidecar and attaches those attributes to every vector chunk generated from that document. At query time, the agent can combine semantic search with metadata filters: - "What's new in Bedrock this week?" → vector similarity for "Bedrock" + greaterThanOrEquals filter on published date - "Show me security bulletins" → vector similarity + equals filter on source type: "security-bulletin" - "Lambda announcements from the last month" → vector similarity + filters on both service and published date Without the metadata file, the agent would get the most semantically similar results regardless of date or service — so a question about "this week" might return announcements from 3 months ago that happen to be textually similar. The metadata filters let the agent constrain results to the correct time window or service before ranking by relevance. The naming convention .metadata.json is a Bedrock KB convention — it automatically associates the sidecar with its parent document during ingestion. No code links them; the filename pattern is enough. Bedrock Knowledge Base metadata supports four types: STRING, NUMBER, BOOLEAN and STRING LIST. There is no native data type. The comparison operators greaterThan, greaterThanOrEquals, lessThan, lessThanOrEquals only work with NUMBER. Our original implementation stored published date as a string "2026-05-14" . When the agent tried to filter, we got back the following exception: ValidationException: The filter value type provided isn't supported for the given operation: GREATER THAN OR EQUALS The fix was to store dates as YYYYMMDD numbers so using "20260514" instead of "2026-05-14" . We also inject today's date into the system prompt at runtime so the LLM can easily calculate relative dates. Note that Amazon S3 Vectors has a strict 2 KB limit on filterable metadata per vector. We found the Bedrock Knowledge Base internal metadata keys AMAZON BEDROCK TEXT and AMAZON BEDROCK METADATA were set as filterable by default, which caused frequent ValidationException errors. The fix was mark both of these keys as non-filterable when creating the vector index: vector index = s3vectors.CfnIndex self, "AnnouncementVectorIndex", index name="announcements", vector bucket name=vector bucket.vector bucket name, dimension=1024, Titan Embed Text v2 distance metric="cosine", data type="float32", metadata configuration=s3vectors.CfnIndex.MetadataConfigurationProperty non filterable metadata keys= "AMAZON BEDROCK TEXT", "AMAZON BEDROCK METADATA", , , This meant the only filterable metadata is contained in the .metadata.json fields, which are the only fields we filter on. The next post covers how we used an agentic framework Strands Agents SDK in combination with AgentCore to really start bringing the briefing agent to life.