How to Clean Search Results Before Sending Them to an LLM A developer created a Python cleaning layer to sanitize SERP API responses before sending them to an LLM. The approach normalizes fields like title, URL, snippet, and position, stripping out raw HTML, tracking parameters, and other noise to reduce token waste and improve answer quality. The pattern uses a simple source-numbered context format that LLMs can process efficiently. Search results look clean when you see them in a browser. A title. A URL. A snippet. Maybe a date. Maybe a few related links. Then you call a SERP API and look at the JSON. Suddenly your “simple search result” has ads, organic results, local packs, related questions, tracking URLs, missing snippets, duplicate domains, nested fields, weird formatting, and sometimes a small family of empty strings living under the couch. If you are building an LLM app, do not throw that raw response into the prompt. That is how you get noisy answers, wasted tokens, weak citations, and sometimes prompt injection problems. The better pattern is: SERP API response → clean results → normalized fields → source-numbered context → LLM prompt In this article, we will build a small Python cleaning layer for search results before sending them to an LLM. The goal is not to support every SERP API on earth. The goal is to create a practical pattern you can adapt. An LLM does not need the full search response. It needs useful evidence. For most search-grounded workflows, the model only needs: title URL snippet position source number Sometimes you may also need: date domain result type location language But you usually do not need: raw HTML tracking parameters empty fields duplicate links API metadata nested debug objects ads, unless your task needs ads large unrelated blocks Every extra field costs tokens. Every noisy field makes the model work harder. Every irrelevant block is a tiny fog machine inside your prompt. Here is a common mistake: prompt = f""" Answer the user's question using these search results: {raw serp json} """ This is easy, but it has problems. The raw JSON may be huge. It may contain fields the model does not need. It may include duplicate results. It may include text that looks like instructions. It may contain messy URLs. It may push the useful snippets far away from the actual user question. A better approach is to clean the response first. We will write a Python script that: The final context will look like this: Source 1 Title: Example Search Result URL: https://example.com/article Snippet: A short clean summary from the search result. Source 2 Title: Another Result URL: https://example.org/guide Snippet: Another useful snippet. That format is simple. Simple is good. LLMs like clean context. Developers like debuggable context. Everyone gets a tiny biscuit. Different providers use different response shapes, but many return something like this: { "organic results": { "position": 1, "title": "Best SERP APIs for Developers", "link": "https://example.com/serp-api?utm source=google", "snippet": "Compare SERP APIs for SEO, AI agents, and search workflows." }, { "position": 2, "title": "Search API Guide", "link": "https://example.org/search-api", "snippet": "Learn how to use search APIs in applications." } } Some APIs may use different keys: organic results organic results And for URLs: link url href So the cleaner should be defensive. We only need standard Python plus beautifulsoup4 if you want to strip HTML from snippets. pip install beautifulsoup4 You can skip BeautifulSoup if your snippets are already plain text. Create a file called clean search results.py . python import re from urllib.parse import urlparse, urlunparse, parse qsl, urlencode from bs4 import BeautifulSoup Now add a text cleaner. python def clean text value : if not value: return "" if not isinstance value, str : value = str value value = BeautifulSoup value, "html.parser" .get text " " value = re.sub r"\s+", " ", value value = value.strip return value This removes HTML and collapses weird whitespace. For example: Best