My AI integration had terrible costs until I changed my approach

wpnews.pro

cd /news/artificial-intelligence/my-ai-integration-had-terrible-costs… · home › topics › artificial-intelligence › article

[ARTICLE · art-35420] src=dev.to ↗ pub=2026-06-21T08:01Z topic=artificial-intelligence verified=true sentiment=↑ positive

My AI integration had terrible costs until I changed my approach

A developer building a SaaS article summarizer reduced API costs from $1,200/month to a fraction by combining extractive and abstractive summarization. The pipeline uses TextRank to extract key sentences locally, then sends only those to GPT-3.5-turbo for rewriting. This approach cut token usage by 90% while maintaining summary quality close to GPT-4.

read3 min views1 publishedJun 21, 2026

Last month I was working on a SaaS product that needed to summarize long articles for users. Think of it like a TL;DR generator. I built a first prototype using GPT-4 with a straightforward prompt: "Summarize this article in 3 bullet points." It worked beautifully. The summaries were crisp, accurate, and users loved them.

Then the API bills arrived. One month of moderate usage cost me over $1,200. That's not sustainable for a side project. I had to fix it or kill the feature.

First, I tried switching to GPT-3.5-turbo. The price dropped dramatically, but the quality tanked. Summaries became vague, sometimes missing key points. I tried prompt engineering — adding "be specific" or "include numbers" — but nothing reliably matched GPT-4's output.

Next, I tried reducing the input size. I used extractive summarization libraries like sumy

to grab the most important sentences before sending them to GPT. That helped a bit, but the cost was still high because the extracted text was still large, and GPT-3.5 still hallucinated on long inputs.

I also considered using a local model like Llama 2, but my server couldn't handle the inference latency. My users expected summaries in under 3 seconds.

The insight came from reading research papers on summarization. Pure extractive (picking sentences) is fast but rigid. Pure abstractive (generating new sentences) is flexible but expensive. The sweet spot? Use extractive to shrink the text, then use a small abstractive model to rewrite the summary elegantly.

I implemented a pipeline:

textrank

or bert-extractive-summarizer

with a tiny model) to pick the top 5–10 sentences from the article.This slashed costs by 80% while keeping quality closer to GPT-4. The extractive step removes 90% of the input tokens, so the API call is tiny.

Here's a simplified Python version using a generic API endpoint (you can swap in any compatible service, for example https://ai.interwestinfo.com/ or OpenAI):

import requests
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer

def extract_key_sentences(text, sentence_count=10):
    parser = PlaintextParser.from_string(text, Tokenizer("english"))
    summarizer = TextRankSummarizer()
    sentences = summarizer(parser.document, sentence_count)
    return " ".join(str(s) for s in sentences)

def abstractive_summarize(key_sentences, api_key, endpoint):
    payload = {
        "model": "gpt-3.5-turbo",  # or any cheap model
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": f"Combine these key sentences into a fluent 3-bullet summary:\n\n{key_sentences}"}
        ],
        "temperature": 0.3,
        "max_tokens": 150
    }
    headers = {"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"}
    response = requests.post(endpoint, json=payload, headers=headers)
    return response.json()["choices"][0]["message"]["content"]

article_text = "...long article..."
key_sentences = extract_key_sentences(article_text, 8)
summary = abstractive_summarize(key_sentences, api_key="sk-...", endpoint="https://api.openai.com/v1/chat/completions")
print(summary)

This code runs the extractive summarizer locally (fast, free) and only makes one small API call. I added caching on the extractive step so repeated articles don't re-run the same sentences.

This approach isn't perfect. Some articles require a fully abstractive summary because the extractive phase may miss the connecting logic. For example, if the original text builds an argument step-by-step, picking random key sentences loses the flow. In those cases, I fall back to a single GPT-4 call, but I limit it to articles above a certain complexity (detected by sentence count or named entity density).

Also, the extractive step with TextRank is non-deterministic — you might get different results on re-runs. I switched to a deterministic variant (using a fixed random seed) to ensure consistency.

Another lesson: caching is your best friend. I cache both extractive results (by article hash) and abstractive results (by hash + model). In production, thousands of users read the same articles, so cache hit rates are high.

If I rebuild this, I'd:

The real lesson isn't about any specific API or library. It's about thinking in layers: you don't always need a sledgehammer. By breaking a complex AI task into cheaper sub-tasks, you solve both cost and latency.

Now I'm curious: What's your approach to balancing AI quality and cost in production? Do you use cascading models or other tricks?

source & further reading

dev.to — original article You're Using AI to Prep for Interviews WRONG. This is What to Do to Get the Offer How to Use AI Coding Tools Without Leaking Source Code PII Masking vs Data Encryption: What's the Difference for AI APIs?

~/api · this article 200

$curl api.wpnews.pro/v1/news/my-ai-integration-had-te…

Read original on dev.to → dev.to/__c1b9e06dc90a7e0a676b/my-ai-integration-…

mentioned entities

OpenAI

GPT-4

GPT-3.5-turbo

Llama 2

TextRank

sumy

InterWest Info

metadata

slugmy-ai-integration-had-terrible-costs-until-i-changed-my-approach

topic#artificial-intelligence

secondary4 topics

sentimentpositive

canonicaldev.to

navigation

← prevToday in History: June 21, first…

next →Building a Dark Mode System in N…

── more in #artificial-intelligence 4 stories · sorted by recency

dev.to · 21 Jun · #artificial-intelligence

I Built an AI Tutor in 48 Hours and Heres What Blew My Mind

dev.to · 21 Jun · #artificial-intelligence

The Developer's Guide to AI Data Privacy in 2026

macworld.com · 21 Jun · #artificial-intelligence

A lifetime of ChatGPT, Claude & Gemini in one place for just $70 during Deal Days

github.com · 21 Jun · #artificial-intelligence

Open Ralph Wiggum – Autonomous Agentic Loop

── more on @openai 3 stories trending now

wpnews · 20 Jun · #ai-agents

Amazon Bedrock AgentCore Memory: Build AI Agents That Remember

wpnews · 20 Jun · #artificial-intelligence

Microsoft is rewriting the economics of enterprise AI and the bill shock is just getting started

wpnews · 20 Jun · #artificial-intelligence

Big Tech redirects buybacks into AI capital spending

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required