What I Learned Building an Autonomous Deal-Hunting Agent System A developer built a multi-agent AI system called 'The Price Is Right' that autonomously scans the internet for bargains, estimates product value using three pricing techniques, and sends notifications. The system uses Modal for serverless GPU infrastructure and splits tasks among specialized agents coordinated by a planning agent. Over the past week I built a multi-agent AI system that autonomously scans the internet for bargains, estimates the true value of products using three different pricing techniques, and pushes a notification straight to my phone the moment it finds a deal worth acting on. Along the way I picked up a ton of practical, transferable lessons about agentic AI architecture, RAG, fine-tuning vs. prompting, tool calling, and shipping a real if scrappy product with a Gradio front end. This post is my write-up of the whole journey, with the code that made it work. The system, nicknamed "The Price Is Right" , is built around the idea that no single model is the best at everything. Instead of one giant prompt, the architecture splits the problem into focused agents that each do one job well, coordinated by a planning agent: Here's how the five days of building this broke down. The first lesson was about infrastructure : how do you run a model especially a fine-tuned open-source LLM without managing your own GPU server? The answer here was Modal https://modal.com , a serverless platform for running Python functions in the cloud — including on GPUs. The "hello world" of Modal is refreshingly simple. You define an App , an Image essentially a container spec with pip dependencies , and decorate a normal Python function with @app.function : python hello.py import modal from modal import Image Setup app = modal.App "hello" image = Image.debian slim .pip install "requests" Hello @app.function image=image def hello - str: import requests response = requests.get "https://ipinfo.io/json" data = response.json city, region, country = data "city" , data "region" , data "country" return f"Hello from {city}, {region}, {country} " New - added thanks to student Tue H. @app.function image=image, region="eu" def hello europe - str: import requests response = requests.get "https://ipinfo.io/json" data = response.json city, region, country = data "city" , data "region" , data "country" return f"Hello from {city}, {region}, {country} " What I loved here is the region="eu" parameter — with one keyword argument you can pin where in the world your function actually executes, which matters for latency, data residency, and sometimes cost. Calling it locally vs. remotely is just as simple from a notebook: python from hello import app, hello, hello europe with app.run : reply = hello.local runs on your machine with app.run : reply = hello.remote runs on Modal's cloud The next step up is running an actual language model. This is where Modal's GPU support and Secrets come in — you don't want to hardcode your Hugging Face token, so you register it once in Modal's dashboard under a name e.g. huggingface-secret and reference it in code: python llama.py import modal from modal import Image Setup app = modal.App "llama" image = Image.debian slim .pip install "torch", "transformers", "accelerate" secrets = modal.Secret.from name "huggingface-secret" GPU = "T4" MODEL NAME = "meta-llama/Llama-3.2-3B" @app.function image=image, secrets=secrets, gpu=GPU, timeout=1800 def generate prompt: str - str: from transformers import AutoTokenizer, AutoModelForCausalLM, set seed tokenizer = AutoTokenizer.from pretrained MODEL NAME tokenizer.pad token = tokenizer.eos token tokenizer.padding side = "right" model = AutoModelForCausalLM.from pretrained MODEL NAME, device map="auto" set seed 42 inputs = tokenizer.encode prompt, return tensors="pt" .to "cuda" outputs = model.generate inputs, max new tokens=5 return tokenizer.decode outputs 0 A few things clicked for me here: gpu="T4" is timeout=1800 matters because the first call has to download and load model weights — that cold start can take minutes. transformers imports, tokenizer, model is imported The really important conceptual jump on Day 1 was going from an ephemeral app with app.run : ... , which spins up and tears down for a single call to a deployed app : uv run modal deploy -m pricer service Once deployed, the service runs independently of my notebook, and I can call it from anywhere just by referencing it by name: python import modal Pricer = modal.Cls.from name "pricer-service", "Pricer" pricer = Pricer reply = pricer.price.remote "Quadcast HyperX condenser mic, connects via usb-c to your computer for crystal clear audio" print reply This is essentially how you'd put a fine-tuned model "behind an API" for a production system — and it's the foundation for the Specialist Agent , which wraps this exact deployed pricer. There's also a nice optimization here: by default a Modal container scales down to zero when idle, so the first call after inactivity can take ~30 seconds to wake up. If you're willing to spend a few extra credits, you can keep a container warm: python import modal Pricer = modal.Cls.from name "pricer-service", "Pricer" pricer = Pricer pricer.update autoscaler scaledown window=1200 stay warm for 20 minutes Takeaway: Modal turns "deploy a fine-tuned model as a microservice" into a one-line decorator and a one-line CLI command. The mental model — write a normal Python function, decorate it, deploy it, call it like a remote object — is something I'll reuse for any future "specialist model as a service" project. Day 2 was about a different way to make a frontier model GPT-5.1 better at a narrow task — Retrieval Augmented Generation RAG — and then about combining multiple pricing strategies into one. The first ingredient is a local, open-source sentence embedding model , which turns text into a 384-dimensional vector capturing its meaning: python from sentence transformers import SentenceTransformer encoder = SentenceTransformer 'sentence-transformers/all-MiniLM-L6-v2' Pass in a list of texts, get back a numpy array of vectors vector = encoder.encode "A proficient AI engineer who has almost reached the finale of AI Engineering Core Track " 0 print vector.shape 384, These vectors get stored — along with the product description and metadata category, price — in a Chroma vector database, batched 1,000 items at a time across hundreds of thousands of products: collection name = "products" existing collection names = collection.name for collection in client.list collections if collection name not in existing collection names: collection = client.create collection collection name for i in tqdm range 0, len train , 1000 : documents = item.summary for item in train i: i+1000 vectors = encoder.encode documents .astype float .tolist metadatas = {"category": item.category, "price": item.price} for item in train i: i+1000 ids = f"doc {j}" for j in range i, i+1000 ids = ids :len documents collection.add ids=ids, documents=documents, embeddings=vectors, metadatas=metadatas collection = client.get or create collection collection name One of the most satisfying moments was reducing those 384-dimensional vectors down to 3D with t-SNE and seeing the products cluster by category — electronics in one corner, musical instruments in another: python from sklearn.manifold import TSNE import plotly.graph objects as go tsne = TSNE n components=3, random state=42 reduced vectors = tsne.fit transform vectors fig = go.Figure data= go.Scatter3d x=reduced vectors :, 0 , y=reduced vectors :, 1 , z=reduced vectors :, 2 , mode='markers', marker=dict size=2, color=colors, opacity=0.7 , text= f"Category: {c}