I Built an AI Data Chat Tool in My Portfolio App Using Gemma 4, CrewAI, DuckDB, Supabase Edge Functions & Google Cloud Run 🚀

Creation of an AI-powered data chat tool that allows users to upload a file and ask questions in plain English, eliminating the need to know SQL. The backend architecture uses Google's Gemma 4 model for natural-language-to-SQL generation, orchestrated by CrewAI agents that inspect the schema and execute queries on DuckDB. The tool is deployed using a Supabase Edge Function as a proxy to a Google Cloud Run backend, providing a transparent chat interface that displays the generated SQL and results.

What If You Could Just... Ask Your Data a Question? 🤔 Most people who need insights from a data file are blocked by one simple thing: they don't know SQL. Even technically strong users often don't want to stop, inspect schema manually, write queries, debug syntax, and format results just to answer a quick question like "Which category has the highest revenue?" or "Show me null rates by column." This project removes that friction entirely. Upload your file, type your question in plain English, and let a Gemma 4-powered agentic backend inspect the schema, generate DuckDB SQL, execute it, and return the results — right inside a clean chat interface. 🎯 👉 Try it live here: https://databro.dev/backend/ai-data-chat/ 🛠️ The Full Stack at a Glance Before diving deep, here's the architecture that powers this tool: 🌟 Meet the Gemma 4 Model Family Gemma 4 is Google DeepMind's latest family of open models — and they are genuinely exciting for agentic builders. Unlike earlier iterations, Gemma 4 is built with logic-heavy, reasoning-oriented workflows in mind, which makes it a natural fit for tasks like natural-language-to-SQL generation. Here's a builder-friendly overview of the lineup: The AI Data Chat tool exposes google/gemma-4-31B-it and google/gemma-4-26B-A4B-it via Hugging Face endpoints. Once you see these models turn a fuzzy natural-language question into a working DuckDB query over your own uploaded file, it becomes very hard not to imagine ten more use cases. 🔥 🤖 A Brief Foundation on CrewAI CrewAI is an open-source framework for orchestrating autonomous AI agents in structured multi-agent workflows. Instead of a single giant prompt doing everything, CrewAI lets you break a problem into specialized, coordinated responsibilities. Three building blocks to understand: 🧑💼 Agents Specialized workers, each with a defined role. Think of them as employees on your AI team — one might be a "Schema Inspector," another a "SQL Writer," another a "Result Formatter." 📋 Tasks Units of work assigned to agents. A task has a clear description, expected output, and the agent responsible for it. Examples: - "Analyze the uploaded file and return its schema" - "Given this schema and user intent, write a valid DuckDB SQL query" - "Execute the SQL and format the result for the user" 🔧 Tools Capabilities agents can invoke to act on the world — file inspection utilities, DuckDB query executors, schema extractors, etc. This model is powerful because it creates transparent, inspectable pipelines instead of black-box AI magic. Every step has a purpose, and every output is traceable. 🏗️ How the Tool Is Built — End to End Step 1 — The Next.js Frontend The portfolio app at databro.dev/backend/ai-data-chat/ provides a split-panel chat UI: - Left panel: File upload zone drag & drop , attached file display with name and size, LLM Settings panel with Provider + Model selectors - Right panel: Chat interface with starter prompt buttons, conversation history with SQL panels and output tables The user picks a file, selects a Gemma 4 model via Hugging Face, and types a natural-language question. The frontend packages everything into FormData file + user intent + provider + model and sends it to a Supabase Edge Function. Step 2 — Supabase Edge Function A lean TypeScript Edge Function validates the multipart request and proxies it to the Google Cloud Run backend. This keeps the frontend thin, avoids exposing Cloud Run endpoints directly to the browser, and centralizes auth concerns at the edge. Step 3 — FastAPI on Google Cloud Run The Cloud Run backend receives the uploaded file and user intent. It: - Stores the file in a temporary directory - Detects file type CSV, Parquet, Arrow, JSON, XLSX - Loads the data into DuckDB as a data table - Runs DESCRIBE data to extract the full schema - Counts total rows - Passes schema context + row count + user intent to the CrewAI agent pipeline Step 4 — CrewAI Agent Pipeline The agentic pipeline runs two coordinated tasks: Task 1 — Schema Analysis An agent uses a DuckDB tool to introspect the uploaded file and returns column names, types, and a row count. This grounding step is critical — the LLM sees real column names before generating SQL, which dramatically reduces hallucination. Task 2 — SQL Generation + Execution The schema and user intent are passed to the Gemma 4 model via Hugging Face. The LLM returns a valid DuckDB SQL query. The agent then executes that SQL against the uploaded file and serializes the results. Step 5 — Response Back to the UI The backend returns a structured JSON response containing: - model — the resolved model used - sql — the exact DuckDB SQL that was generated and run - schema — column definitions - total rows — row count of the uploaded file - result — the query output rows The frontend renders the SQL in a dark code panel and the results in a scrollable table. Transparency is built in — you always see exactly what query was run. 🎬 Live Demo — Real Queries, Real Results I tested the tool with a real e-commerce sales CSV 49 rows, 18 columns covering orders, customers, products, categories, regions, sales, profit, discounts, and returns . Here's what Gemma 4 + CrewAI + DuckDB produced for 6 progressively complex natural language questions. ✅ Test Setup — File Uploaded, Gemma 4 31B Selected The file sample test file.csv is attached 8.7 KB . Provider set to Hugging Face, model set to Gemma 4 31B Instruct. The chat interface is ready with starter prompts visible. Upload & model selection — sample test file.csv loaded, Hugging Face + Gemma 4 31B Instruct selected 💡 The LLM Settings panel visible by scrolling right on the chat panel lets you switch between Gemma 4 31B Instruct andGemma 4 26B A4B Instruct mid-session. 🟢 Query 1 — "Show me the top 10 rows" What Gemma 4 generated: SELECT FROM data LIMIT 10 Result: Processed 49 CSV rows, returned 10 of 10 matching rows. Clean tabular output showing order id , customer id , customer name , customer email , customer segment and more. 🟢 Query 2 — "What are the key columns and null rates?" This is where things get interesting. Instead of a simple SELECT , Gemma 4 understood intent — it generated a full null-rate audit query covering every single column: What Gemma 4 generated: SELECT count AS total rows, count - count order id 100.0 / count AS order id null rate, count - count customer id 100.0 / count AS customer id null rate, count - count customer name 100.0 / count AS customer name null rate, count - count customer email 100.0 / count AS customer email null rate, -- ... every column covered count - count returned 100.0 / count AS returned null rate FROM data Result: 49 total rows, 0% null rate across all 18 columns — a complete data quality snapshot in one shot, without the user writing a single line of SQL. 🤯 This is the magic moment. A non-technical user typed a plain English question, and the system produced a production-quality data quality audit query autonomously. 🟡 Query 3 — "Show top 5 categories ranked by total sales and profit margin percentage" What Gemma 4 generated: SELECT category, SUM sales AS total sales, SUM profit / SUM sales 100 AS profit margin percentage FROM data GROUP BY category ORDER BY total sales DESC LIMIT 5 Result: 3 categories returned with calculated profit margins: Insight unlocked: Furniture is the highest-revenue category but is running at a loss. Technology and Office Supplies are both profitable — Office Supplies has the best margin. 🟡 Query 4 — "Show me monthly revenue trend with total orders count and average order value grouped by month" What Gemma 4 generated: SELECT date trunc 'month', order date AS month, SUM sales AS total revenue, COUNT order id AS total orders, AVG sales AS avg order value FROM data GROUP BY date trunc 'month', order date ORDER BY month Result: Gemma 4 correctly used DuckDB's native date trunc 'month', ... function — not a workaround, not a SUBSTR hack — real DuckDB date intelligence applied from natural language. 🔴 Query 5 — "Which customers have placed more than one order? Show their name, order count and total spend ranked by highest spend" What Gemma 4 generated: SELECT customer name, COUNT DISTINCT order id AS order count, SUM sales AS total spend FROM data GROUP BY customer name HAVING COUNT DISTINCT order id 1 ORDER BY total spend DESC Result: Gemma 4 correctly identified the need for HAVING COUNT DISTINCT ... 1 — a classic SQL pattern that trips up many developers, generated correctly from plain English. 🔴 Query 6 — "Find all returned orders, show total profit loss by category and ship mode, sorted by biggest loss first" What Gemma 4 generated: SELECT category, ship mode, SUM profit AS total profit loss FROM data WHERE returned = 'Yes' GROUP BY category, ship mode ORDER BY total profit loss ASC Result: Every single return came from Furniture. Second Class shipping returns caused the most financial damage. This is exactly the kind of insight a business analyst would spend hours finding — delivered in under 30 seconds via plain English. 💡 Why This Is Genuinely Useful — No SQL Expertise Needed Look at what just happened across those 6 queries: