{"slug": "i-built-an-ai-data-chat-tool-in-my-portfolio-app-using-gemma-4-crewai-duckdb-run", "title": "I Built an AI Data Chat Tool in My Portfolio App Using Gemma 4, CrewAI, DuckDB, Supabase Edge Functions & Google Cloud Run 🚀", "summary": "Creation of an AI-powered data chat tool that allows users to upload a file and ask questions in plain English, eliminating the need to know SQL. The backend architecture uses Google's Gemma 4 model for natural-language-to-SQL generation, orchestrated by CrewAI agents that inspect the schema and execute queries on DuckDB. The tool is deployed using a Supabase Edge Function as a proxy to a Google Cloud Run backend, providing a transparent chat interface that displays the generated SQL and results.", "body_md": "What If You Could Just... Ask Your Data a Question? 🤔\nMost people who need insights from a data file are blocked by one simple thing: they don't know SQL. Even technically strong users often don't want to stop, inspect schema manually, write queries, debug syntax, and format results just to answer a quick question like \"Which category has the highest revenue?\" or \"Show me null rates by column.\"\nThis project removes that friction entirely. Upload your file, type your question in plain English, and let a Gemma 4-powered agentic backend inspect the schema, generate DuckDB SQL, execute it, and return the results — right inside a clean chat interface. 🎯\n👉 Try it live here: https://databro.dev/backend/ai-data-chat/\n🛠️ The Full Stack at a Glance\nBefore diving deep, here's the architecture that powers this tool:\n🌟 Meet the Gemma 4 Model Family\nGemma 4 is Google DeepMind's latest family of open models — and they are genuinely exciting for agentic builders. Unlike earlier iterations, Gemma 4 is built with logic-heavy, reasoning-oriented workflows in mind, which makes it a natural fit for tasks like natural-language-to-SQL generation.\nHere's a builder-friendly overview of the lineup:\nThe AI Data Chat tool exposes google/gemma-4-31B-it\nand google/gemma-4-26B-A4B-it\nvia Hugging Face endpoints. Once you see these models turn a fuzzy natural-language question into a working DuckDB query over your own uploaded file, it becomes very hard not to imagine ten more use cases. 🔥\n🤖 A Brief Foundation on CrewAI\nCrewAI is an open-source framework for orchestrating autonomous AI agents in structured multi-agent workflows. Instead of a single giant prompt doing everything, CrewAI lets you break a problem into specialized, coordinated responsibilities.\nThree building blocks to understand:\n🧑💼 Agents\nSpecialized workers, each with a defined role. Think of them as employees on your AI team — one might be a \"Schema Inspector,\" another a \"SQL Writer,\" another a \"Result Formatter.\"\n📋 Tasks\nUnits of work assigned to agents. A task has a clear description, expected output, and the agent responsible for it. Examples:\n- \"Analyze the uploaded file and return its schema\"\n- \"Given this schema and user intent, write a valid DuckDB SQL query\"\n- \"Execute the SQL and format the result for the user\"\n🔧 Tools\nCapabilities agents can invoke to act on the world — file inspection utilities, DuckDB query executors, schema extractors, etc.\nThis model is powerful because it creates transparent, inspectable pipelines instead of black-box AI magic. Every step has a purpose, and every output is traceable.\n🏗️ How the Tool Is Built — End to End\nStep 1 — The Next.js Frontend\nThe portfolio app at databro.dev/backend/ai-data-chat/ provides a split-panel chat UI:\n- Left panel: File upload zone (drag & drop), attached file display with name and size, LLM Settings panel with Provider + Model selectors\n- Right panel: Chat interface with starter prompt buttons, conversation history with SQL panels and output tables\nThe user picks a file, selects a Gemma 4 model via Hugging Face, and types a natural-language question. The frontend packages everything into FormData\n(file + user intent + provider + model) and sends it to a Supabase Edge Function.\nStep 2 — Supabase Edge Function\nA lean TypeScript Edge Function validates the multipart request and proxies it to the Google Cloud Run backend. This keeps the frontend thin, avoids exposing Cloud Run endpoints directly to the browser, and centralizes auth concerns at the edge.\nStep 3 — FastAPI on Google Cloud Run\nThe Cloud Run backend receives the uploaded file and user intent. It:\n- Stores the file in a temporary directory\n- Detects file type (CSV, Parquet, Arrow, JSON, XLSX)\n- Loads the data into DuckDB as a\ndata\ntable - Runs\nDESCRIBE data\nto extract the full schema - Counts total rows\n- Passes schema context + row count + user intent to the CrewAI agent pipeline\nStep 4 — CrewAI Agent Pipeline\nThe agentic pipeline runs two coordinated tasks:\nTask 1 — Schema Analysis\nAn agent uses a DuckDB tool to introspect the uploaded file and returns column names, types, and a row count. This grounding step is critical — the LLM sees real column names before generating SQL, which dramatically reduces hallucination.\nTask 2 — SQL Generation + Execution\nThe schema and user intent are passed to the Gemma 4 model via Hugging Face. The LLM returns a valid DuckDB SQL query. The agent then executes that SQL against the uploaded file and serializes the results.\nStep 5 — Response Back to the UI\nThe backend returns a structured JSON response containing:\n-\nmodel\n— the resolved model used -\nsql\n— the exact DuckDB SQL that was generated and run -\nschema\n— column definitions -\ntotal_rows\n— row count of the uploaded file -\nresult\n— the query output rows\nThe frontend renders the SQL in a dark code panel and the results in a scrollable table. Transparency is built in — you always see exactly what query was run.\n🎬 Live Demo — Real Queries, Real Results\nI tested the tool with a real e-commerce sales CSV (49 rows, 18 columns covering orders, customers, products, categories, regions, sales, profit, discounts, and returns). Here's what Gemma 4 + CrewAI + DuckDB produced for 6 progressively complex natural language questions.\n✅ Test Setup — File Uploaded, Gemma 4 31B Selected\nThe file sample_test_file.csv\nis attached (8.7 KB). Provider set to Hugging Face, model set to Gemma 4 31B Instruct. The chat interface is ready with starter prompts visible.\n![Upload & model selection — sample_test_file.csv loaded, Hugging Face + Gemma 4 31B Instruct selected\n💡 The LLM Settings panel (visible by scrolling right on the chat panel) lets you switch between\nGemma 4 31B Instruct\nandGemma 4 26B A4B Instruct\nmid-session.\n🟢 Query 1 — \"Show me the top 10 rows\"\nWhat Gemma 4 generated:\n\n```\nSELECT * FROM data LIMIT 10\n```\n\nResult: Processed 49 CSV rows, returned 10 of 10 matching rows. Clean tabular output showing order_id\n, customer_id\n, customer_name\n, customer_email\n, customer_segment\nand more.\n🟢 Query 2 — \"What are the key columns and null rates?\"\nThis is where things get interesting. Instead of a simple SELECT\n, Gemma 4 understood intent — it generated a full null-rate audit query covering every single column:\nWhat Gemma 4 generated:\n\n```\nSELECT\n    count(*) AS total_rows,\n    (count(*) - count(order_id)) * 100.0 / count(*) AS order_id_null_rate,\n    (count(*) - count(customer_id)) * 100.0 / count(*) AS customer_id_null_rate,\n    (count(*) - count(customer_name)) * 100.0 / count(*) AS customer_name_null_rate,\n    (count(*) - count(customer_email)) * 100.0 / count(*) AS customer_email_null_rate,\n    -- ... (every column covered)\n    (count(*) - count(returned)) * 100.0 / count(*) AS returned_null_rate\nFROM data\n```\n\nResult: 49 total rows, 0% null rate across all 18 columns — a complete data quality snapshot in one shot, without the user writing a single line of SQL.\n🤯 This is the magic moment. A non-technical user typed a plain English question, and the system produced a production-quality data quality audit query autonomously.\n🟡 Query 3 — \"Show top 5 categories ranked by total sales and profit margin percentage\"\nWhat Gemma 4 generated:\n\n```\nSELECT\n    category,\n    SUM(sales) AS total_sales,\n    (SUM(profit) / SUM(sales)) * 100 AS profit_margin_percentage\nFROM data\nGROUP BY category\nORDER BY total_sales DESC\nLIMIT 5\n```\n\nResult: 3 categories returned with calculated profit margins:\nInsight unlocked: Furniture is the highest-revenue category but is running at a loss. Technology and Office Supplies are both profitable — Office Supplies has the best margin.\n🟡 Query 4 — \"Show me monthly revenue trend with total orders count and average order value grouped by month\"\nWhat Gemma 4 generated:\n\n```\nSELECT\n    date_trunc('month', order_date) AS month,\n    SUM(sales) AS total_revenue,\n    COUNT(order_id) AS total_orders,\n    AVG(sales) AS avg_order_value\nFROM data\nGROUP BY date_trunc('month', order_date)\nORDER BY month\n```\n\nResult:\nGemma 4 correctly used DuckDB's native date_trunc('month', ...)\nfunction — not a workaround, not a SUBSTR hack — real DuckDB date intelligence applied from natural language.\n🔴 Query 5 — \"Which customers have placed more than one order? Show their name, order count and total spend ranked by highest spend\"\nWhat Gemma 4 generated:\n\n```\nSELECT\n    customer_name,\n    COUNT(DISTINCT order_id) AS order_count,\n    SUM(sales) AS total_spend\nFROM data\nGROUP BY customer_name\nHAVING COUNT(DISTINCT order_id) > 1\nORDER BY total_spend DESC\n```\n\nResult:\nGemma 4 correctly identified the need for HAVING COUNT(DISTINCT ...) > 1\n— a classic SQL pattern that trips up many developers, generated correctly from plain English.\n🔴 Query 6 — \"Find all returned orders, show total profit loss by category and ship mode, sorted by biggest loss first\"\nWhat Gemma 4 generated:\n\n```\nSELECT\n    category,\n    ship_mode,\n    SUM(profit) AS total_profit_loss\nFROM data\nWHERE returned = 'Yes'\nGROUP BY category, ship_mode\nORDER BY total_profit_loss ASC\n```\n\nResult:\nEvery single return came from Furniture. Second Class shipping returns caused the most financial damage. This is exactly the kind of insight a business analyst would spend hours finding — delivered in under 30 seconds via plain English.\n💡 Why This Is Genuinely Useful — No SQL Expertise Needed\nLook at what just happened across those 6 queries:", "url": "https://wpnews.pro/news/i-built-an-ai-data-chat-tool-in-my-portfolio-app-using-gemma-4-crewai-duckdb-run", "canonical_source": "https://dev.to/databro/i-built-an-ai-data-chat-tool-in-my-portfolio-app-using-gemma-4-crewai-duckdb-supabase-edge-9pn", "published_at": "2026-05-22 00:03:33+00:00", "updated_at": "2026-05-22 00:35:46.669949+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "open-source", "developer-tools", "data"], "entities": ["Gemma 4", "CrewAI", "DuckDB", "Supabase", "Google Cloud Run", "Google DeepMind", "Hugging Face", "databro.dev"], "alternates": {"html": "https://wpnews.pro/news/i-built-an-ai-data-chat-tool-in-my-portfolio-app-using-gemma-4-crewai-duckdb-run", "markdown": "https://wpnews.pro/news/i-built-an-ai-data-chat-tool-in-my-portfolio-app-using-gemma-4-crewai-duckdb-run.md", "text": "https://wpnews.pro/news/i-built-an-ai-data-chat-tool-in-my-portfolio-app-using-gemma-4-crewai-duckdb-run.txt", "jsonld": "https://wpnews.pro/news/i-built-an-ai-data-chat-tool-in-my-portfolio-app-using-gemma-4-crewai-duckdb-run.jsonld"}}