{"slug": "synthesize-the-big-picture-and-analyze-trends-with-bigquery-s-ai-agg-function", "title": "Synthesize the big picture and analyze trends with BigQuery's AI.AGG function", "summary": "Google Cloud announced the preview of BigQuery's AI.AGG() function, which enables natural-language analysis of unstructured and multimodal data across millions of rows using a single line of SQL. The function allows users to summarize, synthesize, and identify trends in logs, documents, and product reviews, and can be combined with other BigQuery AI functions for complex data analysis.", "body_md": "We recently announced the preview of the BigQuery [ AI.AGG()](https://docs.cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-ai-agg) function. With\n\n`AI.AGG()`\n\n, you can use natural-language instructions within a single line of SQL to summarize or synthesize information over millions of rows of unstructured or even multimodal data.While BigQuery already offers [powerful AI functions that help you analyze individual rows of data](https://medium.com/google-cloud/analyze-anything-with-ai-powered-sql-in-bigquery-80c0d3113656), analyzing unstructured data at scale requires a different approach.` AI.AGG()`\n\nlets you ask questions from unstructured data such as logs and documents, for example:\n\nWhat are the top three feature requests among the negative product reviews?\n\nWhat kind of errors are users seeing most frequently, and how should I start investigating them?\n\nIn which specific scenarios is our automated agent consistently failing to resolve customer issues?\n\nIn this post, we'll dive deeper into the `AI.AGG()`\n\nfunction and look at a few of the use cases that it unlocks, including how it can be used in combination with BigQuery’s other managed AI functions for complex, intelligent data analysis.\n\n`AI.AGG()`\n\nA great example of the power of `AI.AGG()`\n\nis analyzing system logging. Log messages, warnings, errors, and stack traces can contain extremely useful information for improving your service, but it can be time- and labor-intensive to investigate them manually — especially if you operate at scale and have thousands of them to review.\n\nWith `AI.AGG()`\n\n, you can easily analyze many logs at once, grouping and prioritizing them to decide which ones to dig deeper into first. In fact, our BigQuery engineering team used this exact approach while developing `AI.AGG()`\n\n— using the function to help identify edge cases related to input handling for the feature itself!\n\nTo demonstrate this, let’s analyze a public dataset of Apache Spark standard `INFO`\n\nlogs available from [Loghub](https://github.com/logpai/loghub). Often, clusters can run into issues like memory thrashing, clock drift, or broadcast bottlenecks without ever throwing a `FATAL`\n\nerror. You can use `AI.AGG()`\n\nto analyze these seemingly normal logs for hidden inefficiencies. You can load [the sample data file](https://github.com/logpai/loghub/blob/master/Spark/Spark_2k.log_structured.csv) into BigQuery using [any of the supported methods, such as the UI, CLI, or client libraries](https://docs.cloud.google.com/bigquery/docs/batch-loading-data#loading_data_from_local_files). The following example assumes you’ve loaded the log file into a dataset called `bq_logs_demo`\n\nand table named `spark_logs_unstructured`\n\n.\n\nNotice how we construct the prompt here. We explicitly give the model permission to say \"everything is fine,\" which prevents it from hallucinating errors, while instructing it to hunt for specific anomalies:\n\nYou can see in these results that `AI.AGG()`\n\nsuccessfully acknowledges the \"operating normally\" messages while surfacing the critical diagnostic insights:\n\nNow, let’s look at some more use cases that demonstrate the flexibility of `AI.AGG()`\n\n, using one of BigQuery’s public datasets, `cymbal_pets`\n\n, a fictional pet supply shop. It includes a catalog of products carried by the store, with unstructured data like product names, descriptions, and images, making it a great example of the power of AI functions for handling unstructured data.\n\nFor example, let’s say you want to categorize the products in the dataset. The first hurdle in this case isn't applying labels to your products, but discovering what categories exist across the product catalog. With `AI.AGG()`\n\n, you can ask the model to analyze the raw product names and descriptions to identify the overarching categories for you.\n\nThis query returns a simple plaintext list of categories:\n\nThis initial query is great for discovery, but a simple plaintext string isn't enough to build a reliable, automated data pipeline. To actually tag your data, you need to instruct `AI.AGG()`\n\nto return a structured format, like a JSON array. Then, you can use the structured categories as a parameter within another AI function, [ AI.CLASSIFY()](https://docs.cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-ai-classify), to actually label each product with its category.\n\nThe following SQL statement completes each of these steps in one script:\n\nYou can now view the resulting table, which includes an `assigned_category`\n\n:\n\nIf you look closely at the intermediate table, you'll notice the structured categories changed slightly from the initial plaintext results. This happens for two reasons: First, LLMs are nondeterministic, meaning that they don't always give the exact same response to the same prompt. Second, the prompt was adjusted to accommodate the new output structure.\n\nWith the table now labeled by category, you can group by the categories to do traditional SQL aggregation, or use `AI.AGG()`\n\nto consider each category separately.\n\nFor example, the following query fetches traditional metrics (like row counts) right alongside a synthesized AI summary of what those specific grouped products have in common:\n\nUnstructured data isn't limited to text. Because `AI.AGG()`\n\nnatively supports multimodal inputs, you can return aggregated insights directly from image files.\n\nThe `cymbal_pets`\n\nGoogle Cloud project also contains a Cloud Storage bucket full of product photos. By creating an external object table, you can securely pass the image URIs directly into `AI.AGG()`\n\nand ask the model to summarize the visual content of the entire collection.\n\nTo use `AI.AGG()`\n\neffectively in your own environment, it helps to understand how it processes data behind the scenes. Here’s what you need to know about context windows, error handling, and optimizing your pipelines.\n\n**1. Context windows and multi-level aggregation** LLMs have a specific context window and can have a hard time handling massive amounts of input.\n\n`AI.AGG()`\n\nsolves this problem by automatically dividing your input rows into batches, aggregating those batches, and then aggregating the results of those batches into a final answer. This means you don’t have to worry about manually managing the context window when passing in large numbers of rows. Note that `AI.AGG()`\n\nwon’t split up a row of data across batches, so make sure that each individual row is smaller than the context window, to avoid the row being skipped. Many smaller rows will give `AI.AGG()`\n\nmore flexibility with how to batch each row.**2. Token usage with multi-level aggregation**\n\nBecause `AI.AGG()`\n\nuses a multi-level aggregation structure, the total input tokens sent to the model may be higher than the raw tokens in your starting table (depending on how many rounds of aggregation are required). As a best practice, always reduce the number of input tokens by using `LIMIT`\n\nor pre-filtering your data upstream before passing it to `AI.AGG()`\n\n.\n\n**3. Specifying your model endpoint** If you don’t specify a model endpoint,\n\n`AI.AGG()`\n\nwill default to a recent model. However, for production pipelines, you often want explicit control:**Short-form names:** You can use a short-form endpoint (e.g., `gemini-2.5-flash`\n\n), in which case `AI.AGG()`\n\nwill use that model in the query execution region:\n\n**4. Input and output modalities**\n\n**Inputs:** `AI.AGG()`\n\nsupports text (via strings or references to text files) and image data. It also supports arrays of these types, though you should refer to the [known issues documentation](https://docs.cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-ai-agg#known_issues) for edge cases regarding arrays of images.\n\n**Outputs: **The function **will always return a string**. While you can prompt the model in your instructions to format the output as JSON or Markdown, keep in mind that the database engine does not strictly enforce this. Multimodal output (e.g., generating an image) is not currently supported.\n\n**5. Treatment of ****NULL****s**\n\n`AI.AGG()`\n\nautomatically skips `NULL`\n\ninput rows without processing them. However, you must be careful when passing structured data. Like other BigQuery AI functions, `AI.AGG()`\n\nconcatenates `STRUCT`\n\nfields similarly to the standard `CONCAT()`\n\nfunction. This means if even one field within your `STRUCT`\n\nis `NULL`\n\n, the entire row is treated as `NULL`\n\nand will be skipped.Let's revisit our first categorization query. What if several rows of our `products`\n\ntable are missing their `description`\n\n? Because of the `NULL`\n\nconcatenation rule, those rows would be silently dropped from the analysis entirely. Here is how we can use `IFNULL()`\n\nto provide a fallback string, guaranteeing that every product is taken into account even if its description is blank:\n\n**6. Error handling** If\n\n`AI.AGG()`\n\nreceives invalid input, or encounters an error during LLM processing, it will attempt to provide partial results. Rows containing invalid input or which were rejected by the LLM model will not be considered in the final results. You can review exactly how many rows failed to process by checking your BigQuery job statistics, exactly as you would for scalar managed AI functions like` AI.IF()`\n\n.\n\nThese are just a few examples of the ways `AI.AGG()`\n\ncan help analyze unstructured data. The [ AI.AGG() function](https://docs.cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-ai-agg) is in preview in BigQuery now, so it’s available to all BigQuery users. Try it out on your own use cases!\n\nYou may also be interested in checking out BigQuery's other [managed AI functions](https://docs.cloud.google.com/bigquery/docs/generative-ai-overview#managed_ai_functions), `AI.CLASSIFY()`\n\n, `AI.IF()`\n\n, and `AI.SCORE()`\n\n, as well as [general-purpose functions](https://docs.cloud.google.com/bigquery/docs/generative-ai-overview#general_purpose_ai) like `AI.GENERATE()`\n\n. We look forward to seeing what you build with them.", "url": "https://wpnews.pro/news/synthesize-the-big-picture-and-analyze-trends-with-bigquery-s-ai-agg-function", "canonical_source": "https://cloud.google.com/blog/products/data-analytics/deep-dive-into-bigquery-ai-agg-function/", "published_at": "2026-06-29 16:00:00+00:00", "updated_at": "2026-06-29 16:37:05.098022+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "ai-tools", "natural-language-processing", "ai-infrastructure"], "entities": ["Google Cloud", "BigQuery", "AI.AGG", "Loghub", "Apache Spark", "cymbal_pets"], "alternates": {"html": "https://wpnews.pro/news/synthesize-the-big-picture-and-analyze-trends-with-bigquery-s-ai-agg-function", "markdown": "https://wpnews.pro/news/synthesize-the-big-picture-and-analyze-trends-with-bigquery-s-ai-agg-function.md", "text": "https://wpnews.pro/news/synthesize-the-big-picture-and-analyze-trends-with-bigquery-s-ai-agg-function.txt", "jsonld": "https://wpnews.pro/news/synthesize-the-big-picture-and-analyze-trends-with-bigquery-s-ai-agg-function.jsonld"}}