Transform Video Into Instantly Searchable, Actionable Intelligence with AI Agents and Skills

NVIDIA released a new version of its Metropolis Blueprint for video search and summarization (VSS) that transforms live and recorded video into searchable, actionable intelligence using AI agents and skills. The platform enables enterprises to monitor operations, detect trends, and make faster decisions by integrating vision-language models, large language models, and retrievers for real-time video analytics. The latest VSS update introduces a modular design, advanced fusion search, and skills that allow developers to automate deployment and integration into custom applications through a simple agentic chat interface.

In today’s data-driven world, organizations increasingly rely on video to capture critical information, yet extracting meaningful, real-time insights from massive amounts of footage remains a challenge. NVIDIA Metropolis Blueprint for video search and summarization VSS https://build.nvidia.com/nvidia/video-search-and-summarization overcomes this hurdle by transforming millions of live video streams or hours of recorded video into instantly searchable, actionable intelligence. VSS brings a reference architecture for building video analytics AI agents https://www.nvidia.com/en-us/use-cases/video-analytics-ai-agents/ that perceive, reason, and act in real-time on massive volumes of live video streams and recorded data. It uses accelerated vision-based microservices, vision-language models VLMs https://www.nvidia.com/en-us/glossary/vision-language-models/ , large language models LLMs https://www.nvidia.com/en-us/glossary/large-language-models/ , and retrievers for real-time video intelligence, agentic search, and automated reporting. VSS helps enterprises monitor operations, detect trends, and make informed decisions faster than ever. The latest version of VSS brings a new modular design, advanced fusion search capability and a set of skills to easily integrate with autonomous agents. In this post you will learn how to use the new VSS skills https://github.com/NVIDIA-AI-Blueprints/video-search-and-summarization/tree/main/skills with coding agents to automate VSS deployment and integration into custom applications, followed by a deep dive into the technology behind VSS 3. Continue reading to learn how to use VSS skills with coding agents for building autonomous video analytics AI Agents http://google.com/search?q=video+analytics+AI+agent+nvidia&rlz=1C1ONGR enUS1050US1050&oq=video+analy&gs lcrp=EgZjaHJvbWUqBggAEEUYOzIGCAAQRRg7MgcIARAAGIAEMg0IAhAAGIMBGLEDGIAEMgcIAxAAGIAEMgYIBBBFGDwyBggFEEUYPDIGCAYQRRg8MgYIBxBFGEHSAQgxMTE4ajBqN6gCALACAA&sourceid=chrome&ie=UTF-8 . You can also watch a recording to learn how to build a video analytics AI agent with VSS skills. Build a video AI agent with VSS skills and coding agents In the past, developers had to manually configure, deploy and integrate the rich set of microservices VSS provides for video management, search, summarization and more to build video analytic applications. Today, it’s possible to use coding agents augmented with VSS skills to automate the deployment, usage and integration of VSS all through a simple agentic chat interface. VSS skills are hosted on the VSS GitHub Repository and follow the agent skills specification https://agentskills.io/specification , allowing them to be used with a wide variety of agents. A prerequisite to utilizing these skills is to have a system that is set up to run VSS and an agent compatible with skills such as Codex, Claude Code, OpenClaw, or NemoClaw. First we will show an example of how to add VSS skills to Codex and use it to deploy the VSS search profile. Then, we will show how to add VSS skills to OpenClaw, which will allow us to interact with our VSS deployment through nearly any chat interface to search and analyze large volumes of video. Setting up the VSS pre-requisites The first step is to prepare a system to run VSS. The easiest way to do this is to use the NVIDIA Brev Launchable for VSS. Go to the VSS launchable documentation page https://docs.nvidia.com/vss/latest/cloud-brev.html and click the “Launch Blueprint” button and then “Deploy Launchable.” Once deployed click the Open Notebook button and navigate to the /video-search-and-summarization/scripts/deploy vss launchable.ipynb notebook. Paste in your NGC CLI API KEY from NGC https://catalog.ngc.nvidia.com/ in the first cell and then execute the entire notebook including the tear-down section. This will ensure the system is fully set up for VSS and then you can make use of the deployment skill to manage our VSS deployment from our coding agent. Once the notebook has run to completion, install the Brev CLI on your host system, launch VSCode and remotely connect to your Brev Instance following the Using Brev CLI SSH section from your Launchable page as shown in Figure 2, below. Once you have a remote access configured, you can install the Codex through the VSCode extension to use as the coding agent. Deploying VSS with Codex In VSCode you will use the extensions tab to search for and install Codex. Once installed you need to install the VSS skills. You can do this by telling Codex to self install the VSS skills and providing it the location of our VSS Github repository as shown in the following prompt: Read ~/video-search-and-summarization/skills/README.md and every SKILL.md file under ~/video-search-and-summarization/skills/. For each skill in the catalog, install it for this host so I can invoke it from a shell or chat session. Use the host's standard skills directory: Claude Code: ~/.claude/skills/<name / Codex: ~/.codex/skills/<name / Hosts that follow the agentskills.io universal path: ~/.agents/skills/<name / Symlink each skill folder rather than copying it so a git pull here keeps every install up to date. Skip skills that are already installed and pointing at this checkout. When you're done, list the skills you registered and which directory you used. Figure 3, below, shows how the agent will respond, verifying that it can access the VSS skills. Once your agent is loaded with the VSS skills, you can use it to deploy the various VSS components and profiles. Then you can use Codex to deploy the new VSS Search profile, as shown in Figure 4, below. Codex will then plan out the deployment, configure the necessary environment variables and deploy all the containers needed to enable the VSS Search capability. From here, you can continue using Codex to interact with VSS for searching videos or continue to the next section to see how to also use OpenClaw with VSS skills. Searching videos with VSS and OpenClaw With the search profile running you can install and configure OpenClaw to be an autonomous agent for analyzing videos using VSS. We will show you how to set up OpenClaw on the Brev system to see what a powerful autonomous agent can do. You will follow the standard OpenClaw installation instructions https://docs.openclaw.ai/install from the VSCode terminal connected to the Brev instance and use the recommended installer script. After running through the initial configuration, you can hatch our agent shown in Figure 5, below, and give it some context that it will be an agent for building video analytic applications using VSS. After the initial setup, you need to provide OpenClaw with the VSS Skills. The easiest way to do this is to manually copy the skills into the OpenClaw workspace. mkdir ~/.openclaw/workspace/skills cp -r ~/video-search-and-summarization/skills/ ~/.openclaw/workspace/skills Now, open up the OpenClaw UI by running the openclaw dashboard command in the terminal, which will return a clickable link to access the OpenClaw UI. Once opened, you can verify that OpenClaw has access to the VSS Skills. Now you can tell OpenClaw to use the VSS search profile deployed in the previous section to start analyzing large volumes of video data. For this example, you will provide a path to three 10-minute videos captured in a warehouse that need to be analyzed for safe ladder usage. You want OpenClaw to use the search capability to find all instances of ladder usage in the videos and verify the worker is wearing a hardhat and safety vest. For this, you will use the following prompt: I have a set of warehouse videos located at ~/warehouse videos. I need to find any instances of a worker climbing a ladder and verify they are wearing a hardhat and safety vest. Can you do this with the VSS Search profile that is deployed? Once prompted, OpenClaw will start working behind the scenes to figure out the necessary skills and associated tool calls it needs to make to complete the task. OpenClaw makes use of the VSS skills to upload your video files to VIOS, ingest the videos through the embedding microservices to generate searchable indexes and then use the fusion search capability in VSS to find the video clips where a worker wearing a hardhat and safety vest is climbing a ladder. Once it’s done, OpenClaw returns a concise report of all ladder usage seen across the videos as well as screenshots from the videos. This section covered just one simple example of using Codex for deployment and OpenClaw for video analysis with VSS skills. By augmenting agents with VSS Skills, they are given endless possibilities to gain valuable insights into video data and build new applications with VSS. Now you can dive deeper into the technology that powers the rich set of video analysis capabilities in VSS 3. Smarter video: From alerts to search Large-scale video search remains one of the most challenging frontiers in modern information retrieval. User queries are inherently complex and ambiguous—capturing full semantic intent within a single visual embedding is fundamentally insufficient, particularly when objects and events carry multi-layered attributes that resist simple vector representation. At massive scale, locating a specific moment across millions of hours of footage becomes a true “needle in a haystack” problem, where nearest-neighbor search over a monolithic embedding space quickly degrades in both precision and recall. Addressing these limitations requires a more sophisticated search architecture built on two core capabilities: Multi-type embedding extraction and retrieval , combined with relevance filtering and semantic deduplication. Search orchestration driven by agentic reasoning ; decomposing complex queries into tractable sub-queries, applying reasoning-based retrieval strategies at each step, and running iterative verification and reflection loops to progressively refine results. The search architecture first uses RTVI-CV with embedding and RTVI-embedding microservices to ingest video and extract features. The VSS agent then uses this feature data and vision-aware tools to perform a deep, iterative search on video, creating a plan and retrieving results to locate specific objects or events in the video timeline. Modular architecture brings high flexibility and performance VSS is designed around a docker-compose based modular developer profile system: A base agent deploys in under five minutes, and additional workflows are layered on top as needed. | Workflow | Profile | Core Capability | | Alert Verification https://docs.nvidia.com/vss/latest/agent-workflow-alert-verification.html Real-Time VLM Alerts https://docs.nvidia.com/vss/latest/agent-workflow-rt-alert.html Search https://docs.nvidia.com/vss/latest/agent-workflow-search.html Video Summarization https://docs.nvidia.com/vss/latest/agent-workflow-lvs.html Table 1. Available VSS developer profiles Each workflow is supported on several types of GPUs in various configurations to meet your hardware and performance needs. Let’s look at some benchmarks for the various workflows and configurations. The agentic search workflow can be characterized by its maximum concurrent input streams, the time it takes to ingest the incoming streams and the retrieval latency to receive a search result. Table 2, below, shows these metrics on single GPU configurations for H100 and NVIDIA RTX PRO 6000. GPU | Max Concurrent Streams | Max Ingestion Latency s | Retrieval Latency s | | 1x H100 | 33 | 0.079 | 2.24 | | 1x RTX PRO 6000 | 51 | 0.101 | 1.87 | Table 2: Key performance metrics for the agentic search workflow For the alert verification workflow, the maximum number of concurrent streams is measured along with the latency for the verification to take place. Table 3, below, shows these metrics measured using RT-DETR as the detector, Cosmos Reason 2 as the VLM verifier operating on streams with an average of 1 alert event per minute. GPU | Max Concurrent Streams | Verification Latency s | | 1x DGX Spark 1x AGX Thor | 14 | 0.89 | | 1x H100 | 147 | 1.01 | | 1x RTX PRO 6000 | 87 | 0.82 | Table 3. Key performance metrics for the alert verification workflow The long video summarization microservice rapidly produces summaries on hours of video footage. Figure below, shows the time it takes for a given GPU configuration to summarize an hour long video. Scaling the LVS microservice to multiple GPUs can greatly decrease the summarization time. Get started with VSS skills VSS skills enable developers to transform video into searchable, meaningful data using natural language—making it easier to uncover insights, generate summaries, and build smarter applications. To dive deeper into VSS, see the documentation https://docs.nvidia.com/vss/3.0.0/vssnext-docs/3.0.0/ . Explore all VSS skills in Github https://github.com/NVIDIA-AI-Blueprints/video-search-and-summarization/tree/main/skills . For technical questions, visit our forum https://forums.developer.nvidia.com/c/accelerated-computing/intelligent-video-analytics/visual-ai-agent/680 . GTC Event: Join us at NVIDIA GTC Taipei in June where developers, researchers, and industry leaders come together to explore the future of AI, from agentic and reasoning AI to physical AI, robotics, and beyond. Get details https://www.nvidia.com/en-tw/gtc/taipei/. .