Cosmos Claw: Hack on a Boat in SF (Nvidia Cosmos Based Social Media Manager) A team of developers built Cosmos Claw, an AI-powered social media manager that uses NVIDIA Cosmos 3 and GPT-4o to autonomously film, edit, and post promotional videos for venues, deployed on Nebius H200 GPUs during a San Francisco yacht hackathon. It shoots, directs, and edits social videos for your venue โ€” on autopilot, in every format, without a film crew. Built for the Yacht Hackathon โ€” by @ComposioHQ, @nebius, @tavily-ai & @openclaw. Watch the 60s demo on X ยท or play it inline below ๐Ÿ‘‡ Thirty seconds. That's all a viewer gives your venue before they swipe. To win those seconds you need a constant stream of video โ€” but a human videographer is expensive, slow, and shoots one thing at a time. Cosmos Claw is the always-on, AI-native alternative: a videographer and a marketing manager that never sleep. Point it at any venue โ€” a short-let, a cafรฉ, a bar โ€” and it runs the whole studio on autopilot: Studies your space โ€” GPT-4o vision labels every photo and learns what each room is. Brands it โ€” invents the positioning and locks in the missing facts price, story, amenities so they stay identical across every video. Ideates like a manager โ€” brainstorms a fresh campaign for each post angle, hook, photo order, format, music, voice that's different from everything it has shipped before. Films real motion โ€” NVIDIA Cosmos 3, a world model built for robotics , doesn't pan over stills; it generates a first-person POV that physically walks into the room . Voices & cuts it โ€” a unique GPT-written voiceover over a mood-matched music bed, cross-faded into the right aspect ratio. Delivers everywhere โ€” ready-to-post cards caption, hashtags, handle, recommended audio in every feed: Reels/TikTok 9:16, IG 1:1 & 4:5, YouTube 16:9. No crew, no call sheet, no edit bay. Just your existing photos in โ€” a full, on-brand social calendar out, on repeat. As you read this, two workers are filming in parallel โ€” pumping out a stream of ready-to-post Reels and TikToks for two San Francisco venues at once, each with its own AI voiceover โ€” all on a Cosmos 3 model we deployed ourselves on Nebius H200s. It pauses when the network blips and resumes on its own. Truly always-on. Looking forward to expanding this on the Yacht โ€” SF is sooooamazing ๐ŸŒ‰โ›ต๏ธ 1 โ€” Raw photos in. Drop a venue's existing images into the project. That's the only input. Here: the Alamo Square Hacker House โ€” bedrooms, gym, coworking. 2 โ€” The marketing manager's memory. An OpenClaw-style GPT-4o manager researches the venue, locks in a consistent brand positioning, audience, tone, pitch , writes the voiceover, and picks the assets & order โ€” the durable memory every video is grounded on. 3 โ€” Ready-to-post cuts out. The Agent Loop streams everything the videographer does in real time, and each published cut is a ready-to-post package: video + caption + recommended audio music & voice + handle, ready to download or push to the channel. | Layer | What we used | |---|---| ๐ŸŽฅ Video model | NVIDIA Cosmos 3 Nano โ€” a world model built for robotics/embodied POV , self-deployed by us for first-person walk-throughs | โšก Compute | โ€” NVIDIAยฎ H200 NVLink GPUs | ๐Ÿง  Manager + director | GPT-4o vision โ€” studies the photos, brands the venue, ideates each campaign & storyboard | ๐Ÿ”Ž Neighborhood research | โ€” enriches each venue with real local context | ๐Ÿ—บ๏ธ Maps & info cards | OpenStreetMap โ€” location, transit & nearby spots | ๐Ÿ”Š Audio | OpenAI TTS โ€” a unique per-cut voiceover over a mood-matched music bed | ๐Ÿงฉ App | FastAPI Studio UI + an always-on marketing loop driver, FFmpeg for cutting/transitions | We didn't just call a hosted API โ€” we stood up Cosmos 3 Nano ourselves on Nebius H200 NVLink GPUs vLLM-Omni, OpenAI-compatible and drove it end-to-end. Tavily researches the surrounding neighborhood so every second of the video carries the context a viewer needs to say yes . Shout-out to the partners: @ship builders ยท @nebiusai ยท @nvidia ยท @composio ยท @tavilyai ยท @openclaw venue photos + facts โ”‚ โ–ผ GPT-4o manager โ”€โ”€โ†’ brand dossier positioning + durable assumptions โ”‚ โ”‚ โ”‚ โ”œโ”€ Tavily โ”€โ†’ neighborhood research โ”‚ โ””โ”€ ideate โ”€โ†’ one fresh campaign angle ยท photos ยท โ”‚ format ยท music ยท voice ยท caption ยท VO โ–ผ NVIDIA Cosmos 3 Nano โ”€โ”€โ†’ a short first-person POV clip per beat world model, self-hosted on Nebius H200 โ”‚ โ–ผ transitions + audio โ”€โ”€โ†’ cross-fade ยท GPT voiceover ยท mood music ยท reframe โ”‚ โ–ผ ready-to-post cut.mp4 โ”€โ”€โ†’ Agent Loop feed caption ยท hashtags ยท audio โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ loop: next idea, next venue in parallel | File | Role | |---|---| scripts/marketing loop.py | The always-on loop: study โ†’ ideate โ†’ film โ†’ voice โ†’ publish, per venue parallel-safe | scripts/cosmos montage.py | Terminal montage: GPT vision per photo โ†’ Cosmos clips โ†’ fast transitions | app/marketing agent.py | GPT-4o marketing manager: research โ†’ brand โ†’ brief | app/brand.py | Per-venue brand dossier memory, durable assumptions, social posts | app/main.py | FastAPI server + Studio UI + generation API | app/trailer.py | GPT-4o "director" โ€” storyboard, shot/motion + walk-through mode | app/generation/cosmos.py | NVIDIA Cosmos 3 imageโ†’video adapter motion โ†’ flow-shift | app/generation/stub.py | Free local FFmpeg fallback generator | app/transitions.py | Fast cross-fade montage into any aspect ratio xfade | app/curation.py | Best-of-N take scoring motion energy + stability | app/audio.py | TTS voiceover + mood music bed + duck-and-mux | app/infocards.py | Map / price / neighborhood cards OpenStreetMap | app/pipeline.py | Orchestrates a single run best-of-N, info beats, finish | app/agent.py | Terminal CLI to drive the manager + fire renders | deploy/tunnel keeper.sh | Self-healing SSH tunnel to the Nebius GPU | Cosmos Claw isn't a one-shot tool โ€” it's a loop . A persistent brand dossier outputs/listing {id} brand.json is the single source of truth per venue, and an autonomous manager works against it the way a real social-media manager would โ€” forever: Study โ€” GPT-4o vision builds an asset index : what every uploaded photo is cached, so it's paid for once . Ideate โ€” brainstorms ONE fresh campaign distinct from past themes : the angle, which photos to use ordered like a story , the social format, music mood, TTS voice, a ready-to-post caption + hashtags, and a ~25s voiceover script . Film โ€” turns the chosen photos into short, first-person Cosmos clips. Cut โ€” cross-fades them into the campaign's aspect ratio and mixes the GPT voiceover over a mood-matched music bed. Publish โ€” drops a ready-to-post card into the Agent Loop feed and logs every step to the dossier timeline. โ€ฆthen it does it again, with a brand-new idea. Run one worker per venue and they generate in parallel , so multiple feeds fill at once: one always-on worker per project, running concurrently python scripts/marketing loop.py --projects la-house-1 --tag la --max-videos 6 python scripts/marketing loop.py --projects hacker-house --tag hh --max-videos 6 It's built to run unattended: a live endpoint probe before every shot means a Wi-Fi/tunnel blip just pauses the shoot and resumes when the connection is back โ€” no babysitting, no half-burned campaigns. A self-healing SSH tunnel keeper deploy/tunnel keeper.sh keeps the link to the GPU alive underneath. Consistency is the trick. Whatever the manager makes up, it makes up once : build brand writes the missing facts as durable assumptions that are never overwritten , so price, amenities and host story stay identical across every cut. study vision asset index โ”€โ†’ ideate fresh campaign โ”€โ†’ film โ”€โ†’ voice + cut โ”€โ†’ publish โ”€โ” โ–ฒ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ grounded on the brand dossier โ—€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ Prefer to drive it by hand? The same brain runs from the Agent Loop tab in the UI, or from the terminal: python -m app.agent list projects + dossier status python -m app.agent run la-house-1 --format reel research โ†’ brand โ†’ brief python -m app.agent assume la-house-1 price "$245/night" lock a consistent fact python -m app.agent generate la-house-1 --format youtube render via the live API Formats: reel , tiktok , shorts , story , snap 9:16 , youtube 16:9 , square 1:1 , portrait 4:5 . The render canvas switches automatically. Requires Python 3.9+ and FFmpeg. cd LiveHere brew install ffmpeg one time python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txt cp .env.example .env add your keys OpenAI, Tavily, Cosmos python -m app โ†’ http://127.0.0.1:8000 Open http://127.0.0.1:8000 http://127.0.0.1:8000 , pick a listing, tweak the auto-filled details, and hit Generate . With no GPU configured it runs on the free local FFmpeg stub; point it at Cosmos for the real thing below . The generation backend is swapped purely via env vars โ€” no code change to the UI or pipeline. .env LIVEHERE BACKEND=cosmos COSMOS API STYLE=vllm omni COSMOS BASE URL=http://