Cosmos Claw: Hack on a Boat in SF (Nvidia Cosmos Based Social Media Manager)

wpnews.pro

It shoots, directs, and edits social videos for your venue — on autopilot, in every format, without a film crew.

Built for the Yacht Hackathon — by @ComposioHQ, @nebius, @tavily-ai & @openclaw.

** Watch the 60s demo on X** · or play it inline below 👇

Thirty seconds. That's all a viewer gives your venue before they swipe. To win those seconds you need a constant stream of video — but a human videographer is expensive, slow, and shoots one thing at a time.

Cosmos Claw is the always-on, AI-native alternative: a videographer and a marketing manager that never sleep. Point it at any venue — a short-let, a café, a bar — and it runs the whole studio on autopilot:

Studies your space — GPT-4o (vision) labels every photo and learns what each room is.Brands it — invents the positioning andlocks inthe missing facts (price, story, amenities) so they stay identical across every video.Ideates like a manager — brainstorms a fresh campaign for each post (angle, hook, photo order, format, music, voice) that's different from everything it has shipped before.Films real motion — NVIDIA Cosmos 3, aworld model built for robotics, doesn't pan over stills; it generates a first-person POV that physicallywalks into the room.Voices & cuts it — a unique GPT-writtenvoiceover over a mood-matched music bed, cross-faded into the right aspect ratio.Delivers everywhere— ready-to-post cards (caption, hashtags, handle, recommended audio) in every feed: Reels/TikTok 9:16, IG 1:1 & 4:5, YouTube 16:9.

No crew, no call sheet, no edit bay. Just your existing photos in — a full, on-brand social calendar out, on repeat.

As you read this, two workers are filming in parallel — pumping out a stream of ready-to-post Reels and TikToks for two San Francisco venues at once, each with its own AI voiceover — all on a Cosmos 3 model we deployed ourselves on Nebius H200s. It s when the network blips and resumes on its own. Truly always-on.

Looking forward to expanding this on the Yacht — SF is

sooooamazing 🌉⛵️

1 — Raw photos in. Drop a venue's existing images into the project. That's the only input. (Here: the Alamo Square Hacker House — bedrooms, gym, coworking.)

2 — The marketing manager's memory. An OpenClaw-style GPT-4o manager researches the venue, locks in a consistent brand (positioning, audience, tone, pitch), writes the voiceover, and picks the assets & order — the durable memory every video is grounded on.

3 — Ready-to-post cuts out. The Agent Loop streams everything the videographer does in real time, and each published cut is a ready-to-post package: video + caption + recommended audio (music & voice) + handle, ready to download or push to the channel.

Layer	What we used
🎥 Video model
NVIDIA Cosmos 3 Nano — a world model (built for robotics/embodied POV), self-deployed by us for first-person walk-throughs
⚡ Compute
— NVIDIA®
H200 NVLink GPUs
🧠 Manager + director
GPT-4o (vision) — studies the photos, brands the venue, ideates each campaign & storyboard
🔎 Neighborhood research
— enriches each venue with real local context

🗺️ Maps & info cards
OpenStreetMap — location, transit & nearby spots
🔊 Audio
OpenAI TTS — a unique per-cut voiceover over a mood-matched music bed
🧩 App
FastAPI Studio UI + an always-on `marketing_loop` driver, FFmpeg for cutting/transitions

We didn't just call a hosted API — we stood up Cosmos 3 Nano ourselves on Nebius H200 NVLink GPUs (vLLM-Omni, OpenAI-compatible) and drove it end-to-end. Tavily researches the surrounding neighborhood so every second of the video carries the context a viewer needs to say yes.

Shout-out to the partners: @ship_builders · @nebiusai · @nvidia · @composio · @tavilyai · @openclaw

venue photos + facts
        │
        ▼
  GPT-4o manager  ──→  brand dossier (positioning + durable assumptions)
        │                     │
        │                     ├─ Tavily ─→ neighborhood research
        │                     └─ ideate ─→ one fresh campaign (angle · photos ·
        │                                   format · music · voice · caption · VO)
        ▼
  NVIDIA Cosmos 3 Nano  ──→  a short first-person POV clip per beat
   (world model, self-hosted on Nebius H200)
        │
        ▼
  transitions + audio  ──→  cross-fade · GPT voiceover · mood music · reframe
        │
        ▼
  ready-to-post cut.mp4  ──→  Agent Loop feed (caption · hashtags · audio)
        │
        └──────────────  loop: next idea, next venue (in parallel)

File	Role
`scripts/marketing_loop.py`
The always-on loop: study → ideate → film → voice → publish, per venue (parallel-safe)
`scripts/cosmos_montage.py`
Terminal montage: GPT vision per photo → Cosmos clips → fast transitions
`app/marketing_agent.py`
GPT-4o marketing manager: research → brand → brief
`app/brand.py`
Per-venue brand dossier (memory, durable assumptions, social posts)
`app/main.py`
FastAPI server + Studio UI + generation API
`app/trailer.py`
GPT-4o "director" — storyboard, shot/motion + walk-through mode
`app/generation/cosmos.py`
NVIDIA Cosmos 3 image→video adapter (motion → flow-shift)
`app/generation/stub.py`
Free local FFmpeg fallback generator
`app/transitions.py`
Fast cross-fade montage into any aspect ratio (xfade)
`app/curation.py`
Best-of-N take scoring (motion energy + stability)
`app/audio.py`
TTS voiceover + mood music bed + duck-and-mux
`app/infocards.py`
Map / price / neighborhood cards (OpenStreetMap)
`app/pipeline.py`
Orchestrates a single run (best-of-N, info beats, finish)
`app/agent.py`
Terminal CLI to drive the manager + fire renders
`deploy/tunnel_keeper.sh`
Self-healing SSH tunnel to the Nebius GPU

Cosmos Claw isn't a one-shot tool — it's a loop. A persistent brand dossier (outputs/listing_{id}_brand.json

) is the single source of truth per venue, and an autonomous manager works against it the way a real social-media manager would — forever:

Study— GPT-4o vision builds anasset index: what every uploaded photo is (cached, so it's paid for once).Ideate— brainstorms ONE fresh campaigndistinct from past themes: the angle, which photos to use (ordered like a story), the social format, music mood, TTS voice, a ready-to-post caption + hashtags, and a ~25svoiceover script.** Film**— turns the chosen photos into short, first-person Cosmos clips.** Cut**— cross-fades them into the campaign's aspect ratio and mixes the GPT voiceover over a mood-matched music bed.** Publish**— drops a ready-to-post card into the** Agent Loop**feed and logs every step to the dossier timeline.

…then it does it again, with a brand-new idea. Run one worker per venue and they generate in parallel, so multiple feeds fill at once:

python scripts/marketing_loop.py --projects la-house-1   --tag la --max-videos 6
python scripts/marketing_loop.py --projects hacker-house --tag hh --max-videos 6

It's built to run unattended: a live endpoint probe before every shot means a Wi-Fi/tunnel blip just s the shoot and resumes when the connection is back — no babysitting, no half-burned campaigns. A self-healing SSH tunnel keeper (deploy/tunnel_keeper.sh

) keeps the link to the GPU alive underneath.

Consistency is the trick. Whatever the manager makes up, it makes up once: build_brand

writes the missing facts as durable assumptions that are never overwritten, so price, amenities and host story stay identical across every cut.

study (vision asset index) ─→ ideate (fresh campaign) ─→ film ─→ voice + cut ─→ publish ─┐
        ▲                                                                                 │
        └───────────────────────  grounded on the brand dossier  ◀──────────────────────┘

Prefer to drive it by hand? The same brain runs from the Agent Loop tab in the UI, or from the terminal:

python -m app.agent list                                  # projects + dossier status
python -m app.agent run la-house-1 --format reel          # research → brand → brief
python -m app.agent assume la-house-1 price "$245/night"  # lock a consistent fact
python -m app.agent generate la-house-1 --format youtube  # render via the live API

Formats: reel

, tiktok

, shorts

, story

, snap

(9:16), youtube

(16:9), square

(1:1), portrait

(4:5). The render canvas switches automatically.

Requires Python 3.9+ and FFmpeg.

cd LiveHere

brew install ffmpeg                 # one time

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

cp .env.example .env                # add your keys (OpenAI, Tavily, Cosmos)

python -m app                       # → http://127.0.0.1:8000

Open http://127.0.0.1:8000, pick a listing, tweak the auto-filled details, and hit Generate. With no GPU configured it runs on the free local FFmpeg stub; point it at Cosmos for the real thing (below).

The generation backend is swapped purely via env vars — no code change to the UI or pipeline.

LIVEHERE_BACKEND=cosmos
COSMOS_API_STYLE=vllm_omni
COSMOS_BASE_URL=http://<your-gpu-host>:8000/v1
COSMOS_API_KEY=...

We self-hosted it on a Nebius H200 NVLink instance with vLLM-Omni:

vllm serve nvidia/Cosmos3-Nano --omni --host 0.0.0.0 --port 8000 --no-guardrails

Full deploy walkthrough (Nebius / Modal / RunPod) is in deploy/DEPLOY.md. Cosmos can't run on Apple Silicon — keep the GPU instance up only while generating, and tear it down when idle.

Cosmos Claw · made with ☕ for the Yacht Hackathon · Composio × Nebius × Tavily

source & further reading

github.com — original article

Cosmos Claw: Hack on a Boat in SF (Nvidia Cosmos Based Social Media Manager)

Run your AI side-project on zahid.host