Training interactive world models requires data that is notoriously hard to find: ego-centric video sequences with densely aligned action signals (keyboard inputs, camera motion, and ego state) all synchronized to the visual stream.
Real-world embodied data is costly to collect, while synthetic data often lacks the visual richness or behavioral diversity needed for generalization. Counter-Strike 2 demos offer a compelling middle ground: because matches are recorded as deterministic replays, we can reconstruct clean first-person video at any point in a match, extracting the precise control inputs that drove each visual change. For these reasons, Counter-Strike is fast becoming a popular substrate for embodied AI and world-model research, with recent efforts such as EgoCS-400k reflecting a growing community interest in it as a rich source of egocentric training data.
Today we release CS2-10k, a large-scale egocentric gameplay dataset built from professional CS2 matches. It contains 600,000+ player-round videos spanning 10,000+ hours of first-person footage, paired with per-frame annotations covering keyboard state, mouse movement, and 3D player trajectory. Alongside this ready-to-use dataset, we are also releasing the ready-to-extend cs2-dem-renderer, the open-source pipeline used to produce it. All of this, so we can build better world models, together.
Dataset Overview #
CS2-10k is built from public professional match demos sourced from HLTV. For each demo, we render clean first-person video at 720p, 48fps using the demo replay tool inside CS2, producing one video per player per round. Alongside each video, we store a parquet file containing per-frame annotations synchronized to the video timeline.
Annotation Schema
Every video clip has its corresponding anotations stored in a .parquet
file:
| Field | Type | Description |
|---|---|---|
| string | Map name (e.g. "mirage", "dust2") | |
| int | Round within the match | |
| int | 0 = Counter-Terrorist, 1 = Terrorist | |
| int | Total frames in the clip | |
| float | Video frame rate (48.0) | |
| float | Clip duration in seconds | |
| float | Camera field of view (90.0°) |
| list[dict] | Per-frame annotation array (see below) |
Per-Frame Annotations
Each entry in frame_data
contains:
| Field | Description |
|---|---|
| Concatenated active keys: | |
| Horizontal camera delta — proxy for mouse X movement | |
| Vertical camera delta — proxy for mouse Y movement | |
| Player world position in game units |
| Camera yaw angle (−180° to 180°) |
| Camera pitch angle (−90° to 90°) |
The combination of video and per-frame control signals creates a tight action-observation loop.
No Abrupt Visual Changes #
Each clip is a contiguous segment of a single round from a single player's perspective. There are no mid-round cuts, no editing transitions, and no UI HUD. The camera moves in a physically plausible relationship in the world and we hide the player weapon to get rid of sudden visual changes caused by weapon recoil, reloads, and weapon switching.
Many Use Cases #
CS2-10k is designed for training interactive world models that learn how first-person visual observations change in response to player actions. The same aligned video, control, and state signals also support a range of related research workflows:
Rendering Pipeline #
If CS2-10k does not cover the scale, matches, or annotations you need, you can use our open-source pipeline at github.com/reka-ai/cs2-dem-renderer to render your own CS2 datasets. Given a .dem
file, it performs a two-pass parse to extract per-player spawn/death intervals and per-frame button inputs, then drives CS2's built-in demo replay system to render first-person video for each player each round. Frames are streamed in real time from CS2's movie output to ffmpeg (VAAPI HEVC), producing .mp4
clips alongside synchronized .parquet
annotation files. A worker mode processes entire directories of demos with automatic deduplication, making it straightforward to run at the scale of CS2-10k.
Citation #
If you use CS2-10k in your work, please cite: