{"slug": "how-to-use-agenttrove-streaming-1-7m-agentic-traces-and-building-a-clean-sft-in", "title": "How to Use AgentTrove: Streaming 1.7M Agentic Traces and Building a Clean ShareGPT SFT Dataset in Python", "summary": "AgentTrove, one of the largest open-source collections of agentic interaction traces, is now available for streaming analysis without requiring full dataset downloads. Researchers can inspect conversation schemas, normalize agent turns, and parse command-style assistant outputs using Python tools from the datasets and pandas libraries. The workflow enables sampling thousands of traces, summarizing turn-level statistics, and exporting successful trajectories into a clean ShareGPT-style JSONL format for supervised fine-tuning.", "body_md": "In this tutorial, we explore[ AgentTrove](https://huggingface.co/datasets/open-thoughts/AgentTrove), one of the largest open-source collections of agentic interaction traces, and learn how to work with it efficiently. Instead of downloading the full dataset, we use streaming to inspect rows, detect the conversation schema, normalize agent turns, and understand how user, assistant, system, and tool messages are structured. We also build utilities to parse command-style assistant outputs, render complete trajectories in a readable format, and study how agents interact with tools across different tasks. Also, we create a lightweight analytical workflow that samples thousands of traces, converts them into a DataFrame, summarizes turn-level statistics, visualizes important dataset patterns, and exports successful traces into a clean ShareGPT-style JSONL format for supervised fine-tuning.\n\n```\n!pip -q install \"datasets>=2.19\" pandas matplotlib pyarrow huggingface_hub\nimport itertools, json, collections, textwrap, re, random, statistics\nimport pandas as pd\nimport matplotlib.pyplot as plt\nfrom datasets import load_dataset\nREPO = \"open-thoughts/AgentTrove\"\nrandom.seed(0)\nprint(\"✅ Imports ready. Target dataset:\", REPO)\nds = load_dataset(REPO, split=\"train\", streaming=True)\nprint(\"✅ Streaming dataset opened.\")\nfirst = next(iter(ds))\nprint(\"\\n🔎 Columns present in a row:\")\nfor k in first.keys():\n   v = first[k]\n   t = type(v).__name__\n   preview = (str(v)[:70] + \"…\") if v is not None and len(str(v)) > 70 else v\n   print(f\"   • {k:<18} ({t}): {preview}\")\n```\n\nWe install the required libraries and import the core tools needed for streaming, analysis, and visualization. We define the AgentTrove repository, open the dataset in streaming mode, and avoid downloading the full dataset locally. We then inspect the first row to understand the available columns and get an initial view of the dataset schema.\n\n``` python\ndef find_trace_key(row):\n   for cand in (\"conversations\", \"messages\"):\n       if cand in row and isinstance(row[cand], list):\n           return cand\n   for k, v in row.items():\n       if isinstance(v, list) and v and isinstance(v[0], dict) and \\\n          (\"content\" in v[0] or \"role\" in v[0] or \"value\" in v[0]):\n           return k\n   raise KeyError(\"No conversation-like column found.\")\nTRACE_KEY = find_trace_key(first)\nprint(f\"\\n✅ Trace column detected: '{TRACE_KEY}'\")\ndef normalize_turns(trace):\n   turns = []\n   for turn in trace:\n       if not isinstance(turn, dict):\n           turns.append((\"unknown\", str(turn)))\n           continue\n       role = turn.get(\"role\") or turn.get(\"from\") or \"unknown\"\n       content = turn.get(\"content\")\n       if content is None:\n           content = turn.get(\"value\", \"\")\n       turns.append((str(role), \"\" if content is None else str(content)))\n   return turns\nsample_turns = normalize_turns(first[TRACE_KEY])\nprint(f\"✅ First trace has {len(sample_turns)} turns. \"\n     f\"Roles: {collections.Counter(r for r, _ in sample_turns)}\")\n```\n\nWe create a defensive function to automatically detect the column that contains the conversation or trace data. We then normalize each turn into a consistent role-content format so that different dataset schemas can be handled smoothly. We also inspect the first trajectory to count the number of turns and understand the roles present in the conversation.\n\n``` python\ndef extract_commands(assistant_content):\n   \"\"\"Best-effort: pull shell commands out of an assistant JSON turn.\"\"\"\n   cmds = []\n   txt = re.sub(r\"```(?:json)?|```\", \"\", assistant_content).strip()\n   try:\n       obj = json.loads(txt)\n   except Exception:\n       return cmds\n   def walk(o):\n       if isinstance(o, dict):\n           for key in (\"commands\", \"command\", \"keystrokes\", \"cmd\", \"action\"):\n               if key in o:\n                   val = o[key]\n                   if isinstance(val, str):\n                       cmds.append(val.strip())\n                   elif isinstance(val, list):\n                       for item in val:\n                           if isinstance(item, str):\n                               cmds.append(item.strip())\n                           elif isinstance(item, dict):\n                               walk(item)\n           for v in o.values():\n               if isinstance(v, (dict, list)):\n                   walk(v)\n       elif isinstance(o, list):\n           for v in o:\n               walk(v)\n   walk(obj)\n   return [c for c in cmds if c]\n```\n\nWe define a command-extraction utility that reads assistant responses and attempts to parse shell commands from JSON-style outputs. We clean possible code fences, load the content as JSON, and recursively search through common command-related fields. This helps us identify tool-like actions inside agent trajectories and measure how often agents issue executable commands.\n\n``` python\ndef render_trace(row, max_chars=600):\n   meta = {k: row.get(k) for k in\n           (\"original_source\", \"original_teacher\", \"model\", \"task\",\n            \"result\", \"reward\", \"model_provider\") if k in row}\n   print(\"=\" * 78)\n   print(\"📦 METADATA:\", {k: v for k, v in meta.items() if v is not None})\n   print(\"=\" * 78)\n   for i, (role, content) in enumerate(normalize_turns(row[TRACE_KEY])):\n       tag = {\"system\": \"⚙️ SYSTEM\", \"user\": \"👤 USER\",\n              \"assistant\": \"🤖 ASSISTANT\", \"tool\": \"🛠️ TOOL\"}.get(role, f\"❓ {role.upper()}\")\n       snippet = content if len(content) <= max_chars else content[:max_chars] + \" …[truncated]\"\n       print(f\"\\n[{i}] {tag}\")\n       print(textwrap.indent(snippet, \"    \"))\n       if role == \"assistant\":\n           for c in extract_commands(content)[:5]:\n               print(f\"      └─⌨️  parsed command: {c!r}\")\n   print(\"=\" * 78, \"\\n\")\nprint(\"\\n📜 EXAMPLE TRAJECTORY (first row):\")\nrender_trace(first, max_chars=400)\n```\n\nWe build a trace-rendering function that prints the metadata and the full conversation trajectory in a readable format. We label each turn by role, truncate long messages for clarity, and show parsed commands under assistant messages.\n\n```\nN = 2000\nrecords = []\nprint(f\"\\n⏳ Streaming {N} rows for analysis…\")\nfor row in itertools.islice(load_dataset(REPO, split=\"train\", streaming=True), N):\n   turns = normalize_turns(row[TRACE_KEY])\n   roles = collections.Counter(r for r, _ in turns)\n   total_chars = sum(len(c) for _, c in turns)\n   asst_cmds = sum(len(extract_commands(c)) for r, c in turns if r == \"assistant\")\n   records.append({\n       \"original_source\":  row.get(\"original_source\"),\n       \"original_teacher\": row.get(\"original_teacher\"),\n       \"model\":            row.get(\"model\"),\n       \"model_provider\":   row.get(\"model_provider\"),\n       \"result\":           row.get(\"result\"),\n       \"reward\":           row.get(\"reward\"),\n       \"n_turns\":          len(turns),\n       \"n_user\":           roles.get(\"user\", 0),\n       \"n_assistant\":      roles.get(\"assistant\", 0),\n       \"n_tool\":           roles.get(\"tool\", 0),\n       \"total_chars\":      total_chars,\n       \"n_commands\":       asst_cmds,\n   })\ndf = pd.DataFrame(records)\nprint(f\"✅ Built DataFrame: {df.shape[0]} rows × {df.shape[1]} cols\")\nprint(\"\\n📊 Numeric summary (turns / length / commands):\")\nprint(df[[\"n_turns\", \"n_assistant\", \"n_tool\", \"total_chars\", \"n_commands\"]]\n     .describe().round(1).to_string())\ndef show_dist(col, top=15):\n   if col in df and df[col].notna().any():\n       print(f\"\\n🏷️  Top values for '{col}':\")\n       print(df[col].value_counts(dropna=True).head(top).to_string())\n   else:\n       print(f\"\\n🏷️  '{col}' is empty/absent in this sample.\")\nfor c in (\"original_source\", \"original_teacher\", \"model\", \"model_provider\", \"result\"):\n   show_dist(c)\n```\n\nWe stream a sample of rows from AgentTrove and collect useful statistics, including turn counts, tool usage, total characters, and parsed command counts. We store these lightweight features in a pandas DataFrame to make the dataset easier to summarize and analyze. We also print distribution tables for fields such as source, teacher model, model provider, and result to understand where the traces originate.\n\n```\nfig, axes = plt.subplots(2, 2, figsize=(14, 10))\nsrc = df[\"original_source\"].value_counts().head(10)\naxes[0, 0].barh(src.index[::-1], src.values[::-1], color=\"#4C72B0\")\naxes[0, 0].set_title(\"Top 10 Task Sources\"); axes[0, 0].set_xlabel(\"traces\")\ntch = df[\"original_teacher\"].value_counts().head(10)\naxes[0, 1].barh(tch.index[::-1], tch.values[::-1], color=\"#55A868\")\naxes[0, 1].set_title(\"Teacher Models\"); axes[0, 1].set_xlabel(\"traces\")\naxes[1, 0].hist(df[\"n_turns\"].clip(upper=df[\"n_turns\"].quantile(0.97)),\n               bins=30, color=\"#C44E52\", edgecolor=\"white\")\naxes[1, 0].set_title(\"Turns per Trajectory (97th-pct clipped)\")\naxes[1, 0].set_xlabel(\"turns\"); axes[1, 0].set_ylabel(\"count\")\naxes[1, 1].scatter(df[\"n_assistant\"], df[\"n_commands\"], alpha=0.3, s=12, color=\"#8172B2\")\naxes[1, 1].set_title(\"Assistant Turns vs. Parsed Commands\")\naxes[1, 1].set_xlabel(\"assistant turns\"); axes[1, 1].set_ylabel(\"shell commands extracted\")\nplt.tight_layout(); plt.show()\n```\n\nWe create four visualizations to explore the sampled traces from different angles. We plot the top task sources, teacher models, turn-count distribution, and the relationship between assistant turns and parsed commands. These charts help us quickly identify patterns in the dataset and understand how agent behavior varies across sources and tasks.\n\n``` python\ndef is_success(row):\n   res = (row.get(\"result\") or \"\").lower()\n   if res in (\"resolved\", \"success\", \"pass\", \"passed\", \"correct\"):\n       return True\n   rw = row.get(\"reward\")\n   try:\n       return float(rw) >= 1.0\n   except (TypeError, ValueError):\n       return False\nout_path = \"agenttrove_clean_sft.jsonl\"\nkept, scanned, SCAN, KEEP = 0, 0, 1500, 200\nprint(f\"\\n⏳ Scanning up to {SCAN} rows, keeping up to {KEEP} successful traces…\")\nwith open(out_path, \"w\") as f:\n   for row in itertools.islice(load_dataset(REPO, split=\"train\", streaming=True), SCAN):\n       scanned += 1\n       if not is_success(row):\n           continue\n       turns = normalize_turns(row[TRACE_KEY])\n       conv = [{\"from\": r, \"value\": c} for r, c in turns if c.strip()]\n       if len(conv) < 2:\n           continue\n       f.write(json.dumps({\n           \"conversations\": conv,\n           \"source\": row.get(\"original_source\"),\n           \"teacher\": row.get(\"original_teacher\"),\n       }) + \"\\n\")\n       kept += 1\n       if kept >= KEEP:\n           break\nprint(f\"✅ Scanned {scanned} rows → wrote {kept} clean traces to '{out_path}'\")\ndef search_traces(keyword=None, source=None, limit=3, scan=3000):\n   \"\"\"Stream the dataset and yield-print traces matching filters.\"\"\"\n   hits = 0\n   for row in itertools.islice(load_dataset(REPO, split=\"train\", streaming=True), scan):\n       if source and row.get(\"original_source\") != source:\n           continue\n       if keyword:\n           blob = \" \".join(c for _, c in normalize_turns(row[TRACE_KEY]))\n           if keyword.lower() not in blob.lower():\n               continue\n       render_trace(row, max_chars=300)\n       hits += 1\n       if hits >= limit:\n           break\n   if hits == 0:\n       print(\"No matches in the scanned window — try increasing `scan`.\")\nprint(\"\\n🔍 Searching for 'nl2bash' source traces:\")\nsearch_traces(source=\"nl2bash\", limit=2, scan=4000)\nprint(\"\\n🎉 Tutorial complete! Next ideas:\")\nprint(\"   • Increase N / SCAN for bigger analyses.\")\nprint(\"   • Filter by original_source (swesmith, codeforces, r2egym…) for a domain SFT set.\")\nprint(\"   • Feed agenttrove_clean_sft.jsonl into Axolotl / LLaMA-Factory for fine-tuning.\")\n```\n\nWe define a success filter that retains traces marked as resolved, passed, correct, or positively rewarded. We then export successful trajectories into a clean ShareGPT-style JSONL file for downstream fine-tuning workflows. Also, we add a search utility to find traces by keyword or source, making the dataset easier to explore for specific agentic tasks.\n\nIn conclusion, we built a complete, hands-on pipeline to inspect, analyze, filter, and export data from AgentTrove in a Colab-friendly way. We started with streaming access, then progressively added schema detection, turn normalization, command extraction, trajectory rendering, statistical analysis, visualization, success-based filtering, and keyword or source-based search. This workflow helps us understand the internal structure of agentic traces and gives us a reusable foundation for preparing high-quality subsets for fine-tuning or evaluation. We also keep the process scalable by avoiding full dataset downloads and using streamed samples only when needed. Also, we demonstrated how AgentTrove can be used as more than a static dataset: we treated it as a rich source of agent behavior, tool usage, task outcomes, and training-ready conversations that can support future experiments in agent learning, workflow analysis, and domain-specific SFT dataset creation.\n\nCheck out the ** Full Codes with Notebook. **Also, feel free to follow us on\n\n**and don’t forget to join our**[Twitter](https://x.com/intent/follow?screen_name=marktechpost)\n\n**and Subscribe to**\n\n[150k+ ML SubReddit](https://www.reddit.com/r/machinelearningnews/)**. Wait! are you on telegram?**\n\n[our Newsletter](https://www.aidevsignals.com/)\n\n[now you can join us on telegram as well.](https://t.me/machinelearningresearchnews)Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? [Connect with us](https://forms.gle/wbash1wF6efRj8G58)\n\nSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.\n\n- Sana Hassan\n- Sana Hassan\n- Sana Hassan\n- Sana Hassan", "url": "https://wpnews.pro/news/how-to-use-agenttrove-streaming-1-7m-agentic-traces-and-building-a-clean-sft-in", "canonical_source": "https://www.marktechpost.com/2026/05/29/how-to-use-agenttrove-streaming-1-7m-agentic-traces-and-building-a-clean-sharegpt-sft-dataset-in-python/", "published_at": "2026-05-30 00:46:05+00:00", "updated_at": "2026-05-30 00:55:07.698043+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "ai-agents", "ai-tools"], "entities": ["AgentTrove", "Hugging Face", "ShareGPT", "Python"], "alternates": {"html": "https://wpnews.pro/news/how-to-use-agenttrove-streaming-1-7m-agentic-traces-and-building-a-clean-sft-in", "markdown": "https://wpnews.pro/news/how-to-use-agenttrove-streaming-1-7m-agentic-traces-and-building-a-clean-sft-in.md", "text": "https://wpnews.pro/news/how-to-use-agenttrove-streaming-1-7m-agentic-traces-and-building-a-clean-sft-in.txt", "jsonld": "https://wpnews.pro/news/how-to-use-agenttrove-streaming-1-7m-agentic-traces-and-building-a-clean-sft-in.jsonld"}}