How to Use AgentTrove: Streaming 1.7M Agentic Traces and Building a Clean ShareGPT SFT Dataset in Python

AgentTrove, one of the largest open-source collections of agentic interaction traces, is now available for streaming analysis without requiring full dataset downloads. Researchers can inspect conversation schemas, normalize agent turns, and parse command-style assistant outputs using Python tools from the datasets and pandas libraries. The workflow enables sampling thousands of traces, summarizing turn-level statistics, and exporting successful trajectories into a clean ShareGPT-style JSONL format for supervised fine-tuning.

In this tutorial, we explore AgentTrove https://huggingface.co/datasets/open-thoughts/AgentTrove , one of the largest open-source collections of agentic interaction traces, and learn how to work with it efficiently. Instead of downloading the full dataset, we use streaming to inspect rows, detect the conversation schema, normalize agent turns, and understand how user, assistant, system, and tool messages are structured. We also build utilities to parse command-style assistant outputs, render complete trajectories in a readable format, and study how agents interact with tools across different tasks. Also, we create a lightweight analytical workflow that samples thousands of traces, converts them into a DataFrame, summarizes turn-level statistics, visualizes important dataset patterns, and exports successful traces into a clean ShareGPT-style JSONL format for supervised fine-tuning. pip -q install "datasets =2.19" pandas matplotlib pyarrow huggingface hub import itertools, json, collections, textwrap, re, random, statistics import pandas as pd import matplotlib.pyplot as plt from datasets import load dataset REPO = "open-thoughts/AgentTrove" random.seed 0 print "✅ Imports ready. Target dataset:", REPO ds = load dataset REPO, split="train", streaming=True print "✅ Streaming dataset opened." first = next iter ds print "\n🔎 Columns present in a row:" for k in first.keys : v = first k t = type v . name preview = str v :70 + "…" if v is not None and len str v 70 else v print f" • {k:<18} {t} : {preview}" We install the required libraries and import the core tools needed for streaming, analysis, and visualization. We define the AgentTrove repository, open the dataset in streaming mode, and avoid downloading the full dataset locally. We then inspect the first row to understand the available columns and get an initial view of the dataset schema. python def find trace key row : for cand in "conversations", "messages" : if cand in row and isinstance row cand , list : return cand for k, v in row.items : if isinstance v, list and v and isinstance v 0 , dict and \ "content" in v 0 or "role" in v 0 or "value" in v 0 : return k raise KeyError "No conversation-like column found." TRACE KEY = find trace key first print f"\n✅ Trace column detected: '{TRACE KEY}'" def normalize turns trace : turns = for turn in trace: if not isinstance turn, dict : turns.append "unknown", str turn continue role = turn.get "role" or turn.get "from" or "unknown" content = turn.get "content" if content is None: content = turn.get "value", "" turns.append str role , "" if content is None else str content return turns sample turns = normalize turns first TRACE KEY print f"✅ First trace has {len sample turns } turns. " f"Roles: {collections.Counter r for r, in sample turns }" We create a defensive function to automatically detect the column that contains the conversation or trace data. We then normalize each turn into a consistent role-content format so that different dataset schemas can be handled smoothly. We also inspect the first trajectory to count the number of turns and understand the roles present in the conversation. python def extract commands assistant content : """Best-effort: pull shell commands out of an assistant JSON turn.""" cmds = txt = re.sub r" ?:json ?| ", "", assistant content .strip try: obj = json.loads txt except Exception: return cmds def walk o : if isinstance o, dict : for key in "commands", "command", "keystrokes", "cmd", "action" : if key in o: val = o key if isinstance val, str : cmds.append val.strip elif isinstance val, list : for item in val: if isinstance item, str : cmds.append item.strip elif isinstance item, dict : walk item for v in o.values : if isinstance v, dict, list : walk v elif isinstance o, list : for v in o: walk v walk obj return c for c in cmds if c We define a command-extraction utility that reads assistant responses and attempts to parse shell commands from JSON-style outputs. We clean possible code fences, load the content as JSON, and recursively search through common command-related fields. This helps us identify tool-like actions inside agent trajectories and measure how often agents issue executable commands. python def render trace row, max chars=600 : meta = {k: row.get k for k in "original source", "original teacher", "model", "task", "result", "reward", "model provider" if k in row} print "=" 78 print "📦 METADATA:", {k: v for k, v in meta.items if v is not None} print "=" 78 for i, role, content in enumerate normalize turns row TRACE KEY : tag = {"system": "⚙️ SYSTEM", "user": "👤 USER", "assistant": "🤖 ASSISTANT", "tool": "🛠️ TOOL"}.get role, f"❓ {role.upper }" snippet = content if len content <= max chars else content :max chars + " … truncated " print f"\n {i} {tag}" print textwrap.indent snippet, " " if role == "assistant": for c in extract commands content :5 : print f" └─⌨️ parsed command: {c r}" print "=" 78, "\n" print "\n📜 EXAMPLE TRAJECTORY first row :" render trace first, max chars=400 We build a trace-rendering function that prints the metadata and the full conversation trajectory in a readable format. We label each turn by role, truncate long messages for clarity, and show parsed commands under assistant messages. N = 2000 records = print f"\n⏳ Streaming {N} rows for analysis…" for row in itertools.islice load dataset REPO, split="train", streaming=True , N : turns = normalize turns row TRACE KEY roles = collections.Counter r for r, in turns total chars = sum len c for , c in turns asst cmds = sum len extract commands c for r, c in turns if r == "assistant" records.append { "original source": row.get "original source" , "original teacher": row.get "original teacher" , "model": row.get "model" , "model provider": row.get "model provider" , "result": row.get "result" , "reward": row.get "reward" , "n turns": len turns , "n user": roles.get "user", 0 , "n assistant": roles.get "assistant", 0 , "n tool": roles.get "tool", 0 , "total chars": total chars, "n commands": asst cmds, } df = pd.DataFrame records print f"✅ Built DataFrame: {df.shape 0 } rows × {df.shape 1 } cols" print "\n📊 Numeric summary turns / length / commands :" print df "n turns", "n assistant", "n tool", "total chars", "n commands" .describe .round 1 .to string def show dist col, top=15 : if col in df and df col .notna .any : print f"\n🏷️ Top values for '{col}':" print df col .value counts dropna=True .head top .to string else: print f"\n🏷️ '{col}' is empty/absent in this sample." for c in "original source", "original teacher", "model", "model provider", "result" : show dist c We stream a sample of rows from AgentTrove and collect useful statistics, including turn counts, tool usage, total characters, and parsed command counts. We store these lightweight features in a pandas DataFrame to make the dataset easier to summarize and analyze. We also print distribution tables for fields such as source, teacher model, model provider, and result to understand where the traces originate. fig, axes = plt.subplots 2, 2, figsize= 14, 10 src = df "original source" .value counts .head 10 axes 0, 0 .barh src.index ::-1 , src.values ::-1 , color=" 4C72B0" axes 0, 0 .set title "Top 10 Task Sources" ; axes 0, 0 .set xlabel "traces" tch = df "original teacher" .value counts .head 10 axes 0, 1 .barh tch.index ::-1 , tch.values ::-1 , color=" 55A868" axes 0, 1 .set title "Teacher Models" ; axes 0, 1 .set xlabel "traces" axes 1, 0 .hist df "n turns" .clip upper=df "n turns" .quantile 0.97 , bins=30, color=" C44E52", edgecolor="white" axes 1, 0 .set title "Turns per Trajectory 97th-pct clipped " axes 1, 0 .set xlabel "turns" ; axes 1, 0 .set ylabel "count" axes 1, 1 .scatter df "n assistant" , df "n commands" , alpha=0.3, s=12, color=" 8172B2" axes 1, 1 .set title "Assistant Turns vs. Parsed Commands" axes 1, 1 .set xlabel "assistant turns" ; axes 1, 1 .set ylabel "shell commands extracted" plt.tight layout ; plt.show We create four visualizations to explore the sampled traces from different angles. We plot the top task sources, teacher models, turn-count distribution, and the relationship between assistant turns and parsed commands. These charts help us quickly identify patterns in the dataset and understand how agent behavior varies across sources and tasks. python def is success row : res = row.get "result" or "" .lower if res in "resolved", "success", "pass", "passed", "correct" : return True rw = row.get "reward" try: return float rw = 1.0 except TypeError, ValueError : return False out path = "agenttrove clean sft.jsonl" kept, scanned, SCAN, KEEP = 0, 0, 1500, 200 print f"\n⏳ Scanning up to {SCAN} rows, keeping up to {KEEP} successful traces…" with open out path, "w" as f: for row in itertools.islice load dataset REPO, split="train", streaming=True , SCAN : scanned += 1 if not is success row : continue turns = normalize turns row TRACE KEY conv = {"from": r, "value": c} for r, c in turns if c.strip if len conv < 2: continue f.write json.dumps { "conversations": conv, "source": row.get "original source" , "teacher": row.get "original teacher" , } + "\n" kept += 1 if kept = KEEP: break print f"✅ Scanned {scanned} rows → wrote {kept} clean traces to '{out path}'" def search traces keyword=None, source=None, limit=3, scan=3000 : """Stream the dataset and yield-print traces matching filters.""" hits = 0 for row in itertools.islice load dataset REPO, split="train", streaming=True , scan : if source and row.get "original source" = source: continue if keyword: blob = " ".join c for , c in normalize turns row TRACE KEY if keyword.lower not in blob.lower : continue render trace row, max chars=300 hits += 1 if hits = limit: break if hits == 0: print "No matches in the scanned window — try increasing scan ." print "\n🔍 Searching for 'nl2bash' source traces:" search traces source="nl2bash", limit=2, scan=4000 print "\n🎉 Tutorial complete Next ideas:" print " • Increase N / SCAN for bigger analyses." print " • Filter by original source swesmith, codeforces, r2egym… for a domain SFT set." print " • Feed agenttrove clean sft.jsonl into Axolotl / LLaMA-Factory for fine-tuning." We define a success filter that retains traces marked as resolved, passed, correct, or positively rewarded. We then export successful trajectories into a clean ShareGPT-style JSONL file for downstream fine-tuning workflows. Also, we add a search utility to find traces by keyword or source, making the dataset easier to explore for specific agentic tasks. In conclusion, we built a complete, hands-on pipeline to inspect, analyze, filter, and export data from AgentTrove in a Colab-friendly way. We started with streaming access, then progressively added schema detection, turn normalization, command extraction, trajectory rendering, statistical analysis, visualization, success-based filtering, and keyword or source-based search. This workflow helps us understand the internal structure of agentic traces and gives us a reusable foundation for preparing high-quality subsets for fine-tuning or evaluation. We also keep the process scalable by avoiding full dataset downloads and using streamed samples only when needed. Also, we demonstrated how AgentTrove can be used as more than a static dataset: we treated it as a rich source of agent behavior, tool usage, task outcomes, and training-ready conversations that can support future experiments in agent learning, workflow analysis, and domain-specific SFT dataset creation. Check out the Full Codes with Notebook. Also, feel free to follow us on and don’t forget to join our Twitter https://x.com/intent/follow?screen name=marktechpost and Subscribe to 150k+ ML SubReddit https://www.reddit.com/r/machinelearningnews/ . Wait are you on telegram? our Newsletter https://www.aidevsignals.com/ now you can join us on telegram as well. https://t.me/machinelearningresearchnews Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us https://forms.gle/wbash1wF6efRj8G58 Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions. - Sana Hassan - Sana Hassan - Sana Hassan - Sana Hassan