04:00
2026-06-24
arxiv.org
large-language-models
Beyond Trajectory Imitation: Strategy-Guided Policy Optimization for LLM Reasoning
Researchers propose Strategy-Guided Policy Optimization (SGPO), a method that replaces trajectory-level imitation with reusable strategy distillation to improve LLM reasoning. SGPO extracts structuredβ¦