This document is aimed at senior engineers, architects, and technical decision-makers.
Open source, feel free to give a star π GitHub π«±:[https://github.com/sopaco/repomix-rs]
Although current mainstream LLMs (Deepseek, GLM) have expanded their context windows, token costs grow linearly. A medium-sized project's complete source code often exceeds 100K tokens, surpassing the comfortable processing range of most models. Traditional solutions have structural flaws:
| Solution | Problem |
|---|---|
| Manual splitting + prompt engineering | High human cost, not scalable |
| RAG (vector retrieval) | Loses global structure; depends on embedding quality |
| Copy-paste into chat | Error-prone; cannot be automated |
| git archive + compression | AI cannot directly consume it |
repomix solves a more fundamental problem: how to transmit a codebase's structure and content in an AI-readable format, precisely, completely, and reproducibly.
The core constraint of AI engineering is the token budget. repomix-rs addresses three problems in a targeted way:
tiktoken-rs
(OpenAI o200k_base
), fully aligned with GPT-4o billing.--split-output
allows splitting by tokens, ensuring the context window is never exceeded.--compress
(Tree-sitter) saves an average of 70% tokens without losing structural information.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AI Consumer Layer β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββββββββββββββββββ β
β β Claude β β Cursor β β Hermes Agent β β
β β Desktop β β IDE β β Custom Agents β β
β ββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββββββββββ¬ββββββββββββββββββββ β
β ββββββββββββΌβββββββββββΌβββββββββββββββββββββββΌβββββββββββββββββββββ β
β MCP Protocol (JSON-RPC over stdio) β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β repomix-mcp (MCP Server) β β
β β Tools: pack_codebase | pack_remote_repository β β
β β read_repomix_output | grep_repomix_output β β
β ββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββ ββββββββββββββ΄βββββββββββββββββββββ β
β βrepomix-cli β β repomix-core β β
β β(clap CLI) β β (Library) β β
β ββββββββ¬βββββββ ββββββββββββ¬βββββββββββββββββββββββ β
β β β repomix-config β
β βββββββββββββββββββββ€ (Config Schema) β
β β β
β βββββββββββββββ ββββββββββββββ βββββββββββββββ β
β βFile Collectorβ β Processor β β Git Intg. β β
β β(rayon par.) β β(tree-sitter)β β (git CLI) β β
β ββββββββ¬βββββββββ βββββββ¬βββββββ ββββββββ¬βββββββ β
β ββββββββββ΄βββββββββββββββββΌβββββββββββββββΌβββββββββββββββββ β
β β βΌ βΌ βΌ β
β β ββββββββββββ ββββββββββββββββββ ββββββββββββββββ β
β β βFile Systemβ β Secretlint β β tiktoken-rs β β
β β β(tokio fs) β β (Security) β β (Tokenize) β β
β β ββββββββββββ ββββββββββββββββββ ββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
repomix-rs adopts a 5-Crate Cargo Workspace architecture, aligned with Rust ecosystem best practices for layered design:
repomix-rs/
βββ crates/
β βββ repomix-core/ β Core engine (public API)
β βββ repomix-config/ β Config types + default modes
β βββ repomix-shared/ β Cross-crate shared types
β βββ repomix-cli/ β CLI entry point (depends on core + config)
β βββ repomix-mcp/ β MCP Server (depends on core + shared)
βββ Cargo.toml β workspace root
βββ README.md
repomix-core
(Core Engine) This is the sole "business logic" crate, encompassing:
| Module | Responsibility |
|---|---|
file_collector |
|
| Recursive directory scanning; apply include/exclude rules | |
processor |
|
| File content processing (compression, comment removal, AST analysis) | |
output |
|
| Serialization for four formats (XML / MD / JSON / Plain) | |
git |
|
| Git-aware operations (change frequency analysis, diff, log) | |
metrics |
|
| Token counts, character statistics, Top-N leaderboard | |
security |
|
| Secretlint integration; suspicious file detection |
Exposed Traits:
#[async_trait]
pub trait ProgressCallback: Send + Sync {
fn on_progress(&self, msg: &str);
fn on_complete(&self, msg: &str);
fn on_error(&self, msg: &str);
}
pub trait FileProcessor: Send + Sync {
async fn process(&self, file: &Path) -> Result<ProcessedFile>;
}
repomix-config
(Configuration Schema) Dedicated to type-safe configuration and default values:
RepomixConfig
: Root config struct, derives Deserialize
/Serialize
OutputConfig
: Output format, path, compression optionsnode_modules/
, __pycache__/
, .git/
, etc.~/.repomix/repomix.config.json
repomix-shared
(Cross-Crate Shared Types) Holds type definitions shared across crates:
pub struct ProcessedFile {
pub path: PathBuf,
pub content: String,
pub tokens: usize,
pub chars: usize,
pub is_suspicious: bool,
pub compress_ratio: f64,
}
pub struct PackResult {
pub total_files: usize,
pub total_tokens: usize,
pub total_characters: usize,
pub top_files_by_tokens: Vec<FileTokenCount>,
pub suspicious_files: Vec<SuspiciousFileResult>,
pub skipped_files: Vec<SkippedFile>,
}
repomix-cli
(CLI Layer)
clap
(derive mode) for argument parsing#[tokio::main]
async mainrepomix-mcp
(MCP Server Layer)
rmcp
crate (Rust MCP SDK)tokio::Mutex
(prevents concurrent git clone conflicts)serde
-structured parameter schema
repomix-mcp ββββββββββββββΊ repomix-core
β² β
β β
repomix-cli βββββββββββββββββ€
β
repomix-config
β²
β
repomix-shared
No circular dependencies; each crate is a independently testable unit.
File System (on disk)
β
β [1] Async scan (tokio async fs + rayon par_iter)
βΌ
FileEntry { path, size, mtime }
β
β [2] Include/Exclude filtering
βΌ
FilteredFileEntry
β
β [3] Git info enrichment (optional, git CLI)
βΌ
GitEnrichedFile { change_count, last_commit }
β
β [4] Content read
βΌ
RawFileContent
β
β [5] Processing pipeline (optional)
β βββ tree-sitter compression
β βββ Comment removal
β βββ Empty-line removal
βΌ
ProcessedFile { content, tokens, chars }
β
β [6] Secretlint scan (optional)
βΌ
SecureProcessedFile { is_suspicious, suspicious_patterns? }
β
β [7] Format serialization
βΌ
PackOutput { xml | markdown | json | plain }
β
β [8] Written to disk
βΌ
repomix-output.{xml|md|json|txt}
β
β [9] Consumed by AI Consumer
βΌ
LLM Context Window
[2] β [3] Ordered Dependency: Filter by include/exclude rules first, then enrich with Git info. Git operations are heavy (spawns subprocesses), so executing them only on the known file set is more efficient.
[5] tree-sitter pipeline: Tree-sitter provides incremental parsing. For large files, only the changed parts are re-parsed, not the full file β a detail of performance optimization.
[7] Lazy format binding: The choice of output format is deferred to the last stage of the processing pipeline. This means all formats share the same intermediate representation ProcessedFile
, making it easy to extend with new formats.
| Benefit | Cost |
|---|---|
| Speed: 10β20Γ | Steep learning curve |
| Memory safety | Longer compile times |
| Single binary deployment | Debug complexity |
| MCP ecosystem alignment | Ecosystem younger than JS's |
Why Rust instead of Go?
tiktoken-rs
, burn
, etc.)Tokio was chosen because:
tokio::Mutex
is more controllable in MCP concurrency isolation scenariosUses JSON because:
repomix.config.json
Calls the system git
command instead of using git2
(libgit2 bindings):
Trade-off: Depends on git
being in PATH. Without git, functionality degrades gracefully rather than failing β this is an intentional fail-soft design.
Argues against a "one format serves all" approach:
repomix-rs's configuration system follows the Layer Cake Pattern:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CLI Flags (highest priority, appends, not replace)β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β ./repomix.config.json (project-level) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β ~/.repomix/repomix.config.json β
β (global user-level) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Hardcoded Defaults (in-code defaults) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The three layers merge using the append-override principle:
--include
appends to existing rules, does not replace--ignore
appends to existing rules, does not replaceThe rationale is "local config takes priority; global config provides the baseline", preventing global configuration from inadvertently polluting individual projects β consistent with the Unix philosophy of "explicit over implicit".
.gitignore
Design
.repomixignore
syntax is fully aligned with .gitignore
. This is not accidental:
gitignore.io
)MCP is an open protocol championed by Anthropic, defining a standardized AI Agent β Tool communication interface:
ββββββββββββββββ stdio JSON-RPC ββββββββββββββββ
β Client β ββββββββββββββββββββββββββββββΊβ Server β
β (Claude, β β (repomix-mcp)β
β Cursor) β β β
ββββββββββββββββ ββββββββββββββββ
The protocol layer has only two core primitives: tools/list
and tools/call
, but through these two primitives, powerful tool compositions can be built.
User question
βΌ
Claude Desktop (MCP Client)
"I need to understand this project's auth module"
βΌ
tools/call(pack_codebase, {directory: ".", compress: true})
βΌ
repomix-mcp Server
pack_directory(".") βββΊ repomix-core
Tree-sitter compression (retains only auth-related function signatures)
βΌ
Returns PackResult
βΌ
Claude Desktop injects result into context
βΌ
Claude understands project structure and answers the question
The original Repomix has no MCP, meaning it is just a CLI tool. For an AI Agent to use it, it must:
repomix-rs's MCP Server turns the pack operation into an AI-native capability:
This design upgrades repomix-rs from "a tool" to "an infrastructure component".
A single pack operation roughly has four stages:
| Stage | Compute Characteristics | repomix-rs Implementation | Original Repomix |
|---|---|---|---|
| File discovery | I/O + lightweight matching | rayon::par_iter |
|
Single-threaded fs.scandir |
|||
| Content reading | I/O-intensive | tokio::fs::read |
|
| async fs (libuv single-threaded) | |||
| AST compression | CPU-intensive | ||
rayon parallel tree-sitter |
|||
| Single-threaded JS | |||
| Output writing | I/O-intensive | tokio::fs::write |
|
fs.write |
Theoretical level:
repomix-rs employs a dual-engine architecture of Rayon data parallelism + Tokio async I/O β a design capability unique to Rust:
// Pseudo-code illustration
entries.par_iter().for_each(|entry| {
let content = rt.block_on(tokio::fs::read(&entry.path));
let compressed = tree_sitter_compress(&content);
result_tx.send(ProcessedFile::from(entry, compressed)).unwrap();
});
The key point: par_iter()
causes Rayon to automatically utilize all available cores, while tokio::fs::read
releases the thread back to the thread pool while waiting for I/O. The original Node.js "concurrency" is cooperative concurrency based on the event loop, which cannot parallelize CPU-intensive tasks across cores β this is why the tree-sitter compression stage shows the largest gap (20Γ+).
Engineering level:
| Optimization Technique | Effect |
|---|---|
| Memory-mapped I/O (mmap) | Reduces copying, especially for large files |
| Zero-copy string slicing | Tree-sitter output avoids memory allocation |
| Streaming output | No full buffering needed, T=O(1) memory |
Arc shared config |
|
| Zero-copy read access to config in multi-threaded scenarios | |
| Early filtering | Applies ignore rules before reading file contents |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Layer 1: Configuration Layer β
β β’ .repomixignore excludes known dangerous paths β
β β’ Default excludes (node_modules, .git, etc.) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Layer 2: Scanning Layer (Secretlint) β
β β’ Regex matching for API Keys, Tokens, private keys β
β β’ Scan results configurable: warn / exclude / ignore β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Layer 3: Output Layer β
β β’ Suspicious files flagged, with pattern description attached β
β β’ Supports --exclude-suspicious for hard filtering β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Layer 4: Runtime Layer (Rust memory safety) β
β β’ No buffer overflows / Use-After-Free β
β β’ No memory leaks (RAII) β
β β’ No data races (Send + Sync trait constraints) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Defense in Depth is the core principle of security design. repomix-rs does not rely on a single security mechanism; it provides protection at every layer. Rust's inclusion transforms Layer 4 from "as safe as possible" into "compile-time guaranteed safety". For a tool that processes user code, potentially encountering sensitive content, this is a qualitative leap.
Developer workflow
ββ Code editing β IDE (VSCode / Cursor)
ββ Code review β LLM + repomix-rs output
ββ Code generation β Cursor / Copilot
ββ Code knowledge retrieval β RAG / Embedding
βββ Codebase context injection ββββββββββββββββββββββββββββββ
β
AI Agent capability stack β
ββ Tool invocation (Function Calling) βββββββββββββββββββββββ€
ββ Context management (Context Management) ββββββββββββββββββ€
β βββ repomix-rs provides structured code context β
ββ Long-term memory (Memory / RAG) β
ββ Autonomous execution (Agentic Workflow) β
β
MCP Ecosystem β
ββ MCP Servers: filesystem, sqlite, β¦ β
ββ MCP Servers: repomix-rs (code context) βββββββββββββββββββ€
ββ MCP Servers: your custom tools β
repomix-rs occupies the codebase context provider niche in the AI coding toolchain. Its irreplaceability stems from:
RAG (Retrieval-Augmented Generation) addresses the problem of "knowing where to look", while repomix-rs addresses the problem of "how to transmit completely":
| Dimension | RAG | repomix-rs |
|---|---|---|
| Applicable scenario | Large knowledge base retrieval | Small-to-medium project full context |
| Accuracy | Depends on embedding quality | Precise and complete |
| Token cost | Charged by retrieved chunks | Controllable compression |
| Setup complexity | High (requires vector DB) | Low (single command) |
| Real-time | Requires index updates | Real-time pack |
Best practice: Use RAG + repomix-rs together β RAG for large knowledge bases; repomix-rs for current project context.
| Dimension | Status |
|---|---|
| Core packing | β Production-ready |
| Language support | β οΈ 10 (extensible) |
| MCP Server | β Production-ready |
| Remote repository packing | β Production-ready |
| Secretlint integration | β Basic, configurable scope |
| Token calculation | β Precise |
| Performance | β 10β40Γ over original |
| Documentation | β οΈ Moderate |
| Community contributions | π Growing |
Near-term (v2.x):
Medium-term (v3.x):
repomix-lsp
: Language Server Protocol integration for real-time code context maintenance in IDEsLong-term:
Boldly choosing Rust to rewrite developer tools is itself a technical signal. Bun chose Rust; parts of Vite chose Rust (Rolldown was rewritten in Rust). repomix-rs stands within this trend, proving that tools that are performance-sensitive, security-sensitive, and tightly coupled with the AI ecosystem are entering Rust's golden age.
repomix-rs is not a simple "Rust port" of the original Repomix. It is a tool re-architected around AI code consumption scenarios:
Choosing repomix-rs means choosing architecture for the future.
npm install -g repomix-rs