{"slug": "m-m-star-a-modular-extensible-serving-system-for-multimodal-models", "title": "M* (M-Star): A Modular, Extensible, Serving System for Multimodal Models", "summary": "M* (M-Star), a new serving system for multimodal models, matches or beats specialized systems by up to 2.7× on speech and image serving and 12.5× on world-model rollouts. It uses a Walk Graph abstraction to handle composite models with non-autoregressive loops, internal parallelism, and input-dependent paths. The system generalizes prior serving stacks like vLLM and SGLang, which are modality-locked and cannot efficiently compose heterogeneous components.", "body_md": "## Inference is no longer a single loop\n\nLLM serving systems like vLLM and SGLang are built on one assumption: that inference is a single autoregressive loop — prefill the prompt, then decode one token at a time until the model stops. The newest multimodal models break that assumption. Five families make it concrete:\n\n**UMMs**— BAGEL** SpeechLMs**— Orpheus** Omni**— Qwen3-Omni** VLAs**— π0.5** World models**— V-JEPA 2\n\nThey are *composite*: built from structurally distinct components — vision encoders,\ntransformer backbones, diffusion and flow heads, audio codecs, action and world-model predictors\n— wired together in patterns that change with the input. They add **non-AR loops**\n(diffusion image generation, variable-horizon world-model rollouts), **internal\nparallelism** (the branches of classifier-free guidance; the pipelined Thinker–Talker of\nan omni model), and **input-dependent paths** (in BAGEL, *generating* an image and\n*understanding* one traverse different components of the *same* model).\n\nM* serves all of them from a single runtime. On the models we have benchmarked, M* matches or beats\nthe specialized system built for each — by up to **2.7×** on speech and image\nserving, and **12.5×** on world-model rollouts. The rest of this post shows how M*\nworks, starting with code.\n\n## Why today's serving stacks fall short\n\nComposite models pose three challenges at once: **architectural diversity** (many\npaths, non-AR loops), **performant modularity** (HuggingFace Transformers is flexible\nbut slow; vLLM and VoxServe are fast but domain-locked), and **physical topology**\n(heterogeneous components want different placement, batching, and transport).\n\n**vLLM and SGLang** are superb at autoregressive text, but they are\n**modality-locked**: built for text generation, with image (and even text) inputs\nsupported only as prefill-time encoder add-ons, and a single decode loop whose output is always text.\nThere is no first-class way to compose heterogeneous components into loops and parallel branches\n— no CFG fan-out — and no cross-component streaming. **vLLM-Omni and\nSGLang-Omni** go further, modeling a request as a flat pipeline of stages wired by explicit\ndata-transfer functions — enough for a Thinker–Talker–codec chain. But iteration\nstays inside a single stage and stages cannot be composed in parallel, so patterns such as diffusion\nloops or classifier-free guidance (CFG) fan-out must be added per-model as glue code. In vLLM-Omni,\nfor instance, BAGEL's CFG runs through a bespoke plugin built on `torch.distributed`\n\n.\n\nWe built M* because we wanted to make it easier for current and future composite models to achieve\nstate-of-the-art efficiency. We found that current systems could be generalized into the M*\n**Walk Graph**.\n\n| vLLM-Omni | SGLang-Omni | M* (ours) | |\n|---|---|---|---|\n| Graph node | Engine-instance stage | Worker-pool stage | Model component |\n| Composition | Flat DAG | Flat DAG | Seq. / Par. / Loop / Stream |\n| Paths per model | Prefill, decode | Prefill, decode | Flexible |\n| Loops | Within a stage | Within a stage | Across any subgraph |\n| Placement | Stage | Stage | Component, w/ optional Walk |\n\nTable 1. Each prior abstraction is a restricted subset of the Walk Graph.\n\n## The Walk Graph, by example\n\nIn M*, a model is declared as a graph of model-component **nodes** connected by tensor\n**edges**, plus a set of named **Walks**. Each Walk is a labeled subgraph for\none phase of behavior. A request is a *series of Walks*, chosen by a small state machine the\nmodel author writes. The author provides only the graph and the Walks. Everything physical —\nplacement, scheduling, batching, tensor transport, streaming — is the runtime's job.\n\nFor example, BAGEL has four core components — `vit_encoder`\n\n, `vae_encoder`\n\n,\nthe `LLM`\n\n, and `vae_decoder`\n\n— and a handful of Walks. The state machine\nstrings them together differently per request:\n\n**Generate an image**(text→image):`prefill_text → image_gen`\n\n**Understand an image**(image→text):`prefill_text → prefill_vit → decode`\n\n**Edit an image**(image→image):`prefill_text → prefill_vae → prefill_vit → image_gen`\n\nDefining requests as Walks means that the runtime executes *only the components a request\nneeds*. Image understanding never touches the diffusion loop or the `vae_decoder`\n\n;\nimage generation never runs the ViT understanding path.\n\nWalks are run based on a state machine the author writes: it builds the prefill steps from the input\nmodalities, then transitions to decode or image generation based on the requested output (*note:\nthis is a simplification of the actual M* model code*):\n\n``` python\n# Pick the next Walk based on the current phase\ndef next_walk(self, state):\n    if state.prefill_steps:                 # still consuming inputs\n        return state.prefill_steps.pop(0)   #   prefill_text / prefill_vae / prefill_vit\n    if state.target == \"image\":\n        return \"image_gen\"                  #   image_gen_cfg when CFG is configured\n    return \"decode\"                         # otherwise, autoregressive text\n```\n\nNext we'll see how the model author defines the BAGEL graph. If you would rather run it first, go to\nthe [quickstart](/mstar/quickstart.html).\n\n**Start with one node.** A node names its inputs and declares where each output goes.\nBAGEL's `vae_decoder`\n\ntakes denoised latents and emits an image to the client:\n\n``` python\nfrom mstar.graph.base import GraphNode, GraphEdge\nfrom mstar.graph.special_destinations import EMIT_TO_CLIENT\n\nvae_decoder = GraphNode(\n    name=\"vae_decoder\",\n    input_names=[\"latents\"],\n    outputs=[\n        GraphEdge(next_node=EMIT_TO_CLIENT, name=\"image_output\",\n                  output_modality=\"image\"),\n    ],\n)\n```\n\nThe graph only names inputs, outputs, and wiring. The compute behind a node is a\n`torch.nn.Module`\n\n— the model author implements `prepare_inputs`\n\nand a\npure-tensor `forward`\n\n, and the runtime handles batching, KV caching, CUDA graphs, and\ntensor transport. Here is that Submodule for the `vae_decoder`\n\nnode:\n\n```\nclass VAEDecoderSubmodule(NodeSubmodule):      # NodeSubmodule is a torch.nn.Module\n    def __init__(self, vae_model):\n        self.vae_model = vae_model\n\n    def prepare_inputs(self, graph_walk, fwd_info, inputs):\n        # gather the tensors this node consumes from its input edges\n        return NodeInputs(tensor_inputs={\"latents\": inputs[\"latents\"][0]})\n\n    def forward(self, graph_walk, engine_inputs, latents):\n        # pure tensor compute; outputs are keyed to the node's output edges\n        image = self.vae_model.decode(unpatchify(latents))\n        return {\"image_output\": [image]}\n```\n\n**Add a loop.** BAGEL generates an image by running flow-matching steps on its\n`LLM`\n\nbackbone, then decoding the final latents to pixels. This can be expressed in M* with a\n`Loop`\n\n, which runs its `section`\n\nrepeatedly, feeding each step's outputs back as\nthe next step's inputs. When the loop finishes, its `outputs`\n\nroute forward — here, the\nlatents route to the `vae_decoder`\n\nwe just built:\n\n``` python\nfrom mstar.graph.base import Sequential, Loop\n\nimage_gen = Sequential([\n    Loop(\n        section=GraphNode(\n            name=\"LLM\",\n            input_names=[\"latents\", \"time_index\"],\n            outputs=[\n                GraphEdge(next_node=\"LLM\", name=\"latents\"),\n                GraphEdge(next_node=\"LLM\", name=\"time_index\"),\n            ],\n        ),\n        max_iters=49,                       # num_timesteps - 1\n        outputs=[GraphEdge(next_node=\"vae_decoder\", name=\"latents\")],\n    ),\n    vae_decoder,                            # the node from above\n])\n```\n\nThe same `Loop`\n\nprimitive covers autoregressive text decode (it stops on an\nend-of-sequence signal instead of a fixed count) and world-model rollout (it stops at the horizon).\nNothing here is special-cased to images. Furthermore, because Loops are generic, M* applies continuous\nbatching and CUDA-graph replay to flow steps exactly as it does to token decode.\n\nA node is whatever compute you name\n\nBAGEL's diagram splits the model into a backbone, an LM head, a flow head, a time\nembedder — yet the code has a single `LLM`\n\nnode. That is a design choice for\nperformance: BAGEL's flow projection and time embedder are one or two linear layers each and both run\non the same hidden states as the backbone, so M* keeps them inside the one `LLM`\n\nnode\n— splitting them out would add scheduling and input-preparation overhead on the\nimage-generation critical path, with no performance benefit. The ViT and VAE *are* separate\nnodes, because they genuinely differ in compute and placement needs.\n\n**Add parallelism.** Classifier-free guidance (CFG) runs three forward passes per\ndenoising step — an unconditional pass and two conditioned ones — and combines them.\nRunning these in parallel is ideal for minimizing latency. Unfortunately, this kind of pattern is hard\nto capture in the flat stage pipelines used by vLLM-Omni or SGLang-Omni. Because three-way CFG can't be\nnatively supported, it requires a bespoke per-model plugin (e.g., a `CFGParallelMixin`\n\nthat\n`all_gather`\n\ns velocities across ranks in vLLM).\n\nMeanwhile, M* handles all parallelism in a generic way, so the user just needs to express the\nparallelism to the runtime. This is done with a `Parallel`\n\nblock of three `LLM`\n\n“views” that fan into a `combine_cfg`\n\nnode and loop. Each branch can sit on its\nown GPU; the runtime places and merges them with no per-model glue code (listing lightly simplified):\n\nThree branches fire in parallel each step, then `combine_cfg`\n\napplies the CFG formula + an Euler step and loops the latents back.\n\n``` python\nfrom mstar.graph.base import Parallel\n\nimage_gen_cfg = Sequential([\n    Loop(\n        section=Sequential([\n            Parallel([\n                GraphNode(name=\"LLM\",          input_names=[\"latents\", \"time_index\"],\n                          outputs=[GraphEdge(next_node=\"combine_cfg\", name=\"v_main\")]),\n                GraphNode(name=\"LLM_cfg_text\", input_names=[\"latents\", \"time_index\"],\n                          outputs=[GraphEdge(next_node=\"combine_cfg\", name=\"v_cfg_text\")]),\n                GraphNode(name=\"LLM_cfg_img\",  input_names=[\"latents\", \"time_index\"],\n                          outputs=[GraphEdge(next_node=\"combine_cfg\", name=\"v_cfg_img\")]),\n            ]),  # latent-init consistency ensured via a fixed per-request seed\n            GraphNode(\n                name=\"combine_cfg\",\n                input_names=[\"v_main\", \"v_cfg_text\", \"v_cfg_img\", \"latents\", \"time_index\"],\n                outputs=[       # feed latents + time_index back to every branch\n                    GraphEdge(next_node=\"LLM\",          name=\"latents\"),\n                    GraphEdge(next_node=\"LLM\",          name=\"time_index\"),\n                    GraphEdge(next_node=\"LLM_cfg_text\", name=\"latents\"),\n                    GraphEdge(next_node=\"LLM_cfg_text\", name=\"time_index\"),\n                    GraphEdge(next_node=\"LLM_cfg_img\",  name=\"latents\"),\n                    GraphEdge(next_node=\"LLM_cfg_img\",  name=\"time_index\"),\n                ],\n            ),\n        ]),\n        max_iters=49,\n        outputs=[GraphEdge(next_node=\"vae_decoder\", name=\"latents\")],\n    ),\n    vae_decoder,  # as defined in \"Start with one node\" above\n])\n```\n\nHow do the three `LLM`\n\nviews connect to the real model? Each node name maps to a Submodule,\nand the three CFG branches are the same language model wrapped under three names, differing only in\nwhich guidance cache they read and write.\n\n**Placement.** Placement is a small YAML file that maps logical nodes to physical GPU\nranks. Nothing in the model code changes when you move components around. Mapping each node to GPU\nranks — disaggregating components, disaggregating prefill from decode, or using tensor-parallel\n**sharding** — always uses the same placement API, so you can shard a big Qwen3-Omni\nbackbone while disaggregating its encoders and codec elsewhere.\n\nDisaggregated: each component on its own GPU(s) and scaled independently.\n\nAs an example, the *same* BAGEL graph runs on one GPU:\n\n```\n# Single GPU: everything colocated\nmodel: \"bagel\"\nnode_groups:\n  - { node_names: [vit_encoder, vae_encoder, vae_decoder, LLM], ranks: [0] }\n```\n\n...or fans the three CFG branches across three GPUs — active only during image generation — by editing the same file:\n\n```\n# Three GPUs: CFG branches on their own ranks, only during image_gen_cfg\nmodel: \"bagel\"\nnode_groups:\n  - { node_names: [vit_encoder, vae_encoder, vae_decoder], ranks: [0] }\n  - { node_names: [LLM, combine_cfg], ranks: [0] }\n  - { node_names: [LLM_cfg_text], ranks: [1], graph_walks: [image_gen_cfg] }\n  - { node_names: [LLM_cfg_img],  ranks: [2], graph_walks: [image_gen_cfg] }\n```\n\nThe `graph_walks`\n\nkey lets you place a node differently *per Walk* — for\nexample, prefill for a node can happen on one GPU while decode happens on another.\n\n### Streaming, by example: Qwen3-Omni\n\nSome components have to overlap in time. Qwen3-Omni speaks by pipelining three components: a\n**Thinker** (the LLM that produces hidden states and text), a **Talker** (an\nautoregressive model that turns those into audio codec tokens), and **Code2Wav** (a\ncode-to-waveform codec decoder). To start playing audio before the whole response is computed, the\nThinker streams one hidden state at a time to the Talker, and the Talker streams codec frames to\nCode2Wav.\n\nIn M*, streaming is a first-class edge type: the producer just marks an output as streaming to a\ndownstream partition, and a **chunk policy** — declared once in the model's topology\nand matched to the edge by name — decides how the consumer reassembles the stream:\n\n``` python\nfrom mstar.streaming.topology import Connection, PartitionTopology, StreamingGraphEdge\nfrom mstar.streaming.chunk_policy import FixedChunkPolicy, LeftContextChunkPolicy\n\n# Inside the Thinker's walk: hidden states stream to the Talker.\nStreamingGraphEdge(next_node=\"Talker\", name=\"thinker_states\", target_partition=\"Talker\")\n\n# Inside the Talker's walk: codec frames stream to Code2Wav.\nStreamingGraphEdge(next_node=\"Code2Wav\", name=\"codec_tokens\", target_partition=\"Code2Wav\")\n\n# How each stream is reassembled is declared once, in the model's topology:\nPartitionTopology(\n    partitions=[\"Thinker\", \"Talker\", \"Code2Wav\"],\n    connections=[\n        Connection(from_partition=\"Thinker\", to_partition=\"Talker\",\n                   edge_name=\"thinker_states\",\n                   chunk_policy_factory=lambda: FixedChunkPolicy(chunk_size=1,\n                                                                 continue_after_done=True)),\n        Connection(from_partition=\"Talker\", to_partition=\"Code2Wav\",\n                   edge_name=\"codec_tokens\",\n                   chunk_policy_factory=lambda: LeftContextChunkPolicy(chunk=25, left_context=25)),\n    ],\n)\n```\n\n`FixedChunkPolicy(chunk_size=1)`\n\nfeeds the Talker one Thinker state per step;\n`LeftContextChunkPolicy`\n\nhands Code2Wav 25-frame chunks plus 25 frames of left context to\nwarm up its causal convolutions. The Talker runs as an autoregressive `Loop`\n\n; Code2Wav is\nre-triggered per chunk. The result is three components on three GPUs, overlapping in time, emitting\naudio incrementally. The same small set of chunk policies — fixed, sliding-window, left-context\n— covers every streaming edge in our models (Orpheus's SNAC decoder uses the sliding-window one),\ninstead of bespoke per-model streaming code.\n\n## What the Walk Graph unlocks\n\nDecoupling the model from the runtime is where the performance comes from.\n\n#### Modality-aware scheduling\n\nRun only the components a request needs. A Walk names exactly which parts of the model participate, so text-only responses bypass image-generation paths — and these optimizations emerge from the model executor itself, not model-specific scheduling logic.\n\n#### Reusable systems optimizations\n\nExecution stages share a common interface, so paged attention, FlashInfer kernels,\n\n`torch.compile`\n\n, and CUDA Graphs apply across diverse components — from LLM decoding to diffusion transformers and speech modules — with no bespoke integration per model.#### Flexible parallelism\n\nExpress parallelism\n\n*within a graph stage*with`Parallel`\n\n(e.g. the three CFG branches); the runtime executes all instances of parallelism uniformly.#### Flexible placement\n\nMap each node to GPU rank(s): encoder/decoder disaggregation, prefill/decode/flow split, independent scaling, transparent multiplexing, and tensor-parallel\n\n**sharding** of one large component across GPUs.#### Loops are first-class\n\nContinuous batching and CUDA-graph replay apply to any loop, so diffusion steps, world-model rollouts, and token decode all ride the same machinery — and a rollout's KV cache persists across steps instead of being recomputed.\n\n#### Streaming is first-class\n\nOne small set of chunk policies covers every streaming edge, regardless of placement — and connections between colocated components incur no communication overhead.\n\n## Under the hood\n\nM* lowers the graph to a distributed runtime. A **Conductor** tracks each request's Walk\nand dispatches work to per-GPU **Workers** that route tensors directly to one another.\nSome key features:\n\n#### Pluggable data plane\n\nComponents exchange tensors over shared memory, RDMA, or TCP (via Mooncake), chosen by where the components are located.\n\n#### A handful of engines\n\nA modality-agnostic AR engine (it also handles any node that needs a KV cache and/or sampling) with a FlashInfer paged-attention KV cache, plus a stateless engine for encoders, decoders, and audio codecs; all support continuous batching and CUDA-graph replay.\n\n#### Overlapped scheduling\n\nWhile the current step runs on the GPU, M* prepares the next batch and its attention plan on a separate stream, and keeps loops moving by deferring each stop check one iteration. This is implemented generically over the\n\n`Loop`\n\nprimitive — not just text or speculative decoding — so the GPU rarely stalls on CPU scheduling.#### Sharding × disaggregation\n\nTensor-parallel sharding (parallel linears, vocab-parallel embeddings, sharded MoE and KV cache, NCCL collectives) is built in and set with a\n\n`tp_size`\n\nin the placement file, so one large component doesn't have to fit on one GPU.\n\n## Does it work? — Matching or beating specialized systems\n\nWe instantiate M* on five real models and compare against the strongest specialized baseline for each.\n\n| Model · task | Baseline(s) | Setup | Speedup over baseline |\n|---|---|---|---|\n| BAGEL · text→image | vLLM-Omni | 3×H100, CFG-parallel, B=1 | ≈1.3× lower latency |\n| BAGEL · image editing | vLLM-Omni | 3×H100, CFG-parallel, B=1 | up to 2.6× lower latency |\n| BAGEL · image→text | vLLM-Omni | 1×H100, B≤16 | ≈1.6× faster first token |\n| Qwen3-Omni · TTS | vLLM-Omni, SGLang-Omni | 2×H200 | ≈2.7× throughput vs vLLM-Omni @ B=16 (≈4× vs SGLang) |\n| Qwen3-Omni · TTS (TP-2 thinker) | SGLang-Omni | 2×H200, Thinker sharded | ≈3.8× throughput @ B=16 |\n| Orpheus · TTS | VoxServe | 1×H200 | ≈1.3× throughput @ B=8 and lower RTF |\n| V-JEPA 2 · rollout | Meta native | 1×H100 | up to 12.5× faster |\n\nTable 2. Five models, five specialized baselines — M* matches or beats each. Benchmarks as of June 2026.\n\nThe wins come from the abstraction. For image generation and editing (Figure 5), M* runs BAGEL's\nthree-way classifier-free guidance as a `Parallel`\n\nblock spread across three GPUs, and\nfinishes faster than every vLLM-Omni configuration: about 1.3× lower end-to-end latency on\ntext-to-image, and up to 2.6× on image editing versus vLLM-Omni's default pipeline. Against\nvLLM-Omni's best-tuned single-stage configuration, the editing margin is about 1.2×.\n\nWhat is vLLM-Omni's “single-stage” config?\n\nBy default, vLLM-Omni runs BAGEL as two stages — a Thinker (text and\nunderstanding, on vLLM's autoregressive engine) feeding a separate DiT stage for image generation,\nwith the conditioning KV cache shipped between them. The single-stage config collapses the whole\nmodel — LLM, ViT, VAE, and DiT — into one diffusion process, eliminating that\ncross-stage transfer: it matches the default on text-to-image (where the transferred text\nconditioning is small) but is much faster on editing (where the conditioning includes an encoded\nimage). The catch is that text and understanding then run *inside the diffusion engine*\nrather than vLLM's AR engine, giving up continuous batching, token streaming, and paged-attention\nKV management — a whole-model choice that speeds up editing at the expense of the text path.\nM* needs no such bargain: because a Walk names exactly the components a request uses,\nimage-generation and understanding requests each execute the right way, with the engine\noptimizations intact.\n\nImage understanding is more nuanced (Figures 6 to 8). Because a Walk names exactly the components a request touches, an image-to-text request never runs the diffusion path, so M* returns the first token about 1.6× faster than vLLM-Omni and holds a throughput lead that grows with batch size, reaching about 46% for short outputs. The cost is a slightly higher median inter-token latency, roughly 1 to 3 ms. M*'s advantage is therefore largest under load and for shorter responses, and narrows to near-parity for long outputs at low concurrency.\n\nSpeech and omni models follow the same pattern (Figures 9 to 11). On Qwen3-Omni text-to-speech, M* sustains about 2.7× the throughput of vLLM-Omni and about 4× that of SGLang-Omni, and it stays real-time through batch size 32, where SGLang-Omni's tail latency runs past the real-time threshold. Sharding the Thinker across two GPUs keeps about a 3.8× throughput lead, an example of sharding and disaggregation working together. On Orpheus, M* posts a lower real-time factor and higher audio throughput than VoxServe at every batch size we benchmarked.\n\nWorld models show what first-class loops buy (Figure 12). M* expresses the rollout as a\n`Loop`\n\nwith a persistent KV cache instead of recomputing it from scratch each step, which\nyields up to 12.5× over Meta's native rollout.\n\n## Coming soon\n\n- +\n**More models, coming soon.** More omni models (Ming-flash-omni-2.0, Qwen2.5-Omni), world models (Cosmos 3), and more VLAs — among others. Want a model supported?[Get in touch](mailto:atindra@cs.stanford.edu)or[open a GitHub issue](https://github.com/mstar-project/mstar/issues). - +\n**More parallelism, everywhere.** Tensor-parallel sharding is live and rolling out across model families; sequence/context and DiT-specific parallelism are coming soon. - +\n**Unified engine plugins.** Converging the AR, encoder/decoder, and audio-codec engines behind one interface.\n\n## What's next\n\nThe bigger picture the Walk Graph opens up — three directions we are actively pursuing.\n\n- →\n**SLO-aware placement and path-aware autoscaling.** Search automatically for a node-and-Walk → worker placement that meets an objective (throughput, latency, or cost), and rescale to the live traffic mix: scale up only the components on hot Walks, and offload cold ones to host memory off the critical path. - →\n**An agentic serving layer.** An agent is itself a graph over model calls, the same shape M* runs inside one model; we are building a layer that places the inter-model agent graph and the intra-model component graph under one runtime, so calls across many agents share scheduling, placement, batching, and cached state. - →\n**A compiler for the Walk Graph.** Treated as an IR, the graph enables graph-level optimization (eliminating components a request never touches, fusing operations, scheduling the overlap above) and mapping each component to the hardware it runs best on.\n\n## Get the code\n\n**Try it:** install M*, point it at a model with a placement config, and serve in one\ncommand (see the [quickstart](/mstar/quickstart.html)). We'd love your feedback: open a\n[GitHub issue](https://github.com/mstar-project/mstar/issues), or email\n[atindra@cs.stanford.edu](mailto:atindra@cs.stanford.edu). If there's a model you'd like to\nsee supported, tell us.\n\nM* is open source. If you build on this work, please cite it:\n\n### Cite\n\n```\n@article{mstar2026,\n  title     = {M*: A Modular, Extensible, Serving System for Multimodal Models},\n  author    = {Atindra Jha and Naomi Sagan and Keisuke Kamahori and Irmak Sivgin and\n               Rohan Sanda and Steven Gao and Mark Horowitz and Luke Zettlemoyer and\n               Olivia Hsu and Jure Leskovec and Baris Kasikci and Stephanie Wang},\n  year      = {2026},\n  eprint    = {2606.12688},\n  archivePrefix = {arXiv},\n  primaryClass = {cs.LG}\n}\n```\n\n", "url": "https://wpnews.pro/news/m-m-star-a-modular-extensible-serving-system-for-multimodal-models", "canonical_source": "https://mstar.stanford.edu/", "published_at": "2026-06-18 19:23:04+00:00", "updated_at": "2026-06-18 19:31:19.761957+00:00", "lang": "en", "topics": ["machine-learning", "large-language-models", "generative-ai", "ai-infrastructure", "ai-research"], "entities": ["M*", "vLLM", "SGLang", "BAGEL", "Orpheus", "Qwen3-Omni", "V-JEPA 2", "HuggingFace Transformers"], "alternates": {"html": "https://wpnews.pro/news/m-m-star-a-modular-extensible-serving-system-for-multimodal-models", "markdown": "https://wpnews.pro/news/m-m-star-a-modular-extensible-serving-system-for-multimodal-models.md", "text": "https://wpnews.pro/news/m-m-star-a-modular-extensible-serving-system-for-multimodal-models.txt", "jsonld": "https://wpnews.pro/news/m-m-star-a-modular-extensible-serving-system-for-multimodal-models.jsonld"}}