NAVI-Orbital: First In-Orbit Demonstration of a Zero-Shot Vision-Language Model for Autonomous Earth Observation

On April 16, 2026, NAVI-Orbital became the first system to demonstrate a zero-shot vision-language model performing autonomous multi-modal inference entirely onboard a Low Earth Orbit spacecraft. The system uses Gemma 3 to classify scenes, generate text descriptions, and respond to operator queries via natural language, achieving 88.16% accuracy on the AID benchmark and processing live uncorrected YAM-9 imagery with hardware-accelerated GPU inference. This marks a shift from conventional acquire-then-downlink approaches to in-orbit semantic compression of Earth observations.

arXiv:2606.18271v1 Announce Type: new Abstract: As Earth Observation data generation outpaces downlink bandwidth and human-in-the-loop processing, a widening gap has emerged between onboard collection and actionable ground intelligence. This paper presents NAVI-Orbital, a software system deployed on a Low Earth Orbit LEO spacecraft. On April 16, 2026, NAVI-Orbital achieved what is, to the authors' knowledge, the first in-orbit demonstration of a vision-language model performing autonomous multi-modal inference entirely onboard. NAVI-Orbital uses a local vision-language model Gemma 3 to classify each captured scene, produce a text description of its content and the relationships between its features, and respond to operator follow-up via natural-language dialogue. The system is re-tasked through plain-English prompts in place of conventional command sequences, and is orchestrated by a graph-based state machine LangGraph coordinating dedicated agents for detection and dialogue. Results across ground benchmarking 88.16% accuracy on the 7,960-image curated AID benchmark , Flatsat validation, and live in-orbit captures of newly acquired, previously unseen Earth imagery including uncorrected YAM-9 imagery, processed onboard with hardware-accelerated GPU inference and no fine-tuning for the flight instrument demonstrate the feasibility of running foundation models on satellite-class edge computers to invert the conventional acquire-then-downlink-everything bandwidth profile through semantic compression of Earth observations in-orbit.