MASt3R-Nav: WayPixel Navigation in Relative 3D Maps

Anonymous Authors
ICRA 2026
Overview of MASt3R-Nav

MASt3R-Nav constructs a pixel-level topological map from RGB sequences and generates dense WayPixel Costmaps that guide a learned controller to the goal.

Abstract

Visual navigation ability is strongly tied to its underlying representation of the world. Unlike classical 3D maps that require globally-consistent geometry, image- or object-relative topological graphs almost entirely do away with geometric understanding. But, this comes at the cost of navigation capability, often limiting it to merely teach-and-repeat. In this work, we propose a novel map representation in the form of pixel-relative connectivity, which is geometrically accurate but does not require global geometric consistency.

Inspired by recent progress in 3D grounded image matching, we construct a map from an image sequence through inter-image connectivity based on pixel correspondences in the relative 3D coordinate systems of individual image pairs. We then use this pixel-level graph to perform global path planning by approximating and sparsifying intra-image pixel connectivity. Through this, we derive a "WayPixel Costmap" representation and train a controller conditioned on it to predict a trajectory rollout.

We show that this dense pixel-level costmap based on relative geometry is a more accurate conditioning variable for control prediction than its image- and object-level counterparts. This enables a highly capable navigation system, as validated on four types of navigation tasks in the simulator and through real world demonstrations.

Method

MASt3R-Nav Architecture

MASt3R-Nav Architecture. Mapping involves constructing a pixel-level topological graph by linking correspondences across frames and encoding traversal costs using 3D geometry from MASt3R. Execution has the agent localize itself against the map and generate a fine-grained pixel costmap by matching current observations and propagating their costs to all pixels. Planning is performed by computing shortest paths through this costmap, yielding dense pixel-wise gradients towards the goal. Finally, a neural controller consumes the pixel costmap to predict waypoints, enabling a direct comparison with object-based baselines.


WayPixel Costmap

WayPixel Costmap generation

WayPixel Costmap generation. Given the pixel-relative map representation and a query (1), we obtain pixel-level planning costs through a series of steps that form a path highlighted in white background (2) from the query pixel through matched and bridging map pixels to the goal. (3) We show the flow of cost gradients from each pixel to its closest least-cost matched pixel and (4) the final dense WayPixel Costmap on which we condition our trained controller PixelReact. (5) shows the query RGB; the goal position is off-screen, toward the left of the scene.

Mapping Visualization

This interactive view shows the map construction process in two synchronized panels. The left panel displays the sequential camera/image trajectory, while the right panel shows a side-by-side point-cloud view for the active frame pair. In both panels, we overlay inter-image and intra-image graph edges to highlight how local observations connect into a pixel-relative topological structure over time.

Note: the visualization can take around ~1 minute to fully load after opening.

Open in New Tab

Results

We evaluate on the HM3D IIN-val set across four navigation tasks: Imitate, Alt Goal, Shortcut, and Reverse, reporting SPL and SSPL metrics.

Object-level vs. Pixel-level Representations

Mapper Localizer Controller SPL SSPL
Object-level Representation
LGlue LGlue ObjectReact 51.51 58.59
MASt3R LGlue ObjectReact 45.45 53.21
LGlue MASt3R ObjectReact 51.50 60.85
MASt3R MASt3R ObjectReact 51.48 58.64
Pixel-level Representation
MASt3R MASt3R ObjectReact 63.63 74.11
MASt3R MASt3R PixelReact (Ours) 81.77 84.36

Comparison of object-level vs. pixel-level representations for image-goal navigation. Replacing the object-level costmap with our pixel-relative WayPixel Costmap and PixelReact controller substantially improves performance.

State-of-the-Art Comparison

Method Type Train Data Imitate Alt Goal Shortcut Reverse Average
SPL SSPL SPL SSPL SPL SSPL SPL SSPL SPL SSPL
Image-Relative
GNM Image-Relative Real 78.79 82.95 8.70 15.44 15.38 31.74 3.33 6.11 26.55 34.56
GNM (HM3D) Image-Relative HM3D 81.82 86.38 0.00 10.91 15.38 24.57 13.28 20.77 27.62 35.66
Object-Relative
PixNav Object-Relative HM3D 42.42 46.75 26.09 31.66 7.69 22.29 16.16 25.56 23.09 31.57
RoboHop Object-Relative Zero Shot 57.56 64.99 30.43 38.23 30.77 40.87 9.98 16.92 32.19 40.25
ObjectReact Object-Relative HM3D 60.60 68.51 21.74 26.68 23.08 39.64 30.00 42.01 33.36 44.71
Pixel-Relative
MASt3R-Nav (Ours) Pixel-Relative HM3D 93.94 94.95 47.83 58.06 46.15 61.10 23.25 26.83 52.79 60.24

State-of-the-art comparison of different control methods on four navigation tasks. MASt3R-Nav outperforms all baselines on Imitate, Alt Goal, and Shortcut tasks by large margins, achieving an absolute 10% improvement on Imitate over the prior best (GNM) and nearly doubling the SPL/SSPL of object-relative methods on Alt Goal and Shortcut.

Scalability Study

Connectivity Topomap Graph Stats Computation Time (s) Navigation
EC NC Num Nodes Intra-Frame Edges Disk (MB) Intra Edge-Weight Dijkstra SPL SSPL
Exhaustive Sub10 24076 4660494 90.6 16.7 50.5 9.0 74.78 81.41
EMST Sub10 24076 24011 73.9 8.7 2.1 1.4 78.62 82.85
Delaunay 3D Sub10 24076 167970 75.3 7.0 3.7 1.7 78.56 82.85
EMST None 191281 191215 78.4 98.2 4.3 2.3 62.06 73.16
Delaunay 3D None 191281 1418359 85.9 25.7 24.2 8.5 66.28 74.92
ObjectReact 1676 22633 4.76 -- -- 0.011 60.60 68.51

Scalability study comparing inter- and intra-image connectivity strategies. EMST with Sub10 subsampling reduces edges from 4.6M to 24K while maintaining navigation performance, demonstrating that sparse pixel-level connectivity suffices for robot navigation.


Real World Demonstration

Real World Demonstration

We deployed MASt3R-Nav on a P3DX mobile robot equipped with a RealSense camera for RGB images. We show RGB observations, their WayPixel costmaps and the controller waypoints towards the goal object (shaded in blue in image 4) on four different locations in the robot trajectory. Despite being trained exclusively on the HM3D simulated dataset, our navigation pipeline performs effectively during inference on a real-world mobile robot in an unseen environment.

BibTeX

@inproceedings{mastrnav2026,
  title     = {MASt3R-Nav: WayPixel Navigation in Relative 3D Maps},
  author    = {Anonymous Authors},
  booktitle = {International Conference on Robotics and Automation (ICRA)},
  year      = {2026},
}