MASt3R-Nav: WayPixel Navigation in Relative 3D Maps

Abstract

Visual navigation ability is strongly tied to its underlying representation of the world. Unlike classical 3D maps that require globally-consistent geometry, image- or object-relative topological graphs almost entirely do away with geometric understanding. But, this comes at the cost of navigation capability, often limiting it to merely teach-and-repeat. In this work, we propose a novel map representation in the form of pixel-relative connectivity, which is geometrically accurate but does not require global geometric consistency.

Inspired by recent progress in 3D grounded image matching, we construct a map from an image sequence through inter-image connectivity based on pixel correspondences in the relative 3D coordinate systems of individual image pairs. We then use this pixel-level graph to perform global path planning by approximating and sparsifying intra-image pixel connectivity. Through this, we derive a "WayPixel Costmap" representation and train a controller conditioned on it to predict a trajectory rollout.

We show that this dense pixel-level costmap based on relative geometry is a more accurate conditioning variable for control prediction than its image- and object-level counterparts. This enables a highly capable navigation system, as validated on four types of navigation tasks in the simulator and through real world demonstrations.

Method

MASt3R-Nav Architecture. Mapping involves constructing a pixel-level topological graph by linking correspondences across frames and encoding traversal costs using 3D geometry from MASt3R. Execution has the agent localize itself against the map and generate a fine-grained pixel costmap by matching current observations and propagating their costs to all pixels. Planning is performed by computing shortest paths through this costmap, yielding dense pixel-wise gradients towards the goal. Finally, a neural controller consumes the pixel costmap to predict waypoints, enabling a direct comparison with object-based baselines.

WayPixel Costmap

WayPixel Costmap generation. Given the pixel-relative map representation and a query (1), we obtain pixel-level planning costs through a series of steps that form a path highlighted in white background (2) from the query pixel through matched and bridging map pixels to the goal. (3) We show the flow of cost gradients from each pixel to its closest least-cost matched pixel and (4) the final dense WayPixel Costmap on which we condition our trained controller PixelReact. (5) shows the query RGB; the goal position is off-screen, toward the left of the scene.

Mapping Visualization

This interactive view shows the map construction process in two synchronized panels. The left panel displays the sequential camera/image trajectory, while the right panel shows a side-by-side point-cloud view for the active frame pair. In both panels, we overlay inter-image and intra-image graph edges to highlight how local observations connect into a pixel-relative topological structure over time.

Note: the visualization can take around ~1 minute to fully load after opening.

Open in New Tab

Results

We evaluate on the HM3D IIN-val set across four navigation tasks: Imitate, Alt Goal, Shortcut, and Reverse, reporting SPL and SSPL metrics.

Object-level vs. Pixel-level Representations

Mapper	Localizer	Controller	SPL	SSPL
Object-level Representation
LGlue	LGlue	ObjectReact	51.51	58.59
MASt3R	LGlue	ObjectReact	45.45	53.21
LGlue	MASt3R	ObjectReact	51.50	60.85
MASt3R	MASt3R	ObjectReact	51.48	58.64
Pixel-level Representation
MASt3R	MASt3R	ObjectReact	63.63	74.11
MASt3R	MASt3R	PixelReact (Ours)	81.77	84.36

Comparison of object-level vs. pixel-level representations for image-goal navigation. Replacing the object-level costmap with our pixel-relative WayPixel Costmap and PixelReact controller substantially improves performance.

State-of-the-Art Comparison

Method	Type	Train Data	Imitate		Alt Goal		Shortcut		Reverse		Average
Method	Type	Train Data	SPL	SSPL	SPL	SSPL	SPL	SSPL	SPL	SSPL	SPL	SSPL
Image-Relative
GNM	Image-Relative	Real	78.79	82.95	8.70	15.44	15.38	31.74	3.33	6.11	26.55	34.56
GNM (HM3D)	Image-Relative	HM3D	81.82	86.38	0.00	10.91	15.38	24.57	13.28	20.77	27.62	35.66
Object-Relative
PixNav	Object-Relative	HM3D	42.42	46.75	26.09	31.66	7.69	22.29	16.16	25.56	23.09	31.57
RoboHop	Object-Relative	Zero Shot	57.56	64.99	30.43	38.23	30.77	40.87	9.98	16.92	32.19	40.25
ObjectReact	Object-Relative	HM3D	60.60	68.51	21.74	26.68	23.08	39.64	30.00	42.01	33.36	44.71
Pixel-Relative
MASt3R-Nav (Ours)	Pixel-Relative	HM3D	93.94	94.95	47.83	58.06	46.15	61.10	23.25	26.83	52.79	60.24

State-of-the-art comparison of different control methods on four navigation tasks. MASt3R-Nav outperforms all baselines on Imitate, Alt Goal, and Shortcut tasks by large margins, achieving an absolute 10% improvement on Imitate over the prior best (GNM) and nearly doubling the SPL/SSPL of object-relative methods on Alt Goal and Shortcut.

Scalability Study

Connectivity		Topomap Graph Stats			Computation Time (s)			Navigation
EC	NC	Num Nodes	Intra-Frame Edges	Disk (MB)	Intra	Edge-Weight	Dijkstra	SPL	SSPL
Exhaustive	Sub10	24076	4660494	90.6	16.7	50.5	9.0	74.78	81.41
EMST	Sub10	24076	24011	73.9	8.7	2.1	1.4	78.62	82.85
Delaunay 3D	Sub10	24076	167970	75.3	7.0	3.7	1.7	78.56	82.85
EMST	None	191281	191215	78.4	98.2	4.3	2.3	62.06	73.16
Delaunay 3D	None	191281	1418359	85.9	25.7	24.2	8.5	66.28	74.92
ObjectReact		1676	22633	4.76	--	--	0.011	60.60	68.51

Scalability study comparing inter- and intra-image connectivity strategies. EMST with Sub10 subsampling reduces edges from 4.6M to 24K while maintaining navigation performance, demonstrating that sparse pixel-level connectivity suffices for robot navigation.

Real World Demonstration

We deployed MASt3R-Nav on a P3DX mobile robot equipped with a RealSense camera for RGB images. We show RGB observations, their WayPixel costmaps and the controller waypoints towards the goal object (shaded in blue in image 4) on four different locations in the robot trajectory. Despite being trained exclusively on the HM3D simulated dataset, our navigation pipeline performs effectively during inference on a real-world mobile robot in an unseen environment.

BibTeX

@inproceedings{mastrnav2026,
  title     = {MASt3R-Nav: WayPixel Navigation in Relative 3D Maps},
  author    = {Anonymous Authors},
  booktitle = {International Conference on Robotics and Automation (ICRA)},
  year      = {2026},
}