AION: Aerial Indoor Object-Goal Navigation |

Demo

Overview

Object-Goal Navigation (ObjectNav) requires an agent to autonomously explore an unknown environment and navigate toward target objects specified by a semantic label. While prior work has primarily studied zero-shot ObjectNav under 2D locomotion, extending it to aerial platforms with 3D locomotion capability remains underexplored. Aerial robots offer superior maneuverability and search efficiency, but they also introduce new challenges in spatial perception, dynamic control, and safety assurance. In this paper, we propose AION for vision-based aerial ObjectNav without relying on external localization or global maps. AION is an end-to-end dual-policy reinforcement learning (RL) framework that decouples exploration and goal-reaching behaviors into two specialized policies. We evaluate AION on the AI2-THOR benchmark and further assess its real-time performance in IsaacSim using high-fidelity drone models. Experimental results show that AION achieves superior performance across comprehensive evaluation metrics in exploration, navigation efficiency, and safety.

Details

1. Task

Indoor object-goal navigation for UAVs with 3D locomotion: the drone must autonomously explore an unknown environment and navigate toward a target object specified by a semantic label (e.g., “laptop”, “microwave”), without any prior map or external localization.

2. Framework

A dual-policy RL framework that switches between two modes based on target visibility:

Exploration Mode — maximize spatial coverage in unknown space
Goal-Reaching Mode — visual servoing toward the detected target object

3. Exploration Mode

Input: Depth map + ROI (Region of Interest). The ROI identifies open, navigable areas in the depth image — simulating how humans instinctively look toward open spaces when navigating. The ROI is extracted using OpenCV-based methods and provides a directional cue (centroid position \((d_x, d_y)\) and mean depth \(\bar{z}\)), rather than absolute unknown-space information.

Rewards:

\[r_t^E = R_{forward} + R_{center} + R_{safe}\]

\(R_{forward}\): reward for moving toward open space
\(R_{center}\): penalty for yaw deviation from ROI centroid
\(R_{safe}\): collision / obstacle proximity penalty

4. Goal-Reaching Mode

Input: RGB image + frozen CLIP text embedding (aligns text and visual features for zero-shot object recognition) + object/class bounding box.

Rewards:

\[r_t^G = R_{dist} + R_{bbox} + R_{parent} + R_{suc} - R_{collision}\]

\(R_{dist}\): reward for reducing Euclidean distance to target
\(R_{bbox}\): reward for centering and enlarging the target bounding box in the field of view (indicates approaching the object)
\(R_{parent}\): parent-class reward — e.g., reaching a desk earns partial reward if the target is a laptop on that desk
\(R_{suc}\): task success reward
\(R_{collision}\): collision penalty

5. Action Space

Discrete 3D actions — forward, turn left/right, ascend, descend, etc.

6. Evaluation

Evaluated on two simulators: AI2-THOR (standard benchmark with seen/unseen object splits) and IsaacSim (larger multi-room environments where the target may be in a different room).

AI2-THOR Benchmark

Model	Split	Seen SR	SPL	Unseen SR	SPL
BaseModel	18/4	76.7	39.9	81.5	36.4
Scene Prior	18/4	74.3	42.1	83.7	41.9
MJO	18/4	81.2	52.0	90.7	51.7
SSNet	18/4	72.3	50.4	77.8	50.0
Ours	18/4	88.7	57.9	95.0	55.2
BaseModel	14/8	73.3	47.3	70.8	46.6
Scene Prior	14/8	79.3	52.7	71.0	44.8
MJO	14/8	78.8	43.6	83.0	45.6
SSNet	14/8	79.2	44.3	81.8	46.4
Ours	14/8	84.7	61.2	87.0	60.5

SR = Success Rate (%), SPL = Success weighted by Path Length (%)

IsaacSim Cross-Scene

Algorithm	Object	Chem.	Beech.	Ihlen
Exp+MJO	Sofa	3/5	4/5	4/5
	Plant	2/5	5/5	5/5
	Laptop	0/5	3/5	5/5
	Microwave	2/5	5/5	2/5
Exp+SSNet	Sofa	3/5	4/5	3/5
	Plant	3/5	2/5	3/5
	Laptop	0/5	3/5	5/5
	Microwave	1/5	5/5	3/5
AION	Sofa	4/5	4/5	5/5
	Plant	5/5	5/5	4/5
	Laptop	2/5	5/5	5/5
	Microwave	3/5	5/5	5/5

SR = Success Rate (successes / 5 trials)