FlySearch: Exploring how vision-language models explore

Adam Pardyl, Dominik Matuszek, Mateusz Przebieracz, Marek Cygan, Bartosz Zieliński, Maciej Wołczyk

Tl;dr

A benchmark for evaluating vision-language models in simulated 3D, outdoor, photorealistic environments. Easy for humans, hard for state-of-the-art VLMs / MLLMs.

Leaderboard

Model	FS-1		FS-Anomaly-1	FS-2
	Forest (%)	City (%)	Overall (%)	City (%)
Human (untrained)	--	66.7 ± 4.5	--	60.8 ± 6.9
Gemini 2.0 flash	42.5 ± 3.5	41.5 ± 3.5	35.5 ± 3.4	6.0 ± 1.1
Claude 3.5 Sonnet	52.0 ± 3.5	30.5 ± 3.3	27.5 ± 3.2	6.5 ± 1.2
GPT-4o	45.5 ± 3.5	33.5 ± 3.3	27.0 ± 3.1	3.5 ± 0.9
Pixtral-Large	38.0 ± 3.4	21.5 ± 2.9	15.0 ± 2.5	3.0 ± 0.8
Qwen2-VL 72B	16.5 ± 2.6	18.0 ± 2.7	7.5 ± 1.9	--
Llava-Onevision 72b	12.5 ± 2.3	6.5 ± 1.7	8.5 ± 2.0	--
Qwen2.5-VL 7B	6.0 ± 1.7	1.5 ± 0.9	2.8 ± 1.2	0.0 ± 0.0
InternVL-2.5 8B MPO	2.5 ± 1.1	1.5 ± 0.9	3.5 ± 1.3	--
Llava-Interleave-7b	0.0 ± 0.0	1.5 ± 0.9	0.0 ± 0.0	--
Phi 3.5 vision	0.0 ± 0.0	0.0 ± 0.0	0.0 ± 0.0	--

Motivation: Vision-Language Models for Embodied AI exploration

Can Vision-Language Models (VLMs) perform active, goal-driven exploration of real-world environments?

The real world is messy and unstructured. Uncovering critical information and decision-making requires curiosity, adaptability, and a goal-oriented mindset.
VLMs offer great zero-shot performance in many difficult tasks ranging from image captioning to robotics.
The ability of VLMs to operate in realistic, open-ended environments remains largely untested.

Idea: Evaluate VLM exploration skills in 3D open world scenarios

FlySearch -- a new benchmark that evaluates exploration skills using vision-language reasoning.

Task: locate an object specified in natural language or by visual examples.
Embodied interaction: control an Unmanned Aerial Vehicle (UAV), observing images obtained from successive locations of the UAV and providing text commands that describe the next move.

FlySearch task

Realistic vision-language exploration benchmark

High-fidelity outdoor environment built with Unreal Engine 5, enabling realistic and scalable evaluation of embodied agents in complex, unstructured settings.
A suite of object-based exploration challenges designed to isolate and measure the exploration capabilities of VLMs and humans in open-world scenarios.
Procedurally generated. We can generate an infinite number of vastly different test scenarios.
Easy to solve for a human, while challenging for popular VLMs. We identify consistent failure modes across vision, grounding, and reasoning.

Our benchmark consists of two types of evaluation environments, forest and city. For each environment, we can generate an infinite number of procedurally generated test scenarios.

Forest environment
City environment

Standardized evaluation set

We define three levels of difficulty of FlySearch and provide a test set of generated scenarios:

FS-1: The target is visible from the starting position, but can be just a few pixels wide in the initial view. The object is described by text (e.g. "a red sport car").
FS-Anomaly-1: The target object is an easy-to-spot anomaly (e.g. a flying saucer/UFO). The object description is not given to the model, the task is to look for an anomaly. Other settings as in FS-1.
FS-2: The object can be hidden behind buildings or be far away from the staring position. Additional visual preview of the object is given to the model at the start.

Evaluation pipeline

FlySearch consists of three parts besides the vision-language model: the simulator, the evaluation controller, and the scenario generator.

FlySearch architecture

Benchmark environment details

Initial prompt: Describes the target, provides information on how to perform actions, and the success conditions.
Observations: During exploration, the model is provided with the current camera view of the UAV (500 × 500 px RGB image, downward-facing camera) and current altitude.
Action: Relative movement coordinates or the FOUND action (as text), preceded by an unlimited number of reasoning tokens.
Success condition: The agent has the target's center in its view, is at most 10 meters above it, and responds with FOUND.

Example trajectory

Example of a successful trajectory in FS-1 performed by GPT-4o. The agent navigates to red sports car object by first descending and then moving to the right. The first row shows the model’s visual inputs, and the second actions it has taken. Note the presence of the grid overlay on images.

Acknowledgements

This paper has been supported by the Horizon Europe Programme (HORIZONCL4-2022-HUMAN-02) under the project "ELIAS: European Lighthouse of AI for Sustainability", GA no. 101120237. This research was funded by National Science Centre, Poland (grant no. 2023/50/E/ST6/00469 and Sonata Bis grant no 2024/54/E/ST6/00388). The research was supported by a grant from the Faculty of Mathematics and Computer Science under the Strategic Programme Excellence Initiative at Jagiellonian University. We gratefully acknowledge Polish high-performance computing infrastructure PLGrid (HPC Center: ACK Cyfronet AGH) for providing computer facilities and support within computational grant no. PLG/2024/017483. Some experiments were performed on servers purchased with funds from the Priority Research Area (Artificial Intelligence Computing Center Core Facility) under the Strategic Programme Excellence Initiative at Jagiellonian University.

Citation

@misc{pardyl2025flysearch,
  title={FlySearch: Exploring how vision-language models explore},
  author={Adam Pardyl and Dominik Matuszek and Mateusz Przebieracz and Marek Cygan and Bartosz Zieliński and Maciej Wołczyk},
  year={2025},
  eprint={2506.02896},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2506.02896}, 
}

Contact

For questions, please open an issue on GitHub or contact Adam Pardyl -- adam.pardyl <at> doctoral.uj.edu.pl.

Paper Code Get started