Where Unmanned Aerial Vehicles Take Off and Large Language Models Unfold!
This repository accompanies the work:
UAVs Meet LLMs: Overviews and Perspectives Toward Agentic Low-Altitude Mobility
This is an active repository, you can watch for the latest advances.
If you find it useful, please star ⭐ this repo and cite the paper.
- [2025-01-07] 📃 Check out our new paper: UAVs Meet LLMs: Overviews and Perspectives Toward Agentic Low-Altitude Mobility.
- [2024-12-28] This repository is newly launched to explore the synergy between Unmanned Aerial Vehicles (UAVs) and Large Language Models (LLMs). We will continually update it with fresh papers, demos, and insights.
- [2024-12-27] Yiduo Li added the content of the dataset section
- [2024-12-27] Fei Lin and Yonglin Tian curated this list and published the first version.
If you have any questions or suggestions, please feel free to open an issue or contact us via email.
This repository accompanies our work on "UAVs Meet LLMs: Overviews and Perspectives Toward Agentic Low-Altitude Mobility".
Here, we primarily store various tables referenced in the survey/overview paper. These tables focus on:
- Summarization of typical LLMs, VLMs, and VFMs
- Awesome works on Fundation Models based UAV Systems
- UAV-oriented Datasets across multiple application domains
Note: The goal is to provide a structured, easy-to-navigate resource for researchers interested in the intersection of UAVs and Large Language Models.
Category | Characteristics | Advantages | Disadvantages |
---|---|---|---|
Fixed-wing UAV | Fixed wings generate lift with forward motion. | High speed, long endurance, stable flight. | Cannot hover, high demands for takeoff/landing areas. |
Multirotor UAV | Multiple rotors provide lift and control. | Low cost, easy operation, capable of VTOL and hovering. | Limited flight time, low speed, small payload capacity. |
Unmanned Helicopter | Single or dual rotors allow vertical take-off and hovering. | High payload capacity, good wind resistance, long endurance, VTOL. | Complex structure, higher maintenance cost, slower than fixed-wing UAVs. |
Hybrid UAV | Combines fixed-wing and multirotor capabilities. | Flexible missions, long endurance, VTOL. | Complex mechanisms, higher cost. |
Flapping-wing UAV | Uses clap-and-fling mechanism for flight. | Low noise, high propulsion efficiency, high maneuverability. | Complex analysis and control, limited payload capacity. |
Unmanned Airship | Aerostat aircraft with gasbag for lift. | Low cost, low noise. | Low speed, low maneuverability, highly affected by wind. |
Category | Examples | References |
---|---|---|
Intelligent optimization algorithm | Ant Colony Algorithm | Ref |
Genetic Algorithm | Ref | |
Simulated Annealing Algorithm | Ref | |
Mathematical programming | mixed integer linear programming | Ref |
nonlinear programming | Ref | |
AI based method | Deep Learning | Ref |
Reinforcement Learning | Ref |
Category | Examples | References |
---|---|---|
Heuristic Algorithm | Particle Swarm Optimization Algorithm | Ref |
Genetic Algorithm | Ref | |
Simulated Annealing Algorithm | Ref | |
AI Based Algorithm | Reinforcement Learning | Ref |
Artificial Neural Network | Ref | |
Mathematical Programming Methods | Mixed Integer Programming | Ref |
Market Mechanism Based Method | Auction Based Algorithm | Ref |
Consensus Based Bundle Algorithm | Ref | |
Contract Net Protocol | Ref |
Category | References |
---|---|
infrastructure-based architectures | Ref |
Flying Ad-hoc Network (FANET) Architectur | Ref |
Category | Example | References |
---|---|---|
Centralized Control | Virtual Structure | Ref |
Leader-Follower Approaches | Ref | |
Decentralized Control | Decentralized Model Prediction Method | Ref |
Distributed Control | Behavior Method | Ref |
Consistency Method | Ref |
Subcategory | Model Name | Institution / Author |
---|---|---|
General | GPT-3, GPT-3.5, GPT-4 | OpenAI |
Claude 2, Claude 3 | Anthropic | |
Mistral series | Mistral AI | |
PaLM series, Gemini series | Google Research | |
LLaMA, LLaMA2, LLaMA3 | Meta AI | |
Vicuna | Vicuna Team | |
Qwen series | Qwen Team, Alibaba Group | |
InternLM | Shanghai AI Laboratory | |
BuboGPT | Bytedance | |
ChatGLM | Zhipu AI | |
DeepSeek series | DeepSeek |
Subcategory | Model Name | Institution / Author |
---|---|---|
General | GPT-4V, GPT-4o, GPT-4o mini, GPT o1-preview | OpenAI |
Claude 3 Opus, Claude 3.5 Sonnet | Anthropic | |
Step-2 | Jieyue Xingchen | |
LLaVA, LLaVA-1.5, LLaVA-NeXT | Liu et al. | |
MoE-LLaVA | Lin et al. | |
LLaVA-CoT | Xu et al. | |
Flamingo | Alayrac et al. | |
BLIP | Li et al. | |
BLIP-2 | Li et al. | |
InstructBLIP | Dai et al. | |
Video Understanding | LLaMA-VID | Li et al. |
IG-VLM | Kim et al. | |
Video-ChatGPT | Maaz et al. | |
VideoTree | Wang et al. | |
Visual Reasoning | X-VLM | Zeng et al. |
Chameleon | Lu et al. | |
HYDRA | Ke et al. | |
VISPROG | PRIOR @ Allen Institute for AI |
Subcategory | Model Name | Institution / Author |
---|---|---|
General | CLIP | OpenAI |
FILIP | Yao et al. | |
RegionCLIP | Microsoft Research | |
EVA-CLIP | Sun et al. | |
Object Detection | GLIP | Microsoft Research |
DINO | Zhang et al. | |
Grounding-DINO | Liu et al. | |
DINOv2 | Meta AI Research | |
AM-RADIO | NVIDIA | |
DINO-WM | Zhou et al. | |
YOLO-World | Cheng et al. | |
Image Segmentation | CLIPSeg | Lüdecke and Ecker |
SAM | Meta AI Research, FAIR | |
Embodied-SAM | Xu et al. | |
Point-SAM | Zhou et al. | |
Open-Vocabulary SAM | Yuan et al. | |
TAP | Pan et al. | |
EfficientSAM | Xiong et al. | |
MobileSAM | Zhang et al. | |
SAM 2 | Meta AI Research, FAIR | |
SAMURAI | University of Washington | |
SegGPT | Wang et al. | |
Osprey | Yuan et al. | |
SEEM | Zou et al. | |
Seal | Liu et al. | |
LISA | Lai et al. | |
Depth Estimation | ZoeDepth | Bhat et al. |
ScaleDepth | Zhu et al. | |
Depth Anything | Yang et al. | |
Depth Anything V2 | Yang et al. | |
Depth Pro | Apple |
Name | Year | Types | Amount |
---|---|---|---|
AirFisheye | 2024 | Fisheye image, Depth image, Point cloud, IMU | Over 26,000 fisheye images in total. Data is collected at a rate of 10 frames per second. |
SynDrone | 2023 | Image, Depth image, Point cloud | Contains 72,000 annotation samples, providing 28 types of pixel-level and object-level annotations. |
WildUAV | 2022 | Image, Video, Depth image, Metadata | Mapping images are provided as 24-bit PNG files, with the resolution of 5280x3956. Video images are provided as JPG files at a resolution of 3840x2160. There are 16 possible class labels detailed. |
Name | Year | Types | Amount |
---|---|---|---|
CapERA | 2023 | Video, Text | 2864 videos, each with 5 descriptions, totaling 14,320 texts. Each video lasts 5 seconds and is captured at 30 frames/second with a resolution of 640 × 640 pixels. |
ERA | 2020 | Video | A total of 2,864 videos, including disaster events, traffic accidents, sports competitions, and other 25 categories. Each video is 24 frames/second for 5 seconds. |
VIRAT | 2016 | Video | 25 hours of static ground video and 4 hours of dynamic aerial video. There are 23 event types involved. |
Name | Year | Types | Amount |
---|---|---|---|
WebUAV-3M | 2024 | Video, Text, Audio | 4,500 videos totaling more than 3.3 million frames with 223 target categories, providing natural language and audio descriptions. |
UAVDark135 | 2022 | Video | 135 video sequences with over 125,000 manually annotated frames. |
DUT-VTUAV | 2022 | RGB-T Image | Nearly 1.7 million well-aligned visible-thermal (RGB-T) image pairs with 500 sequences for unveiling the power of RGB-T tracking. Including 13 sub-classes and 15 scenes cross 2 cities. |
TNL2K | 2022 | Video, Infrared video, Text | 2,000 video sequences, comprising 1,244,340 frames and 663 words. |
PRAI-1581 | 2020 | Image | 39,461 images of 1581 person identities. |
VOT-ST2020/VOT-RT2020 | 2020 | Video | 1,000 sequences, each varying in length, with an average length of approximately 100 frames. |
VOT-LT2020 | 2020 | Video | 50 sequences, each with a length of approximately 40,000 frames. |
VOT-RGBT2020 | 2020 | Video, Infrared video | 50 sequences, each with a length of approximately 40,000 frames. |
VOT-RGBD2020 | 2020 | Video, Depth image | 80 sequences with a total of approximately 101,956 frames. |
GOT-10K | 2019 | Image, Video | 420 video clips belonging to 84 object categories and 31 motion categories. |
DTB70 | 2017 | Video | 70 video sequences, each consisting of multiple video frames, with each frame containing an RGB image at a resolution of 1280x720 pixels. |
Stanford Drone | 2016 | Video | 19,000+ target tracks, containing 6 types of targets, about 20,000 target interactions, 40,000 target interactions with the environment, covering 100+ scenes in the university campus. |
COWC | 2016 | Image | 32,716 unique vehicles and 58,247 non-vehicle targets were labeled. Covering 6 different geographical areas. |
Name | Year | Types | Amount |
---|---|---|---|
Aeriform in-action | 2023 | Video | 32 videos, 13 types of action, 55,477 frames, 40,000 callouts. |
MEVA | 2021 | Video, Infrared video, GPS, Point cloud | Total 9,300 hours of video, 144 hours of activity notes, 37 activity types, over 2.7 million GPS track points. |
UAV-Human | 2021 | Video, Night-vision video, Fisheye video, Depth video, Infrared video, Skeleton | 67,428 videos (155 types of actions, 119 subjects), 22,476 frames of annotated key points (17 key points), 41,290 frames of people re-recognition (1,144 identities), 22,263 frames of attribute recognition (such as gender, hat, backpack, etc.). |
MOD20 | 2020 | Video | 20 types of action, 2,324 videos, 503,086 frames. |
NEC-DRONE | 2020 | Video | 5,250 videos containing 256 minutes of action videos involving 19 actors and 16 action categories. |
Drone-Action | 2019 | Video | 240 HD videos, 66,919 frames, 13 types of action. |
UAV-GESTURE | 2019 | Video | 119 videos, 37,151 frames, 13 types of gestures, 10 actors. |
Name | Year | Types | Amount |
---|---|---|---|
CityNav | 2024 | Image, Text | 32,000 natural language descriptions and companion tracks. |
CNER-UAV | 2024 | Text | 12,000 labeled samples containing 5 types of address labels (e.g., building, unit, floor, room, etc.). |
AerialVLN | 2023 | Simulator path, Text | 25 city-level scenes, 8,446 paths, 3 natural language descriptions per path, totaling 25,338 instructions. |
DenseUAV | 2023 | Image | Training: 6,768 UAV images, 13,536 satellite images. Test: 2,331 UAV query images, 4,662 satellite images. |
map2seq | 2022 | Image, Text, Map path | 29,641 panoramic images, 7,672 navigation instruction texts. |
VIGOR | 2021 | Image | 90,618 aerial images, 238,696 street panorama images. |
University-1652 | 2020 | Image | 1,652 university buildings, 72 universities, 50,218 training images, 37,855 UAV query images, 701 satellite query images, and 21,099 ordinary & 5,580 street view images. |
Name | Year | Types | Amount |
---|---|---|---|
TrafficNight | 2024 | Image, Infrared Image, Video, Infrared Video, Map | The dataset consists of 2,200 pairs of annotated thermal infrared and sRGB image data, and video data from 7 traffic scenes, with a total duration of approximately 240 minutes. Each scene includes a high-precision map, providing a detailed layout and topological information. |
VisDrone | 2022 | Video, Image | 263 videos, 179,264 frames. 10,209 still images. More than 2,500,000 object instance annotations. The data covers 14 different cities, covering a wide range of weather and light conditions. |
ITCVD | 2020 | Image | A total of 173 aerial images were collected, including 135 in the training set with 23,543 vehicles and 38 in the test set with 5,545 vehicles. There is 60% regional overlap between the images, and there is no overlap between the training set and the test set. |
UAVid | 2020 | Image, Video | 30 videos, 300 images, 8 semantic category annotations. |
AU-AIR | 2020 | Video, GPS, Altitude, IMU, Speed | 32,823 frames of video, 1920x1080 resolution, 30 FPS, divided into 30,000 training validation samples and 2,823 test samples. The total duration of the 8 videos is about 2 hours, with a total of 132,034 instances, distributed in 8 categories. |
iSAID | 2020 | Image | Total images: 2,806. Total number of instances: 655,451. Test set: 935 images (not publicly labeled, used to evaluate the server). |
CARPK | 2018 | Image | 1448 images, approx. 89,777 vehicles, providing box annotations. |
highD | 2018 | Video, Trajectory | 16.5 hours, 110,000 vehicles, 5,600 lane changes, 45,000 km, totaling approximately 447 hours of vehicle travel data; 4 predefined driving behavior labels. |
UAVDT | 2018 | Video, Weather, Altitude, Camera angle | 100 videos, about 80,000 frames, 30 frames per second, containing 841,500 target boxes, covering 2,700 targets. |
CADP | 2016 | Video | A total of 5.24 hours, 1,416 traffic accident clips, 205 full-time and space annotation videos. |
VEDAI | 2016 | Image | 1,210 images (1024 × 1024 and 512 × 512 pixels), 9 types of vehicles, containing about 6,650 targets in total. |
Name | Year | Types | Amount |
---|---|---|---|
RET-3 | 2024 | Image, Text | Approximately 13,000 samples. Including RSICD, RSITMD and UCM. |
DET-10 | 2024 | Image | In the object detection dataset, the number of objects per image ranges from 1 to 70, totaling about 80,000 samples. |
SEG-4 | 2024 | Image | The segmented data set covers different regions and resolutions, totaling about 72,000 samples. |
DIOR | 2020 | Image | 23,463 images, containing 192,472 target instances, covering 20 categories, including aircraft, vehicles, ships, bridges, etc., each category contains about 1,200 instances. |
TGRS-HRRSD | 2019 | Image | Total images: 21,761. 13 categories, including aircraft, vehicles, bridges, etc. The total number of targets is approximately 53,000 targets. |
xView | 2018 | Image | There are more than 1 million goals and 60 categories, including vehicles, buildings, facilities, boats and so on, which are divided into seven parent categories and several sub-categories. |
DOTA | 2018 | Image | 2806 images, 188, 282 targets, 15 categories. |
RSICD | 2018 | Image, Text | 10,921 images, 54,605 descriptive sentences. |
HRSC2016 | 2017 | Image | 3,433 instances, totaling 1,061 images, including 70 pure ocean images and 991 images containing mixed land-sea areas. 2,876 marked vessel targets. 610 unlabeled images. |
RSOD | 2017 | Image | Contains 4 types of targets (tank, aircraft, overpass, playground) with 12,000 positive samples and 48,000 negative samples. |
NWPU-RESISC45 | 2017 | Image | A total of 31,500 images, covering 45 scene categories, 700 images per category, resolution 256 × 256 pixels, spatial resolution from 0.2m to 30m. |
NWPU VHR-10 | 2014 | Image | 800 high-resolution images, of which 650 contain targets and 150 are background images, covering 10 categories (such as aircraft, ships, bridges, etc.), totaling more than 3,000 targets. |
Name | Year | Types | Amount |
---|---|---|---|
WEED-2C | 2024 | Image | Contains 4,129 labeled samples covering 2 weed species. |
CoFly-WeedDB | 2023 | Image, Health data | Consisting of 201 aerial images, different weed types of 3 disturbed row crops (cotton) and their corresponding annotated images. |
Avo-AirDB | 2022 | Image | 984 high-resolution RGB images (5472 × 3648 pixels), 93 of which have detailed polygonal annotations, divided into 3 to 4 categories (small, medium, large, and background). |
Name | Year | Types | Amount |
---|---|---|---|
UAPD | 2021 | Image | There are 2,401 crack images in the original data and 4,479 crack images after data enhancement. |
InsPLAD | 2023 | Image | 10,607 UAV images containing 17 classes of power assets with a total of 28,933 labeled instances, and defect labels for 5 assets with a total of 402 defect samples classified into 6 defect types. |
Name | Year | Types | Amount |
---|---|---|---|
AFID | 2023 | Image | A total of 816 images with resolutions of 2720 × 1536 and 2560 × 1440. Contains 8 semantic segmentation categories. |
FloodNet | 2021 | Image, Text | The whole dataset has 2,343 images, divided into training (~60%), validation (~20%), and test (~20%) sets. The semantic segmentation labels include: Background, Building Flooded, Building Non-Flooded, Road Flooded, Road Non-Flooded, Water, Tree, Vehicle, Pool, Grass. |
Aerial SAR | 2020 | Image | 2,000 images with 30,000 action instances covering multiple human behaviors. |
Name | Year | Types | Amount |
---|---|---|---|
MOCO | 2024 | Image, Text | 7,449 images, 37,245 captions. |
Name | Year | Types | Amount |
---|---|---|---|
WAID | 2023 | Image | 14,375 UAV images covering 6 species of wildlife and multiple environment types. |
Name | Year | Types | Amount |
---|---|---|---|
DroneRFa | 2024 | RF signal | It includes 24 types of UAV signals (9 types of outdoor acquisition and 15 types of indoor acquisition) and 1 type of background signals, covering 3 ISM frequency bands. |
IDTDSAT | 2019 | Infrared image, Trajectory | Infrared image sequence of 22 segments, total number of frames 16,177, total number of targets 16,944, 30 tracks; image resolution 256 × 256 pixels. |
DTDAOTRES | 2019 | Radar | 15 segments of 8.76 GB. |
Name | Publication |
---|---|
AirSim | Airsim: High-fidelity visual and physical simulation for autonomous vehicles |
Carla | CARLA: An open urban driving simulator |
NVIDIA Isaac Sim | |
AerialVLN Simulator | Aerialvln: Vision-and-language navigation for uavs |
Embodied City | EmbodiedCity: A Benchmark Platform for Embodied Agent in Real-world City Environment |
Title | Type | Publication | Code |
---|---|---|---|
Li et al. (A Benchmark for UAV-View Natural Language-Guided Tracking) | VFM | MDPI | GitHub |
Ma et al. (Applying Unsupervised Semantic Segmentation to High-Resolution UAV Imagery for Enhanced Road Scene Parsing) | VFM | Arxiv | - |
Limberg et al. (Leveraging YOLO-World and GPT-4V LMMs for Zero-Shot Person Detection and Action Recognition in Drone Imagery) | VFM+VLM | Arxiv | - |
Kim et al. (Weather-Aware Drone-View Object Detection Via Environmental Context Understanding) | VLM+VFM | ICIP 2024 | - |
LGNet (Shooting condition insensitive unmanned aerial vehicle object detection) | VFM | Expert Systems with Applications | - |
Sakaino et al. (Dynamic Texts From UAV Perspective Natural Images) | VLM+VFM | ICCV 2023 | - |
COMRP (Unsupervised semantic segmentation of high-resolution UAV imagery for road scene parsing) | VFM | Arxiv | GitHub |
CrossEarth (CrossEarth: Geospatial Vision Foundation Model for Domain Generalizable Remote Sensing Semantic Segmentation) | VFM | Arxiv | GitHub |
TanDepth (TanDepth: Leveraging Global DEMs for Metric Monocular Depth Estimation in UAVs) | VFM | Arxiv | GitHub |
DroneGPT (DroneGPT: Zero-shot Video Question Answering For Drones) | VLM+LLM+VFM | CVDL 2024 | - |
de Zarzà et al. (Socratic video understanding on unmanned aerial vehicles) | LLM | Procedia Computer Science | - |
AeroAgent (Agent as Cerebrum, Controller as Cerebellum: Implementing an Embodied LMM-based Agent on Drones) | VLM | Arxiv | - |
RS-LLaVA (Rs-llava: A large vision-language model for joint captioning and question answering in remote sensing imagery) | VLM | MDPI | - |
GeoRSCLIP (RS5M and GeoRSCLIP: A large scale vision-language dataset and a large vision-language model for remote sensing) | VFM | IEEE Transactions on Geoscience and Remote Sensing | GitHub |
SkyEyeGPT (Skyeyegpt: Unifying remote sensing vision-language tasks via instruction tuning with large language model) | VFM+LLM | Arxiv | GitHub |
Title | Type | Publication | Code |
---|---|---|---|
NaVid (Navid: Video-based vlm plans the next step for vision-and-language navigation) | VFM+LLM | Arxiv | - |
VLN-MP (Why Only Text: Empowering Vision-and-Language Navigation with Multi-modal Prompts) | VFM | Arxiv | GitHub |
Gao et al. (Aerial Vision-and-Language Navigation via Semantic-Topo-Metric Representation Guided LLM Reasoning) | VFM+LLM | Arxiv | - |
MGP (CityNav: Language-Goal Aerial Navigation Dataset with Geographic Information) | LLM+VFM | Arxiv | GitHub |
UAV Navigation LLM (Towards Realistic UAV Vision-Language Navigation: Platform, Benchmark, and Methodology) | LLM+VFM | Arxiv | GitHub |
GOMAA-Geo (GOMAA-Geo: GOal Modality Agnostic Active Geo-localization) | LLM+VFM | Arxiv | GitHub |
NavAgent (NavAgent: Multi-scale Urban Street View Fusion For UAV Embodied Vision-and-Language Navigation) | LLM+VFM+VLM | Arxiv | - |
ASMA (ASMA: An Adaptive Safety Margin Algorithm for Vision-Language Drone Navigation via Scene-Aware Control Barrier Functions) | LLM+VFM | Arxiv | - |
Zhang et al. (Demo Abstract: Embodied Aerial Agent for City-level Visual Language Navigation Using Large Language Model) | VFM+LLM | IPSN 2024 | - |
Chen et al. (Vision-Language Navigation for Quadcopters with Conditional Transformer and Prompt-based Text Rephraser) | LLM | MMAsia 2023 | - |
CloudTrack (CloudTrack: Scalable UAV Tracking with Cloud Semantics) | VFM+VLM | Arxiv | - |
NEUSIS (NEUSIS: A Compositional Neuro-Symbolic Framework for Autonomous Perception, Reasoning, and Planning in Complex UAV Search Missions) | VFM+VLM | Arxiv | - |
Say-REAPEx (Say-REAPEx: An LLM-Modulo UAV Online Planning Framework for Search and Rescue) | LLM | Openreview | - |
Title | Type | Publication | Code |
---|---|---|---|
TypeFly (Typefly: Flying drones with large language model) | LLM | Arxiv | - |
SPINE (SPINE: Online Semantic Planning for Missions with Incomplete Natural Language Specifications in Unstructured Environments) | LLM+VFM+VLM | Arxiv | - |
LEVIOSA (LEVIOSA: Natural Language-Based Uncrewed Aerial Vehicle Trajectory Generation) | LLM | MDPI | GitHub |
TPML (TPML: Task Planning for Multi-UAV System with Large Language Models) | LLM | ICCA 2023 | - |
REAL (Real: Resilience and adaptation using large language models on autonomous aerial robots) | LLM | Arxiv | - |
Liu et al. (Multi-Agent Formation Control Using Large Language Models) | LLM | Techrxiv | - |
Title | Type | Publication | Code |
---|---|---|---|
PromptCraft (Chatgpt for robotics: Design principles and model abilities) | LLM | IEEE Access | GitHub |
Zhong et al. (A safer vision-based autonomous planning system for quadrotor uavs with dynamic obstacle trajectory prediction and its application with llms) | LLM | WACV 2024 | - |
Tazir et al. (From words to flight: Integrating openai chatgpt with px4/gazebo for natural language-based drone control) | LLM | WCSE 2023 | - |
Phadke et al. (Integrating Large Language Models for UAV Control in Simulated Environments: A Modular Interaction Approach) | LLM | Arxiv | - |
EAI-SIM (EAI-SIM: An Open-Source Embodied AI Simulation Framework with Large Language Models) | LLM | ICCA 2024 | GitHub |
TAIiST (TAIiST CPS-UAV at the SBFT Tool Competition 2024) | LLM | SBFT 2024 | GitHub |
Swarm-GPT (Swarm-gpt: Combining large language models with safe motion planning for robot choreography design) | LLM | Arxiv | - |
FlockGPT (FlockGPT: Guiding UAV Flocking with Linguistic Orchestration) | LLM | Arxiv | - |
CLIPSwarm (CLIPSwarm: Generating Drone Shows from Text Prompts with Vision-Language Models) | VFM | Arxiv | - |
Title | Type | Publication | Code |
---|---|---|---|
DTLLM-VLT (DTLLM-VLT: Diverse Text Generation for Visual Language Tracking Based on LLM) | VFM+LLM | CVPR 2024 | - |
Yao et al. (Can llm substitute human labeling? a case study of fine-grained chinese address entity recognition dataset for uav delivery) | LLM | Companion Proceedings of the ACM Web Conference 2024 | GitHub |
GPG2A (Cross-View Meets Diffusion: Aerial Image Synthesis with Geometry and Text Guidance) | LLM | Arxiv | GitLap |
AeroVerse (AeroVerse: UAV-Agent Benchmark Suite for Simulating, Pre-training, Finetuning, and Evaluating Aerospace Embodied World Models) | VLM+LLM | Arxiv | - |
Tang et al. (Defining and Evaluating Physical Safety for Large Language Models) | LLM | Arxiv | Hugging face |
Xu et al. (Emergency Networking Using UAVs: A Reinforcement Learning Approach with Large Language Model) | LLM | IPSN 2024 | - |
LLM-RS (Real-time Integration of Fine-tuned Large Language Model for Improved Decision-Making in Reinforcement Learning) | LLM | IJCNN 2024 | - |
Pineli et al. (Evaluating Voice Command Pipelines for Drone Control: From STT and LLM to Direct Classification and Siamese Networks) | LLM | Arxiv | - |
We want to thank the following contributors for creating, maintaining, and curating the tables in this repository:
- Yonglin Tian
- Fei Lin
- Yiduo Li
- Tengchao Zhang
- Xuan Fu
If you have any questions about this repository, feel free to get in touch with Yonglin Tian 📧 or Fei Lin 📧.
(If you would like to contribute to this repo, please open an Issue or Pull Request.)
If you find this repository useful, please consider citing this paper:
@misc{tian2025uavsmeetllmsoverviews,
title={UAVs Meet LLMs: Overviews and Perspectives Toward Agentic Low-Altitude Mobility},
author={Yonglin Tian and Fei Lin and Yiduo Li and Tengchao Zhang and Qiyao Zhang and Xuan Fu and Jun Huang and Xingyuan Dai and Yutong Wang and Chunwei Tian and Bai Li and Yisheng Lv and Levente Kovács and Fei-Yue Wang},
year={2025},
eprint={2501.02341},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2501.02341},
}
This project is licensed under the MIT License.