Skip to content

YiyiyiZhao/VIALM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VIALM

This repository offers a survey and benchmark for Visual Impairment Assistance using Large Models (VIALM).

1. Task Illustration

The following figure illustrates a sample input and output of VIALM. The input is a pair of a visual display of the environment (the image on the left) and a user request in human language (the grey box). The yellow box shows the output guidance for VI users to complete the request within the environment (the image on the right). The output should be grounded and fine-grained for VI users to follow easily. VIALM

2. Paper Collections

For the survey part, we have collected a list of related papers, covering LLMs, VLMs, and embodid agents.

2.1 Large Language models (LLMs)

  • Tom B. Brown, Benjamin Mann, and et al. Language models are few-shot learners. In NeurIPS, 2020. link
  • Nan Du, Yanping Huang, and et al. GLaM: Efficient scaling of language models with mixture-of-experts. In ICML,2022.link
  • Jack W. Rae, Sebastian Borgeaud, and et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv, 2112.11446, 2021. link
  • Romal Thoppilan, Daniel De Freitas, and et al. LaMDA: Language models for dialog applications. arXiv, 2201.08239, 2022. link
  • Aakanksha Chowdhery, Sharan Narang, and et al. PaLM: Scaling language modeling with pathways. J. Mach. Learn. Res., 2023. link
  • Jordan Hoffmann, Sebastian Borgeaud, and et al. Training compute-optimal large language models. arXiv, 2203.15556, 2022. link
  • Susan Zhang, Stephen Roller, and et al. OPT: open pre-trained transformer language models. arXiv, 2205.01068, 2022. link
  • Aohan Zeng, Xiao Liu, and et al. GLM-130B: an open bilingual pre-trained model. In ICLR, 2023. link
  • Teven Le Scao, Angela Fan, and et al. BLOOM: A 176b-parameter open-access multilingual language model. arXiv, 2211.05100, 2022. link
  • Ross Taylor, Marcin Kardas, and et al. Galactica: A large language model for science. arXiv, 2211.09085, 2022. link
  • Hugo Touvron, Thibaut Lavril, and et al. LLaMA: Open and efficient foundation language models. arXiv, 2302.13971, 2023. link
  • Bo Peng, Eric Alcaide, and et al. RWKV: Reinventing rnns for the transformer era. In EMNLP (Findings), 2023. link
  • Rohan Anil, Andrew M. Dai, and et al. PaLM 2 technical report. arXiv, 2305.10403, 2023. link
  • Hugo Touvron, Louis Martin, and et al. Llama 2: Open foundation and fine-tuned chat models. arXiv, 2307.09288, 2023. link
  • Ebtesam Almazrouei, Hamza Alobeidli, and et al. The falcon series of open language models. arXiv, 2311.16867, 2023. link

2.2 Visual-Language Models (VLMs)

  • Jean-Baptiste Alayrac, Jeff Donahue, and et al. Flamingo: a visual language model for few-shot learning. In NeurIPS,2022. link
  • Xi Chen, Xiao Wang, and et al. PaLI: A jointly-scaled multilingual language-image model. In ICLR, 2023. link
  • Junnan Li, Dongxu Li, and et al. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023. link
  • OpenAI. GPT-4 technical report. arXiv, 2303.08774, 2023. link
  • Haotian Liu, Chunyuan Li, and et al. Visual instruction tuning. arXiv, 2304.08485, 2023. link
  • Deyao Zhu, Jun Chen, and et al. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. arXiv, 2304.10592, 2023. link
  • Xi Chen, Josip Djolonga, and et al. PaLI-X: On scaling up a multilingual vision and language model. arXiv, 2305.18565, 2023. link
  • Wenliang Dai, Junnan Li, and et al. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. arXiv, 2305.06500, 2023. link
  • Wenhai Wang, Zhe Chen, and et al. VisionLLM: Large language model is also an open-ended decoder for vision-centric tasks. arXiv, 2305.11175, 2023. link
  • Wenbo Hu, Yifan Xu, and et al. BLIVA: A simple multimodal LLM for better handling of text-rich visual questions. arXiv, 2308.09936, 2023. link
  • Jinze Bai, Shuai Bai, and et al. Qwen-VL: A frontier large vision-language model with versatile abilities. arXiv, 2308.12966, 2023. link
  • Jun Chen, Deyao Zhu, and et al. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv, 2310.09478, 2023. link
  • Weihan Wang, Qingsong Lv, and et al. CogVLM: Visual expert for pretrained language models. arXiv, 2311.03079, 2023. link

2.3 Embodied Agents

  • Wenlong Huang, Pieter Abbeel, and et al. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In ICML, 2022. link
  • Brian Ichter, nthony Brohan, and et al. Do as I can, not as I say: Grounding language in robotic affordances. In CoRL, 2022. link
  • Jacky Liang, Wenlong Huang, and et al. Code as Policies:Language model programs for embodied control. In ICRA, 2023. link
  • Wenlong Huang, Fei Xia, and et al. Inner monologue: Embodied reasoning through planning with language models. In CoRL, 2022. link
  • Ishika Singh, Valts Blukis, and et al. ProgPrompt: Generating situated robot task plans using large language models. In ICRA, 2023. link
  • Chan Hee Song, Jiaman Wu, and et al. Llm-planner: Few-shot grounded planning for embodied agents with large language models. arXiv, 2212.04088, 2022. link
  • Xufeng Zhao, Mengdi Li, and et al. Chat with the environment: Interactive multimodal perception using large language models. arXiv, 2303.08268, 2023. link
  • Danny Driess, Fei Xia, and et al. PaLM-E: An embodied multimodal language model. In ICML, 2023. link
  • Wenlong Huang, Fei Xia, and et al. Grounded decoding: Guiding text generation with grounded models for robot control. arXiv, 2303.00855, 2023. link
  • Guanzhi Wang, Yuqi Xie, and et al. Voyager: An open-ended embodied agent with large language models. arXiv, 2305.16291, 2023. link
  • Xizhou Zhu, Yuntao Chen, and et al. Ghost in the Minecraft: Generally capable agents for open-world environments via large language models with text-based knowledge and memory. arXiv, 2305.17144, 2023. link
  • Anthony Brohan, Noah Brown, and et al. RT-2: vision-language-action models transfer web knowledge to robotic control. arXiv, 2307.15818, 2023. link
  • Yingdong Hu, Fanqi Lin, and et al. Look Before You Leap: Unveiling the power of GPT-4V in robotic vision-language planning. arXiv, 2311.17842, 2023. link

3. Benchmark Evaluation

Our benchmark consists of two common daily life environment, domestic home and supermarket.

Example: Benchmark

3.1 Benchmark Annotations

The annotated evaluation dataset can be found at ./benchmark/annotations. The input environment images can be found at ./benchmark/environment_images.

3.2 LM Predictions

The predictions made by the six large models (LMs) can be found at ./benchmark/lm_predictions.

The evaluation models are:

Prediction example: Case

About

Survey and Benchmark of VIALM

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published