VIALM

This repository offers a survey and benchmark for Visual Impairment Assistance using Large Models (VIALM).

1. Task Illustration

The following figure illustrates a sample input and output of VIALM. The input is a pair of a visual display of the environment (the image on the left) and a user request in human language (the grey box). The yellow box shows the output guidance for VI users to complete the request within the environment (the image on the right). The output should be grounded and fine-grained for VI users to follow easily.

2. Paper Collections

For the survey part, we have collected a list of related papers, covering LLMs, VLMs, and embodid agents.

2.1 Large Language models (LLMs)

Tom B. Brown, Benjamin Mann, and et al. Language models are few-shot learners. In NeurIPS, 2020. link
Nan Du, Yanping Huang, and et al. GLaM: Efficient scaling of language models with mixture-of-experts. In ICML,2022.link
Jack W. Rae, Sebastian Borgeaud, and et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv, 2112.11446, 2021. link
Romal Thoppilan, Daniel De Freitas, and et al. LaMDA: Language models for dialog applications. arXiv, 2201.08239, 2022. link
Aakanksha Chowdhery, Sharan Narang, and et al. PaLM: Scaling language modeling with pathways. J. Mach. Learn. Res., 2023. link
Jordan Hoffmann, Sebastian Borgeaud, and et al. Training compute-optimal large language models. arXiv, 2203.15556, 2022. link
Susan Zhang, Stephen Roller, and et al. OPT: open pre-trained transformer language models. arXiv, 2205.01068, 2022. link
Aohan Zeng, Xiao Liu, and et al. GLM-130B: an open bilingual pre-trained model. In ICLR, 2023. link
Teven Le Scao, Angela Fan, and et al. BLOOM: A 176b-parameter open-access multilingual language model. arXiv, 2211.05100, 2022. link
Ross Taylor, Marcin Kardas, and et al. Galactica: A large language model for science. arXiv, 2211.09085, 2022. link
Hugo Touvron, Thibaut Lavril, and et al. LLaMA: Open and efficient foundation language models. arXiv, 2302.13971, 2023. link
Bo Peng, Eric Alcaide, and et al. RWKV: Reinventing rnns for the transformer era. In EMNLP (Findings), 2023. link
Rohan Anil, Andrew M. Dai, and et al. PaLM 2 technical report. arXiv, 2305.10403, 2023. link
Hugo Touvron, Louis Martin, and et al. Llama 2: Open foundation and fine-tuned chat models. arXiv, 2307.09288, 2023. link
Ebtesam Almazrouei, Hamza Alobeidli, and et al. The falcon series of open language models. arXiv, 2311.16867, 2023. link

2.2 Visual-Language Models (VLMs)

Jean-Baptiste Alayrac, Jeff Donahue, and et al. Flamingo: a visual language model for few-shot learning. In NeurIPS,2022. link
Xi Chen, Xiao Wang, and et al. PaLI: A jointly-scaled multilingual language-image model. In ICLR, 2023. link
Junnan Li, Dongxu Li, and et al. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023. link
OpenAI. GPT-4 technical report. arXiv, 2303.08774, 2023. link
Haotian Liu, Chunyuan Li, and et al. Visual instruction tuning. arXiv, 2304.08485, 2023. link
Deyao Zhu, Jun Chen, and et al. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. arXiv, 2304.10592, 2023. link
Xi Chen, Josip Djolonga, and et al. PaLI-X: On scaling up a multilingual vision and language model. arXiv, 2305.18565, 2023. link
Wenliang Dai, Junnan Li, and et al. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. arXiv, 2305.06500, 2023. link
Wenhai Wang, Zhe Chen, and et al. VisionLLM: Large language model is also an open-ended decoder for vision-centric tasks. arXiv, 2305.11175, 2023. link
Wenbo Hu, Yifan Xu, and et al. BLIVA: A simple multimodal LLM for better handling of text-rich visual questions. arXiv, 2308.09936, 2023. link
Jinze Bai, Shuai Bai, and et al. Qwen-VL: A frontier large vision-language model with versatile abilities. arXiv, 2308.12966, 2023. link
Jun Chen, Deyao Zhu, and et al. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv, 2310.09478, 2023. link
Weihan Wang, Qingsong Lv, and et al. CogVLM: Visual expert for pretrained language models. arXiv, 2311.03079, 2023. link

2.3 Embodied Agents

Wenlong Huang, Pieter Abbeel, and et al. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In ICML, 2022. link
Brian Ichter, nthony Brohan, and et al. Do as I can, not as I say: Grounding language in robotic affordances. In CoRL, 2022. link
Jacky Liang, Wenlong Huang, and et al. Code as Policies:Language model programs for embodied control. In ICRA, 2023. link
Wenlong Huang, Fei Xia, and et al. Inner monologue: Embodied reasoning through planning with language models. In CoRL, 2022. link
Ishika Singh, Valts Blukis, and et al. ProgPrompt: Generating situated robot task plans using large language models. In ICRA, 2023. link
Chan Hee Song, Jiaman Wu, and et al. Llm-planner: Few-shot grounded planning for embodied agents with large language models. arXiv, 2212.04088, 2022. link
Xufeng Zhao, Mengdi Li, and et al. Chat with the environment: Interactive multimodal perception using large language models. arXiv, 2303.08268, 2023. link
Danny Driess, Fei Xia, and et al. PaLM-E: An embodied multimodal language model. In ICML, 2023. link
Wenlong Huang, Fei Xia, and et al. Grounded decoding: Guiding text generation with grounded models for robot control. arXiv, 2303.00855, 2023. link
Guanzhi Wang, Yuqi Xie, and et al. Voyager: An open-ended embodied agent with large language models. arXiv, 2305.16291, 2023. link
Xizhou Zhu, Yuntao Chen, and et al. Ghost in the Minecraft: Generally capable agents for open-world environments via large language models with text-based knowledge and memory. arXiv, 2305.17144, 2023. link
Anthony Brohan, Noah Brown, and et al. RT-2: vision-language-action models transfer web knowledge to robotic control. arXiv, 2307.15818, 2023. link
Yingdong Hu, Fanqi Lin, and et al. Look Before You Leap: Unveiling the power of GPT-4V in robotic vision-language planning. arXiv, 2311.17842, 2023. link

3. Benchmark Evaluation

Our benchmark consists of two common daily life environment, domestic home and supermarket.

Example:

3.1 Benchmark Annotations

The annotated evaluation dataset can be found at ./benchmark/annotations. The input environment images can be found at ./benchmark/environment_images.

3.2 LM Predictions

The predictions made by the six large models (LMs) can be found at ./benchmark/lm_predictions.

The evaluation models are:

GPT-4 link
CogVLM Repository
MiniGPT Repository
Qwen-VL Repository
LLaVA Repository
BLIVA Repository

Prediction example：

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.idea		.idea
benchmark		benchmark
images		images
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VIALM

1. Task Illustration

2. Paper Collections

2.1 Large Language models (LLMs)

2.2 Visual-Language Models (VLMs)

2.3 Embodied Agents

3. Benchmark Evaluation

3.1 Benchmark Annotations

3.2 LM Predictions

About

Releases

Packages

Contributors 2

YiyiyiZhao/VIALM

Folders and files

Latest commit

History

Repository files navigation

VIALM

1. Task Illustration

2. Paper Collections

2.1 Large Language models (LLMs)

2.2 Visual-Language Models (VLMs)

2.3 Embodied Agents

3. Benchmark Evaluation

3.1 Benchmark Annotations

3.2 LM Predictions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages