Subscribe to the Responsible AI newsletter for weekly updates on new papers and more.
Welcome to the Responsible AI Paper Summaries repository! Here, you'll find concise summaries of key papers in various areas of responsible AI.
This repository provides brief summaries of AI/ML papers in the following areas:
- Explainability and Interpretability
- Fairness and Biases
- Privacy
- Security
- Safety
- Accountability
- Sustainability
- Human Control and Interaction
- Legal and Ethical Guidelines
-
Sabotage Evaluations of Frontier Models - Anthropic Research Report, 2024. This paper presents risk evaluations focused on sabotage capabilities in advanced AI models, assessing whether they can manipulate human decisions without detection.
-
An Adversarial Perspective on Machine Unlearning for AI Safety - ArXiv, September 2024. This paper explores the robustness of machine unlearning methods designed to remove hazardous knowledge (machine unlearning) from large language models, arguing that these methods may be ineffective under adversarial scrutiny.
-
Taxonomy of Risks Posed by Language Models - FAccT ’22. This paper develops a comprehensive taxonomy of ethical and social risks associated with large-scale language models (LMs). It identifies twenty-one risks and categorizes them into six risk areas to guide responsible innovation and mitigation strategies.
Explainability and Interpretability
-
#A Survey on Knowledge Graphs: Representation, Acquisition, and Applications - IEEE Transactions on Neural Networks and Learning Systems, 2021. A comprehensive review of knowledge graph representation learning, acquisition methods, and applications.
-
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models - NeurIPS 2022. This paper demonstrates how chain-of-thought (CoT) prompting significantly enhances the reasoning abilities of large language models (LLMs).
-
A Nutritional Label for Rankings - SIGMOD’18. Provides a web-based application called Ranking Facts that generates a "nutritional label" for rankings to enhance transparency, fairness, and stability.
-
Graph of Thoughts: Solving Elaborate Problems with Large Language Models - arXiv, 2024. This paper introduces the Graph of Thoughts (GoT) framework, enhancing the reasoning capabilities of large language models by structuring their thought processes as directed graphs.
-
Tree of Thoughts: Deliberate Problem Solving with Large Language Models - NeurIPS 2023. The paper introduces the Tree of Thoughts (ToT) framework, enhancing the problem-solving abilities of large language models by enabling exploration and evaluation of multiple reasoning paths.
-
A Unified Approach to Interpreting Model Predictions - NIPS 2017. Introduces SHAP (SHapley Additive exPlanations), a unified framework for interpreting model predictions by assigning each feature an importance value for a particular prediction, integrating six existing methods into a single, cohesive approach.
-
Sparse Autoencoders Find Highly Interpretable Features in Language Models - arXiv, 2023. This paper uses sparse autoencoders to extract interpretable features from language models, addressing polysemanticity in neural networks.
-
Why Should I Trust You? Explaining the Predictions of Any Classifier - KDD 2016. This paper introduces LIME (Local Interpretable Model-agnostic Explanations), a technique to explain the predictions of any classifier in a faithful and interpretable manner by learning an interpretable model locally around the prediction.
-
Understanding the Reasoning Ability of Language Models From the Perspective of Reasoning Paths Aggregation - 41st International Conference on Machine Learning (ICML) 2024. The paper investigates how pre-trained language models (LMs) perform reasoning tasks by aggregating indirect reasoning paths seen during pre-training.
Fairness and Biases
-
Benchmarking Cognitive Biases in Large Language Models as Evaluators - arXiv, 2023. This paper introduces COBBLER (COGNITIVE BIAS BENCHMARK FOR LLMS AS EVALUATORS), a benchmark for evaluating cognitive biases in LLMs used as evaluators.
-
Taxonomy of Risks Posed by Language Models - FAccT ’22. This paper develops a comprehensive taxonomy of ethical and social risks associated with large-scale language models (LMs). It identifies twenty-one risks and categorizes them into six risk areas to guide responsible innovation and mitigation strategies.
Privacy
Security
Safety
-
A Grading Rubric for AI Safety Frameworks - ArXiv, September 2024. This paper proposes a comprehensive grading rubric for evaluating AI safety frameworks. It introduces seven evaluation criteria with 21 indicators, a six-tier grading system, and three methods for applying the rubric. The goal is to enable nuanced comparisons between frameworks, identify areas for improvement, and promote responsible AI development.
-
Sabotage Evaluations of Frontier Models - Anthropic Research Report, 2024. This paper presents risk evaluations focused on sabotage capabilities in advanced AI models, assessing whether they can manipulate human decisions without detection.
-
An Adversarial Perspective on Machine Unlearning for AI Safety - ArXiv, September 2024. This paper explores the robustness of machine unlearning methods designed to remove hazardous knowledge (machine unlearning) from large language models, arguing that these methods may be ineffective under adversarial scrutiny.
-
REFCHECKER: Reference-based Fine-grained Hallucination Checker and Benchmark for Large Language Models - arXiv 2024. REFCHECKER is a framework designed to detect fine-grained hallucinations in large language model responses using a novel claim-triplet analysis.
-
Taxonomy of Risks Posed by Language Models - FAccT ’22. This paper develops a comprehensive taxonomy of ethical and social risks associated with large-scale language models (LMs). It identifies twenty-one risks and categorizes them into six risk areas to guide responsible innovation and mitigation strategies.
-
LLMs’ Classification Performance is Overclaimed - arXiv, 2024. The paper reveals the limitations of LLMs in classification tasks without gold labels. This work provides a new testbed to evaluate LLMs' human-level discrimination intelligence, proposing a framework for future research to enhance LLMs' robustness and reliability.
-
A Comprehensive Study of the Capabilities of Large Language Models for Vulnerability Detection - . This paper investigates the effectiveness of Large Language Models (LLMs) in detecting software vulnerabilities. It evaluates 11 state-of-the-art LLMs using a variety of prompt designs and presents insights into the models' limitations in understanding code structures and reasoning about vulnerabilities.
-
CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models - arXiv, 2024. CARES evaluates the trustworthiness of medical vision language models (Med-LVLMs) across five dimensions: trustfulness, fairness, safety, privacy, and robustness.
-
Graph Retrieval-Augmented Generation: A Survey - [arXiv 2024]. The paper surveys GraphRAG, a framework that enhances traditional RAG by incorporating graph-based retrieval for improved knowledge representation and generation.
-
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors - 2024 Arxiv. This paper introduces SORRY-Bench, a benchmark for evaluating LLM safety refusal behaviors.
-
Detecting Hallucinations in Large Language Models Using Semantic Entropy - Nature, 2024. This paper proposes a method to detect hallucinations in large language models using semantic entropy.
-
To Believe or Not to Believe Your LLM - arXiv, 2024. This paper explores uncertainty quantification in LLMs to detect hallucinations by distinguishing epistemic from aleatoric uncertainties using an information-theoretic metric.
-
Air Gap: Protecting Privacy-Conscious Conversational Agents - arXiv 2024. A paper from Google proposes AirGapAgent to prevent data leakage from LLMs, ensuring privacy in adversarial contexts.
-
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions - arXiv, 2024. This paper from OpenAI introduces an instruction hierarchy to train LLMs to prioritize privileged instructions(system messages) over lower-level ones (user messages and third-party inputs), enhancing their robustness against adversaries.
-
Characterizing Bugs in Python and R Data Analytics Programs - . This paper provides a comprehensive study of bugs in Python and R data analytics programs. It uses data from Stack Overflow posts, GitHub bug fix commits, and issues in popular libraries to explore common bug types, root causes, and effects. The study also provides a dataset of manually verified bugs.
Accountability
Sustainability
-
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling - ArXiv, July 2024. Repeated sampling is an effective and cost-efficient method for scaling inference compute in LLMs, significantly improving task performance across a range of models and tasks.
Human Control and Interaction
-
Dimensions underlying the representational alignment of deep neural networks with humans - 2024 Conference on Computer Vision. The paper analyzes representational alignment between humans and DNNs, highlighting divergent strategies.
-
Understanding the Capabilities and Limitations of Large Language Models for Cultural Commonsense - Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics. This paper examines the capabilities and limitations of large language models (LLMs) in understanding cultural commonsense.
-
Towards a Science of Human-AI Decision Making: A Survey of Empirical Studies - FAccT '23. This survey reviews over 100 empirical studies to understand and improve human-AI decision-making, emphasizing the need for unified research frameworks.
Legal and Ethical Guidelines
Each summary is stored in the relevant subfolder within the summaries/
directory. You can browse through the summaries to quickly understand the main points of various papers.
We welcome contributions! Please read our CONTRIBUTING.md file for more details on how to contribute.