FactCG: Enhancing Fact Checkers with Graph-Based Multi-Hop Data

Pytorch implementation of our EMNLP 2025 paper: FactCG: Enhancing Fact Checkers with Graph-Based Multi-Hop Data.

We propose a fact-checker to detect Large language Model ungrounded hallucinations and a synthetic data generation method to collect high quality training data.

We leverage LLM to convert text documents into content graphs. Graph data, in semi-structured form, is easier to manipulate in both neural and symbolic ways than text. Thus allow us to create high quality and high granularity syhthetic data with high control.

Prior research on training grounded factuality classification models to detect hallucinations in large language models (LLMs) has relied on public natural language inference (NLI) data and synthetic data. However, conventional NLI datasets are not well-suited for document-level reasoning, which is critical for detecting LLM hallucinations. Recent approaches to document-level synthetic data generation involve iteratively removing sentences from documents and annotating factuality using LLM-based prompts. While effective, this method is computationally expensive for long documents and limited by the LLM's capabilities. In this work, we analyze the differences between existing synthetic training data used in state-of-the-art models and real LLM output claims. Based on our findings, we propose a novel approach for synthetic data generation, CG2C, that leverages multi-hop reasoning on context graphs extracted from documents. Our fact checker model, \ours, demonstrates improved performance with more connected reasoning, using the same backbone models. Experiments show it even outperforms GPT-4-o on the LLM-A{\small GGRE}F{\small ACT} benchmark with much smaller model size.

If you find the repository or FactCG helpful, please cite the following paper

@inproceedings{lei2025factcg,
  title={FactCG: Enhancing Fact Checkers with Graph-Based Multi-Hop Data},
  author={Lei, Deren and Li, Yaxi and Li, Siyao and Hu, Mengya and Xu, Rui and Archer, Ken and Wang, Mingyu and Ching, Emily and Deng, Alex},
  journal={NAACL},
  year={2025}
}

LLM-Aggrefact Leaderboard

We evaluate FactCG with dynamic threshold and fixed threshold settings on LLM-Aggrefact (<1B models). Highest score (except GPT) for each dataset is highlighted with dark green and second highest score highlighted with light green. We observe that FactCG-DBT with only 0.4B parameter achieves best BAcc.

Dynamic Threshold (our preferred way)

We adjust the threshold for each LLM-AGGREFACT dataset to two decimal places, optimizing for the best performance on the validation set. In this way, we can assess fact-checkers' optimal performance without fine-tuning for each specific scenario.

Fixed Threshold

We fix the thresohld to 0.5.

Checkpoint

We provide FactCG with deberta-v3-large backbone (0.4B parameters) pytorch lightning checkpoint at

https://huggingface.co/derenlei/FactCG-DeBERTa-v3-Large/blob/main/factcg_dbt.ckpt

To run benchmark.sh directly, create ckpt folder in the root directory and place the checkpoint inside.

Note: The checkpoint has everything needed to restore a pytorch lightning training session. You can extract the model parameters for a light-weight inference usage.

Installation

We trained and evaluated our model using torch==2.2.1 and pytorch_lightning>=2.2.0. We recommend using the same version for reproducibility.

Pip install factcg

You can install factcg by cloning this repository and pip install ..

from factcg import FactCGScore

scorer = FactCGScore(model_name="microsoft/deberta-v3-large",
                     batch_size=16, ckpt_path="./ckpt/factcg_dbt.ckpt", verbose=False)
score = scorer.score(contexts=['sun raises from east'], claims=['sun raises from west'])
print(score) # example output [0.06528743356466293]
score = scorer.score(contexts=['sun raises from west'], claims=['sun raises from west'])
print(score) # example output [0.7840349674224854]

model_name: the backbone model. 'microsoft/deberta-v3-large' for instance.

batch_size: batch size for inference.

ckpt_path: the path to the checkpoint. [download-here]

verbose: whether to output the progress bar

Install for benchmarking

Please note due to legacy issue, some of previous work on fact-checkers have dependency conflict. For example, AlignScore requires pytorch-lightning<2 and >=1.7.7 but we used a higher version. Therefore there are some extra dependencies to run the benchmarking besides our factcg. Due to time constrain we didn't explore resolving those conflicts into a unified python env. Please check the following env requirements for previous work on fact-checkers:

requirements for minicheck:

minicheck @ git+https://github.com/Liyan06/MiniCheck.git@main
accelerate>=0.26.0

requirements for alignscore and summa, summac

alignscore @ git+https://github.com/yuh-zha/AlignScore.git@main # pytorch-lightning<2 and >=1.7.7
summa==1.2.0
summac==0.0.3  #summac 0.0.3 depends on transformers==4.8.1

Training

To reproduce FactCG-DBT with 2-stage training

sh train.sh

Evaluation

Evaluate on LLM-Aggrefact Benchmark

./benchmark.sh

Note:

you can evaulate different fact-checkers: FactCG, Minicheck, AlignScore, SummaC-ZS and SummaC-CV
you can choose threshold-setting as tune for selecting the best threshold per best dev set performance in LLM-Aggrefact, or as fixed for fixing threshold to 0.5.

Evaluation on Connected Reasoning

python3 wice_connected_reasoning.py --method factcg

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
factcg		factcg
figs		figs
prompts		prompts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
benchmark.py		benchmark.py
benchmark.sh		benchmark.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
train.py		train.py
train.sh		train.sh
wice_connected_reasoning.py		wice_connected_reasoning.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FactCG: Enhancing Fact Checkers with Graph-Based Multi-Hop Data

LLM-Aggrefact Leaderboard

Dynamic Threshold (our preferred way)

Fixed Threshold

Checkpoint

Installation

Pip install factcg

Install for benchmarking

Training

Evaluation

About

Releases

Packages

Contributors 3

Languages

License

derenlei/FactCG

Folders and files

Latest commit

History

Repository files navigation

FactCG: Enhancing Fact Checkers with Graph-Based Multi-Hop Data

LLM-Aggrefact Leaderboard

Dynamic Threshold (our preferred way)

Fixed Threshold

Checkpoint

Installation

Pip install factcg

Install for benchmarking

Training

Evaluation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages