Source code of our long paper:
Caseformer: Pre-training for Legal Case Retrieval
@article{su2023caseformer,
title={Caseformer: Pre-training for Legal Case Retrieval},
author={Su, Weihang and Ai, Qingyao and Wu, Yueyue and Ma, Yixiao and Li, Haitao and Liu, Yiqun},
journal={arXiv preprint arXiv:2311.00333},
year={2023}
}
.
└── caseformer
├── data_preprocess
│ ├── crime_extraction.py
│ └── law_article_extration.py
├── demo_data
│ ├── legal_documents
│ │ ├── file_format.txt
│ │ └── legal_documents.jsonl
│ └── preprocessed_training_data
│ ├── FDM_task.jsonl
│ ├── file_format.txt
│ └── LJP_task.jsonl
├── pre-training
│ ├── pre-train_reranker.sh
│ └── pre-train_retriever.sh
├── pre-training_data_generation
│ ├── calc_LP-ICF_score.py
│ ├── demo_data
│ │ ├── bm25_top100.jsonl
│ │ ├── extracted_crimes.jsonl
│ │ ├── extracted_law_articles.jsonl
│ │ └── LP-ICF_top100.jsonl
│ ├── generate_FDM_task_data.py
│ └── generate_LJP_task_data.py
├── README.md
└── requirements.txt
git clone git@github.com:caseformer/caseformer.git
cd caseformer
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
cd caseformer
python ./data_preprocess/law_article_extraction.py \
--path_to_documents your_path \
--output_path your_path
Format of the input documents:
{"docID":string,"content":string}
{"docID":string,"content":string}
{"docID":string,"content":string}
......
{"docID":string,"content":string}
cd caseformer
python ./data_preprocess/crime_extraction.py \
--path_to_documents your_path \
--output_path your_path
Format of the input documents:
{"docID":string,"content":string}
{"docID":string,"content":string}
{"docID":string,"content":string}
......
{"docID":string,"content":string}
cd caseformer
python ./pre-training_data_generation/generate_LJP_task_data.py \
--BM25_top_100 path \
--law_articles path \
--crimes path \
--output_path your_path
cd caseformer
python ./pre-training_data_generation/generate_FDM_task_data.py \
--LP-ICF_top_100 path \
--law_articles path \
--crimes path \
--output_path your_path
We will disclose the complete code and data in this repository.