Benchmark for evaluating Coq proof search tools.
Requirements:
- opam >= 2.1.2 (previous versions are untested)
- poetry >= 1.8.3 (previous versions are untested)
- python >= 3.11 (previous versions are untested)
- Clone this repository and its submodules:
git clone git@github.com:rkthomps/CoqStoq --recurse-submodules
- Build and initialize the CoqStoq python environment:
cd CoqStoq
poetry install
poetry shell
- Install the CoqStoq opam switch
opam switch import coqstoq.opam --switch=coqstoq --repos=default,coq-released=https://coq.inria.fr/opam/released
- Build the CoqStoq repositories
python3 coqstoq/build_projects.py
- Check your setup (from the project root directory)
pytest
A eval theorem is represented by the following python object:
@dataclass
class EvalTheorem:
project: Project
path: Path # relative path in the project
theorem_start_pos: Position # inclusive
theorem_end_pos: Position # inclusive line, exclusive column
proof_start_pos: Position # inclusive
proof_end_pos: Position # inclusive line, exclusive column
hash: str # Hash of file when theorem was collected
You can interact with the predefined EvalThm
s in CoqStoq in the following way.
Note that CoqStoq's theorems have been shuffled. That is, for some index
Also note, if you are going to load many theorems from CoqStoq, it is more efficient to use the get_theorems
function than the get_theorem
function.
from coqstoq import num_theorems, get_theorem, get_theorem_list, Split
# number of theorems in the testing split
print(num_theorems(Split.TEST))
# get the theorem at index 10 from the validation split
print(get_theorem(Split.VAL), 10)
# get the list of theorems in the cutoff split
print(get_theorems(Split.CUTOFF))
To add the results of a new tool to CoqStoq, we ask that the results of your tool be presented in a .json
file containing the following data structure (which has a .to_json()
)
@dataclass
class EvalResults:
hardware: str # Description of hardware used
results: list[Result]
@dataclass
class Result:
thm: EvalTheorem
proof: Optional[str] # Proof found
time: Optional[float] # Time in seconds
- We impose a 2-minute timeout when compiling files with
coqc
, if a file takes longer than 2 minutes to compile it is not included in our evaluation. In practice, this only affects 4 files from theBB5
project in the cutoff split. Namely,BB42Theorem.v, BB52Theorem.v, BB24Theorem.v, Skelet1.v
. - CoqStoq depends on coq-lsp to parse Coq Files. Sometimes,
coq-lsp
fails unexpectedly even whencoqc
successfully compiled a file. In our case,coq-lsp
failed on 49 files from the fourcolor project where all of the files were of the formtheories/job<n1>to<n2>.v
. Each of these files contain a single proof that is one tactic long:Proof. CheckReducible. Qed
.