Are LLMs Prescient? A Continuous Evaluation using Daily News as the Oracle

Paper | Website | Dataset

Daily Oracle is a continuous evaluation benchmark using automatically generated QA pairs from daily news to assess how the future prediction capabilities of LLMs evolve over time.

Setup

git clone git@github.com:agentic-learning-ai-lab/daily-oracle.git
cd daily-oracle
pip install -r requirements.txt

QA Generation

/qa-generation folder contains code for generating TF (True/False) and MC (Multiple Choice) QA pairs from daily news, with news articles selected either randomly or based on hot topics.

Step 1: Select news articles from hot topics

Run the following command to select news articles from the top three hot topics of the day. Ensure the news file is located in the /news directory.

python qa-generation/hot_topic_selection.py \
  --start_date 2024-07-01 \
  --end_date 2024-07-04 \
  --input_path ./news/<news-file-name>.csv \
  --output_path ./data/raw_data

Step 2: Generate TF and MC QA pairs

Run the following command to generate QA pairs

from random news articles

python qa-generation/main.py \
  --article_selection random \
  --start_date 2024-07-01 \
  --end_date 2024-07-04 \
  --input_path ./news/<news-file-name>.csv \
  --output_path ./data/raw_data

from selected hot topic news articles (need to run Step 1 first)

python qa-generation/main.py \
  --article_selection selected \
  --start_date 2024-07-01 \
  --end_date 2024-07-04 \
  --input_path ./data/raw_data \
  --output_path ./data/raw_data

Step 3: Clean the generated data

In Step 2, the raw QA pairs will be saved as pickle file and need postprocess to filter and retain only valid questions. Use the following command to clean the data:

python qa-generation/clean_raw_data.py \
  --start_date 2024-07-01 \
  --end_date 2024-07-04 \
  --input_path ./data/raw_data \
  --output_path ./data/clean_data

Evaluation

Supported models: gpt-35, gpt-4, gpt-4o-mini, gpt-4o, claude-3-5-sonnet, llama3, mistral-7b, mixtral-8x7b, gemma2-2b, qwen2-7b. You can also add new models under load_model() function in /evaluation/eval.py.

Closed-Book Setting

In closed-book setting, we ask the questions directly without providing extra information.

python evaluation/eval.py \
  --eval_setting closed-book \
  --model_name gpt-35 \
  --input_path ./data/clean_data/<question-file-name>.csv \
  --output_path ./evaluation/results

Constraint Open-Book Setting

In the constrained open-book setting, we explore how access to news articles up to different time cutoffs influences LLM performance using RAG.

Step 1: Tokenize data for RAG

python evaluation/rag_tokenize_data.py \
  --news_path ./news/<news-file-name>.csv \
  --question_path ./data/clean_data/<question-file-name>.csv \
  --output_path ./evaluation/rag_data/input

Step 2: Retrieve articles

RAG_cutoff: the latest accessible date for retrieving articles \
The accessible date range for a question with resolution date d_res is [start_date_of_news_corpus, min(d_res − 1, RAG_cutoff))
For questions' resolution date is
a) before RAG_cutoff, we have dynamic_cutoff with the retriver having information up to d_res-1
b) after RAG_cutoff, we have a fixed retreiver with information up to RAG_cutoff

For the following command, use --dynamic_cutoff for scenario (a). For scenario (b), omit the flag.

python evaluation/rag.py \
  --news_path ./evaluation/rag_data/input/<news-file-name>_tokenized.pkl \
  --question_path ./evaluation/rag_data/input/<question-file-name>_tokenized.pkl \
  --output_path ./evaluation/rag_data/output \
  --cutoff_date 2024-03-01 \
  --dynamic_cutoff

Step 3: Evaluation

Note that the --input path should be the pickle file generated in Step 2.

python evaluation/eval.py \
  --eval_setting open-book \
  --model_name gpt-35 \
  --input_path ./evaluation/rag_data/output/cutoff_<cutoff_date>/<rag-question-file-name>.pkl \
  --output_path ./evaluation/results

Gold Article Setting

In the gold article setting, models are provided direct access to the gold article, from which the question is generated.

python evaluation/eval.py \
  --eval_setting gold-article \
  --model_name gpt-35 \
  --input_path ./data/clean_data/<question-file-name>.csv \
  --output_path ./evaluation/results

BibTex

@article{dai2024dailyoracle,
  title={Are LLMs Prescient? A Continuous Evaluation using Daily News as the Oracle},
  author={Dai, Hui and Teehan, Ryan and Ren, Mengye},
  journal={arXiv preprint arXiv:2411.08324},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
evaluation		evaluation
news		news
qa-generation		qa-generation
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Are LLMs Prescient? A Continuous Evaluation using Daily News as the Oracle

Paper | Website | Dataset

Setup

QA Generation

Evaluation

Closed-Book Setting

Constraint Open-Book Setting

Gold Article Setting

BibTex

About

Releases

Packages

Languages

agentic-learning-ai-lab/daily-oracle

Folders and files

Latest commit

History

Repository files navigation

Are LLMs Prescient? A Continuous Evaluation using Daily News as the Oracle

Paper | Website | Dataset

Setup

QA Generation

Evaluation

Closed-Book Setting

Constraint Open-Book Setting

Gold Article Setting

BibTex

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages