Skip to content

dobbersc/fundus-evaluation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

97 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fundus News Scraper Evaluation

This repository contains the evaluation code and dataset to reproduce the results from the paper "Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions".

Fundus is a user-friendly news scraper that enables users to obtain millions of high-quality news articles with just a few lines of code.

In the following sections, we provide instructions to reproduce the comparative evaluation of Fundus against prominent scraping libraries. Our evaluation shows that Fundus yields significantly higher quality extractions (complete and artifact-free news articles) than comparable news scrapers. For a more in-depth overview of Fundus, the evaluation practises, and its results, consult the result summary and our paper.

Prerequisites

Fundus and this evaluation repository require Python 3.8 or later and Java for the Boilerpipe scraper. (Note: The evaluation was tested and performed using Python 3.8 and Java JDK 17.0.10.)

To install the fundus-evaluation Python package, including the reference scraper dependencies, clone this GitHub repository and simply install the package using pip:

git clone https://github.com/dobbersc/fundus-evaluation.git
pip install ./fundus-evaluation

This installation also contains the dataset and evaluation results. If you only are interested in the Python package directly (without the dataset and evaluation results), install the fundus-evaluation package directly from GitHub using pip:

pip install git+https://github.com/dobbersc/fundus-evaluation.git@master

Verify the installation by running evaluate --version, with the expected output of evaluate <version>, where <version> specifies the current version of the evaluation package.

Development

For development, install the package, including the development dependencies:

git clone https://github.com/dobbersc/fundus-evaluation.git
pip install -e ./fundus-evaluation[dev]

Reproducing the Evaluation Results

In the following steps, we assume that the current working directory is the root of the repository.

To fully reproduce the evaluation results, only the dataset is required. Each step in the evaluation pipeline requires the outputs from the previous step (dataset -> scrape -> score -> analysis). To ease the reproducibility, we also provide the artifacts of intermediate steps in the dataset folder. Therefore, the pipeline may be started from any step.

Usage

The evaluation results may be reproduced using the package's command line interface (CLI), representing the evaluation pipeline steps:

$ evaluate --help
usage: evaluate [-h] [--version] {complexity,scrape,score,analysis} ...

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit

Fundus News Scraper Evaluation:
  select evaluation pipeline step

  {complexity,scrape,score,analysis}
    complexity          calculate page complexity scores
    scrape              scrape extractions on the evaluation dataset
    score               calculate evaluation scores
    analysis            generate tables and plots

Each entry point also provides its help page, e.g. with evaluate scrape --help.

Alternatively to the CLI, we provide direct Python entry points in fundus_evaluation.entry_points. In the following steps, we will use the CLI.

(1) Obtaining the Evaluation Dataset

We selected the 16 English-language publishers Fundus currently supports as the data source, and retrieved five articles for each publisher from the respective RSS feeds/sitemaps. The selection process yielded an evaluation corpus of 80 news articles. From it, we manually extracted the plain text from each article and stored it together with information on the original paragraph structure.

The resulting evaluation dataset is included in this repository and consists of the (compressed) HTML article files and their ground truth extractions as JSON.

(2) Generating the Scraper Extractions

Execute the following command to let all supported scrapers extract the plain text of the evaluation dataset's articles:

evaluate scrape \
  --ground-truth-path dataset/ground_truth.json \
  --html-directory dataset/html/ \
  --output-directory dataset/extractions/

To restrict the scrapers that are part of the evaluation,

  • use the --scrapers option to explicitly specify a list of evaluation scrapers,
  • or use the --exclude-scrapers option to exclude scrapers from the evaluation.

E.g. to exclude BoilerNet, as this scraper is very resource intensive, add the --exclude-scrapers boilernet argument to the command above.

(3) Calculating the Evaluation Scores

To evaluate the extraction results with the three supported metrics (paragraph match, ROUGE-LSum and WER), run the following command:

evaluate score \
  --ground-truth-path dataset/ground_truth.json \
  --extractions-directory dataset/extractions/ \
  --output-directory dataset/scores/

Calculating the Page Complexity (Optional)

This step is not part of the evaluation in our paper and is thus optional.

Execute the following command to calculate the page complexity scores established in "An Empirical Comparison of Web Content Extraction Algorithms" (Bevendorff et al., 2023):

evaluate complexity \
  --ground-truth-path dataset/ground_truth.json \
  --html-directory dataset/html/ \
  --output-path dataset/complexity.tsv

(4) Analyzing the Data

Run the following command to produce the paper's tables and plots for the ROUGE-LSum score:

evaluate analysis --rouge-lsum-path dataset/scores/rouge_lsum.tsv --output-directory dataset/analysis/

To also produce a boxplot of the page complexity, execute:

evaluate analysis --complexity-path dataset/complexity.tsv --output-directory dataset/analysis/

Results

The following table summarizes the overall performance of Fundus and evaluated scrapers in terms of averaged ROUGE-LSum precision, recall and F1-score and their standard deviation. In addition, we provide the scrapers' versions at their evaluation time. The table is sorted in descending order over the F1-score:

Fundus-Evaluation v0.2.0

Scraper Precision Recall F1-Score Version
Fundus 99.89±0.57 96.75±12.75 97.69±9.75 0.4.1
Trafilatura 93.91±12.89 96.85±15.69 93.62±16.73 1.12.0
news-please 97.95±10.08 91.89±16.15 93.39±14.52 1.6.13
BTE 81.09±19.41 98.23±8.61 87.14±15.48 /
jusText 86.51±18.92 90.23±20.61 86.96±19.76 3.0.1
BoilerNet 85.96±18.55 91.21±19.15 86.52±18.03 /
Boilerpipe 82.89±20.65 82.11±29.99 79.90±25.86 1.3.0
Previous Results

Fundus-Evaluation v0.1.0

Scraper Precision Recall F1-Score Version
Fundus 99.89±0.57 96.75±12.75 97.69±9.75 0.2.2
Trafilatura 90.54±18.86 93.23±23.81 89.81±23.69 1.7.0
BTE 81.09±19.41 98.23±8.61 87.14±15.48 /
jusText 86.51±18.92 90.23±20.61 86.96±19.76 3.0.0
news-please 92.26±12.40 86.38±27.59 85.81±23.29 1.5.44
BoilerNet 84.73±20.82 90.66±21.05 85.77±20.28 /
Boilerpipe 82.89±20.65 82.11±29.99 79.90±25.86 1.3.0

Contributing

We encourage contributions, particularly those involving competitive news scrapers. For example, you can contribute by:

  • Submitting a New Scraper: Open an issue or submit a pull request to incorporate your scraper into our evaluation pipeline. We will review and integrate new submissions as appropriate.
  • Updating an Existing Scraper: Please inform us if a supported scraper has undergone significant updates. We are open to re-evaluating our results accordingly. (Previous evaluation results are available on our Release Page.)

Note: We also appreciate contributions to the Fundus library!

Questions and Support

Please open an issue for unresolved questions about our paper or the evaluation in this repository. For questions about the general functionality or bug reports regarding Fundus please refer to our main repository and submit an issue.

Cite

Please cite the following paper when using Fundus or building upon our work:

@inproceedings{dallabetta-etal-2024-fundus,
    title = "Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions",
    author = "Dallabetta, Max  and
      Dobberstein, Conrad  and
      Breiding, Adrian  and
      Akbik, Alan",
    editor = "Cao, Yixin  and
      Feng, Yang  and
      Xiong, Deyi",
    booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.acl-demos.29",
    pages = "305--314",
    abstract = "This paper introduces Fundus, a user-friendly news scraper that enables users to obtain millions of high-quality news articles with just a few lines of code. Unlike existing news scrapers, we use manually crafted, bespoke content extractors that are specifically tailored to the formatting guidelines of each supported online newspaper. This allows us to optimize our scraping for quality such that retrieved news articles are textually complete and without HTML artifacts. Further, our framework combines both crawling (retrieving HTML from the web or large web archives) and content extraction into a single pipeline. By providing a unified interface for a predefined collection of newspapers, we aim to make Fundus broadly usable even for non-technical users. This paper gives an overview of the framework, discusses our design choices, and presents a comparative evaluation against other popular news scrapers. Our evaluation shows that Fundus yields significantly higher quality extractions (complete and artifact-free news articles) than prior work.The framework is available on GitHub under https://github.com/flairNLP/fundus and can be simply installed using pip.",
}

Acknowledgements

  • This repository's architecture has been inspired by the web content extraction benchmark (Bevendorff et al., 2023).
  • Since BoilerNet has no Python package on PyPI, we adopted a stripped-down version of the upstream BoilerNet provided by Bevendorff et al. from their web content extraction benchmark.
  • Similarly, BTE has no Python package on PyPI. Here, we used the implementation by Jan Pomikalek found from this and this source.