eseBench: Benchmarking low-resource entity set expansion methods

Given a corpus and a seed set of entities of user-defined categories, entity set expansion (ESE) methods aim to find more entities that have the same categories. In this work, we consider user-generated text to understand the generalizability of ESE methods. We develop new benchmarks and propose more rigorous evaluation metrics for assessing performance of ESE methods. Additionally, we identify phenomena such as non-named entities, multifaceted entities, vague concepts that are more prevalent in user-generated text than well-formed text, and use them to profile ESE methods.

Running the code

Step 1: Corpus pre-processing

You need to first create a data folder in project root if it already does not exist. Then create a new folder with dataset name $DATA in the data folder of the project. Then create a $DATA/intermediate/ and put the intermediate files generated by AutoPhrase into this folder.

data/$DATA
└── intermediate
    └── AutoPhrase_single-word.txt: the sub-ranked list for single-word phrases only.
    └── AutoPhrase_multi-words.txt: the sub-ranked list for multi-word phrases only.
    └── sent_segmentation.txt: sentence-wise the highlighted phrases will be enclosed by the phrase tags (e.g., <phrase>data mining</phrase>).
    └── sentences.json: documents with entity phrases and noun-chunks
    └── BERTembed+seeds.txt: embeddings for the keyphrases

For running AutoPhrase on a new dataset, please follow instructions provided here

Step 2: Seed Definition

You next need to create a (comma-separated) csv file with the name seed_aligned_concepts.csv under the $DATA folder to specify the the seed entities.

data/$DATA
└── seed_aligned_concepts.csv

Here is what it should look like:

=======================================================================================================================================================
| alignedCategoryName              |   unalignedCategoryName |      generalizations |   seedInstances                                                 |
|----------------------------------+-------------------------+---------------------+----------------------------------------------------------------- |
| technology                       |   technology            |                      |"['python', 'sql', 'java', 'html', 'perl', 'javascript', 'php']" |
| programming_language             |   programming language  |                      |"['distributed systems', 'load balancing', 'network monitoring']"|
=======================================================================================================================================================

alignedCategoryName is a canonical name of the category. unalignedCategoryName is the common name of the category used by the pre-trained language model. generalizations if any for the category. this field is optional. seedInstances is a comman-separated list of seed entities. typically 5-10 entities are sufficient per category.

Step 3: Run the tool

First create a conda environment from the provided environements/conda_entity_expan.yml file. The run the following to create keywords as entity candidates from the corpus, learn their embeddings and create a ranked list of entities for each category.

cd src
source activate conda_entity_expan
./expand_taxonomy.sh $DATASET_NAME

This creates the following new output files at:

data/$DATA
└── intermediate
    └── ee_mrr_combine_bert_k=200.csv: top-200 predictions based on MRR over corpus embeddings and PLM rankings
    └── ee_concept_knn_k=None.csv: ranked list of entities based on corpus embeddings
    └── ee_LM_bert_k=None.csv: ranked list of entities based on PLM

Benchmark Details

We have made four benchmark datasets public: apr, tripadvisor, wiki, and yelp, which can be accessed from the Bencrmark folder under project root. The directory structure of each dataset is as follows:

Benchmark/$DATA
└── entity_candidates.txt: frequency distribution of entity candidates
└── final_benchmark.csv: the ground truth labels of entity candidates
└── entity_properties.csv: the characteristics of is positive entity candidates (multifcated=`y/n`, vague=`y/n`,non-named=`y/n`)
└── seed.csv: user provided entity candidates for concepts

Citation and Contact

For more details on the benchmark and experiments read our technical paper at NAACL 2022. Cite our work as follows:

@inproceedings{shao-etal-2022-low,
    title = "Low-resource Entity Set Expansion: A Comprehensive Study on User-generated Text",
    author = "Shao, Yutong  and
      Bhutani, Nikita  and
      Rahman, Sajjadur  and
      Hruschka, Estevam",
    booktitle = "Findings of the Association for Computational Linguistics: NAACL 2022",
    month = jul,
    year = "2022",
    address = "Seattle, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-naacl.100",
    pages = "1343--1353"
}

Contact

To get help with problems using eseBench or replicating our results, please submit a GitHub issue.

For personal communication related to eseBench, please contact Nikita Bhutani (nikita@megagon.ai), or Sajjadur Rahman (sajjadur@megagon.ai).

Disclosure

Embedded in, or bundled with, this product are open source software (OSS) components, datasets and other third party components identified below. The license terms respectively governing the datasets and third-party components continue to govern those portions, and you agree to those license terms, which, when applicable, specifically limit any distribution. You may receive a copy of, distribute and/or modify any open source code for the OSS component under the terms of their respective licenses. In the event of conflicts between Megagon Labs, Inc. Recruit Co., Ltd., license conditions and the Open Source Software license conditions, the Open Source Software conditions shall prevail with respect to the Open Source Software portions of the software. You agree not to, and are not permitted to, distribute actual datasets used with the OSS components listed below. You agree and are limited to distribute only links to datasets from known sources by listing them in the datasets overview table below. You are permitted to distribute derived datasets of data sets from known sources by including links to original dataset source in the datasets overview table below. You agree that any right to modify datasets originating from parties other than Megagon Labs, Inc. are governed by the respective third party’s license conditions. All OSS components and datasets are distributed WITHOUT ANY WARRANTY, without even implied warranty such as for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE, and without any liability to or claim against any Megagon Labs, Inc. entity other than as explicitly documented in this README document. You agree to cease using any part of the provided materials if you do not agree with the terms or the lack of any warranty herein. While Megagon Labs, Inc., makes commercially reasonable efforts to ensure that citations in this document are complete and accurate, errors may occur. If you see any error or omission, please help us improve this document by sending information to contact_oss@megagon.ai.

All datasets used within the product are listed below (including their copyright holders and the license conditions). For Datasets having different portions released under different licenses, please refer to the included source link specified for each of the respective datasets for identifications of dataset files released under the identified licenses.

ID	Dataset	Modified	Copyright Holder	Source Link	License
1	APR	Yes	University of Illinois	source	N/A
2	Wiki	Yes	University of Illinois	source	N/A
3	Yelp	Yes	University of Illinois	source	N/A
4	TripAdvisor	Yes	Zenodo	source	Creative Commons Attribution Non-Commercial 4.0 International License

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
Benchmark		Benchmark
environments		environments
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

eseBench: Benchmarking low-resource entity set expansion methods

Running the code

Step 1: Corpus pre-processing

Step 2: Seed Definition

Step 3: Run the tool

Benchmark Details

Citation and Contact

Contact

Disclosure

About

Releases

Packages

Languages

License

megagonlabs/eseBench

Folders and files

Latest commit

History

Repository files navigation

eseBench: Benchmarking low-resource entity set expansion methods

Running the code

Step 1: Corpus pre-processing

Step 2: Seed Definition

Step 3: Run the tool

Benchmark Details

Citation and Contact

Contact

Disclosure

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages