Skip to content
This repository has been archived by the owner on Jan 10, 2023. It is now read-only.

Integrate link graph into main wiki pipeline #429

Merged
merged 2 commits into from
Nov 15, 2019

Conversation

ringgaard
Copy link
Contributor

I have integrated the link graph into the main wiki pipeline. The fanin values now also includes counts from basic facts from the items.

I have also updated the documentation to use the new Myelin-based parser trainer.

@ringgaard ringgaard self-assigned this Nov 15, 2019
Copy link
Contributor

@anders-sandholm anders-sandholm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to see the handling of Wikidata links implemented.
And thanks for making the documentation more up-to-date as well.

## Preparing the training data

The LDC2013T19 OntoNotes 5 corpus is needed to produce the training data for
CASPAR. This is licensed by LDC and you need a LDC license to use the corpus:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a -> an

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


## Pre-trained word embeddings

The CASPAR parser uses pre-trained word embeddings which can be download from
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

download -> downloaded

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

with Python 3 you can install a pre-built wheel:

```
sudo -H pip3 install http://www.jbox.dk/sling/sling-2.0.0-py3-none-linux_x86_64.whl
```
and download the pre-trained model:

You can test the installing by trying to import the `sling` package:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

installing -> installation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

graph that is stored in the `/w/item/links` property for each item. The link
graph is built over all the Wikipedias being processed. The fan-in,
i.e. the number of links to the item, is also computed and stored in the
`/w/item/popularity` property. Tge popularity count also includes the number of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tge -> The

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

graph is built over all the Wikipedias being processed. The fan-in,
i.e. the number of links to the item, is also computed and stored in the
`/w/item/popularity` property. Tge popularity count also includes the number of
times the item is a fact taget in other items.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

taget -> target

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Text id = store->FrameId(target);
if (id.empty()) continue;

if (!store->IsFrame(target)) continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this line might be redundant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Removed.

if (id.empty()) continue;

if (!store->IsFrame(target)) continue;
accumulator_.Increment(id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indention off by one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Contributor

@anders-sandholm anders-sandholm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to hit approve in my previous update.
LGTM.

Copy link
Contributor Author

@ringgaard ringgaard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review

## Preparing the training data

The LDC2013T19 OntoNotes 5 corpus is needed to produce the training data for
CASPAR. This is licensed by LDC and you need a LDC license to use the corpus:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


## Pre-trained word embeddings

The CASPAR parser uses pre-trained word embeddings which can be download from
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

with Python 3 you can install a pre-built wheel:

```
sudo -H pip3 install http://www.jbox.dk/sling/sling-2.0.0-py3-none-linux_x86_64.whl
```
and download the pre-trained model:

You can test the installing by trying to import the `sling` package:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

graph that is stored in the `/w/item/links` property for each item. The link
graph is built over all the Wikipedias being processed. The fan-in,
i.e. the number of links to the item, is also computed and stored in the
`/w/item/popularity` property. Tge popularity count also includes the number of
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

graph is built over all the Wikipedias being processed. The fan-in,
i.e. the number of links to the item, is also computed and stored in the
`/w/item/popularity` property. Tge popularity count also includes the number of
times the item is a fact taget in other items.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Text id = store->FrameId(target);
if (id.empty()) continue;

if (!store->IsFrame(target)) continue;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Removed.

if (id.empty()) continue;

if (!store->IsFrame(target)) continue;
accumulator_.Increment(id);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@ringgaard ringgaard merged commit 93541e9 into google:master Nov 15, 2019
@ringgaard ringgaard deleted the wikilink branch November 15, 2019 15:36
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants