Skip to content
This repository has been archived by the owner on Jan 10, 2023. It is now read-only.

Alias transfer #326

Merged
merged 2 commits into from
Jan 22, 2019
Merged

Alias transfer #326

merged 2 commits into from
Jan 22, 2019

Conversation

ringgaard
Copy link
Contributor

We use anchors, redirects, and wikipedia titles as alias sources for building the phrase table, but these are all noisy sources. I have implemented "alias transfer" in the phrase table builder to clean up some of these noisy aliases. The basic idea is that an item cannot have the same name as another item that it is related to, except for a small list of exceptions (e.g. named after, different from etc.). The aliases are divided in reliable and unreliable aliases based on the source, and if an item has an unreliable alias and is related to another item which has the same alias from a reliable source, the alias count is transferred to the target.

The alias transfer procedure cleans up many cases of noisy aliases. For instance "botanist" is remove from "botany" because [botanist (Q2374149) field of this occupation (P425): botany (Q441)].

Wikidata has finer granularity that any individual Wikipedia, so the alias transfer also fixes many cases where a "sub-item" has a redirect to a broader item, e.g. villages that are not in Wikipedia redirects to the county, taxon redirects to parent taxon etc.

Removing all punctuation for alias normalization introduces too many false matches, so I have changed the default normalization for the phrase table so only dashes and periods are normalized.

Other minor changes:

  • Case form caching in tokens
  • Fix rendering of thematic frames in docviewer
  • Python methods for running workflows
  • Fix strptime multi-thread problem
  • Frame store handle overflow detection
  • Fix memory leak for DocumentNames
  • Move name and phrase tables to kb directory
  • Support for mono-lingual text in WikiData
  • Inverse calendar mappings
  • Minor updates to default taxonomy
  • Support for time period in fact extractor
  • Use short names and country codes as aliases
  • Handle "red links" in wiki annotator
  • Stop using "intro text" as alias source since it is too noisy

@ringgaard ringgaard self-assigned this Jan 21, 2019
@ringgaard ringgaard requested a review from rahul1980 January 21, 2019 16:00
Copy link
Contributor

@rahul1980 rahul1980 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is a simple yet elegant way to prune the phrase table. I wish we had done this 6 years ago as well :)

@@ -238,7 +258,7 @@ class ProfileAliasReducer : public task::Reducer {
merged.Add(n_alias_, a.Create());
}

// Output alias profile.
// Output selected aliased.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aliased -> aliases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -66,18 +66,34 @@ class ProfileAliasExtractor : public task::FrameProcessor {
AddAlias(&a, alias.GetHandle(n_name_), SRC_WIKIDATA_FOREIGN,
alias.GetHandle(n_lang_), alias.GetInt(n_count_));
}
} else if (s.name == n_native_name_ || s.name == n_native_label_) {
} else if (s.name == n_native_name_ ||
s.name == n_native_label_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can fit on previous line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it can.

// Prune phrase table by transfering unreliable aliases to reliable
// aliases for related items.
if (transfer_aliases_) {
LOG(INFO) << "Transfer aliases";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you still need the LOG(INFO) statements in this method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use it for verifying that alias transfer is enabled and it is also useful for timing the alias transfer.

bool reliable() const { return count_and_flags & (1 << 31); }

// Phrase form.
int form() const { return (count_and_flags << 29) & 3; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be (count_and_flags >> 29) & 3 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. That's a bug. Good catch!

Copy link
Contributor Author

@ringgaard ringgaard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the speedy review over the holidays.

// Prune phrase table by transfering unreliable aliases to reliable
// aliases for related items.
if (transfer_aliases_) {
LOG(INFO) << "Transfer aliases";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use it for verifying that alias transfer is enabled and it is also useful for timing the alias transfer.

bool reliable() const { return count_and_flags & (1 << 31); }

// Phrase form.
int form() const { return (count_and_flags << 29) & 3; }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. That's a bug. Good catch!

@@ -66,18 +66,34 @@ class ProfileAliasExtractor : public task::FrameProcessor {
AddAlias(&a, alias.GetHandle(n_name_), SRC_WIKIDATA_FOREIGN,
alias.GetHandle(n_lang_), alias.GetInt(n_count_));
}
} else if (s.name == n_native_name_ || s.name == n_native_label_) {
} else if (s.name == n_native_name_ ||
s.name == n_native_label_) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it can.

@@ -238,7 +258,7 @@ class ProfileAliasReducer : public task::Reducer {
merged.Add(n_alias_, a.Create());
}

// Output alias profile.
// Output selected aliased.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@ringgaard ringgaard merged commit 93c2ab1 into google:master Jan 22, 2019
@ringgaard ringgaard deleted the aliasxfer branch January 22, 2019 09:35
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants