-
Notifications
You must be signed in to change notification settings - Fork 267
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is a simple yet elegant way to prune the phrase table. I wish we had done this 6 years ago as well :)
sling/nlp/wiki/aliases.cc
Outdated
@@ -238,7 +258,7 @@ class ProfileAliasReducer : public task::Reducer { | |||
merged.Add(n_alias_, a.Create()); | |||
} | |||
|
|||
// Output alias profile. | |||
// Output selected aliased. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aliased -> aliases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
sling/nlp/wiki/aliases.cc
Outdated
@@ -66,18 +66,34 @@ class ProfileAliasExtractor : public task::FrameProcessor { | |||
AddAlias(&a, alias.GetHandle(n_name_), SRC_WIKIDATA_FOREIGN, | |||
alias.GetHandle(n_lang_), alias.GetInt(n_count_)); | |||
} | |||
} else if (s.name == n_native_name_ || s.name == n_native_label_) { | |||
} else if (s.name == n_native_name_ || | |||
s.name == n_native_label_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can fit on previous line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed it can.
// Prune phrase table by transfering unreliable aliases to reliable | ||
// aliases for related items. | ||
if (transfer_aliases_) { | ||
LOG(INFO) << "Transfer aliases"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you still need the LOG(INFO) statements in this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use it for verifying that alias transfer is enabled and it is also useful for timing the alias transfer.
sling/nlp/kb/phrase-table-builder.cc
Outdated
bool reliable() const { return count_and_flags & (1 << 31); } | ||
|
||
// Phrase form. | ||
int form() const { return (count_and_flags << 29) & 3; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be (count_and_flags >> 29) & 3 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. That's a bug. Good catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the speedy review over the holidays.
// Prune phrase table by transfering unreliable aliases to reliable | ||
// aliases for related items. | ||
if (transfer_aliases_) { | ||
LOG(INFO) << "Transfer aliases"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use it for verifying that alias transfer is enabled and it is also useful for timing the alias transfer.
sling/nlp/kb/phrase-table-builder.cc
Outdated
bool reliable() const { return count_and_flags & (1 << 31); } | ||
|
||
// Phrase form. | ||
int form() const { return (count_and_flags << 29) & 3; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. That's a bug. Good catch!
sling/nlp/wiki/aliases.cc
Outdated
@@ -66,18 +66,34 @@ class ProfileAliasExtractor : public task::FrameProcessor { | |||
AddAlias(&a, alias.GetHandle(n_name_), SRC_WIKIDATA_FOREIGN, | |||
alias.GetHandle(n_lang_), alias.GetInt(n_count_)); | |||
} | |||
} else if (s.name == n_native_name_ || s.name == n_native_label_) { | |||
} else if (s.name == n_native_name_ || | |||
s.name == n_native_label_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed it can.
sling/nlp/wiki/aliases.cc
Outdated
@@ -238,7 +258,7 @@ class ProfileAliasReducer : public task::Reducer { | |||
merged.Add(n_alias_, a.Create()); | |||
} | |||
|
|||
// Output alias profile. | |||
// Output selected aliased. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
We use anchors, redirects, and wikipedia titles as alias sources for building the phrase table, but these are all noisy sources. I have implemented "alias transfer" in the phrase table builder to clean up some of these noisy aliases. The basic idea is that an item cannot have the same name as another item that it is related to, except for a small list of exceptions (e.g. named after, different from etc.). The aliases are divided in reliable and unreliable aliases based on the source, and if an item has an unreliable alias and is related to another item which has the same alias from a reliable source, the alias count is transferred to the target.
The alias transfer procedure cleans up many cases of noisy aliases. For instance "botanist" is remove from "botany" because
[botanist (Q2374149) field of this occupation (P425): botany (Q441)]
.Wikidata has finer granularity that any individual Wikipedia, so the alias transfer also fixes many cases where a "sub-item" has a redirect to a broader item, e.g. villages that are not in Wikipedia redirects to the county, taxon redirects to parent taxon etc.
Removing all punctuation for alias normalization introduces too many false matches, so I have changed the default normalization for the phrase table so only dashes and periods are normalized.
Other minor changes: