Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor the code handling the local vocabulary #820

Merged
merged 5 commits into from
Nov 10, 2022

Conversation

hannahbast
Copy link
Member

The LocalVocab used to be an alias for std::vectorstd::string (defined inside of the ResultTable class). New words were pushed to this vector without checking whether they were already in the vocabulary (exception: when a sequence of identical new words were encountered in a SPARQL expression result, the words was only added once per sequence).

The basic data is still a std::vectorstd::string, but there is now a phase, started by LocalVocab::startConstructionPhase() and ended by LocalVocab::endConstrutionPhase(), where new words can be added and a words-to-ID map records which words have already been added. After that phase, the map is deleted and the local vocabulary is read-only. Check test/LocalVocabTest for how it works in prinicple.

This is also used by the Values class in PR #793 (and I tested it already quite extensively and it works).

@hannahbast hannahbast marked this pull request as ready for review November 9, 2022 04:02
@hannahbast hannahbast requested a review from joka921 November 9, 2022 04:02
@hannahbast hannahbast force-pushed the refactor-local-vocab branch from 2db2c86 to 8b8aa4e Compare November 9, 2022 04:09
The LocalVocab used to be an alias for std::vector<std::string> (defined
inside of the ResultTable class). New words were pushed to this vector
without checking whether they were already in the vocabulary (exception:
when a sequence of identical new words were encountered in a SPARQL
expression result, the words was only added once per sequence).

The basic data is still a std::vector<std::string>, but there is now a
phase, started by LocalVocab::startConstructionPhase() and ended by
LocalVocab::endConstrutionPhase(), where new words can be added and a
words-to-ID map records which words have already been added. After that
phase, the map is deleted and the local vocabulary is read-only. Check
test/LocalVocabTest for how it works in prinicple.

This is also used by the Values class in PR #793 (and I tested it
already quite extensively and it works).
@hannahbast hannahbast force-pushed the refactor-local-vocab branch from 8b8aa4e to 77419df Compare November 9, 2022 04:39
Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much.
With the split up functionality it is much easier to reason about how things should REALLY be.

Some of my suggestions are not "just do this" but rather "maybe there is stuff to discuss here",
So maybe first read them, and then contact me for further details.

In particular, there is no longer a "construction" phase now, at the
cost of using a slower hash map (absl::node_hash_map), slightly more
space (the hash map and a vector to the strings stored in the hash map),
and an indirection when looking up the word for an index (we have to
follow the pointer to the actual string stored in the hash map).
Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice,
This is very close now,
I only have some small suggestions to make the architectural quality
even higher.

We should discuss:

1. Argument type taken by getIndexAndAddIfNotContained
2. Return type of getWord
Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only two small things that I find subtle but important. We are really getting there, great work!

ASSERT_EQ(localVocab.getIndexAndAddIfNotContained("bla"), makeIndex(0));
ASSERT_EQ(localVocab.getIndexAndAddIfNotContained("blu"), makeIndex(1));
ASSERT_EQ(localVocab.getIndexAndAddIfNotContained("bli"), makeIndex(2));
ASSERT_EQ(localVocab.getIndexAndAddIfNotContained("bla"), makeIndex(0));
ASSERT_EQ(localVocab.size(), 3);
ASSERT_EQ(localVocab.getWord(LocalVocabIndex::make(0)), "bla");
ASSERT_EQ(localVocab.getWord(LocalVocabIndex::make(1)), "blu");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be sure, I would add some more words in a loop (100s or 1000s) and then check for the pointers.
The goal is to make this test fail with very high probability once somebody uses a hashMap implementation that is not pointer-stable for the Keys.

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great,
I only have a very minor formatting issue left.
Feel free to merge this afterwards.

@hannahbast hannahbast changed the title Separate LocalVocab class with basic functionality Refactor the code handling the local vocabulary Nov 10, 2022
@hannahbast hannahbast merged commit 13ad3ca into master Nov 10, 2022
@hannahbast hannahbast deleted the refactor-local-vocab branch November 10, 2022 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants