-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor the code handling the local vocabulary #820
Conversation
2db2c86
to
8b8aa4e
Compare
The LocalVocab used to be an alias for std::vector<std::string> (defined inside of the ResultTable class). New words were pushed to this vector without checking whether they were already in the vocabulary (exception: when a sequence of identical new words were encountered in a SPARQL expression result, the words was only added once per sequence). The basic data is still a std::vector<std::string>, but there is now a phase, started by LocalVocab::startConstructionPhase() and ended by LocalVocab::endConstrutionPhase(), where new words can be added and a words-to-ID map records which words have already been added. After that phase, the map is deleted and the local vocabulary is read-only. Check test/LocalVocabTest for how it works in prinicple. This is also used by the Values class in PR #793 (and I tested it already quite extensively and it works).
8b8aa4e
to
77419df
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much.
With the split up functionality it is much easier to reason about how things should REALLY be.
Some of my suggestions are not "just do this" but rather "maybe there is stuff to discuss here",
So maybe first read them, and then contact me for further details.
In particular, there is no longer a "construction" phase now, at the cost of using a slower hash map (absl::node_hash_map), slightly more space (the hash map and a vector to the strings stored in the hash map), and an indirection when looking up the word for an index (we have to follow the pointer to the actual string stored in the hash map).
ab98844
to
de97c03
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice,
This is very close now,
I only have some small suggestions to make the architectural quality
even higher.
We should discuss: 1. Argument type taken by getIndexAndAddIfNotContained 2. Return type of getWord
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only two small things that I find subtle but important. We are really getting there, great work!
test/LocalVocabTest.cpp
Outdated
ASSERT_EQ(localVocab.getIndexAndAddIfNotContained("bla"), makeIndex(0)); | ||
ASSERT_EQ(localVocab.getIndexAndAddIfNotContained("blu"), makeIndex(1)); | ||
ASSERT_EQ(localVocab.getIndexAndAddIfNotContained("bli"), makeIndex(2)); | ||
ASSERT_EQ(localVocab.getIndexAndAddIfNotContained("bla"), makeIndex(0)); | ||
ASSERT_EQ(localVocab.size(), 3); | ||
ASSERT_EQ(localVocab.getWord(LocalVocabIndex::make(0)), "bla"); | ||
ASSERT_EQ(localVocab.getWord(LocalVocabIndex::make(1)), "blu"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be sure, I would add some more words in a loop (100s or 1000s) and then check for the pointers.
The goal is to make this test fail with very high probability once somebody uses a hashMap implementation that is not pointer-stable for the Keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great,
I only have a very minor formatting issue left.
Feel free to merge this afterwards.
The LocalVocab used to be an alias for std::vectorstd::string (defined inside of the ResultTable class). New words were pushed to this vector without checking whether they were already in the vocabulary (exception: when a sequence of identical new words were encountered in a SPARQL expression result, the words was only added once per sequence).
The basic data is still a std::vectorstd::string, but there is now a phase, started by LocalVocab::startConstructionPhase() and ended by LocalVocab::endConstrutionPhase(), where new words can be added and a words-to-ID map records which words have already been added. After that phase, the map is deleted and the local vocabulary is read-only. Check test/LocalVocabTest for how it works in prinicple.
This is also used by the Values class in PR #793 (and I tested it already quite extensively and it works).