-
Notifications
You must be signed in to change notification settings - Fork 64
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
New version based on intensive round of reviewing together with Johannes
In particular, there is no longer a "construction" phase now, at the cost of using a slower hash map (absl::node_hash_map), slightly more space (the hash map and a vector to the strings stored in the hash map), and an indirection when looking up the word for an index (we have to follow the pointer to the actual string stored in the hash map).
- Loading branch information
Hannah Bast
committed
Nov 10, 2022
1 parent
77419df
commit de97c03
Showing
17 changed files
with
167 additions
and
169 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,51 @@ | ||
// Copyright 2022, University of Freiburg | ||
// Chair of Algorithms and Data Structures | ||
// Author: Hannah Bast <bast@cs.uni-freiburg.de> | ||
|
||
#include "engine/LocalVocab.h" | ||
|
||
#include "absl/strings/str_cat.h" | ||
#include "global/Id.h" | ||
#include "global/ValueId.h" | ||
|
||
// _____________________________________________________________________________ | ||
Id LocalVocab::getIdAndAddIfNotContained(const std::string& word) { | ||
// The following code avoids computing the hash for `word` twice in case we | ||
// see it for the first time (note that hashing a string is not cheap). The | ||
// return value of the `insert` operation is a pair, where `result.first` is | ||
// an iterator to the (already existing or newly inserted) key-value pair, and | ||
// `result.second` is a `bool`, which is `true` if and only if the value was | ||
// newly inserted. | ||
auto [keyValuePair, isNewWord] = wordsToIdsMap_.insert({word, nextFreeId_}); | ||
if (isNewWord) { | ||
idsToWordsMap_.push_back(&(keyValuePair->first)); | ||
nextFreeId_ = Id::makeFromLocalVocabIndex( | ||
LocalVocabIndex::make(idsToWordsMap_.size())); | ||
} | ||
return keyValuePair->second; | ||
} | ||
|
||
// _____________________________________________________________________________ | ||
const std::string& LocalVocab::getWord(LocalVocabIndex localVocabIndex) const { | ||
if (localVocabIndex.get() > idsToWordsMap_.size()) { | ||
throw std::runtime_error(absl::StrCat( | ||
"LocalVocab error: request for word with local vocab index ", | ||
localVocabIndex.get(), ", but size of local vocab is only ", | ||
idsToWordsMap_.size(), ", please contact the developers")); | ||
} | ||
return *(idsToWordsMap_[localVocabIndex.get()]); | ||
} | ||
|
||
// _____________________________________________________________________________ | ||
std::shared_ptr<LocalVocab> LocalVocab::mergeLocalVocabsIfOneIsEmpty( | ||
std::shared_ptr<LocalVocab> localVocab1, | ||
std::shared_ptr<LocalVocab> localVocab2) { | ||
bool isLocalVocab1Empty = localVocab1->empty(); | ||
bool isLocalVocab2Empty = localVocab2->empty(); | ||
if (!isLocalVocab1Empty && !isLocalVocab2Empty) { | ||
throw std::runtime_error( | ||
"Merging of two non-empty local vocabularies is currently not " | ||
"supported, please contact the developers"); | ||
} | ||
return !isLocalVocab1Empty ? localVocab1 : localVocab2; | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
// Copyright 2022, University of Freiburg | ||
// Chair of Algorithms and Data Structures | ||
// Author: Hannah Bast <bast@cs.uni-freiburg.de> | ||
|
||
#pragma once | ||
|
||
#include "absl/container/node_hash_map.h" | ||
|
||
// A class for maintaing a local vocabulary with contiguous (local) IDs. This is | ||
// meant for words that are not part of the normal vocabulary (constructed from | ||
// the input data at indexing time). | ||
// | ||
// TODO: This is a first version of this class with basic functionality. Note | ||
// that the local vocabulary used to be a simple `std::vector<std::string>` | ||
// defined inside of the `ResultTable` class. You gotta start somewhere. | ||
class LocalVocab { | ||
public: | ||
// Create a new, empty local vocabulary. | ||
LocalVocab() = default; | ||
|
||
// Prevent accidental copying of a local vocabulary (it can be quite large), | ||
// but moving it is OK. | ||
// | ||
// TODO: does the default move do the "right" thing, that is, move the hash | ||
// map instead of copying it? | ||
LocalVocab(const LocalVocab&) = delete; | ||
LocalVocab(LocalVocab&&) = default; | ||
|
||
// Get ID of a word in the local vocabulary. If the word was already | ||
// contained, return the already existing ID. If the word was not yet | ||
// contained, add it, and return the new ID. | ||
Id getIdAndAddIfNotContained(const std::string& word); | ||
|
||
// The number of words in the vocabulary. | ||
size_t size() const { return idsToWordsMap_.size(); } | ||
|
||
// Return true if and only if the local vocabulary is empty. | ||
bool empty() const { return idsToWordsMap_.empty(); } | ||
|
||
// Return a const reference to the word. | ||
const std::string& getWord(LocalVocabIndex localVocabIndex) const; | ||
|
||
// Merge two local vocabularies if at least one of them is empty. If both are | ||
// non-empty, throws an exception. | ||
// | ||
// TODO: Eventually, we want to have one local vocab for the whole query to | ||
// which each operation writes (one after the other). Then we don't need a | ||
// merge function anymore. | ||
static std::shared_ptr<LocalVocab> mergeLocalVocabsIfOneIsEmpty( | ||
std::shared_ptr<LocalVocab> localVocab1, | ||
std::shared_ptr<LocalVocab> localVocab2); | ||
|
||
private: | ||
// A map of the words in the local vocabulary to their local IDs. This is a | ||
// node hash map because we need the addresses of the words (which are of type | ||
// `std::string`) to remain stable over their lifetime in the hash map because | ||
// we refer to them in `wordsToIdsMap_` below. | ||
absl::node_hash_map<std::string, Id> wordsToIdsMap_; | ||
|
||
// A map of the local IDs to the words. Since the IDs are contiguous, we can | ||
// use a `std::vector`. We store pointers to the actual words in | ||
// `wordsToIdsMap_` to avoid storing every word twice. This saves space, but | ||
// costs us an indirection when looking up a word by its ID. | ||
std::vector<const std::string*> idsToWordsMap_; | ||
|
||
// The next free local ID (will be incremented by one each time we add a new | ||
// word). | ||
Id nextFreeId_ = Id::makeFromLocalVocabIndex(LocalVocabIndex::make(0)); | ||
}; |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.