Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sort Index/Docids By Field #1026

Merged
merged 53 commits into from
May 17, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
521075d
sort index by field
PSeitz Apr 27, 2021
5643ee2
support docid mapping in multivalue fastfield
PSeitz Apr 28, 2021
60bf3f8
handle docid map in bytes fastfield
PSeitz Apr 28, 2021
8469223
forward docid mapping, remap postings
PSeitz May 3, 2021
3cd436e
Merge remote-tracking branch 'upstream/main' into indexmeta
PSeitz May 3, 2021
045dfee
fix merge conflicts
PSeitz May 3, 2021
e97bdc9
move test to index_sorter
PSeitz May 3, 2021
62224fb
add docid index mapping old->new
PSeitz May 3, 2021
1ec2e61
remap docid in fielnorm
PSeitz May 3, 2021
8a7dc78
resort docids in recorder, more extensive tests
PSeitz May 4, 2021
179e859
handle index sorting in docstore
PSeitz May 4, 2021
855680b
refactor
PSeitz May 5, 2021
77b1aa1
u32 to DocId
PSeitz May 5, 2021
a7766fb
better doc_id_map creation
PSeitz May 5, 2021
18ef88c
add non mut method to FastFieldWriters
PSeitz May 5, 2021
d6775cd
remove sort_index
PSeitz May 5, 2021
451479b
fix clippy issues
PSeitz May 5, 2021
0ef02cd
fix SegmentComponent iterator
PSeitz May 5, 2021
3d537bf
fix test
PSeitz May 5, 2021
b0b0129
fmt
PSeitz May 5, 2021
0bbdd42
handle indexsettings deserialize
PSeitz May 5, 2021
b954fa6
add reading, writing bytes to doc store
PSeitz May 5, 2021
8e9278d
rename index_sorter to doc_id_mapping
PSeitz May 6, 2021
aca8cb8
fix compile issue, make sort_by_field optional
PSeitz May 6, 2021
a34cd0e
fix test compile
PSeitz May 6, 2021
ba5a0e6
validate index settings on merge
PSeitz May 7, 2021
20f10e0
fix doctest
PSeitz May 7, 2021
69dab3d
add itertools, use kmerge
PSeitz May 7, 2021
38c178f
implement/test merge for fastfield
PSeitz May 7, 2021
e7468e5
Use precalculated docid mapping in merger
PSeitz May 7, 2021
9aab1b9
fix fast field reader docs
PSeitz May 10, 2021
f8a3022
add test for multifast field merge
PSeitz May 10, 2021
4f77067
add num_bytes to BytesFastFieldReader
PSeitz May 10, 2021
39bcf13
add MultiValueLength trait
PSeitz May 10, 2021
c576e88
Add ReaderWithOrdinal, fix
PSeitz May 10, 2021
1b410d4
add test for merging bytes with sorted docids
PSeitz May 10, 2021
32a3a91
Merge fieldnorm for sorted index
PSeitz May 10, 2021
76f8de9
handle posting list in merge in sorted index
PSeitz May 10, 2021
b1c1c0d
handle doc store order in merge in sorted index
PSeitz May 10, 2021
eb0357c
fix typo, cleanup
PSeitz May 11, 2021
3129885
make IndexSetting non-optional
PSeitz May 11, 2021
00aab07
fix type, rename test file
PSeitz May 11, 2021
cd2711c
remove SegmentReaderWithOrdinal accessors
PSeitz May 11, 2021
4816fc4
cargo fmt
PSeitz May 11, 2021
5fc0ac4
add index sort & merge test to include deletes
PSeitz May 11, 2021
ade0ac0
Fix posting list merge issue
PSeitz May 11, 2021
b2a7fff
performance: cache field readers, use bytes for doc store merge
PSeitz May 11, 2021
aab65f0
change facet merge test to cover index sorting
PSeitz May 12, 2021
25cb568
add RawDocument abstraction to access bytes in doc store
PSeitz May 12, 2021
ea65dc1
Merge remote-tracking branch 'upstream/main' into indexmeta
PSeitz May 12, 2021
b6a0f42
fix deserialization, update changelog
PSeitz May 12, 2021
84da0be
cache store readers to utilize lru cache (4x performance)
PSeitz May 12, 2021
de0ea84
add include_temp_doc_store flag in InnerSegmentMeta
PSeitz May 14, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ Tantivy 0.15.0
- Simplified positions index format (@fulmicoton) #1022
- Moved bitpacking to bitpacker subcrate and add BlockedBitpacker, which bitpacks blocks of 128 elements (@PSeitz) #1030
- Added support for more-like-this query in tantivy (@evanxg852000) #1011
- Added support for sorting an index, e.g presorting documents in an index by a timestamp field. This can heavily improve performance for certain scenarios, by utilizing the sorted data (Top-n optimizations). #1026

Tantivy 0.14.0
=========================
Expand Down
1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ smallvec = "1"
rayon = "1"
lru = "0.6"
fastdivide = "0.3"
itertools = "0.10.0"

[target.'cfg(windows)'.dependencies]
winapi = "0.3"
Expand Down
6 changes: 5 additions & 1 deletion bitpacker/src/bitpacker.rs
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,11 @@ pub struct BitPacker {
mini_buffer: u64,
mini_buffer_written: usize,
}

impl Default for BitPacker {
fn default() -> Self {
BitPacker::new()
}
}
impl BitPacker {
pub fn new() -> BitPacker {
BitPacker {
Expand Down
7 changes: 5 additions & 2 deletions bitpacker/src/blocked_bitpacker.rs
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,11 @@ pub struct BlockedBitpacker {
buffer: Vec<u64>,
offset_and_bits: Vec<BlockedBitpackerEntryMetaData>,
}
impl Default for BlockedBitpacker {
fn default() -> Self {
BlockedBitpacker::new()
}
}

/// `BlockedBitpackerEntryMetaData` encodes the
/// offset and bit_width into a u64 bit field
Expand Down Expand Up @@ -115,8 +120,6 @@ impl BlockedBitpacker {
self.buffer.clear();
self.compressed_blocks
.resize(self.compressed_blocks.len() + 8, 0); // add padding for bitpacker
} else {
return;
}
}
pub fn get(&self, idx: usize) -> u64 {
Expand Down
87 changes: 60 additions & 27 deletions src/core/index.rs
Original file line number Diff line number Diff line change
Expand Up @@ -64,31 +64,42 @@ fn load_metas(
///
/// ```
/// use tantivy::schema::*;
/// use tantivy::{Index, IndexSettings};
/// use tantivy::{Index, IndexSettings, IndexSortByField, Order};
///
/// let mut schema_builder = Schema::builder();
/// let id_field = schema_builder.add_text_field("id", STRING);
/// let title_field = schema_builder.add_text_field("title", TEXT);
/// let body_field = schema_builder.add_text_field("body", TEXT);
/// let number_field = schema_builder.add_u64_field(
/// "number",
/// IntOptions::default().set_fast(Cardinality::SingleValue),
/// );
///
/// let schema = schema_builder.build();
/// let settings = IndexSettings::default();
/// let settings = IndexSettings{sort_by_field: Some(IndexSortByField{field:"number".to_string(), order:Order::Asc})};
/// let index = Index::builder().schema(schema).settings(settings).create_in_ram();
///
/// ```
pub struct IndexBuilder {
schema: Option<Schema>,
index_settings: Option<IndexSettings>,
index_settings: IndexSettings,
}
impl Default for IndexBuilder {
fn default() -> Self {
IndexBuilder::new()
}
}
impl IndexBuilder {
/// Creates a new `IndexBuilder`
pub fn new() -> Self {
Self {
schema: None,
index_settings: None,
index_settings: IndexSettings::default(),
}
}
/// Set the settings
pub fn settings(mut self, settings: IndexSettings) -> Self {
self.index_settings = Some(settings);
self.index_settings = settings;
self
}
/// Set the schema
Expand Down Expand Up @@ -131,15 +142,11 @@ impl IndexBuilder {
let mmap_directory = MmapDirectory::create_from_tempdir()?;
self.create(mmap_directory)
}
fn get_settings_or_default(&self) -> Option<IndexSettings> {
self.index_settings.as_ref().cloned()
}
fn get_expect_schema(&self) -> crate::Result<Schema> {
Ok(self
.schema
self.schema
.as_ref()
.cloned()
.ok_or_else(|| TantivyError::IndexBuilderMissingArgument("schema"))?)
.ok_or(TantivyError::IndexBuilderMissingArgument("schema"))
}
/// Opens or creates a new index in the provided directory
pub fn open_or_create<Dir: Directory>(self, dir: Dir) -> crate::Result<Index> {
Expand All @@ -162,11 +169,11 @@ impl IndexBuilder {
let directory = ManagedDirectory::wrap(dir)?;
save_new_metas(
self.get_expect_schema()?,
self.get_settings_or_default(),
self.index_settings.clone(),
&directory,
)?;
let mut metas = IndexMeta::with_schema(self.get_expect_schema()?);
metas.index_settings = self.get_settings_or_default();
metas.index_settings = self.index_settings.clone();
let index = Index::open_from_metas(directory, &metas, SegmentMetaInventory::default());
Ok(index)
}
Expand All @@ -177,7 +184,7 @@ impl IndexBuilder {
pub struct Index {
directory: ManagedDirectory,
schema: Schema,
settings: Option<IndexSettings>,
settings: IndexSettings,
executor: Arc<Executor>,
tokenizers: TokenizerManager,
inventory: SegmentMetaInventory,
Expand Down Expand Up @@ -265,12 +272,10 @@ impl Index {
pub fn create<Dir: Directory>(
dir: Dir,
schema: Schema,
settings: Option<IndexSettings>,
settings: IndexSettings,
) -> crate::Result<Index> {
let mut builder = IndexBuilder::new().schema(schema);
if let Some(settings) = settings {
builder = builder.settings(settings);
}
builder = builder.settings(settings);
builder.create(dir)
}

Expand Down Expand Up @@ -423,7 +428,7 @@ impl Index {

/// Helper to create an index writer for tests.
///
/// That index writer only simply has a single thread and a heap of 5 MB.
/// That index writer only simply has a single thread and a heap of 10 MB.
/// Using a single thread gives us a deterministic allocation of DocId.
#[cfg(test)]
pub fn writer_for_tests(&self) -> crate::Result<IndexWriter> {
Expand Down Expand Up @@ -452,7 +457,7 @@ impl Index {

/// Accessor to the index settings
///
pub fn settings(&self) -> &Option<IndexSettings> {
pub fn settings(&self) -> &IndexSettings {
&self.settings
}
/// Accessor to the index schema
Expand Down Expand Up @@ -523,11 +528,14 @@ impl fmt::Debug for Index {

#[cfg(test)]
mod tests {
use crate::directory::{RamDirectory, WatchCallback};
use crate::schema::Field;
use crate::schema::{Schema, INDEXED, TEXT};
use crate::IndexReader;
use crate::ReloadPolicy;
use crate::{
directory::{RamDirectory, WatchCallback},
IndexSettings,
};
use crate::{Directory, Index};

#[test]
Expand All @@ -548,7 +556,12 @@ mod tests {
fn test_index_exists() {
let directory = RamDirectory::create();
assert!(!Index::exists(&directory).unwrap());
assert!(Index::create(directory.clone(), throw_away_schema(), None).is_ok());
assert!(Index::create(
directory.clone(),
throw_away_schema(),
IndexSettings::default()
)
.is_ok());
assert!(Index::exists(&directory).unwrap());
}

Expand All @@ -563,23 +576,43 @@ mod tests {
#[test]
fn open_or_create_should_open() {
let directory = RamDirectory::create();
assert!(Index::create(directory.clone(), throw_away_schema(), None).is_ok());
assert!(Index::create(
directory.clone(),
throw_away_schema(),
IndexSettings::default()
)
.is_ok());
assert!(Index::exists(&directory).unwrap());
assert!(Index::open_or_create(directory, throw_away_schema()).is_ok());
}

#[test]
fn create_should_wipeoff_existing() {
let directory = RamDirectory::create();
assert!(Index::create(directory.clone(), throw_away_schema(), None).is_ok());
assert!(Index::create(
directory.clone(),
throw_away_schema(),
IndexSettings::default()
)
.is_ok());
assert!(Index::exists(&directory).unwrap());
assert!(Index::create(directory.clone(), Schema::builder().build(), None).is_ok());
assert!(Index::create(
directory.clone(),
Schema::builder().build(),
IndexSettings::default()
)
.is_ok());
}

#[test]
fn open_or_create_exists_but_schema_does_not_match() {
let directory = RamDirectory::create();
assert!(Index::create(directory.clone(), throw_away_schema(), None).is_ok());
assert!(Index::create(
directory.clone(),
throw_away_schema(),
IndexSettings::default()
)
.is_ok());
assert!(Index::exists(&directory).unwrap());
assert!(Index::open_or_create(directory.clone(), throw_away_schema()).is_ok());
let err = Index::open_or_create(directory, Schema::builder().build());
Expand Down Expand Up @@ -714,7 +747,7 @@ mod tests {
let directory = RamDirectory::create();
let schema = throw_away_schema();
let field = schema.get_field("num_likes").unwrap();
let index = Index::create(directory.clone(), schema, None).unwrap();
let index = Index::create(directory.clone(), schema, IndexSettings::default()).unwrap();

let mut writer = index.writer_with_num_threads(8, 24_000_000).unwrap();
for i in 0u64..8_000u64 {
Expand Down
Loading