Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sort Index/Docids By Field #1026

Merged
merged 53 commits into from
May 17, 2021
Merged
Show file tree
Hide file tree
Changes from 39 commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
521075d
sort index by field
PSeitz Apr 27, 2021
5643ee2
support docid mapping in multivalue fastfield
PSeitz Apr 28, 2021
60bf3f8
handle docid map in bytes fastfield
PSeitz Apr 28, 2021
8469223
forward docid mapping, remap postings
PSeitz May 3, 2021
3cd436e
Merge remote-tracking branch 'upstream/main' into indexmeta
PSeitz May 3, 2021
045dfee
fix merge conflicts
PSeitz May 3, 2021
e97bdc9
move test to index_sorter
PSeitz May 3, 2021
62224fb
add docid index mapping old->new
PSeitz May 3, 2021
1ec2e61
remap docid in fielnorm
PSeitz May 3, 2021
8a7dc78
resort docids in recorder, more extensive tests
PSeitz May 4, 2021
179e859
handle index sorting in docstore
PSeitz May 4, 2021
855680b
refactor
PSeitz May 5, 2021
77b1aa1
u32 to DocId
PSeitz May 5, 2021
a7766fb
better doc_id_map creation
PSeitz May 5, 2021
18ef88c
add non mut method to FastFieldWriters
PSeitz May 5, 2021
d6775cd
remove sort_index
PSeitz May 5, 2021
451479b
fix clippy issues
PSeitz May 5, 2021
0ef02cd
fix SegmentComponent iterator
PSeitz May 5, 2021
3d537bf
fix test
PSeitz May 5, 2021
b0b0129
fmt
PSeitz May 5, 2021
0bbdd42
handle indexsettings deserialize
PSeitz May 5, 2021
b954fa6
add reading, writing bytes to doc store
PSeitz May 5, 2021
8e9278d
rename index_sorter to doc_id_mapping
PSeitz May 6, 2021
aca8cb8
fix compile issue, make sort_by_field optional
PSeitz May 6, 2021
a34cd0e
fix test compile
PSeitz May 6, 2021
ba5a0e6
validate index settings on merge
PSeitz May 7, 2021
20f10e0
fix doctest
PSeitz May 7, 2021
69dab3d
add itertools, use kmerge
PSeitz May 7, 2021
38c178f
implement/test merge for fastfield
PSeitz May 7, 2021
e7468e5
Use precalculated docid mapping in merger
PSeitz May 7, 2021
9aab1b9
fix fast field reader docs
PSeitz May 10, 2021
f8a3022
add test for multifast field merge
PSeitz May 10, 2021
4f77067
add num_bytes to BytesFastFieldReader
PSeitz May 10, 2021
39bcf13
add MultiValueLength trait
PSeitz May 10, 2021
c576e88
Add ReaderWithOrdinal, fix
PSeitz May 10, 2021
1b410d4
add test for merging bytes with sorted docids
PSeitz May 10, 2021
32a3a91
Merge fieldnorm for sorted index
PSeitz May 10, 2021
76f8de9
handle posting list in merge in sorted index
PSeitz May 10, 2021
b1c1c0d
handle doc store order in merge in sorted index
PSeitz May 10, 2021
eb0357c
fix typo, cleanup
PSeitz May 11, 2021
3129885
make IndexSetting non-optional
PSeitz May 11, 2021
00aab07
fix type, rename test file
PSeitz May 11, 2021
cd2711c
remove SegmentReaderWithOrdinal accessors
PSeitz May 11, 2021
4816fc4
cargo fmt
PSeitz May 11, 2021
5fc0ac4
add index sort & merge test to include deletes
PSeitz May 11, 2021
ade0ac0
Fix posting list merge issue
PSeitz May 11, 2021
b2a7fff
performance: cache field readers, use bytes for doc store merge
PSeitz May 11, 2021
aab65f0
change facet merge test to cover index sorting
PSeitz May 12, 2021
25cb568
add RawDocument abstraction to access bytes in doc store
PSeitz May 12, 2021
ea65dc1
Merge remote-tracking branch 'upstream/main' into indexmeta
PSeitz May 12, 2021
b6a0f42
fix deserialization, update changelog
PSeitz May 12, 2021
84da0be
cache store readers to utilize lru cache (4x performance)
PSeitz May 12, 2021
de0ea84
add include_temp_doc_store flag in InnerSegmentMeta
PSeitz May 14, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ smallvec = "1"
rayon = "1"
lru = "0.6"
fastdivide = "0.3"
itertools = "0.10.0"

[target.'cfg(windows)'.dependencies]
winapi = "0.3"
Expand Down
6 changes: 5 additions & 1 deletion bitpacker/src/bitpacker.rs
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,11 @@ pub struct BitPacker {
mini_buffer: u64,
mini_buffer_written: usize,
}

impl Default for BitPacker {
fn default() -> Self {
BitPacker::new()
}
}
impl BitPacker {
pub fn new() -> BitPacker {
BitPacker {
Expand Down
7 changes: 5 additions & 2 deletions bitpacker/src/blocked_bitpacker.rs
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,11 @@ pub struct BlockedBitpacker {
buffer: Vec<u64>,
offset_and_bits: Vec<BlockedBitpackerEntryMetaData>,
}
impl Default for BlockedBitpacker {
fn default() -> Self {
BlockedBitpacker::new()
}
}

/// `BlockedBitpackerEntryMetaData` encodes the
/// offset and bit_width into a u64 bit field
Expand Down Expand Up @@ -115,8 +120,6 @@ impl BlockedBitpacker {
self.buffer.clear();
self.compressed_blocks
.resize(self.compressed_blocks.len() + 8, 0); // add padding for bitpacker
} else {
return;
}
}
pub fn get(&self, idx: usize) -> u64 {
Expand Down
17 changes: 11 additions & 6 deletions src/core/index.rs
Original file line number Diff line number Diff line change
Expand Up @@ -64,22 +64,28 @@ fn load_metas(
///
/// ```
/// use tantivy::schema::*;
/// use tantivy::{Index, IndexSettings};
/// use tantivy::{Index, IndexSettings, IndexSortByField, Order};
///
/// let mut schema_builder = Schema::builder();
/// let id_field = schema_builder.add_text_field("id", STRING);
/// let title_field = schema_builder.add_text_field("title", TEXT);
/// let body_field = schema_builder.add_text_field("body", TEXT);
/// let schema = schema_builder.build();
/// let settings = IndexSettings::default();
/// let settings = IndexSettings{sort_by_field: Some(IndexSortByField{field:"title".to_string(), order:Order::Asc})};
/// let index = Index::builder().schema(schema).settings(settings).create_in_ram();
///
/// ```
pub struct IndexBuilder {
schema: Option<Schema>,
index_settings: Option<IndexSettings>,
}
impl Default for IndexBuilder {
fn default() -> Self {
IndexBuilder::new()
}
}
impl IndexBuilder {
/// Creates a new `IndexBuilder`
pub fn new() -> Self {
Self {
schema: None,
Expand Down Expand Up @@ -135,11 +141,10 @@ impl IndexBuilder {
self.index_settings.as_ref().cloned()
}
fn get_expect_schema(&self) -> crate::Result<Schema> {
Ok(self
.schema
self.schema
.as_ref()
.cloned()
.ok_or_else(|| TantivyError::IndexBuilderMissingArgument("schema"))?)
.ok_or(TantivyError::IndexBuilderMissingArgument("schema"))
}
/// Opens or creates a new index in the provided directory
pub fn open_or_create<Dir: Directory>(self, dir: Dir) -> crate::Result<Index> {
Expand Down Expand Up @@ -423,7 +428,7 @@ impl Index {

/// Helper to create an index writer for tests.
///
/// That index writer only simply has a single thread and a heap of 5 MB.
/// That index writer only simply has a single thread and a heap of 10 MB.
/// Using a single thread gives us a deterministic allocation of DocId.
#[cfg(test)]
pub fn writer_for_tests(&self) -> crate::Result<IndexWriter> {
Expand Down
51 changes: 44 additions & 7 deletions src/core/index_meta.rs
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,7 @@ impl SegmentMeta {
SegmentComponent::Positions => ".pos".to_string(),
SegmentComponent::Terms => ".term".to_string(),
SegmentComponent::Store => ".store".to_string(),
SegmentComponent::TempStore => ".store.temp".to_string(),
SegmentComponent::FastFields => ".fast".to_string(),
SegmentComponent::FieldNorms => ".fieldnorm".to_string(),
SegmentComponent::Delete => format!(".{}.del", self.delete_opstamp().unwrap_or(0)),
Expand Down Expand Up @@ -193,9 +194,36 @@ impl InnerSegmentMeta {
}
}

/// Search Index Settings
#[derive(Clone, Default, Serialize)]
pub struct IndexSettings {}
/// Search Index Settings.
///
/// Contains settings which are applied on the whole
/// index, like presort documents.
#[derive(Clone, Serialize, Deserialize, Eq, PartialEq)]
pub struct IndexSettings {
/// Sorts the documents by information
/// provided in `IndexSortByField`
pub sort_by_field: Option<IndexSortByField>,
}
/// Settings to presort the documents in an index
///
/// Presorting documents can greatly performance
/// in some scenarios, by applying top n
/// optimizations.
#[derive(Clone, Serialize, Deserialize, Eq, PartialEq)]
pub struct IndexSortByField {
/// The field to sort the documents by
pub field: String,
/// The order to sort the documents by
pub order: Order,
}
/// The order to sort by
#[derive(Clone, Serialize, Deserialize, Eq, PartialEq)]
pub enum Order {
/// Ascending Order
Asc,
/// Descending Order
Desc,
}
/// Meta information about the `Index`.
///
/// This object is serialized on disk in the `meta.json` file.
Expand Down Expand Up @@ -227,6 +255,7 @@ pub struct IndexMeta {
#[derive(Deserialize)]
struct UntrackedIndexMeta {
pub segments: Vec<InnerSegmentMeta>,
pub index_settings: Option<IndexSettings>,
pub schema: Schema,
pub opstamp: Opstamp,
#[serde(skip_serializing_if = "Option::is_none")]
Expand All @@ -236,7 +265,7 @@ struct UntrackedIndexMeta {
impl UntrackedIndexMeta {
pub fn track(self, inventory: &SegmentMetaInventory) -> IndexMeta {
IndexMeta {
index_settings: None,
index_settings: self.index_settings,
segments: self
.segments
.into_iter()
Expand Down Expand Up @@ -289,7 +318,10 @@ impl fmt::Debug for IndexMeta {
mod tests {

use super::IndexMeta;
use crate::schema::{Schema, TEXT};
use crate::{
schema::{Schema, TEXT},
IndexSettings, IndexSortByField, Order,
};
use serde_json;

#[test]
Expand All @@ -300,7 +332,12 @@ mod tests {
schema_builder.build()
};
let index_metas = IndexMeta {
index_settings: None,
index_settings: Some(IndexSettings {
sort_by_field: Some(IndexSortByField {
field: "text".to_string(),
order: Order::Asc,
}),
}),
segments: Vec::new(),
schema,
opstamp: 0u64,
Expand All @@ -309,7 +346,7 @@ mod tests {
let json = serde_json::ser::to_string(&index_metas).expect("serialization failed");
assert_eq!(
json,
r#"{"segments":[],"schema":[{"name":"text","type":"text","options":{"indexing":{"record":"position","tokenizer":"default"},"stored":false}}],"opstamp":0}"#
r#"{"index_settings":{"sort_by_field":{"field":"text","order":"Asc"}},"segments":[],"schema":[{"name":"text","type":"text","options":{"indexing":{"record":"position","tokenizer":"default"},"stored":false}}],"opstamp":0}"#
);
}
}
4 changes: 3 additions & 1 deletion src/core/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,9 @@ mod segment_reader;

pub use self::executor::Executor;
pub use self::index::{Index, IndexBuilder};
pub use self::index_meta::{IndexMeta, IndexSettings, SegmentMeta, SegmentMetaInventory};
pub use self::index_meta::{
IndexMeta, IndexSettings, IndexSortByField, Order, SegmentMeta, SegmentMetaInventory,
};
pub use self::inverted_index_reader::InvertedIndexReader;
pub use self::searcher::Searcher;
pub use self::segment::Segment;
Expand Down
12 changes: 10 additions & 2 deletions src/core/segment.rs
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
use super::SegmentComponent;
use crate::core::Index;
use crate::core::SegmentId;
use crate::core::SegmentMeta;
use crate::directory::error::{OpenReadError, OpenWriteError};
Expand All @@ -8,6 +7,7 @@ use crate::directory::{FileSlice, WritePtr};
use crate::indexer::segment_serializer::SegmentSerializer;
use crate::schema::Schema;
use crate::Opstamp;
use crate::{core::Index, indexer::doc_id_mapping::DocIdMapping};
use std::fmt;
use std::path::PathBuf;

Expand Down Expand Up @@ -97,5 +97,13 @@ pub trait SerializableSegment {
///
/// # Returns
/// The number of documents in the segment.
fn write(&self, serializer: SegmentSerializer) -> crate::Result<u32>;
///
/// doc_id_map is used when index is created and sorted, to map to the new doc_id order.
/// It is not used by the `IndexMerger`, since the doc_id_mapping on cross-segments works
/// differently
fn write(
&self,
serializer: SegmentSerializer,
doc_id_map: Option<&DocIdMapping>,
) -> crate::Result<u32>;
}
5 changes: 4 additions & 1 deletion src/core/segment_component.rs
Original file line number Diff line number Diff line change
Expand Up @@ -22,20 +22,23 @@ pub enum SegmentComponent {
/// Accessing a document from the store is relatively slow, as it
/// requires to decompress the entire block it belongs to.
Store,
/// Temporary storage of the documents, before streamed to `Store`.
TempStore,
/// Bitset describing which document of the segment is deleted.
Delete,
}
PSeitz marked this conversation as resolved.
Show resolved Hide resolved

impl SegmentComponent {
/// Iterates through the components.
pub fn iterator() -> slice::Iter<'static, SegmentComponent> {
static SEGMENT_COMPONENTS: [SegmentComponent; 7] = [
static SEGMENT_COMPONENTS: [SegmentComponent; 8] = [
SegmentComponent::Postings,
SegmentComponent::Positions,
SegmentComponent::FastFields,
SegmentComponent::FieldNorms,
SegmentComponent::Terms,
SegmentComponent::Store,
SegmentComponent::TempStore,
SegmentComponent::Delete,
];
SEGMENT_COMPONENTS.iter()
Expand Down
5 changes: 3 additions & 2 deletions src/directory/mmap_directory.rs
Original file line number Diff line number Diff line change
Expand Up @@ -614,9 +614,10 @@ mod tests {
reader.reload().unwrap();
let num_segments = reader.searcher().segment_readers().len();
assert!(num_segments <= 4);
let num_components_except_deletes = crate::core::SegmentComponent::iterator().len() - 1;
let num_components_except_deletes_and_tempstore =
crate::core::SegmentComponent::iterator().len() - 2;
assert_eq!(
num_segments * num_components_except_deletes,
num_segments * num_components_except_deletes_and_tempstore,
mmap_directory.get_cache_info().mmapped.len()
);
}
Expand Down
2 changes: 1 addition & 1 deletion src/directory/owned_bytes.rs
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,8 @@ impl OwnedBytes {
let bytes: &[u8] = box_stable_deref.as_ref();
let data = unsafe { mem::transmute::<_, &'static [u8]>(bytes.deref()) };
OwnedBytes {
box_stable_deref,
data,
box_stable_deref,
}
}

Expand Down
17 changes: 16 additions & 1 deletion src/fastfield/bytes/reader.rs
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
use crate::directory::FileSlice;
use crate::directory::OwnedBytes;
use crate::fastfield::FastFieldReader;
use crate::DocId;
use crate::{directory::FileSlice, fastfield::MultiValueLength};

/// Reader for byte array fast fields
///
Expand Down Expand Up @@ -40,8 +40,23 @@ impl BytesFastFieldReader {
&self.values.as_slice()[start..stop]
}

/// Returns the length of the bytes associated to the given `doc`
pub fn num_bytes(&self, doc: DocId) -> usize {
let (start, stop) = self.range(doc);
stop - start
}

/// Returns the overall number of bytes in this bytes fast field.
pub fn total_num_bytes(&self) -> usize {
self.values.len()
}
}

impl MultiValueLength for BytesFastFieldReader {
fn get_len(&self, doc_id: DocId) -> u64 {
self.num_bytes(doc_id) as u64
}
fn get_total_len(&self) -> u64 {
self.total_num_bytes() as u64
}
}
54 changes: 48 additions & 6 deletions src/fastfield/bytes/writer.rs
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
use std::io;

use crate::fastfield::serializer::FastFieldSerializer;
use crate::schema::{Document, Field, Value};
use crate::DocId;
use crate::{fastfield::serializer::FastFieldSerializer, indexer::doc_id_mapping::DocIdMapping};

/// Writer for byte array (as in, any number of bytes per document) fast fields
///
Expand Down Expand Up @@ -72,20 +72,62 @@ impl BytesFastFieldWriter {
doc
}

/// Returns an iterator over values per doc_id in ascending doc_id order.
///
/// Normally the order is simply iterating self.doc_id_index.
/// With doc_id_map it accounts for the new mapping, returning values in the order of the
/// new doc_ids.
fn get_ordered_values<'a: 'b, 'b>(
&'a self,
doc_id_map: Option<&'b DocIdMapping>,
) -> impl Iterator<Item = &'b [u8]> {
let doc_id_iter = if let Some(doc_id_map) = doc_id_map {
Box::new(doc_id_map.iter_old_doc_ids().cloned()) as Box<dyn Iterator<Item = u32>>
} else {
Box::new(self.doc_index.iter().enumerate().map(|el| el.0 as u32))
as Box<dyn Iterator<Item = u32>>
};
doc_id_iter.map(move |doc_id| self.get_values_for_doc_id(doc_id))
}

/// returns all values for a doc_ids
fn get_values_for_doc_id(&self, doc_id: u32) -> &[u8] {
let start_pos = self.doc_index[doc_id as usize] as usize;
let end_pos = self
.doc_index
.get(doc_id as usize + 1)
.cloned()
.unwrap_or(self.vals.len() as u64) as usize; // special case, last doc_id has no offset information
&self.vals[start_pos..end_pos]
}

/// Serializes the fast field values by pushing them to the `FastFieldSerializer`.
pub fn serialize(&self, serializer: &mut FastFieldSerializer) -> io::Result<()> {
pub fn serialize(
&self,
serializer: &mut FastFieldSerializer,
doc_id_map: Option<&DocIdMapping>,
) -> io::Result<()> {
// writing the offset index
let mut doc_index_serializer =
serializer.new_u64_fast_field_with_idx(self.field, 0, self.vals.len() as u64, 0)?;
for &offset in &self.doc_index {
let mut offset = 0;
for vals in self.get_ordered_values(doc_id_map) {
doc_index_serializer.add_val(offset)?;
offset += vals.len() as u64;
}
doc_index_serializer.add_val(self.vals.len() as u64)?;
doc_index_serializer.close_field()?;
// writing the values themselves
serializer
.new_bytes_fast_field_with_idx(self.field, 1)
.write_all(&self.vals)?;
let mut value_serializer = serializer.new_bytes_fast_field_with_idx(self.field, 1);
// the else could be removed, but this is faster (difference not benchmarked)
if let Some(doc_id_map) = doc_id_map {
for vals in self.get_ordered_values(Some(doc_id_map)) {
// sort values in case of remapped doc_ids?
value_serializer.write_all(vals)?;
}
} else {
value_serializer.write_all(&self.vals)?;
}
Ok(())
}
}
Loading