Skip to content

Commit

Permalink
Closes #1022
Browse files Browse the repository at this point in the history
  • Loading branch information
fulmicoton committed Apr 26, 2021
1 parent aead5d4 commit 2dc5403
Show file tree
Hide file tree
Showing 16 changed files with 257 additions and 146 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ Tantivy 0.15.0
- Date field support for range queries (@rihardsk) #516
- Added lz4-flex as the default compression scheme in tantivy (@PSeitz) #1009
- Renamed a lot of symbols to avoid all uppercasing on acronyms, as per new clippy recommendation. For instance, RAMDireotory -> RamDirectory. (@pmasurel)
- Simplified positions index format (@fulmicoton) #1022

Tantivy 0.14.0
=========================
Expand Down
2 changes: 1 addition & 1 deletion src/core/inverted_index_reader.rs
Original file line number Diff line number Diff line change
Expand Up @@ -142,7 +142,7 @@ impl InvertedIndexReader {
let positions_data = self
.positions_file_slice
.read_bytes_slice(term_info.positions_range.clone())?;
let position_reader = PositionReader::new(positions_data)?;
let position_reader = PositionReader::open(positions_data)?;
Some(position_reader)
} else {
None
Expand Down
3 changes: 2 additions & 1 deletion src/directory/mmap_directory.rs
Original file line number Diff line number Diff line change
Expand Up @@ -614,8 +614,9 @@ mod tests {
reader.reload().unwrap();
let num_segments = reader.searcher().segment_readers().len();
assert!(num_segments <= 4);
let num_components_except_deletes = crate::core::SegmentComponent::iterator().len() - 1;
assert_eq!(
num_segments * 7,
num_segments * num_components_except_deletes,
mmap_directory.get_cache_info().mmapped.len()
);
}
Expand Down
2 changes: 1 addition & 1 deletion src/indexer/merger.rs
Original file line number Diff line number Diff line change
Expand Up @@ -628,7 +628,7 @@ impl IndexMerger {
segment_postings.positions(&mut positions_buffer);

let delta_positions = delta_computer.compute_delta(&positions_buffer);
field_serializer.write_doc(remapped_doc_id, term_freq, delta_positions)?;
field_serializer.write_doc(remapped_doc_id, term_freq, delta_positions);
}

doc = segment_postings.advance();
Expand Down
2 changes: 1 addition & 1 deletion src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -141,7 +141,7 @@ pub mod collector;
pub mod directory;
pub mod fastfield;
pub mod fieldnorm;
pub(crate) mod positions;
pub mod positions;
pub mod postings;
pub mod query;
pub mod schema;
Expand Down
205 changes: 123 additions & 82 deletions src/positions/mod.rs
Original file line number Diff line number Diff line change
@@ -1,28 +1,29 @@
/// Positions are stored in three parts and over two files.
//
/// The `SegmentComponent::Positions` file contains all of the bitpacked positions delta,
/// for all terms of a given field, one term after the other.
///
/// If the last block is incomplete, it is simply padded with zeros.
/// It cannot be read alone, as it actually does not contain the number of bits used for
/// each blocks.
/// .
/// Each block is serialized one after the other.
/// If the last block is incomplete, it is simply padded with zeros.
///
///
/// The `SegmentComponent::PositionsSKIP` file contains the number of bits used in each block in `u8`
/// stream.
///
/// This makes it possible to rapidly skip over `n positions`.
///
/// For every block #n where n = k * `LONG_SKIP_INTERVAL` blocks (k>=1), we also store
/// in this file the sum of number of bits used for all of the previous block (blocks `[0, n[`).
/// That is useful to start reading the positions for a given term: The TermInfo contains
/// an address in the positions stream, expressed in "number of positions".
/// The long skip structure makes it possible to skip rapidly to the a checkpoint close to this
/// value, and then skip normally.
///
//! Tantivy can (if instructed to do so in the schema) store the term positions in a given field.
//! This positions are expressed as token ordinal. For instance,
//! In "The beauty and the beast", the term "the" appears in position 0 and position 4.
//! This information is useful to run phrase queries.
//!
//! The `SegmentComponent::POSITIONS` file contains all of the bitpacked positions delta,
//! for all terms of a given field, one term after the other.
//!
//! Each terms is encoded independently.
//! Like for positing lists, tantivy rely on simd bitpacking to encode the positions delta in blocks of 128 deltas.
//! Because we rarely have a multiple of 128, a final block may encode the remaining values variable byte encoding.
//!
//! In order to make reading possible, the term delta positions first encodes the number of bitpacked blocks,
//! then the bitwidth for each blocks, then the actual bitpacked block and finally the final variable int encoded block.
//!
//! Contrary to postings list, the reader does not have access on the number of positions that is encoded, and instead
//! stops decoding the last block when its byte slice has been entirely read.
//!
//! More formally:
//! * *Positions* := *NumBitPackedBlocks* *BitPackedPositionBlock*^(P/128) *BitPackedPositionsDeltaBitWidth* *VIntPosDeltas*?
//! * *NumBitPackedBlocks**: := *P* / 128 encoded as a variable byte integer.
//! * *BitPackedPositionBlock* := bit width encoded block of 128 positions delta
//! * *BitPackedPositionsDeltaBitWidth* := (*BitWidth*: u8)^*NumBitPackedBlocks*
//! * *VIntPosDeltas* := *VIntPosDelta*^(*P* % 128).
//!
//! The skip widths encoded separately makes it easy and fast to rapidly skip over n positions.
mod reader;
mod serializer;

Expand All @@ -38,42 +39,96 @@ pub mod tests {
use super::PositionSerializer;
use crate::directory::OwnedBytes;
use crate::positions::reader::PositionReader;
use proptest::prelude::*;
use proptest::sample::select;
use std::iter;

fn create_positions_data(vals: &[u32]) -> OwnedBytes {
fn create_positions_data(vals: &[u32]) -> crate::Result<OwnedBytes> {
let mut positions_buffer = vec![];
{
let mut serializer = PositionSerializer::new(&mut positions_buffer);
for &val in vals {
serializer.write_all(&[val]).unwrap();
let mut serializer = PositionSerializer::new(&mut positions_buffer);
serializer.write_positions_delta(&vals);
serializer.close_term()?;
serializer.close()?;
Ok(OwnedBytes::new(positions_buffer))
}

fn gen_delta_positions() -> BoxedStrategy<Vec<u32>> {
select(&[0, 1, 70, 127, 128, 129, 200, 255, 256, 257, 270][..])
.prop_flat_map(|num_delta_positions| {
proptest::collection::vec(
select(&[1u32, 2u32, 4u32, 8u32, 16u32][..]),
num_delta_positions,
)
})
.boxed()
}

proptest! {
#[test]
fn test_position_delta(delta_positions in gen_delta_positions()) {
let delta_positions_data = create_positions_data(&delta_positions).unwrap();
let mut position_reader = PositionReader::open(delta_positions_data).unwrap();
let mut minibuf = [0u32; 1];
for (offset, &delta_position) in delta_positions.iter().enumerate() {
position_reader.read(offset as u64, &mut minibuf[..]);
assert_eq!(delta_position, minibuf[0]);
}
serializer.close_term().unwrap();
serializer.close().unwrap();
}
OwnedBytes::new(positions_buffer)
}

#[test]
fn test_position_read() {
let v: Vec<u32> = (0..1000).collect();
let positions_data = create_positions_data(&v[..]);
fn test_position_read() -> crate::Result<()> {
let position_deltas: Vec<u32> = (0..1000).collect();
let positions_data = create_positions_data(&position_deltas[..])?;
assert_eq!(positions_data.len(), 1224);
let mut position_reader = PositionReader::new(positions_data).unwrap();
let mut position_reader = PositionReader::open(positions_data)?;
for &n in &[1, 10, 127, 128, 130, 312] {
let mut v = vec![0u32; n];
position_reader.read(0, &mut v[..]);
for i in 0..n {
assert_eq!(v[i], i as u32);
assert_eq!(position_deltas[i], i as u32);
}
}
Ok(())
}

#[test]
fn test_position_read_with_offset() {
let v: Vec<u32> = (0..1000).collect();
let positions_data = create_positions_data(&v[..]);
fn test_empty_position() -> crate::Result<()> {
let mut positions_buffer = vec![];
let mut serializer = PositionSerializer::new(&mut positions_buffer);
serializer.close_term()?;
serializer.close()?;
let position_delta = OwnedBytes::new(positions_buffer);
assert!(PositionReader::open(position_delta).is_ok());
Ok(())
}

#[test]
fn test_multiple_write_positions() -> crate::Result<()> {
let mut positions_buffer = vec![];
let mut serializer = PositionSerializer::new(&mut positions_buffer);
serializer.write_positions_delta(&[1u32, 12u32]);
serializer.write_positions_delta(&[4u32, 17u32]);
serializer.write_positions_delta(&[443u32]);
serializer.close_term()?;
serializer.close()?;
let position_delta = OwnedBytes::new(positions_buffer);
let mut output_delta_pos_buffer = vec![0u32; 5];
let mut position_reader = PositionReader::open(position_delta)?;
position_reader.read(0, &mut output_delta_pos_buffer[..]);
assert_eq!(
&output_delta_pos_buffer[..],
&[1u32, 12u32, 4u32, 17u32, 443u32]
);
Ok(())
}

#[test]
fn test_position_read_with_offset() -> crate::Result<()> {
let position_deltas: Vec<u32> = (0..1000).collect();
let positions_data = create_positions_data(&position_deltas[..])?;
assert_eq!(positions_data.len(), 1224);
let mut position_reader = PositionReader::new(positions_data).unwrap();
let mut position_reader = PositionReader::open(positions_data)?;
for &offset in &[1u64, 10u64, 127u64, 128u64, 130u64, 312u64] {
for &len in &[1, 10, 130, 500] {
let mut v = vec![0u32; len];
Expand All @@ -83,15 +138,16 @@ pub mod tests {
}
}
}
Ok(())
}

#[test]
fn test_position_read_after_skip() {
let v: Vec<u32> = (0..1_000).collect();
let positions_data = create_positions_data(&v[..]);
fn test_position_read_after_skip() -> crate::Result<()> {
let position_deltas: Vec<u32> = (0..1_000).collect();
let positions_data = create_positions_data(&position_deltas[..])?;
assert_eq!(positions_data.len(), 1224);

let mut position_reader = PositionReader::new(positions_data).unwrap();
let mut position_reader = PositionReader::open(positions_data)?;
let mut buf = [0u32; 7];
let mut c = 0;

Expand All @@ -105,14 +161,15 @@ pub mod tests {
c += 1;
}
}
Ok(())
}

#[test]
fn test_position_reread_anchor_different_than_block() {
fn test_position_reread_anchor_different_than_block() -> crate::Result<()> {
let positions_delta: Vec<u32> = (0..2_000_000).collect();
let positions_data = create_positions_data(&positions_delta[..]);
let positions_data = create_positions_data(&positions_delta[..])?;
assert_eq!(positions_data.len(), 5003499);
let mut position_reader = PositionReader::new(positions_data.clone()).unwrap();
let mut position_reader = PositionReader::open(positions_data.clone())?;
let mut buf = [0u32; 256];
position_reader.read(128, &mut buf);
for i in 0..256 {
Expand All @@ -122,57 +179,40 @@ pub mod tests {
for i in 0..256 {
assert_eq!(buf[i], (128 + i) as u32);
}
Ok(())
}

#[test]
#[should_panic(expected = "offset arguments should be increasing.")]
fn test_position_panic_if_called_previous_anchor() {
let positions_delta: Vec<u32> = (0..2_000_000).collect();
let positions_data = create_positions_data(&positions_delta[..]);
assert_eq!(positions_data.len(), 5_003_499);
fn test_position_requesting_passed_block() -> crate::Result<()> {
let positions_delta: Vec<u32> = (0..512).collect();
let positions_data = create_positions_data(&positions_delta[..])?;
assert_eq!(positions_data.len(), 533);
let mut buf = [0u32; 1];
let mut position_reader = PositionReader::new(positions_data).unwrap();
let mut position_reader = PositionReader::open(positions_data)?;
position_reader.read(230, &mut buf);
assert_eq!(buf[0], 230);
position_reader.read(9, &mut buf);
assert_eq!(buf[0], 9);
Ok(())
}

#[test]
fn test_positions_bug() {
let mut positions_delta: Vec<u32> = vec![];
for i in 1..200 {
for j in 0..i {
positions_delta.push(j);
}
}
let positions_data = create_positions_data(&positions_delta[..]);
let mut buf = Vec::new();
let mut position_reader = PositionReader::new(positions_data).unwrap();
let mut offset = 0;
for i in 1..24 {
buf.resize(i, 0);
offset += i as u64;
position_reader.read(offset, &mut buf[..]);
let expected_positions_delta: Vec<u32> = (0..i as u32).collect();
assert_eq!(buf, &expected_positions_delta[..], "Failed for offset={},i={}", offset, i);
}
}

#[test]
fn test_position() {
fn test_position() -> crate::Result<()> {
const CONST_VAL: u32 = 9u32;
let positions_delta: Vec<u32> = iter::repeat(CONST_VAL).take(2_000_000).collect();
let positions_data = create_positions_data(&positions_delta[..]);
let positions_data = create_positions_data(&positions_delta[..])?;
assert_eq!(positions_data.len(), 1_015_627);
let mut position_reader = PositionReader::new(positions_data).unwrap();
let mut position_reader = PositionReader::open(positions_data)?;
let mut buf = [0u32; 1];
position_reader.read(0, &mut buf);
assert_eq!(buf[0], CONST_VAL);
Ok(())
}

#[test]
fn test_position_advance() {
fn test_position_advance() -> crate::Result<()> {
let positions_delta: Vec<u32> = (0..2_000_000).collect();
let positions_data = create_positions_data(&positions_delta[..]);
let positions_data = create_positions_data(&positions_delta[..])?;
assert_eq!(positions_data.len(), 5_003_499);
for &offset in &[
10,
Expand All @@ -181,10 +221,11 @@ pub mod tests {
128 * 1024 + 7,
128 * 10 * 1024 + 10,
] {
let mut position_reader = PositionReader::new(positions_data.clone()).unwrap();
let mut position_reader = PositionReader::open(positions_data.clone())?;
let mut buf = [0u32; 1];
position_reader.read(offset, &mut buf);
assert_eq!(buf[0], offset as u32);
}
Ok(())
}
}
Loading

0 comments on commit 2dc5403

Please sign in to comment.