feat: support position delete writer #704

ZENOTME · 2024-11-19T13:34:39Z

Complete #340

liurenjie1024

There are two kinds of writers in iceberg:

Plain position delete writer: https://github.com/apache/iceberg/blob/da2ad389fd9ba8222f6fb3f57922209c239a7045/core/src/main/java/org/apache/iceberg/deletes/PositionDeleteWriter.java#L49
Sorting position delete writer:
https://github.com/apache/iceberg/blob/da2ad389fd9ba8222f6fb3f57922209c239a7045/core/src/main/java/org/apache/iceberg/deletes/PositionDeleteWriter.java#L49

It seems that this pr tries to implement 2, while there are some missing part there. I would suggest to implement 1 first as it's easier, what do you think?

liurenjie1024 · 2024-11-27T07:00:08Z

crates/iceberg/src/arrow/schema.rs

@@ -607,6 +607,19 @@ impl SchemaVisitor for ToArrowSchemaConverter {
    }
 }

+/// Convert iceberg field to an arrow field.
+pub fn field_to_arrow_field(field: &crate::spec::NestedFieldRef) -> Result<FieldRef> {
+    let mut converter = ToArrowSchemaConverter;


The implementation is a little hack to me. How about just create a one field schema, and convert it using arrow schema, then get the result?

liurenjie1024 · 2024-11-27T07:04:18Z

crates/iceberg/src/writer/base_writer/position_delete_file_writer.rs

+    fn write<'life0, 'async_trait>(
+        &'life0 mut self,
+        input: PositionDeleteInput<'a>,
+    ) -> ::core::pin::Pin<
+        Box<dyn ::core::future::Future<Output = Result<()>> + ::core::marker::Send + 'async_trait>,
+    >
+    where
+        'life0: 'async_trait,
+        Self: 'async_trait,


Please remove these auto generated lifetime markers and prefix of types

For here we use a sync version so that seems we need to explicitly declare these auto-generated lifetime.
The reason we need the sync version is that the input takes the reference like: struct PositionDeleteInput<'a> , we need to explicitly convert it into a record batch in the sync function part and then return a async future to write this record batch.

liurenjie1024 · 2024-11-27T07:10:17Z

crates/iceberg/src/writer/base_writer/position_delete_file_writer.rs

+}
+
+/// The memory position delete writer.
+pub struct MemoryPositionDeleteWriter<B: FileWriterBuilder> {


Suggested change

pub struct MemoryPositionDeleteWriter<B: FileWriterBuilder> {

pub struct PositionDeleteWriter<B: FileWriterBuilder> {

I don't think we should add a Memory prefix here since it make people feel that we are storing everything in memory, and it applies to all structs.

ZENOTME · 2024-11-27T13:25:01Z

There are two kinds of writers in iceberg:

Plain position delete writer: https://github.com/apache/iceberg/blob/da2ad389fd9ba8222f6fb3f57922209c239a7045/core/src/main/java/org/apache/iceberg/deletes/PositionDeleteWriter.java#L49

Sorting position delete writer:
https://github.com/apache/iceberg/blob/da2ad389fd9ba8222f6fb3f57922209c239a7045/core/src/main/java/org/apache/iceberg/deletes/PositionDeleteWriter.java#L49

It seems that this pr tries to implement 2, while there are some missing part there. I would suggest to implement 1 first as it's easier, what do you think?

Is position delete must be sorted or it just be optional? From the iceberg spec, it looks like it must be sorted. https://iceberg.apache.org/spec/#position-delete-files:~:text=The%20rows%20in%20the%20delete%20file%20must%20be%20sorted%20by%20file_path%20then%20pos%20to%20optimize%20filtering%20rows%20while%20scanning.

liurenjie1024 · 2024-11-28T02:25:08Z

There are two kinds of writers in iceberg:

Plain position delete writer: https://github.com/apache/iceberg/blob/da2ad389fd9ba8222f6fb3f57922209c239a7045/core/src/main/java/org/apache/iceberg/deletes/PositionDeleteWriter.java#L49

Sorting position delete writer:
https://github.com/apache/iceberg/blob/da2ad389fd9ba8222f6fb3f57922209c239a7045/core/src/main/java/org/apache/iceberg/deletes/PositionDeleteWriter.java#L49

It seems that this pr tries to implement 2, while there are some missing part there. I would suggest to implement 1 first as it's easier, what do you think?

Is position delete must be sorted or it just be optional? From the iceberg spec, it looks like it must be sorted. https://iceberg.apache.org/spec/#position-delete-files:~:text=The%20rows%20in%20the%20delete%20file%20must%20be%20sorted%20by%20file_path%20then%20pos%20to%20optimize%20filtering%20rows%20while%20scanning.

Yes, it's required in spec, but some compute engine could sort this before passing to writer, and writer doesn't need to handle sorting itself.

ZENOTME · 2024-11-28T03:44:06Z

There are two kinds of writers in iceberg:

Plain position delete writer: https://github.com/apache/iceberg/blob/da2ad389fd9ba8222f6fb3f57922209c239a7045/core/src/main/java/org/apache/iceberg/deletes/PositionDeleteWriter.java#L49

Sorting position delete writer:
https://github.com/apache/iceberg/blob/da2ad389fd9ba8222f6fb3f57922209c239a7045/core/src/main/java/org/apache/iceberg/deletes/PositionDeleteWriter.java#L49

It seems that this pr tries to implement 2, while there are some missing part there. I would suggest to implement 1 first as it's easier, what do you think?

Is position delete must be sorted or it just be optional? From the iceberg spec, it looks like it must be sorted. https://iceberg.apache.org/spec/#position-delete-files:~:text=The%20rows%20in%20the%20delete%20file%20must%20be%20sorted%20by%20file_path%20then%20pos%20to%20optimize%20filtering%20rows%20while%20scanning.

Yes, it's required in spec, but some compute engine could sort this before passing to writer, and writer doesn't need to handle sorting itself.

Make sense. Let's implement 1 first

ZENOTME · 2024-12-03T02:12:23Z

I think we can resolve #741 first before this PR.

jonathanc-n · 2025-02-12T08:00:03Z

@ZENOTME Are you still working on this? I'm looking to work on one of the two writers

ZENOTME · 2025-02-12T12:20:37Z

@ZENOTME Are you still working on this? I'm looking to work on one of the two writers

Sorry for the late, I will work on this later. Would you like to work on sorting position delete writer after this PR?

ZENOTME · 2025-02-12T13:29:14Z

Hi @liurenjie1024 @jonathanc-n. I have fixed this PR. It's ready for review.

jonathanc-n

Overall lgtm! I can get started on the sorting position delete writer next after merge

jonathanc-n · 2025-02-13T04:58:05Z

crates/iceberg/src/writer/base_writer/position_delete_file_writer.rs

+    async fn write(&mut self, input: Vec<PositionDeleteInput>) -> Result<()> {
+        let mut path_column_builder = StringBuilder::new();
+        let mut offset_column_builder = PrimitiveBuilder::<Int64Type>::new();
+        for input in input.into_iter() {


Change variable here? ex. pd_input

liurenjie1024

Thanks @ZENOTME for this pr, generally LGTM, left some minor suggestions.

liurenjie1024 · 2025-02-20T03:58:40Z

crates/iceberg/src/writer/base_writer/position_delete_file_writer.rs

+                2147483546,
+                "file_path",


Please make these constants.

liurenjie1024 · 2025-02-20T03:58:47Z

crates/iceberg/src/writer/base_writer/position_delete_file_writer.rs

+                2147483545,
+                "pos",


liurenjie1024 · 2025-02-20T04:07:14Z

crates/iceberg/src/writer/base_writer/position_delete_file_writer.rs

+    /// The offset of the position delete.
+    pub offsets: Vec<i64>,


Suggested change

/// The offset of the position delete.

pub offsets: Vec<i64>,

/// The row number in data file..

pub row: i64,

We should not ask user to think about the container.

liurenjie1024 · 2025-02-20T04:08:30Z

crates/iceberg/src/writer/base_writer/position_delete_file_writer.rs

+#[derive(Clone, PartialEq, Eq, Ord, PartialOrd, Debug)]
+pub struct PositionDeleteInput {
+    /// The path of the file.
+    pub path: String,


Suggested change

pub path: String,

pub path: &'a str,

liurenjie1024 · 2025-02-20T04:09:30Z

crates/iceberg/src/writer/base_writer/position_delete_file_writer.rs

+}
+
+/// Position delete writer.
+pub struct PositionDeleteWriter<B: FileWriterBuilder> {


We should buffer in memory about for the input row number.

Do you mean that for PositionDeleteInput, we should buffer them and write them as a batch?🤔 E.g.

pub struct PositionDeleteWriter { // path -> row_num buffer: HashMap<String, Vec<i64>> }

For here I don't add the buffer because we will add SortPositionDeleteWriter later and it will buffer the input and sort them, so I'm not sure whether we need to add a buffer here. Or we can let it be a optional choice?

ZENOTME · 2025-02-23T06:23:41Z

crates/iceberg/src/writer/base_writer/position_delete_file_writer.rs

+    partition_value: Struct,
+}
+
+impl<'a, B: FileWriterBuilder> IcebergWriter<Vec<PositionDeleteInput<'a>>>


We can't simply this code using #[async_trait] because PositionDeleteInput take the reference, so in here we should convert them into RecordBatch first in sync code and then return a async function to write them. cc @liurenjie1024

@Xuanwo

#704 fail in msrv check and I find that's because `cargo update faststr` will update the munge to `0.4.2` instead of `0.4.1`. The simple fix way is to specify the precise version of munge. But I'm not sure whether it's good practice here. Do you have any suggestions for this? cc @Xuanwo @xxchan Co-authored-by: ZENOTME <st810918843@gmail.com>

ZENOTME force-pushed the pos_delete branch from a38203b to 03f49ae Compare November 19, 2024 13:43

liurenjie1024 reviewed Nov 27, 2024

View reviewed changes

ZENOTME force-pushed the pos_delete branch from 03f49ae to 803cd39 Compare November 28, 2024 12:45

ZENOTME mentioned this pull request Dec 23, 2024

feat: support write risingwavelabs/iceberg-rust#10

Merged

ZENOTME force-pushed the pos_delete branch from 803cd39 to 37069e9 Compare February 12, 2025 13:24

jonathanc-n approved these changes Feb 13, 2025

View reviewed changes

liurenjie1024 reviewed Feb 20, 2025

View reviewed changes

ZENOTME commented Feb 23, 2025

View reviewed changes

ZENOTME force-pushed the pos_delete branch 2 times, most recently from 797379e to 391a061 Compare February 23, 2025 06:29

ZENOTME mentioned this pull request Feb 23, 2025

fix: speficy the version of munge for msrv check #987

Merged

ZENOTME added 3 commits February 24, 2025 16:16

add position delete file writer

813bdb4

refine name

c1c3e9d

refine input of writer

99ad49e

ZENOTME force-pushed the pos_delete branch from 391a061 to 99ad49e Compare February 24, 2025 08:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support position delete writer #704

feat: support position delete writer #704

ZENOTME commented Nov 19, 2024

liurenjie1024 left a comment

liurenjie1024 Nov 27, 2024

liurenjie1024 Nov 27, 2024

ZENOTME Nov 28, 2024

liurenjie1024 Nov 27, 2024

ZENOTME commented Nov 27, 2024

liurenjie1024 commented Nov 28, 2024 •

edited

Loading

ZENOTME commented Nov 28, 2024

ZENOTME commented Dec 3, 2024

jonathanc-n commented Feb 12, 2025

ZENOTME commented Feb 12, 2025

ZENOTME commented Feb 12, 2025

jonathanc-n left a comment

jonathanc-n Feb 13, 2025 •

edited

Loading

liurenjie1024 left a comment

liurenjie1024 Feb 20, 2025

liurenjie1024 Feb 20, 2025

liurenjie1024 Feb 20, 2025

liurenjie1024 Feb 20, 2025

liurenjie1024 Feb 20, 2025

liurenjie1024 Feb 20, 2025

ZENOTME Feb 22, 2025

ZENOTME Feb 23, 2025

	pub struct MemoryPositionDeleteWriter<B: FileWriterBuilder> {
	pub struct PositionDeleteWriter<B: FileWriterBuilder> {

		/// The offset of the position delete.
		pub offsets: Vec<i64>,

feat: support position delete writer #704

Are you sure you want to change the base?

feat: support position delete writer #704

Conversation

ZENOTME commented Nov 19, 2024

liurenjie1024 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZENOTME commented Nov 27, 2024

liurenjie1024 commented Nov 28, 2024 • edited Loading

ZENOTME commented Nov 28, 2024

ZENOTME commented Dec 3, 2024

jonathanc-n commented Feb 12, 2025

ZENOTME commented Feb 12, 2025

ZENOTME commented Feb 12, 2025

jonathanc-n left a comment

Choose a reason for hiding this comment

jonathanc-n Feb 13, 2025 • edited Loading

Choose a reason for hiding this comment

liurenjie1024 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liurenjie1024 commented Nov 28, 2024 •

edited

Loading

jonathanc-n Feb 13, 2025 •

edited

Loading