Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect IPC schema encoding for multiple dictionaries #7058

Open
XiangpengHao opened this issue Feb 1, 2025 · 10 comments · May be fixed by #7109
Open

Incorrect IPC schema encoding for multiple dictionaries #7058

XiangpengHao opened this issue Feb 1, 2025 · 10 comments · May be fixed by #7109

Comments

@XiangpengHao
Copy link
Contributor

Describe the bug

The following test case should pass, but it errors:

thread 'tests::test_dict_schema' panicked at arrow-flight/src/lib.rs:851:56:
called `Result::unwrap()` on an `Err` value: ParseError("Unable to convert flight info to a message: Type `i64` at position 100 is unaligned.\n\twhile verifying table field `id` at position 100\n\twhile verifying table field `dictionary` at position 76\n\twhile verifying vector element 1 at position 56\n\twhile verifying table field `fields` at position 44\n\t while verifying union variant `MessageHeader::Schema` at position 24\n\twhile verifying table field `header` at position 24\n\n")

To Reproduce

    #[test]
    fn test_dict_schema() {
        let schema = Schema::new(vec![
            Field::new(
                "a",
                DataType::Dictionary(Box::new(DataType::UInt16), Box::new(DataType::Utf8)),
                false,
            ),
            Field::new(
                "b",
                DataType::Dictionary(Box::new(DataType::UInt16), Box::new(DataType::Utf8)),
                false,
            ),
        ]);

        let flight_info = FlightInfo::new().try_with_schema(&schema).unwrap();

        let new_schema = Schema::try_from(flight_info).unwrap();
        assert_eq!(schema, new_schema);
    }

Expected behavior

If the schema only has one dictionary, it would work just fine.

Additional context

@XiangpengHao
Copy link
Contributor Author

Update: my latest finding is that it will panic whenever the dict_id is not 0

Probably related to: #5981

@XiangpengHao
Copy link
Contributor Author

For anyone following along, I think the bug is from flatbuffer, and the workaround is to turnoff the verifier on receiving the data. The data seems to be correct, just the verifier is not happy with it:
XiangpengHao@be255bc

I don't have the bandwidth to fix it from flatbuffer, but if anyone wants to push this, I'm happy to share more my findings.

@alamb
Copy link
Contributor

alamb commented Feb 6, 2025

FYI @bkietz and @tustvold

Maybe a pragmatic near term solution is to have an option to disable flatbuffer validation (perhaps similarly to)

@tustvold
Copy link
Contributor

tustvold commented Feb 6, 2025

I wonder if this is an unintentional consequence of the recent alignment changes, do older versions of arrow-rs behave the same way?

@XiangpengHao
Copy link
Contributor Author

I wonder if this is an unintentional consequence of the recent alignment changes, do older versions of arrow-rs behave the same way?

I tested up to last August and the problem can still reproduce.

@bkietz
Copy link
Member

bkietz commented Feb 7, 2025

Just as a sanity check, could we simplify the reproducer by removing reference to flight?

let schema = Schema::new(vec![
            Field::new(
                "a",
                DataType::Dictionary(Box::new(DataType::UInt16), Box::new(DataType::Utf8)),
                false,
            ),
            Field::new(
                "b",
                DataType::Dictionary(Box::new(DataType::UInt16), Box::new(DataType::Utf8)),
                false,
            ),
        ]);

let options = IpcWriteOptions::default();
let msg = SchemaAsIpc::new(&schema, &options).try_into().unwrap();
let new_schema: Schema = msg.try_into().unwrap();
assert_eq!(schema, new_schema);

@XiangpengHao
Copy link
Contributor Author

Just as a sanity check, could we simplify the reproducer by removing reference to flight?

let schema = Schema::new(vec![
Field::new(
"a",
DataType::Dictionary(Box::new(DataType::UInt16), Box::new(DataType::Utf8)),
false,
),
Field::new(
"b",
DataType::Dictionary(Box::new(DataType::UInt16), Box::new(DataType::Utf8)),
false,
),
]);

let options = IpcWriteOptions::default();
let msg = SchemaAsIpc::new(&schema, &options).try_into().unwrap();
let new_schema: Schema = msg.try_into().unwrap();
assert_eq!(schema, new_schema);

Exactly, it's just IPC not flight related

@bkietz
Copy link
Member

bkietz commented Feb 7, 2025

(We've encountered similar failures in the C++ library before ultimately due to the lack of alignment guarantees in protobuf's bytes, which are used to store the schema. IIRC our solution was to recopy the flatbuffers message if it did not start on an 8-byte aligned address.)

@bkietz
Copy link
Member

bkietz commented Feb 7, 2025

this tool might be useful for looking at the bytes you get

@bkietz
Copy link
Member

bkietz commented Feb 10, 2025

Inlining a bit further, the issue seems to be peculiar to flatbuffers::size_prefixed_root::<Message>. The schema can be roundtripped without error using arrow_ipc::reader::parse_message

#[test]
fn test_roundtrip_schema() {
    let schema = Schema::new(vec![
        Field::new(
            "a",
            DataType::Dictionary(Box::new(DataType::UInt16), Box::new(DataType::Utf8)),
            false,
        ),
        Field::new(
            "b",
            DataType::Dictionary(Box::new(DataType::UInt16), Box::new(DataType::Utf8)),
            false,
        ),
    ]);

    let options = IpcWriteOptions::default();
    let data_gen = IpcDataGenerator::default();
    let mut dict_tracker = DictionaryTracker::new(false);
    let encoded_data =
        data_gen.schema_to_bytes_with_dictionary_tracker(&schema, &mut dict_tracker, &options);
    let mut schema_bytes = vec![];
    write_message(&mut schema_bytes, encoded_data, &options).expect("write_message");

    let begin_offset: usize = if schema_bytes[0..4].eq(&CONTINUATION_MARKER) { 4 } else { 0 };
    size_prefixed_root_as_message(&schema_bytes[begin_offset..])
            .expect_err("size_prefixed_root_as_message");
    let msg = parse_message(&schema_bytes).expect("parse_message");
    let ipc_schema = msg.header_as_schema().expect("header_as_schema");
    let new_schema = fb_to_schema(ipc_schema);

    assert_eq!(schema, new_schema);
}

that function seems to be following the prefixed length as though it were an offset... which seems highly suspect, I feel I must be reading that incorrectly

@bkietz bkietz linked a pull request Feb 10, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants