Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot deserialize Uuid #203

Closed
raj-nimble opened this issue Aug 2, 2024 · 11 comments
Closed

Cannot deserialize Uuid #203

raj-nimble opened this issue Aug 2, 2024 · 11 comments

Comments

@raj-nimble
Copy link

raj-nimble commented Aug 2, 2024

I am having an issue using serde_arrow to extract Fields from structs containing Uuid.

When I try to create fields from a record containing a Uuid type, the program panics with the following error:

thread 'main' panicked at src/main.rs:35:10:
failed to get schema: <Error: serde::de::Error: UUID parsing failed: invalid length: expected length 32 for simple format, found 0
Backtrace:
   0: serde_arrow::internal::error::Error::custom
             at /home/sahae/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_arrow-0.11.6/src/internal/error.rs:30:24
   1: <serde_arrow::internal::error::Error as serde::de::Error>::custom
             at /home/sahae/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_arrow-0.11.6/src/internal/error.rs:129:9
   2: uuid::external::serde_support::<impl serde::de::Deserialize for uuid::Uuid>::deserialize::de_error
             at /home/sahae/.cargo/registry/src/index.crates.io-6f17d22bba15001f/uuid-1.10.0/src/external/serde_support.rs:60:13
   3: core::ops::function::FnOnce::call_once
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ops/function.rs:250:5
   4: core::result::Result<T,E>::map_err
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/result.rs:829:27
   5: <uuid::external::serde_support::<impl serde::de::Deserialize for uuid::Uuid>::deserialize::UuidVisitor as serde::de::Visitor>::v
             at /home/sahae/.cargo/registry/src/index.crates.io-6f17d22bba15001f/uuid-1.10.0/src/external/serde_support.rs:74:21
   6: serde::de::Visitor::visit_borrowed_str
             at /home/sahae/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde-1.0.204/src/de/mod.rs:1515:9
   7: <serde_arrow::internal::schema::from_type::TraceAny as serde::de::Deserializer>::deserialize_str
             at /home/sahae/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_arrow-0.11.6/src/internal/schema/from_type/mod.
   8: uuid::external::serde_support::<impl serde::de::Deserialize for uuid::Uuid>::deserialize
             at /home/sahae/.cargo/registry/src/index.crates.io-6f17d22bba15001f/uuid-1.10.0/src/external/serde_support.rs:109:13
   9: <core::marker::PhantomData<T> as serde::de::DeserializeSeed>::deserialize
             at /home/sahae/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde-1.0.204/src/de/mod.rs:801:9
  10: <serde_arrow::internal::schema::from_type::TraceStruct as serde::de::MapAccess>::next_value_seed
             at /home/sahae/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_arrow-0.11.6/src/internal/schema/from_type/mod.
  11: serde::de::MapAccess::next_value
             at /home/sahae/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde-1.0.204/src/de/mod.rs:1874:9
  12: <purr::_::<impl serde::de::Deserialize for purr::Record>::deserialize::__Visitor as serde::de::Visitor>::visit_map
             at ./src/main.rs:7:21
  13: <serde_arrow::internal::schema::from_type::TraceAny as serde::de::Deserializer>::deserialize_struct
             at /home/sahae/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_arrow-0.11.6/src/internal/schema/from_type/mod.
  14: purr::_::<impl serde::de::Deserialize for purr::Record>::deserialize
             at ./src/main.rs:7:21
  15: serde_arrow::internal::schema::from_type::<impl serde_arrow::internal::schema::tracer::Tracer>::from_type
             at /home/sahae/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_arrow-0.11.6/src/internal/schema/from_type/mod.
  16: <serde_arrow::internal::schema::SerdeArrowSchema as serde_arrow::internal::schema::SchemaLike>::from_type
             at /home/sahae/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_arrow-0.11.6/src/internal/schema/mod.rs:304:9
  17: serde_arrow::arrow_impl::schema::<impl serde_arrow::internal::schema::SchemaLike for alloc::vec::Vec<alloc::sync::Arc<arrow_sche
             at /home/sahae/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_arrow-0.11.6/src/arrow_impl/schema.rs:140:12
  18: purr::main
             at ./src/main.rs:34:18
  19: core::ops::function::FnOnce::call_once
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ops/function.rs:250:5
  20: std::sys_common::backtrace::__rust_begin_short_backtrace
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/sys_common/backtrace.rs:155:18
  21: std::rt::lang_start::{{closure}}
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/rt.rs:166:18
  22: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ops/function.rs:284:13
  23: std::panicking::try::do_call
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:552:40
  24: std::panicking::try
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:516:19
  25: std::panic::catch_unwind
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panic.rs:146:14
  26: std::rt::lang_start_internal::{{closure}}
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/rt.rs:148:48
  27: std::panicking::try::do_call
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:552:40
  28: std::panicking::try
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:516:19
  29: std::panic::catch_unwind
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panic.rs:146:14
  30: std::rt::lang_start_internal
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/rt.rs:148:20
  31: std::rt::lang_start
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/rt.rs:165:17
  32: main
  33: __libc_start_main
             at /build/glibc-LcI20x/glibc-2.31/csu/../csu/libc-start.c:308:16
  34: _start
>
stack backtrace:
   0: rust_begin_unwind
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/panicking.rs:72:14
   2: core::result::unwrap_failed
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/result.rs:1654:5
   3: core::result::Result<T,E>::expect
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/result.rs:1034:23
   4: purr::main
             at ./src/main.rs:34:18
   5: core::ops::function::FnOnce::call_once
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ops/function.rs:250:5

Here is a reproducible example that is extended from the example given in the serde_arrow crates.io page.

use arrow::datatypes::FieldRef;
use parquet::arrow::ArrowWriter;
use serde::{Deserialize, Serialize};
use serde_arrow::schema::{SchemaLike, TracingOptions};
use uuid::Uuid;

#[derive(Serialize, Deserialize)]
struct Record {
    a: f32,
    b: i32,
    c: Uuid,
}

fn main() {
    let records = vec![
        Record {
            a: 1.0,
            b: 1,
            c: Uuid::new_v4(),
        },
        Record {
            a: 2.0,
            b: 2,
            c: Uuid::new_v4(),
        },
        Record {
            a: 3.0,
            b: 3,
            c: Uuid::new_v4(),
        },
    ];

    // Determine Arrow schema
    let fields = Vec::<FieldRef>::from_type::<Record>(TracingOptions::default())
        .expect("failed to get schema");

    // Build a record batch
    let batch =
        serde_arrow::to_record_batch(&fields, &records).expect("failed to build record batch");

    let file = std::fs::File::create("example.pq").expect("failed to create parquet file");
    let mut writer =
        ArrowWriter::try_new(file, batch.schema(), None).expect("failed to create writer");

    writer.write(&batch).expect("failed to write batch");
    writer.close().expect("failed to close writer");
}

The Cargo.toml dependencies I used for this were

[dependencies]
arrow = "52.2.0"
parquet = "52.2.0"
serde = { version = "1.0.204", features = ["derive"] }
serde_arrow = { version = "0.11.6", features = ["arrow-52"] }
uuid = { version = "1.10.0", features = ["serde", "v4"] }

My rust version info

$ rustup --version
rustup 1.27.1 (54dd3d00f 2024-04-24)
info: This is the version for the rustup toolchain manager, not the rustc compiler.
info: The currently active `rustc` version is `rustc 1.78.0 (9b00956e5 2024-04-29)`
@chmp
Copy link
Owner

chmp commented Aug 2, 2024

Thanks a lot for the detailed report. That's super helpful. I will have a look and figure out what's going on!

@raj-nimble
Copy link
Author

raj-nimble commented Aug 2, 2024

From a brief look through, I think what might be going on is, trying to get the FieldRef by deserializing from_type figures out what the serde internal data model is for that type then calls deserialize on it to get the default representation of it for serde_arrow to use, but the internal data model for Uuid is String, and so it's deserializing a default string which is actually not valid for a Uuid. Actually, it might be deserializing as a borrowed str, so it's not even calling Default::default() in the deserializer, it's actually just passing in an empty string literal.

So I think I can work around this by getting the field types using from_samples, I hope. I'll report back if I find more.

@raj-nimble
Copy link
Author

raj-nimble commented Aug 2, 2024

FYI, using from_samples does work. It's not a perfect solution but it does mean I'm not blocked for the time being. It would be great if from_type eventually worked.

One drawback (this may in fact be totally separate and not specific to using from_samples) is that the resulting parquet file shows a uuid as a str.

>>> import polars as pl
>>> pl.read_parquet("example.pq")
shape: (3, 3)
┌─────┬─────┬─────────────────────────────────┐
│ a   ┆ b   ┆ c                               │
│ --- ┆ --- ┆ ---                             │
│ f32 ┆ i32 ┆ str                             │
╞═════╪═════╪═════════════════════════════════╡
│ 1.0 ┆ 1   ┆ 60ea4638-3956-44a5-a258-845c70… │
│ 2.0 ┆ 2   ┆ c986c97b-a9a5-4df3-887d-8ac5e5… │
│ 3.0 ┆ 3   ┆ 5fe544b6-ba04-4e34-a41e-ec5826… │
└─────┴─────┴─────────────────────────────────┘

While not a major issue at all, it would be nice if somehow it deserialized into a python UUID object.
I'm mentioning this here but I realize it likely is outside the scope of this crate.

@chmp
Copy link
Owner

chmp commented Aug 2, 2024

@raj-nimble Thanks for the investigation. From what I understand, your explanation is spot on. I am afraid though there is not much that can be done about it. from_type only sees the type without further information, so it knows the object is of type "String", but cannot know anything about the content.

Re. serializing the type as UUID: it seems there is a canonical extension type for UUID. It would definitely make sense to add some functionality to simplify dealing with UUIDs. It is currently not supported by pyarrow, though. Oh. And polars does not have support for extension types currently, so resulting files would not readable from polars, as far as I understand.

Finally, thanks to your report, I also realized that the "human_readable" flag is inconsistently set throughout the crate. I think, it should be false throughout, as I would expect this to result in smaller data files throughout. In your case (UUIDs), this would result the data to be serialized as Bytes. Probably less readable than it is now, but much smaller file sizes (and it would be compatible with the UUID extension type).

I think the following changes would make sense:

  • Mark all serializes / deserializers as non-human readable
  • Add a helper to construct fields with UUID type
  • Include UUIDs in the list of data types that cannot be handled by from_type in its docs

@chmp
Copy link
Owner

chmp commented Aug 3, 2024

I added a warning on using from_type for Uuid to the docs with #206. I am afraid due to the way Serde works, there is not much else that I can do here.

Thanks a lot for bringing this issue to my attention!

@chmp chmp closed this as completed Aug 3, 2024
@raj-nimble
Copy link
Author

Thanks @chmp for the follow up.

@raj-nimble
Copy link
Author

Small follow up to UUID issues, if you serialize/deserialize with the uuid::serde::compact module, then using from_type will work. The resulting parquet file unfortunately isn't human readable and you would have to further process the UUID in byte array form but at least it writes out

struct Record {
    #[serde(with = "uuid::serde::compact")]
    c: Uuid,
}

and the result parquet file

$ python test.py test.pq
                                                                                                                                                                    c  ...               timestamp
0  {'0': 142, '1': 38, '2': 179, '3': 146, '4': 201, '5': 168, '6': 75, '7': 50, '8': 180, '9': 250, '10': 243, '11': 254, '12': 19, '13': 206, '14': 205, '15': 127}  ... 2024-09-19 22:11:10.200
1   {'0': 218, '1': 170, '2': 157, '3': 110, '4': 119, '5': 92, '6': 74, '7': 69, '8': 183, '9': 141, '10': 247, '11': 19, '12': 218, '13': 155, '14': 106, '15': 17}  ... 2024-09-19 22:11:10.200
2        {'0': 129, '1': 22, '2': 102, '3': 25, '4': 0, '5': 23, '6': 69, '7': 20, '8': 158, '9': 89, '10': 83, '11': 117, '12': 197, '13': 105, '14': 97, '15': 146}  ... 2024-09-19 22:11:10.200

[3 rows x 4 columns]

@raj-nimble
Copy link
Author

Sadly, this fails for newtype structs ☹️

@raj-nimble
Copy link
Author

raj-nimble commented Sep 19, 2024

Actually it works for newtype structs in a release version. I think the failure with newtype is related to some bug in the enum flattening changes in that MR.

@chmp
Copy link
Owner

chmp commented Sep 30, 2024

Added an explanation to the error for IpAddr in #242

@raj-nimble
Copy link
Author

raj-nimble commented Oct 22, 2024

I was able to work around this in a much better way by adding a custom deserializer that allows for empty strings.
i.e.

mod allow_empty_uuid_str {
    use std::str::FromStr;
    use uuid::Uuid;

    pub fn deserialize<'de, D>(deserializer: D) -> Result<Uuid, D::Error>
    where
        D: serde::Deserializer<'de>,
    {
        let s: &str = <&str as serde::Deserialize>::deserialize(deserializer)?;
        if s.is_empty() {
            Ok(Uuid::nil())
        } else {
            Uuid::from_str(s).map_err(serde::de::Error::custom)
        }
    }
}

and then

struct MyData {
    #[serde(deserialize_with = "allow_empty_uuid_str::deserialize")
    uuid: Uuid
}

and then in the resulting parquet it is the actual uuid, not the byte array

                                uuid
0    1ad88379-b4b0-47be-81a2-caa079f663ce

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants