Exotic Types as fields of data provider structs #523

sffc · 2021-03-03T19:28:20Z

In #480, @gregtatum is proposing that we put an exotic type (in this case Pattern) in the data struct definition, and I expect @echeran will do something similar in enumerated properties. This is novel in the data provider architecture, and I want to discuss the implications.

Definition: When I say "exotic type", I mean an opaque type defined by ICU4X that implements Serde in a non-obvious way. I do not consider TinyStr and LiteMap to be exotic types, because they are serialization-compatible with String and HashMap.

Advantages of Exotic Types

When we put exotic types in data structs, it means that the serialized form is more opaque to the reader of the code. If we used only simple types in the data structs, it makes it easier to reason about the lifecycle of data structs.

However, as Greg pointed out in #480, putting exotic types in the data provider means that deserialization into those exotic types happens during data loading rather than in the constructor, which is a nice advantage that is consistent with the "pre-processing of data" design goal I outlined in data_provider.md.

On the other hand, we could design data structs to contain "low-level" representations of ICU4X concepts, like the list of tokens in a Pattern rather than the CLDR syntax for a Pattern. This would resolve the "pre-processing of data" issue, but it may tie our hands a bit more by encoding the internal representation of types into the data struct.

Stability

As a reminder, the reason we have versioned data keys and data structs (like plurals/ordinal@1 and PluralRulesV1) is so that a data file from a newer ICU4X can power code from an older ICU4X (and vice-versa as much as possible), as discussed in data_pipeline.md.

We need to think about what implications exotic types have on data struct stability. For example, if you have a struct like FooV1 { bar: Bar }, then the serialized form of Bar must also be stable.

Here are some implications to start with:

If a field of a data struct is an enum, then we need to make sure the serialized form of the enum is stable: both the string representation (for JSON) and presumably the discriminant (integer) version as well.
I don't think #[non_exhaustive] can work for fields of data structs, because if a new variant gets added to an enum, for example, a newer data file, which might have the new enum variant, won't be able to power an older ICU4X client, which doesn't know how to interpret that variant.

Portability

Currently, the Rust structs are the source of truth for the schema of the serialized files. However, I want us to keep the door open to using a cross-platform declarative syntax like JSON Schema or YANG to define the data struct schemas.

In other words, the serialized form of exotic types should be rooted in the building blocks of modern data markup languages: strings, numbers, arrays, and maps, with possible restrictions on those types (e.g., a string that is 1 character long, or a number that is in a certain range).

Portability is also an argument in favor of making the deserialization step as simple as possible. Going back to the Pattern exotic type, if I wanted to implement a data provider in JavaScript that serves a data struct across the FFI boundary, I would have to implement CLDR pattern parsing code.

The text was updated successfully, but these errors were encountered:

sffc · 2021-03-09T17:30:05Z

Notes from 2021-03-09:

@gregtatum: My expectation is that the human-readable CLDR pattern strings are used in JSON, and the array of fields is used in Bincode.
@sffc: Deserialization needs to be plumbed to support Bincode having a different form than JSON
@zbraniecki: We could use a flag to enable or disable the pattern parsing code
@sffc: The flag could enable or disable JSON
@Manishearth: If we make these structs #[repr(C)], then everything inside of them needs to be #[repr(C)], or else we could make mirror structs for FFI that are convertible to and from the Rust representation. Mirror structs could be useful to make it harder to accidentally change the FFI representation.
@sffc: How do we enforce serialization compatibility?
@Manishearth: We could have a test that runs on older versions of serialized data and ensures that the data can be deserialized from that form, and serialized back to the same form.
@sffc: We can also introduce a JSON Schema for the data structs, which will flag any problems if the serialized form changes.

Decision: Exotic types in data structs are OK. Add some documentation in the style guide explaining when to use them.

sffc · 2021-03-26T01:45:16Z

After #539 is implemented, I would like to spend some time profiling code size. I have trouble seeing how the compiler could know that it can eliminate the is_human_readable code path without additional help, especially when erased_serde is being used. Since we're now using erased_serde over FFI, we need a story for how we eliminate the JSON deserialization code path when building with erased_serde.

sffc · 2021-03-27T00:48:38Z

In documenting this, also discuss expectations for validation of user input, which will be necessary in Serde impls of exotic types.

sffc added discuss Discuss at a future ICU4X-SC meeting C-data-infra Component: provider, datagen, fallback, adapters labels Mar 3, 2021

sffc added T-docs-tests Type: Code change outside core library and removed discuss Discuss at a future ICU4X-SC meeting labels Mar 9, 2021

sffc self-assigned this Mar 9, 2021

sffc added this to the 2021-Q2-m1 milestone Mar 9, 2021

sffc added the S-tiny Size: Less than an hour (trivial fixes) label Apr 3, 2021

This was referenced Apr 7, 2021

Optimize performance of LocaleCanonicalizer::maximize. #504

Merged

Correctness, performance, and style #624

Closed

sffc mentioned this issue May 4, 2021

Add sections on zero-copy and exotic types to style guide #699

Merged

sffc modified the milestones: 2021-Q2-m1, ICU4X 0.3 May 13, 2021

Manishearth closed this as completed in #699 Jun 10, 2021

sffc mentioned this issue Sep 28, 2021

Introduce DTF runtime Pattern backed by ZV #1112

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exotic Types as fields of data provider structs #523

Exotic Types as fields of data provider structs #523

sffc commented Mar 3, 2021 •

edited

Loading

sffc commented Mar 9, 2021

sffc commented Mar 26, 2021

sffc commented Mar 27, 2021

Exotic Types as fields of data provider structs #523

Exotic Types as fields of data provider structs #523

Comments

sffc commented Mar 3, 2021 • edited Loading

Advantages of Exotic Types

Stability

Portability

sffc commented Mar 9, 2021

sffc commented Mar 26, 2021

sffc commented Mar 27, 2021

sffc commented Mar 3, 2021 •

edited

Loading