End-to-end tagging: Rust #8304

teh-cmc · 2024-12-03T16:24:07Z

I had to give up on the idea of splitting this thing into neat little PRs -- the enormous amount of extra work needed in this case is just not worth it, it's not even close (turns out changing the definition of Component has cascading consequences 😶).

I'll add a thorough description of what's going on to compensate, and can walk someone through this if needed.

Goals and non-goals

The goal of this PR is to get component tags in, store them, and then get them out.

The goal of this PR is not to port every single bit of component-name based logic to component-descriptor based logic (including but certainly not limited to datastore queries).
That will be the next step: #8293.

Types and traits

First and foremost, this ofc introduces the new ComponentDescriptor type:

/// A [`ComponentDescriptor`] fully describes the semantics of a column of data.
///
/// Every component is uniquely identified by its [`ComponentDescriptor`].
#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Ord)]
pub struct ComponentDescriptor {
    /// Optional name of the `Archetype` associated with this data.
    ///
    /// `None` if the data wasn't logged through an archetype.
    ///
    /// Example: `rerun.archetypes.Points3D`.
    pub archetype_name: Option<ArchetypeName>,

    /// Optional name of the field within `Archetype` associated with this data.
    ///
    /// `None` if the data wasn't logged through an archetype.
    ///
    /// Example: `positions`.
    pub archetype_field_name: Option<ArchetypeFieldName>,

    /// Semantic name associated with this data.
    ///
    /// This is fully implied by `archetype_name` and `archetype_field`, but
    /// included for semantic convenience.
    ///
    /// Example: `rerun.components.Position3D`.
    pub component_name: ComponentName,
}

Note that this is a Rerun type, not a Sorbet type: i.e. it uses Rerun terminology (archetypes, fields, etc), not Sorbet terminology.
As is now tradition, this terminology gets translated into its Sorbet equivalent when leaving the land of internal Chunks for the land of external RecordBatches and Dataframes.

Components are now uniquely identified by a ComponentDescriptor rather than a ComponentName:

/// A [`Component`] describes semantic data that can be used by any number of [`Archetype`]s.
///
/// Implementing the [`Component`] trait automatically derives the [`ComponentBatch`] implementation,
/// which makes it possible to work with lists' worth of data in a generic fashion.
pub trait Component: Loggable {
    /// Returns the complete [`ComponentDescriptor`] for this [`Component`].
    ///
    /// Every component is uniquely identified by its [`ComponentDescriptor`].
    //
    // NOTE: Builtin Rerun components don't (yet) have anything but a `ComponentName` attached to
    // them (other tags are injected at the Archetype level), therefore having a full
    // `ComponentDescriptor` might seem overkill.
    // It's not:
    // * Users might still want to register Components with specific tags.
    // * In the future, `ComponentDescriptor`s will very likely cover than Archetype-related tags
    //   (e.g. generics, metric units, etc).
    fn descriptor() -> ComponentDescriptor;

    /// The fully-qualified name of this component, e.g. `rerun.components.Position2D`.
    ///
    /// This is a trivial but useful helper for `Self::descriptor().component_name`.
    ///
    /// The default implementation already does the right thing: do not override unless you know
    /// what you're doing.
    /// `Self::name()` must exactly match the value returned by `Self::descriptor().component_name`,
    /// or undefined behavior ensues.
    //
    // TODO(cmc): The only reason we keep this around is for convenience, and the only reason we need this
    // convenience is because we're still in this weird half-way in-between state where some things
    // are still indexed by name. Remove this entirely once we've ported everything to descriptors.
    #[inline]
    fn name() -> ComponentName {
        Self::descriptor().component_name
    }
}

Component::name still exists for now, as a convenience during the interim (that is, until we propagate ComponentDescriptor to every last corner of the app).

MaybeOwnedComponentBatch now has the possibility to augment and/or fully-override the ComponentDescriptor of the data within:

/// Some [`ComponentBatch`], optionally with an overridden [`ComponentDescriptor`].
///
/// Used by implementers of [`crate::AsComponents`] to both efficiently expose their component data
/// and assign the right tags given the surrounding context.
pub struct MaybeOwnedComponentBatch<'a> {
    /// The component data.
    pub batch: ComponentBatchCow<'a>,

    /// If set, will override the [`ComponentBatch`]'s [`ComponentDescriptor`].
    pub descriptor_override: Option<ComponentDescriptor>,
}

This is a crucial part of the story, as this is how e.g. archetypes inject their own tags when component data gets logged on their behalf.

Override model

The override model is simple:

Every Component has an associated ComponentDescriptor.
Every ComponentBatch inherits from its underlying Component's ComponentDescriptor.
AsComponents has an opportunity to override each ComponentBatch's ComponentDescriptor (by means of MaybeOwnedComponentBatch.

The goal is to try and carry those semantics over the two other SDKs (Python, C++), while somehow keeping changes to a minimum.

Undefined behavior

Logging the same component multiple times on a single entity (e.g. by logging different archetypes that share parts of their definitions) has always been, for all intents and purposes, UB.

This PR propagates descriptors just enough to get things up and running, no more no less. By which I mean that it is possible to get component tags in and out of the system, but many things still assume that Components are uniquely identified by their names.
This means that some part of the codebase are still indexing things by name, while others index by descriptor. Where these parts meet, what was UB before is even more UB now, as we generally just pick one random component among the ones available.
You'll see a lot of get_first_component in the code: every single one of those is UB if there are multiple components under the same name (for now!).

Debug builds assert for duplicated components, until we properly use descriptors everywhere (remember: nothing should ever be indexed by ComponentName in the future).

Fully-qualified component names & column paths

ComponentDescriptor defines its fully-qualified name as such:

match (archetype_name, component_name, archetype_field_name) {
    (None, component_name, None) => component_name.to_owned(),
    (Some(archetype_name), component_name, None) => {
        format!("{archetype_name}:{component_name}")
    }
    (None, component_name, Some(archetype_field_name)) => {
        format!("{component_name}#{archetype_field_name}")
    }
    (Some(archetype_name), component_name, Some(archetype_field_name)) => {
        format!("{archetype_name}:{component_name}#{archetype_field_name}")
    }
}

which yields e.g. rerun.archetypes.Points3D:rerun.components.Position3D#positions, which is generally shortened to Points3D:Position3D#positions when there is no ambiguity.

In the dataframe API, a fully-qualified column path now becomes {entity_path}@{archetype_name}:{component_name}#{archetype_field_name}, e.g. /my/points@rerun.archetypes.Points3D:rerun.components.Position3D#positions or /my/points@Points3D:Position3D#positions.

This syntax needs to be debated. I have intentionally disabled the syntax in the dataframe APIs so as not to break anything external-facing.

Transport and metadata

ArchetypeName and ArchetypeFieldName are now exposed as rerun.archetype_name and rerun.archetype_field_name in TransportChunk's arrow metadata.

I really cannot wait for a better metadata system.

Performance

ComponentDescriptors add an extra layer of mappings everywhere: we used to have IntMap<ComponentName, T> all over the place, now we have IntMap<ComponentName, IntMap<ComponentDescriptor, T>>.
The extra ComponentName layer is needed because it is very common to want to look for anything matching a ComponentName, without any further tags specified.

Like before, these are NoHash maps, so performance impact should be minimal (ComponentDescriptor implements NoHash by xor'ing everything).

Examples / testing / roundtrips

See:

docs/snippets/all/descriptors/descr_builtin_archetype.rs
docs/snippets/all/descriptors/descr_builtin_component.rs
docs/snippets/all/descriptors/descr_custom_archetype.rs
docs/snippets/all/descriptors/descr_custom_component.rs

These snippets play all roles at once, as usual. In particular they make sure that all languages (well, only Rust for now, Python and C++ coming soon) carry all the right tags in all the right situations.

Part of Tagged components milestone 1: end-to-end tagging #7948

github-actions · 2024-12-03T16:25:51Z

Web viewer failed to build.

Result	Commit	Link
❌		https://rerun.io/viewer/pr/8304

^{Note: This comment is updated whenever you push a commit.}

teh-cmc · 2024-12-03T17:02:31Z

crates/store/re_chunk_store/src/dataframe.rs

+            // NOTE: Uncomment this to expose fully-qualified names in the Dataframe APIs!
+            // I'm not doing that right now, to avoid breaking changes (and we need to talk about
+            // what the syntax for these fully-qualified paths need to look like first).
+            format!("{}:{}", entity_path, descriptor.component_name.short_name()),
+            // format!("{entity_path}@{}", descriptor.short_name()),


As mentioned, I kept the existing column-path syntax around for now, to avoid breaking anything downstream. TBD.

teh-cmc · 2024-12-03T17:05:47Z

crates/store/re_chunk_store/src/store.rs

+        let is_tombstone = re_types_core::archetypes::Clear::all_components()
+            .iter()
+            .any(|descr| descr.component_name == *component_name);


This is one of many many many examples where we still have component-name based logic that needs to be migrated to component-descriptor based logic.

The goal of this PR is to get tags in and out, not migrate everything that exists in one go (that PR is big enough as is!).

teh-cmc · 2024-12-03T17:07:21Z

crates/store/re_chunk_store/src/writes.rs

+        #[cfg(debug_assertions)]
+        for (component_name, per_desc) in chunk.components().iter() {
+            assert!(
+                per_desc.len() <= 1,
+                "Insert Chunk with multiple values for component named `{component_name}`: this is currently UB",
+            );
+        }


One day this will be legal, but first we need to remove every single bit of component-name based logic across the entire app.

teh-cmc · 2024-12-03T17:54:21Z

@rerun-bot full-check

github-actions · 2024-12-03T17:54:51Z

Started a full build: https://github.com/rerun-io/rerun/actions/runs/12145398537

teh-cmc · 2024-12-04T09:19:51Z

crates/store/re_types_core/src/loggable_batch.rs

+///
+/// Used by implementers of [`crate::AsComponents`] to both efficiently expose their component data
+/// and assign the right tags given the surrounding context.
+pub struct MaybeOwnedComponentBatch<'a> {


Maybe at this point this deserves a new name? I don't think it matters all that much, all those APIs will break in the most extraordinary ways once we move to eager serialization anyhow...

That said, DescribedComponentBatch could be made to match what we use on the Python side, and maybe even the C++ side if we get lucky...

agreed. Maybe simpler ComponentBatchCowWithDesc?

By having one opinion more than me, you win the opinion bidding war: ComponentBatchCowWithDescriptor it is

teh-cmc · 2024-12-04T11:45:02Z

cargo machete passes locally, and cargo machete --version matches the CI... I dunno man.

teh-cmc · 2024-12-04T11:45:46Z

I had to give a tour to @Wumpf so that he could help me with the C++ port, so he's probably the best equipped person to review this at this point.

crates/store/re_types_core/src/component_descriptor.rs

Wumpf

oof, that was a lot :)
ship it!

crates/store/re_chunk/src/chunk.rs

Wumpf · 2024-12-05T15:45:27Z

crates/store/re_chunk/src/transport.rs

+                // TODO(cmc): disgusting, but good enough for now.
+                (field.name == RowId::descriptor().component_name.as_str()).then_some(column)


what would the clean alternative look like? why not extract the full descriptor from the field's metadata and compare that? I suppose for RowId that's a bit overzealous given that a rowid would never be tagged

The clean alternative is that all these things should be working at the new protobuf metadata layer, with proper Descriptor types etc.

Wumpf · 2024-12-05T17:16:27Z

crates/store/re_chunk_store/src/store.rs

+    ///
+    /// See also:
+    /// * [`ChunkStore::new`]
+    pub fn from_log_msgs(


the only difference between this and the thing above is just the error context? couldn't that be done with another with_context call on the result of from_log_msgs?

but then from_log_msgs needs to take an iterator of results instead and things get annoying for other people... I'd rather have a bit of duplication and make everyone happy

examples/rust/dna/src/main.rs

examples/rust/dataframe_query/src/main.rs

Wumpf · 2024-12-06T09:05:43Z

crates/store/re_types_core/src/loggable.rs

+    // * Users might still want to register Components with specific tags.
+    // * In the future, `ComponentDescriptor`s will very likely cover than Archetype-related tags
+    //   (e.g. generics, metric units, etc).
+    fn descriptor() -> ComponentDescriptor;


despite above line of argumentation it does irk me a bit that you can set field names and archetype on the descriptor here. It's almost like a stricter subset of the descriptor would be better here: do we really want to be able to set field names at the component type level?
I do however concurr that there may be more stuff that then is very relevant to the component. And a separate type to split those out is likely overkill.

It can definitely be useful yes, I've seen it when implementing actual examples.

If you're implementing a custom component and you know you always want it to be part of a specific archetype, you'd rather just do it once in the your component definition, rather than having to specify overrides at every log site.

crates/store/re_types_core/src/component_descriptor.rs

Wumpf · 2024-12-06T09:10:54Z

crates/top/re_sdk/src/recording_stream.rs

-            components?.into_iter().collect();
+        let components: ChunkComponents = components?.into_iter().collect();
+
+        {


makes sense but how did this validation end up in here? isn't that more general than send_columns?

oO I did not write this code, I'm not sure what's going on

I'll just remove it, I'm not sure what that's about 🤷

Wumpf · 2024-12-06T09:14:39Z

docs/snippets/all/descriptors/descr_builtin_archetype.rs

+        &rerun::Points3D::new([(1.0, 2.0, 3.0)]).with_radii([0.3, 0.2, 0.1]),
+    )?;
+
+    // When this snippet runs through the snippet comparison machinery, this environment variable


that's a really weird hack to do the checking. Nothing about this is /docs/ anymore :/
wouldn't the older "roundtrip tests" do the same job just as well?

if not can we maybe at least add a short readme to this folder to point out that these are not really snippets for documenting but rather tests?

I'll see if i can move it to roundtrips/

oh no wait, i remember why I picked snippets/ -- these are docs, they show how to override tags in different scenarios! but yeah the presence of the testing code in the rust version kinda suck, hmm...

I'll see if i can "hide" the testing code in a separate function at the very least

Co-authored-by: Andreas Reich <andreas@rerun.io>

@Wumpf

Implemented with the help of @Wumpf. This semantically mimics very closely the way things are done in Rust, minus all technical differences due to the differences between both the languages and the SDKs. For that reason, everything stated in #8304 (comment) basically applies as-is. Pretty happy about it, I must say. * DNM: requires #8304 * Part of #7948 --------- Co-authored-by: Andreas Reich <andreas@rerun.io>

This semantically mimics very closely the way things are done in Rust, minus all technical differences due to the differences between both the languages and the SDKs. For that reason, everything stated in #8304 (comment) basically applies as-is. Pretty happy about it, I must say. * DNM: requires #8316 * Part of #7948

teh-cmc changed the title ~~End-to-end tagging for everything Rust~~ End-to-end tagging: Rust Dec 3, 2024

teh-cmc force-pushed the cmc/end_to_end_tags_rust branch 2 times, most recently from 83a1662 to c515636 Compare December 3, 2024 16:59

teh-cmc commented Dec 3, 2024

View reviewed changes

teh-cmc marked this pull request as ready for review December 3, 2024 17:18

teh-cmc marked this pull request as draft December 3, 2024 17:18

teh-cmc force-pushed the cmc/end_to_end_tags_rust branch from c515636 to 05f5cb0 Compare December 3, 2024 17:29

rerun-io deleted a comment from github-actions bot Dec 3, 2024

teh-cmc added enhancement New feature or request ⛃ re_datastore affects the datastore itself 🔍 re_query affects re_query itself 🔩 data model 🪵 Log & send APIs Affects the user-facing API for all languages include in changelog labels Dec 3, 2024

teh-cmc marked this pull request as ready for review December 3, 2024 17:54

teh-cmc commented Dec 4, 2024

View reviewed changes

teh-cmc requested a review from Wumpf December 4, 2024 11:45

teh-cmc added the 🦀 Rust API Rust logging API label Dec 4, 2024

teh-cmc mentioned this pull request Dec 4, 2024

End-to-end tagging: C++ #8316

Merged

teh-cmc force-pushed the cmc/end_to_end_tags_rust branch 2 times, most recently from a82a5bb to 82a1dae Compare December 4, 2024 14:49

teh-cmc mentioned this pull request Dec 4, 2024

End-to-end tagging: Python #8298

Merged

teh-cmc force-pushed the cmc/end_to_end_tags_rust branch from 82a1dae to deb1615 Compare December 5, 2024 15:04

teh-cmc commented Dec 5, 2024

View reviewed changes

crates/store/re_types_core/src/component_descriptor.rs Show resolved Hide resolved

Wumpf approved these changes Dec 6, 2024

View reviewed changes

teh-cmc and others added 8 commits December 9, 2024 09:58

End-to-end tagging for everything Rust

2726915

disable cross-lang tag comparison, for obvious reasons

595e46c

backporting misc fixes

85c8c02

yep, i give up

0608a64

backports

11a3aab

Update crates/store/re_types_core/src/loggable.rs

c0599ff

Co-authored-by: Andreas Reich <andreas@rerun.io>

MaybeOwnedComponentBatch -> ComponentBatchWithDescriptor

2d93b59

addressing review

75a382d

teh-cmc force-pushed the cmc/end_to_end_tags_rust branch from 94ef347 to d8abb0b Compare December 9, 2024 08:59

teh-cmc added 2 commits December 9, 2024 10:02

no clue wth this is

7c2aee3

more review

772fb23

teh-cmc force-pushed the cmc/end_to_end_tags_rust branch from d8abb0b to 772fb23 Compare December 9, 2024 09:11

cargo fmt losing it?

289cceb

teh-cmc merged commit 67a3cac into main Dec 9, 2024
22 of 25 checks passed

teh-cmc deleted the cmc/end_to_end_tags_rust branch December 9, 2024 09:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

End-to-end tagging: Rust #8304

End-to-end tagging: Rust #8304

teh-cmc commented Dec 3, 2024 •

edited

Loading

github-actions bot commented Dec 3, 2024 •

edited

Loading

teh-cmc Dec 3, 2024

teh-cmc Dec 3, 2024

teh-cmc Dec 3, 2024

teh-cmc commented Dec 3, 2024

github-actions bot commented Dec 3, 2024

teh-cmc Dec 4, 2024

teh-cmc Dec 4, 2024

Wumpf Dec 6, 2024

teh-cmc Dec 6, 2024

teh-cmc commented Dec 4, 2024

teh-cmc commented Dec 4, 2024

Wumpf left a comment

Wumpf Dec 5, 2024

teh-cmc Dec 6, 2024

Wumpf Dec 5, 2024

teh-cmc Dec 6, 2024

Wumpf Dec 6, 2024

teh-cmc Dec 6, 2024

Wumpf Dec 6, 2024

teh-cmc Dec 6, 2024

teh-cmc Dec 6, 2024

Wumpf Dec 6, 2024

teh-cmc Dec 6, 2024

teh-cmc Dec 6, 2024

teh-cmc Dec 6, 2024

		// TODO(cmc): disgusting, but good enough for now.
		(field.name == RowId::descriptor().component_name.as_str()).then_some(column)

End-to-end tagging: Rust #8304

End-to-end tagging: Rust #8304

Conversation

teh-cmc commented Dec 3, 2024 • edited Loading

Goals and non-goals

Types and traits

Override model

Undefined behavior

Fully-qualified component names & column paths

Transport and metadata

Performance

Examples / testing / roundtrips

github-actions bot commented Dec 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

teh-cmc commented Dec 3, 2024

github-actions bot commented Dec 3, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

teh-cmc commented Dec 4, 2024

teh-cmc commented Dec 4, 2024

Wumpf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

teh-cmc commented Dec 3, 2024 •

edited

Loading

github-actions bot commented Dec 3, 2024 •

edited

Loading