Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

End-to-end tagging: Rust #8304

Merged
merged 11 commits into from
Dec 9, 2024
Merged

End-to-end tagging: Rust #8304

merged 11 commits into from
Dec 9, 2024

Conversation

teh-cmc
Copy link
Member

@teh-cmc teh-cmc commented Dec 3, 2024

I had to give up on the idea of splitting this thing into neat little PRs -- the enormous amount of extra work needed in this case is just not worth it, it's not even close (turns out changing the definition of Component has cascading consequences 😶).

I'll add a thorough description of what's going on to compensate, and can walk someone through this if needed.


Goals and non-goals

The goal of this PR is to get component tags in, store them, and then get them out.

The goal of this PR is not to port every single bit of component-name based logic to component-descriptor based logic (including but certainly not limited to datastore queries).
That will be the next step: #8293.

Types and traits

First and foremost, this ofc introduces the new ComponentDescriptor type:

/// A [`ComponentDescriptor`] fully describes the semantics of a column of data.
///
/// Every component is uniquely identified by its [`ComponentDescriptor`].
#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Ord)]
pub struct ComponentDescriptor {
    /// Optional name of the `Archetype` associated with this data.
    ///
    /// `None` if the data wasn't logged through an archetype.
    ///
    /// Example: `rerun.archetypes.Points3D`.
    pub archetype_name: Option<ArchetypeName>,

    /// Optional name of the field within `Archetype` associated with this data.
    ///
    /// `None` if the data wasn't logged through an archetype.
    ///
    /// Example: `positions`.
    pub archetype_field_name: Option<ArchetypeFieldName>,

    /// Semantic name associated with this data.
    ///
    /// This is fully implied by `archetype_name` and `archetype_field`, but
    /// included for semantic convenience.
    ///
    /// Example: `rerun.components.Position3D`.
    pub component_name: ComponentName,
}

Note that this is a Rerun type, not a Sorbet type: i.e. it uses Rerun terminology (archetypes, fields, etc), not Sorbet terminology.
As is now tradition, this terminology gets translated into its Sorbet equivalent when leaving the land of internal Chunks for the land of external RecordBatches and Dataframes.

Components are now uniquely identified by a ComponentDescriptor rather than a ComponentName:

/// A [`Component`] describes semantic data that can be used by any number of [`Archetype`]s.
///
/// Implementing the [`Component`] trait automatically derives the [`ComponentBatch`] implementation,
/// which makes it possible to work with lists' worth of data in a generic fashion.
pub trait Component: Loggable {
    /// Returns the complete [`ComponentDescriptor`] for this [`Component`].
    ///
    /// Every component is uniquely identified by its [`ComponentDescriptor`].
    //
    // NOTE: Builtin Rerun components don't (yet) have anything but a `ComponentName` attached to
    // them (other tags are injected at the Archetype level), therefore having a full
    // `ComponentDescriptor` might seem overkill.
    // It's not:
    // * Users might still want to register Components with specific tags.
    // * In the future, `ComponentDescriptor`s will very likely cover than Archetype-related tags
    //   (e.g. generics, metric units, etc).
    fn descriptor() -> ComponentDescriptor;

    /// The fully-qualified name of this component, e.g. `rerun.components.Position2D`.
    ///
    /// This is a trivial but useful helper for `Self::descriptor().component_name`.
    ///
    /// The default implementation already does the right thing: do not override unless you know
    /// what you're doing.
    /// `Self::name()` must exactly match the value returned by `Self::descriptor().component_name`,
    /// or undefined behavior ensues.
    //
    // TODO(cmc): The only reason we keep this around is for convenience, and the only reason we need this
    // convenience is because we're still in this weird half-way in-between state where some things
    // are still indexed by name. Remove this entirely once we've ported everything to descriptors.
    #[inline]
    fn name() -> ComponentName {
        Self::descriptor().component_name
    }
}

Component::name still exists for now, as a convenience during the interim (that is, until we propagate ComponentDescriptor to every last corner of the app).

MaybeOwnedComponentBatch now has the possibility to augment and/or fully-override the ComponentDescriptor of the data within:

/// Some [`ComponentBatch`], optionally with an overridden [`ComponentDescriptor`].
///
/// Used by implementers of [`crate::AsComponents`] to both efficiently expose their component data
/// and assign the right tags given the surrounding context.
pub struct MaybeOwnedComponentBatch<'a> {
    /// The component data.
    pub batch: ComponentBatchCow<'a>,

    /// If set, will override the [`ComponentBatch`]'s [`ComponentDescriptor`].
    pub descriptor_override: Option<ComponentDescriptor>,
}

This is a crucial part of the story, as this is how e.g. archetypes inject their own tags when component data gets logged on their behalf.

Override model

The override model is simple:

  • Every Component has an associated ComponentDescriptor.
  • Every ComponentBatch inherits from its underlying Component's ComponentDescriptor.
  • AsComponents has an opportunity to override each ComponentBatch's ComponentDescriptor (by means of MaybeOwnedComponentBatch.

The goal is to try and carry those semantics over the two other SDKs (Python, C++), while somehow keeping changes to a minimum.

Undefined behavior

Logging the same component multiple times on a single entity (e.g. by logging different archetypes that share parts of their definitions) has always been, for all intents and purposes, UB.

This PR propagates descriptors just enough to get things up and running, no more no less. By which I mean that it is possible to get component tags in and out of the system, but many things still assume that Components are uniquely identified by their names.
This means that some part of the codebase are still indexing things by name, while others index by descriptor. Where these parts meet, what was UB before is even more UB now, as we generally just pick one random component among the ones available.
You'll see a lot of get_first_component in the code: every single one of those is UB if there are multiple components under the same name (for now!).

Debug builds assert for duplicated components, until we properly use descriptors everywhere (remember: nothing should ever be indexed by ComponentName in the future).

Fully-qualified component names & column paths

ComponentDescriptor defines its fully-qualified name as such:

match (archetype_name, component_name, archetype_field_name) {
    (None, component_name, None) => component_name.to_owned(),
    (Some(archetype_name), component_name, None) => {
        format!("{archetype_name}:{component_name}")
    }
    (None, component_name, Some(archetype_field_name)) => {
        format!("{component_name}#{archetype_field_name}")
    }
    (Some(archetype_name), component_name, Some(archetype_field_name)) => {
        format!("{archetype_name}:{component_name}#{archetype_field_name}")
    }
}

which yields e.g. rerun.archetypes.Points3D:rerun.components.Position3D#positions, which is generally shortened to Points3D:Position3D#positions when there is no ambiguity.

In the dataframe API, a fully-qualified column path now becomes {entity_path}@{archetype_name}:{component_name}#{archetype_field_name}, e.g. /my/points@rerun.archetypes.Points3D:rerun.components.Position3D#positions or /my/points@Points3D:Position3D#positions.

This syntax needs to be debated. I have intentionally disabled the syntax in the dataframe APIs so as not to break anything external-facing.

Transport and metadata

ArchetypeName and ArchetypeFieldName are now exposed as rerun.archetype_name and rerun.archetype_field_name in TransportChunk's arrow metadata.

I really cannot wait for a better metadata system.

Performance

ComponentDescriptors add an extra layer of mappings everywhere: we used to have IntMap<ComponentName, T> all over the place, now we have IntMap<ComponentName, IntMap<ComponentDescriptor, T>>.
The extra ComponentName layer is needed because it is very common to want to look for anything matching a ComponentName, without any further tags specified.

Like before, these are NoHash maps, so performance impact should be minimal (ComponentDescriptor implements NoHash by xor'ing everything).

Examples / testing / roundtrips

See:

  • docs/snippets/all/descriptors/descr_builtin_archetype.rs
  • docs/snippets/all/descriptors/descr_builtin_component.rs
  • docs/snippets/all/descriptors/descr_custom_archetype.rs
  • docs/snippets/all/descriptors/descr_custom_component.rs

These snippets play all roles at once, as usual. In particular they make sure that all languages (well, only Rust for now, Python and C++ coming soon) carry all the right tags in all the right situations.


Copy link

github-actions bot commented Dec 3, 2024

Web viewer failed to build.

Result Commit Link
https://rerun.io/viewer/pr/8304

Note: This comment is updated whenever you push a commit.

@teh-cmc teh-cmc changed the title End-to-end tagging for everything Rust End-to-end tagging: Rust Dec 3, 2024
@teh-cmc teh-cmc force-pushed the cmc/end_to_end_tags_rust branch 2 times, most recently from 83a1662 to c515636 Compare December 3, 2024 16:59
Comment on lines +321 to +325
// NOTE: Uncomment this to expose fully-qualified names in the Dataframe APIs!
// I'm not doing that right now, to avoid breaking changes (and we need to talk about
// what the syntax for these fully-qualified paths need to look like first).
format!("{}:{}", entity_path, descriptor.component_name.short_name()),
// format!("{entity_path}@{}", descriptor.short_name()),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned, I kept the existing column-path syntax around for now, to avoid breaking anything downstream. TBD.

Comment on lines +669 to +671
let is_tombstone = re_types_core::archetypes::Clear::all_components()
.iter()
.any(|descr| descr.component_name == *component_name);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of many many many examples where we still have component-name based logic that needs to be migrated to component-descriptor based logic.

The goal of this PR is to get tags in and out, not migrate everything that exists in one go (that PR is big enough as is!).

Comment on lines +40 to +46
#[cfg(debug_assertions)]
for (component_name, per_desc) in chunk.components().iter() {
assert!(
per_desc.len() <= 1,
"Insert Chunk with multiple values for component named `{component_name}`: this is currently UB",
);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One day this will be legal, but first we need to remove every single bit of component-name based logic across the entire app.

@teh-cmc teh-cmc marked this pull request as ready for review December 3, 2024 17:18
@teh-cmc teh-cmc marked this pull request as draft December 3, 2024 17:18
@teh-cmc teh-cmc force-pushed the cmc/end_to_end_tags_rust branch from c515636 to 05f5cb0 Compare December 3, 2024 17:29
@rerun-io rerun-io deleted a comment from github-actions bot Dec 3, 2024
@teh-cmc teh-cmc added enhancement New feature or request ⛃ re_datastore affects the datastore itself 🔍 re_query affects re_query itself 🔩 data model 🪵 Log & send APIs Affects the user-facing API for all languages include in changelog labels Dec 3, 2024
@teh-cmc
Copy link
Member Author

teh-cmc commented Dec 3, 2024

@rerun-bot full-check

@teh-cmc teh-cmc marked this pull request as ready for review December 3, 2024 17:54
Copy link

github-actions bot commented Dec 3, 2024

Started a full build: https://github.com/rerun-io/rerun/actions/runs/12145398537

///
/// Used by implementers of [`crate::AsComponents`] to both efficiently expose their component data
/// and assign the right tags given the surrounding context.
pub struct MaybeOwnedComponentBatch<'a> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe at this point this deserves a new name? I don't think it matters all that much, all those APIs will break in the most extraordinary ways once we move to eager serialization anyhow...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, DescribedComponentBatch could be made to match what we use on the Python side, and maybe even the C++ side if we get lucky...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. Maybe simpler ComponentBatchCowWithDesc?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By having one opinion more than me, you win the opinion bidding war: ComponentBatchCowWithDescriptor it is

@teh-cmc
Copy link
Member Author

teh-cmc commented Dec 4, 2024

cargo machete passes locally, and cargo machete --version matches the CI... I dunno man.

@teh-cmc
Copy link
Member Author

teh-cmc commented Dec 4, 2024

I had to give a tour to @Wumpf so that he could help me with the C++ port, so he's probably the best equipped person to review this at this point.

@teh-cmc teh-cmc requested a review from Wumpf December 4, 2024 11:45
@teh-cmc teh-cmc added the 🦀 Rust API Rust logging API label Dec 4, 2024
@teh-cmc teh-cmc mentioned this pull request Dec 4, 2024
@teh-cmc teh-cmc force-pushed the cmc/end_to_end_tags_rust branch 2 times, most recently from a82a5bb to 82a1dae Compare December 4, 2024 14:49
@teh-cmc teh-cmc force-pushed the cmc/end_to_end_tags_rust branch from 82a1dae to deb1615 Compare December 5, 2024 15:04
Copy link
Member

@Wumpf Wumpf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oof, that was a lot :)
ship it!

crates/store/re_chunk/src/chunk.rs Outdated Show resolved Hide resolved
Comment on lines +547 to +548
// TODO(cmc): disgusting, but good enough for now.
(field.name == RowId::descriptor().component_name.as_str()).then_some(column)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would the clean alternative look like? why not extract the full descriptor from the field's metadata and compare that? I suppose for RowId that's a bit overzealous given that a rowid would never be tagged

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The clean alternative is that all these things should be working at the new protobuf metadata layer, with proper Descriptor types etc.

///
/// See also:
/// * [`ChunkStore::new`]
pub fn from_log_msgs(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only difference between this and the thing above is just the error context? couldn't that be done with another with_context call on the result of from_log_msgs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but then from_log_msgs needs to take an iterator of results instead and things get annoying for other people... I'd rather have a bit of duplication and make everyone happy

examples/rust/dna/src/main.rs Outdated Show resolved Hide resolved
examples/rust/dataframe_query/src/main.rs Outdated Show resolved Hide resolved
// * Users might still want to register Components with specific tags.
// * In the future, `ComponentDescriptor`s will very likely cover than Archetype-related tags
// (e.g. generics, metric units, etc).
fn descriptor() -> ComponentDescriptor;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

despite above line of argumentation it does irk me a bit that you can set field names and archetype on the descriptor here. It's almost like a stricter subset of the descriptor would be better here: do we really want to be able to set field names at the component type level?
I do however concurr that there may be more stuff that then is very relevant to the component. And a separate type to split those out is likely overkill.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can definitely be useful yes, I've seen it when implementing actual examples.

If you're implementing a custom component and you know you always want it to be part of a specific archetype, you'd rather just do it once in the your component definition, rather than having to specify overrides at every log site.

components?.into_iter().collect();
let components: ChunkComponents = components?.into_iter().collect();

{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense but how did this validation end up in here? isn't that more general than send_columns?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oO I did not write this code, I'm not sure what's going on

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll just remove it, I'm not sure what that's about 🤷

&rerun::Points3D::new([(1.0, 2.0, 3.0)]).with_radii([0.3, 0.2, 0.1]),
)?;

// When this snippet runs through the snippet comparison machinery, this environment variable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a really weird hack to do the checking. Nothing about this is /docs/ anymore :/
wouldn't the older "roundtrip tests" do the same job just as well?

if not can we maybe at least add a short readme to this folder to point out that these are not really snippets for documenting but rather tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll see if i can move it to roundtrips/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh no wait, i remember why I picked snippets/ -- these are docs, they show how to override tags in different scenarios! but yeah the presence of the testing code in the rust version kinda suck, hmm...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll see if i can "hide" the testing code in a separate function at the very least

@teh-cmc teh-cmc force-pushed the cmc/end_to_end_tags_rust branch from 94ef347 to d8abb0b Compare December 9, 2024 08:59
@teh-cmc teh-cmc force-pushed the cmc/end_to_end_tags_rust branch from d8abb0b to 772fb23 Compare December 9, 2024 09:11
@teh-cmc teh-cmc merged commit 67a3cac into main Dec 9, 2024
22 of 25 checks passed
@teh-cmc teh-cmc deleted the cmc/end_to_end_tags_rust branch December 9, 2024 09:25
teh-cmc added a commit that referenced this pull request Dec 9, 2024
Implemented with the help of @Wumpf.

This semantically mimics very closely the way things are done in Rust,
minus all technical differences due to the differences between both the
languages and the SDKs.
For that reason, everything stated in
#8304 (comment) basically
applies as-is.

Pretty happy about it, I must say.

* DNM: requires #8304 
* Part of #7948

---------

Co-authored-by: Andreas Reich <andreas@rerun.io>
teh-cmc added a commit that referenced this pull request Dec 9, 2024
This semantically mimics very closely the way things are done in Rust,
minus all technical differences due to the differences between both the
languages and the SDKs.
For that reason, everything stated in
#8304 (comment) basically
applies as-is.

Pretty happy about it, I must say.

* DNM: requires #8316 
* Part of #7948
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🔩 data model enhancement New feature or request include in changelog 🪵 Log & send APIs Affects the user-facing API for all languages ⛃ re_datastore affects the datastore itself 🔍 re_query affects re_query itself 🦀 Rust API Rust logging API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants