Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add stripped down schema evolution #341

Merged
merged 35 commits into from
Mar 15, 2023
Merged

add stripped down schema evolution #341

merged 35 commits into from
Mar 15, 2023

Conversation

hegner
Copy link
Collaborator

@hegner hegner commented Oct 23, 2022

BEGINRELEASENOTES

  • Adding infrastructure for schema evolution
  • Added explicit version tracking to the metadata
  • Data model comparison tool w/ simple heuristics to identify potential omissions / mistakes (e.g. checking for the limits of the ROOT backend)
  • Changed handling of backwards compatibility for the collection info metadata
    ENDRELEASENOTES

For the moment only relying on the ROOT backend. The tooling needed for SIO comes in a separate PR

Copy link
Collaborator

@tmadlener tmadlener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one conceptual question: Wouldn't it be easier to store the schema version as an integer on disk? From what I can tell that would simplify quite a few things, since otherwise all comparisons would need some string manipulation first.

I have also left some smaller comments inline.

@tmadlener
Copy link
Collaborator

What is currently still missing from this? Respectively, how far do we have to take this to put more work into a follow up PR?

I would like to merge #343 rather soonish. It doesn't lead to any merge conflicts, but it would require a few trivial fixes to the python import statements here.

Copy link
Collaborator

@tmadlener tmadlener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have rebased this onto the latest version of master and fixed the few merge requests and also some of the pylint issues. I still have to go through some of them to see how to address them best.

From my point of view, the most important change from this PR is the addition of the schemaversion to the yaml definition and the corresponding writing of it to file. We would need to add this writing of the schema information for the SIO backend still to this PR, but if we can get that merged, we would potentially unblock a few of the EDM4hep pull requests, because they can then for now just increment the schema version, even if we do not yet have full schema evolution capabilities.

One thing where I am not sure yet what the best way to do this is the actual generation of schema evolution code (not yet part of this PR). Here it is part of the general code generation, which has the potential to become very overloaded. Furthermore at this point it takes only one old definition file, but at some point it should take (schemaversion - 1) files to be able to get schema evolution from all old files. Also currently we take an evolution file as well, so in principle we would not need the old definition file at all, since the evolution file already should have all the information (from several versions even).

To address the latter there are two options at the moment as I see it

  • Either we keep everything as it is in this PR and then iterate in "public" at the cost of a being quite unstable for our users until we have figured things out.
  • Or we only keep the parts where we are more certain that they are as they should be and move the other things into a later PR where we can play a bit more with this.

There are also a few things that we need to address regardless of how we decide here (see comments below). @hegner shall I do that or do you want to have a look at them?

Comment on lines 180 to 196
} else if (m_fileVersion < podio::version::Version{0, 17, 0}){

auto* collInfoBranch = root_utils::getBranch(metadatatree, "CollectionTypeInfo");
auto collectionInfoWithoutSchema = new std::vector<root_utils::CollectionInfoTWithoutSchema>;
auto collectionInfo = new std::vector<root_utils::CollectionInfoT>;
collInfoBranch->SetAddress(&collectionInfo);
collInfoBranch->SetAddress(&collectionInfoWithoutSchema);
metadatatree->GetEntry(0);
for (const auto& [collID, collType, isSubsetColl] : *collectionInfoWithoutSchema){
collectionInfo->emplace_back(collID,
collType,
isSubsetColl,
0);
}
createCollectionBranches(*collectionInfo);
delete collectionInfoWithoutSchema;
delete collectionInfo;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can omit this bit, because we will deprecate this in any case, so there is no real use in introducing this branch here. The main reason to remove this is that then we can leave the versions in the CMakeLists.txt alone, since that is set by tagging scripts.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Especially since we have #378 now. I don't see any reason to potentially diverge on the version reported by the tag and the CMakeLists.txt

Comment on lines +45 to +47
bool needsSchemaEvolution{false};
void* data{nullptr};
void* data_oldschema{nullptr};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool needsSchemaEvolution{false};
void* data{nullptr};
void* data_oldschema{nullptr};
void* data{nullptr};
SchemaVersionT bufferVersion;

The buffers don't need two versions of data. They should always only exist in one version. Here we simply keep track of which version the buffers are in.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on how we want to split up the transformation in the long run.
A possibility is to give it an additional state whether a transformation is still needed and do it in place

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But doing it in place would mean to manipulate the data buffer, right?

In principle in the way I imagine the actual schema evolution, you send in a data buffer into the black box and you get one back. Inside this black box we are free to do whatever we want, even simply nothing. However, i would like to avoid the situation where we have data in two different versions hanging around, as that is an easy way towards somewhat undefined state(s).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the black box approach is what I wanted. At the moment (till re-factoring) we however do not have good entities to book keep the data buffers before and after though.
To me an item that goes into the todos

@@ -70,9 +70,10 @@ def _is_fixed_width_type(type_name):
class DataType:
"""Simple class to hold information about a datatype or component that is
defined in the datamodel."""
def __init__(self, klass):
def __init__(self, klass, schema_version=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a default value that is a valid value in the generated code would be better to avoid issue in code generation.

Alternatively, this could be a "global" value that we inject into the generation similar to a few other values, like we do here:

data['package_name'] = self.package_name
data['use_get_syntax'] = self.get_syntax
data['incfolder'] = self.incfolder

Mainly because all datatypes and components will have the same schema version at this point in the generation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is unfortunately not the case. I kept it separate here as chained/stacked yaml files may have different versions and thus I wanted to avoid "globals".
The default "None" we should have neither since in the future we will require a version to be there. And we want to spot the cases where we forgot,

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at least for the generation part having a "global" is fine, because these are always just "global for the current EDM" that is being generated. I.e. extension models use their schema version here and there is no cross contamination from the upstream EDM.

Didn't think about catching non-defined schema versions more gracefully with the None default value. I think we can keep it. Maybe we should then also add a warning here to make it easier to spot from the output?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it explicit now everywhere.


self._write_cmake_lists_file()
self.process_schema_evolution()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be called earlier. Currently everything is already generated and this seems (or at least is supposed) to fill information that should be filled in some templates (?)

Comment on lines 476 to 477
'old_schema_components': [DataType(d) for d in
self.old_datamodels_datatypes | self.old_datamodels_components]}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old_datamodel_datatypes and old_datamodel_components are not filled anywhere, so this will always be empty at the moment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was a bit the blocking piece last time. Seems I forgot putting it back. :-(

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the main concern for now is actually writing a schema version to the output files, should we defer this to later as well?

@hegner
Copy link
Collaborator Author

hegner commented Feb 27, 2023

@tmadlener - did you ever look into the SIO crashes after your merge?

@tmadlener

This comment was marked as outdated.

@tmadlener

This comment was marked as resolved.

@hegner
Copy link
Collaborator Author

hegner commented Mar 13, 2023

@tmadlener - I still have to fix the pylint messages.
And strangely enough flake8 behaved differently on my system...

@tmadlener
Copy link
Collaborator

Yeah flake8 also seems to be slightly less strict locally in my case. I suspect a version mismatch is the cause for this. Here we would need some more tooling to make it easier to have the test environment available locally. But that is a general Key4hep thing, IMHO.

@jmcarcell
Copy link
Member

I have seen that before and checked that the versions were the same as in my local pc (and the dotfiles too of course) and it was still different from the github CI 🤷 Maybe it depends on the python version but sometimes even the complains about imports are different

@hegner
Copy link
Collaborator Author

hegner commented Mar 14, 2023

@thomas - scheint als haengt es jetzt am aktuellen build von SIO in key4hep

@tmadlener
Copy link
Collaborator

@thomas - scheint als haengt es jetzt am aktuellen build von SIO in key4hep

Yes. Unfortunately not something that we can fix really easily in this case, since we would need a build with these changes in for those to disappear. It is a bit of a catch 22. However, given that we build edm4hep and run the tests there, and also that I can run these tests in a consistent environment locally, I think those should be OK to ignore.

Overall, I would like to merge #390 before this in order to make a tag before we merge this, as we haven't done one in quite some time.

@hegner
Copy link
Collaborator Author

hegner commented Mar 14, 2023

#390 just needs the flake8 errors fixed. and then I can deal with it

@tmadlener
Copy link
Collaborator

Needed to resolve the conflict introduced by tagging the rest of podio

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants