add stripped down schema evolution #341

hegner · 2022-10-23T21:25:52Z

BEGINRELEASENOTES

Adding infrastructure for schema evolution
Added explicit version tracking to the metadata
Data model comparison tool w/ simple heuristics to identify potential omissions / mistakes (e.g. checking for the limits of the ROOT backend)
Changed handling of backwards compatibility for the collection info metadata
ENDRELEASENOTES

For the moment only relying on the ROOT backend. The tooling needed for SIO comes in a separate PR

tmadlener

I have one conceptual question: Wouldn't it be easier to store the schema version as an integer on disk? From what I can tell that would simplify quite a few things, since otherwise all comparisons would need some string manipulation first.

I have also left some smaller comments inline.

CMakeLists.txt

include/podio/CollectionBase.h

python/podio_config_reader.py

python/podio_schema_evolution.py

tmadlener · 2022-11-03T18:02:34Z

What is currently still missing from this? Respectively, how far do we have to take this to put more work into a follow up PR?

I would like to merge #343 rather soonish. It doesn't lead to any merge conflicts, but it would require a few trivial fixes to the python import statements here.

python/podio_schema_evolution.py

src/selection.xml

Co-authored-by: Thomas Madlener <thomas.madlener@desy.de>

tmadlener

I have rebased this onto the latest version of master and fixed the few merge requests and also some of the pylint issues. I still have to go through some of them to see how to address them best.

From my point of view, the most important change from this PR is the addition of the schemaversion to the yaml definition and the corresponding writing of it to file. We would need to add this writing of the schema information for the SIO backend still to this PR, but if we can get that merged, we would potentially unblock a few of the EDM4hep pull requests, because they can then for now just increment the schema version, even if we do not yet have full schema evolution capabilities.

One thing where I am not sure yet what the best way to do this is the actual generation of schema evolution code (not yet part of this PR). Here it is part of the general code generation, which has the potential to become very overloaded. Furthermore at this point it takes only one old definition file, but at some point it should take (schemaversion - 1) files to be able to get schema evolution from all old files. Also currently we take an evolution file as well, so in principle we would not need the old definition file at all, since the evolution file already should have all the information (from several versions even).

To address the latter there are two options at the moment as I see it

Either we keep everything as it is in this PR and then iterate in "public" at the cost of a being quite unstable for our users until we have figured things out.
Or we only keep the parts where we are more certain that they are as they should be and move the other things into a later PR where we can play a bit more with this.

There are also a few things that we need to address regardless of how we decide here (see comments below). @hegner shall I do that or do you want to have a look at them?

tmadlener · 2023-02-17T12:19:39Z

src/ROOTReader.cc

+  } else if (m_fileVersion < podio::version::Version{0, 17, 0}){
+
+    auto* collInfoBranch = root_utils::getBranch(metadatatree, "CollectionTypeInfo");
+    auto collectionInfoWithoutSchema = new std::vector<root_utils::CollectionInfoTWithoutSchema>;
    auto collectionInfo = new std::vector<root_utils::CollectionInfoT>;
-    collInfoBranch->SetAddress(&collectionInfo);
+    collInfoBranch->SetAddress(&collectionInfoWithoutSchema);
    metadatatree->GetEntry(0);
+    for (const auto& [collID, collType, isSubsetColl] : *collectionInfoWithoutSchema){
+        collectionInfo->emplace_back(collID,
+                                     collType,
+                                     isSubsetColl,
+                                     0);
+    }
    createCollectionBranches(*collectionInfo);
+    delete collectionInfoWithoutSchema;
    delete collectionInfo;
+


I think we can omit this bit, because we will deprecate this in any case, so there is no real use in introducing this branch here. The main reason to remove this is that then we can leave the versions in the CMakeLists.txt alone, since that is set by tagging scripts.

Especially since we have #378 now. I don't see any reason to potentially diverge on the version reported by the tag and the CMakeLists.txt

include/podio/SchemaEvolution.h

tmadlener · 2023-02-17T12:23:04Z

include/podio/CollectionBuffers.h

+  bool needsSchemaEvolution{false};
  void* data{nullptr};
+  void* data_oldschema{nullptr};


Suggested change

bool needsSchemaEvolution{false};

void* data{nullptr};

void* data_oldschema{nullptr};

void* data{nullptr};

SchemaVersionT bufferVersion;

The buffers don't need two versions of data. They should always only exist in one version. Here we simply keep track of which version the buffers are in.

It depends on how we want to split up the transformation in the long run.
A possibility is to give it an additional state whether a transformation is still needed and do it in place

But doing it in place would mean to manipulate the data buffer, right?

In principle in the way I imagine the actual schema evolution, you send in a data buffer into the black box and you get one back. Inside this black box we are free to do whatever we want, even simply nothing. However, i would like to avoid the situation where we have data in two different versions hanging around, as that is an easy way towards somewhat undefined state(s).

Yes, the black box approach is what I wanted. At the moment (till re-factoring) we however do not have good entities to book keep the data buffers before and after though.
To me an item that goes into the todos

include/podio/UserDataCollection.h

tmadlener · 2023-02-17T12:26:31Z

python/podio/generator_utils.py

@@ -70,9 +70,10 @@ def _is_fixed_width_type(type_name):
 class DataType:
  """Simple class to hold information about a datatype or component that is
  defined in the datamodel."""
-  def __init__(self, klass):
+  def __init__(self, klass, schema_version=None):


I think a default value that is a valid value in the generated code would be better to avoid issue in code generation.

Alternatively, this could be a "global" value that we inject into the generation similar to a few other values, like we do here:

podio/python/podio_class_generator.py

Lines 197 to 199 in ba1594b

data['package_name'] = self.package_name

data['use_get_syntax'] = self.get_syntax

data['incfolder'] = self.incfolder

Mainly because all datatypes and components will have the same schema version at this point in the generation.

That is unfortunately not the case. I kept it separate here as chained/stacked yaml files may have different versions and thus I wanted to avoid "globals".
The default "None" we should have neither since in the future we will require a version to be there. And we want to spot the cases where we forgot,

I think at least for the generation part having a "global" is fine, because these are always just "global for the current EDM" that is being generated. I.e. extension models use their schema version here and there is no cross contamination from the upstream EDM.

Didn't think about catching non-defined schema versions more gracefully with the None default value. I think we can keep it. Maybe we should then also add a warning here to make it easier to spot from the output?

I made it explicit now everywhere.

tmadlener · 2023-02-17T12:27:20Z

python/podio_class_generator.py


    self._write_cmake_lists_file()
+    self.process_schema_evolution()


This needs to be called earlier. Currently everything is already generated and this seems (or at least is supposed) to fill information that should be filled in some templates (?)

python/podio_class_generator.py

tmadlener · 2023-02-17T12:27:58Z

python/podio_class_generator.py

+            'old_schema_components': [DataType(d) for d in
+                                      self.old_datamodels_datatypes | self.old_datamodels_components]}


The old_datamodel_datatypes and old_datamodel_components are not filled anywhere, so this will always be empty at the moment.

Yes, this was a bit the blocking piece last time. Seems I forgot putting it back. :-(

Since the main concern for now is actually writing a schema version to the output files, should we defer this to later as well?

hegner · 2023-02-27T14:13:43Z

@tmadlener - did you ever look into the SIO crashes after your merge?

Co-authored-by: Thomas Madlener <thomas.madlener@desy.de>

python/podio/generator_utils.py

tests/scripts/dumpModelRoundTrip.sh

hegner · 2023-03-13T14:17:22Z

@tmadlener - I still have to fix the pylint messages.
And strangely enough flake8 behaved differently on my system...

tmadlener · 2023-03-13T16:02:08Z

Yeah flake8 also seems to be slightly less strict locally in my case. I suspect a version mismatch is the cause for this. Here we would need some more tooling to make it easier to have the test environment available locally. But that is a general Key4hep thing, IMHO.

jmcarcell · 2023-03-13T16:14:58Z

I have seen that before and checked that the versions were the same as in my local pc (and the dotfiles too of course) and it was still different from the github CI 🤷 Maybe it depends on the python version but sometimes even the complains about imports are different

hegner · 2023-03-14T09:48:52Z

@thomas - scheint als haengt es jetzt am aktuellen build von SIO in key4hep

tmadlener · 2023-03-14T09:59:37Z

@thomas - scheint als haengt es jetzt am aktuellen build von SIO in key4hep

Yes. Unfortunately not something that we can fix really easily in this case, since we would need a build with these changes in for those to disappear. It is a bit of a catch 22. However, given that we build edm4hep and run the tests there, and also that I can run these tests in a consistent environment locally, I think those should be OK to ignore.

Overall, I would like to merge #390 before this in order to make a tag before we merge this, as we haven't done one in quite some time.

hegner · 2023-03-14T10:02:30Z

#390 just needs the flake8 errors fixed. and then I can deal with it

tmadlener · 2023-03-14T15:55:46Z

Needed to resolve the conflict introduced by tagging the rest of podio

tmadlener reviewed Oct 24, 2022

View reviewed changes

tmadlener reviewed Nov 10, 2022

View reviewed changes

python/podio_schema_evolution.py Outdated Show resolved Hide resolved

tmadlener reviewed Nov 10, 2022

View reviewed changes

src/selection.xml Outdated Show resolved Hide resolved

This was referenced Nov 11, 2022

Discussion on Schema Evolution #86

Closed

Add a Frame reader for legacy files #345

Merged

hegner and others added 13 commits February 8, 2023 14:46

add stripped down schema evolution; relying on ROOT for the moment

bcbf8cd

use unsigned int for schema version

b6d76f8

Update python/podio_schema_evolution.py

3567559

Co-authored-by: Thomas Madlener <thomas.madlener@desy.de>

store changes temporarily

006b8f6

another temporary commit

07b6e93

another temp commit

95d1bd6

new temporary commit after fixing CMake and pep8

ea4a45c

another temp commit

eb22de6

Fix "trivial" pylint complaints

7cb072b

Fix python error to get generation going

a6921ca

Make ROOTLegacyReader work with new file format

6ad4b91

Fix unused paramter warnings

1db7655

Make sure to install all necessary files

ad56467

tmadlener mentioned this pull request Feb 13, 2023

Schema evolution changes / fixes hegner/podio#1

Merged

tmadlener force-pushed the master branch from cfdd753 to ad56467 Compare February 17, 2023 10:52

tmadlener reviewed Feb 17, 2023

View reviewed changes

fix pep8; add SchemaVersionT

e1ea914

This comment was marked as outdated.

Sign in to view

This comment was marked as resolved.

Sign in to view

hegner and others added 4 commits March 9, 2023 09:45

Update src/selection.xml

98bbf55

Co-authored-by: Thomas Madlener <thomas.madlener@desy.de>

remove unneeded indirection

b677437

Merge branch 'master' into master

7279b8a

add missing classmethod

861b58c

hegner added 2 commits March 9, 2023 12:17

remove clangformat dependency in tests

c40576d

address flake8

007402d

tmadlener reviewed Mar 13, 2023

View reviewed changes

python/podio/generator_utils.py Show resolved Hide resolved

tmadlener reviewed Mar 13, 2023

View reviewed changes

tests/scripts/dumpModelRoundTrip.sh Outdated Show resolved Hide resolved

hegner added 4 commits March 13, 2023 13:15

unify naming of schema_version; restore clang_format in test

968429c

fix errors reported by new flake8

789cc2d

fix errors reported by new flake8

a043a99

fix clang-format

80cd71e

hegner added 4 commits March 13, 2023 15:34

change pylint setting to allow tree of definition classes to parse

cbd75ba

address more pylint issues

cedb740

more pylint fixes

a1c826e

relax pylint checks for the big parser class

8335f6b

hegner added 4 commits March 14, 2023 08:48

fix more code checker issues

371eda6

Merge branch 'master' into master

b62ded6

split up heuristics

1e20da1

Merge branch 'master' of github.com:hegner/podio

6462929

make passing of schema_version to DataTypes obligatory

19a5cf6

hegner and others added 2 commits March 14, 2023 11:56

make passing of schema_version to DataTypes obligatory

0806078

Merge branch 'master' into master

32463b4

tmadlener approved these changes Mar 14, 2023

View reviewed changes

tmadlener merged commit c7328d6 into AIDASoft:master Mar 15, 2023

tmadlener mentioned this pull request Mar 20, 2023

Add schema_version to YAML definition key4hep/EDM4hep#200

Merged

tmadlener mentioned this pull request May 23, 2023

Fix a version check inside the ROOTReader to avoid seg faults #420

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add stripped down schema evolution #341

add stripped down schema evolution #341

hegner commented Oct 23, 2022

tmadlener left a comment

tmadlener commented Nov 3, 2022

tmadlener left a comment

tmadlener Feb 17, 2023

tmadlener Feb 27, 2023

tmadlener Feb 17, 2023

hegner Feb 27, 2023

tmadlener Feb 27, 2023

hegner Feb 27, 2023

tmadlener Feb 17, 2023

hegner Feb 27, 2023

tmadlener Feb 27, 2023

hegner Mar 14, 2023

tmadlener Feb 17, 2023

tmadlener Feb 17, 2023

hegner Feb 27, 2023

tmadlener Feb 27, 2023

hegner commented Feb 27, 2023

This comment was marked as outdated.

This comment was marked as resolved.

hegner commented Mar 13, 2023

tmadlener commented Mar 13, 2023

jmcarcell commented Mar 13, 2023

hegner commented Mar 14, 2023

tmadlener commented Mar 14, 2023

hegner commented Mar 14, 2023

tmadlener commented Mar 14, 2023

	data['package_name'] = self.package_name
	data['use_get_syntax'] = self.get_syntax
	data['incfolder'] = self.incfolder


		self._write_cmake_lists_file()
		self.process_schema_evolution()

		'old_schema_components': [DataType(d) for d in
		self.old_datamodels_datatypes \| self.old_datamodels_components]}

add stripped down schema evolution #341

add stripped down schema evolution #341

Conversation

hegner commented Oct 23, 2022

tmadlener left a comment

Choose a reason for hiding this comment

tmadlener commented Nov 3, 2022

tmadlener left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hegner commented Feb 27, 2023

This comment was marked as outdated.

This comment was marked as resolved.

hegner commented Mar 13, 2023

tmadlener commented Mar 13, 2023

jmcarcell commented Mar 13, 2023

hegner commented Mar 14, 2023

tmadlener commented Mar 14, 2023

hegner commented Mar 14, 2023

tmadlener commented Mar 14, 2023