-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add stripped down schema evolution #341
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have one conceptual question: Wouldn't it be easier to store the schema version as an integer on disk? From what I can tell that would simplify quite a few things, since otherwise all comparisons would need some string manipulation first.
I have also left some smaller comments inline.
What is currently still missing from this? Respectively, how far do we have to take this to put more work into a follow up PR? I would like to merge #343 rather soonish. It doesn't lead to any merge conflicts, but it would require a few trivial fixes to the python import statements here. |
Co-authored-by: Thomas Madlener <thomas.madlener@desy.de>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have rebased this onto the latest version of master and fixed the few merge requests and also some of the pylint issues. I still have to go through some of them to see how to address them best.
From my point of view, the most important change from this PR is the addition of the schemaversion
to the yaml definition and the corresponding writing of it to file. We would need to add this writing of the schema information for the SIO backend still to this PR, but if we can get that merged, we would potentially unblock a few of the EDM4hep pull requests, because they can then for now just increment the schema version, even if we do not yet have full schema evolution capabilities.
One thing where I am not sure yet what the best way to do this is the actual generation of schema evolution code (not yet part of this PR). Here it is part of the general code generation, which has the potential to become very overloaded. Furthermore at this point it takes only one old definition file, but at some point it should take (schemaversion - 1) files to be able to get schema evolution from all old files. Also currently we take an evolution file as well, so in principle we would not need the old definition file at all, since the evolution file already should have all the information (from several versions even).
To address the latter there are two options at the moment as I see it
- Either we keep everything as it is in this PR and then iterate in "public" at the cost of a being quite unstable for our users until we have figured things out.
- Or we only keep the parts where we are more certain that they are as they should be and move the other things into a later PR where we can play a bit more with this.
There are also a few things that we need to address regardless of how we decide here (see comments below). @hegner shall I do that or do you want to have a look at them?
src/ROOTReader.cc
Outdated
} else if (m_fileVersion < podio::version::Version{0, 17, 0}){ | ||
|
||
auto* collInfoBranch = root_utils::getBranch(metadatatree, "CollectionTypeInfo"); | ||
auto collectionInfoWithoutSchema = new std::vector<root_utils::CollectionInfoTWithoutSchema>; | ||
auto collectionInfo = new std::vector<root_utils::CollectionInfoT>; | ||
collInfoBranch->SetAddress(&collectionInfo); | ||
collInfoBranch->SetAddress(&collectionInfoWithoutSchema); | ||
metadatatree->GetEntry(0); | ||
for (const auto& [collID, collType, isSubsetColl] : *collectionInfoWithoutSchema){ | ||
collectionInfo->emplace_back(collID, | ||
collType, | ||
isSubsetColl, | ||
0); | ||
} | ||
createCollectionBranches(*collectionInfo); | ||
delete collectionInfoWithoutSchema; | ||
delete collectionInfo; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can omit this bit, because we will deprecate this in any case, so there is no real use in introducing this branch here. The main reason to remove this is that then we can leave the versions in the CMakeLists.txt alone, since that is set by tagging scripts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Especially since we have #378 now. I don't see any reason to potentially diverge on the version reported by the tag and the CMakeLists.txt
bool needsSchemaEvolution{false}; | ||
void* data{nullptr}; | ||
void* data_oldschema{nullptr}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool needsSchemaEvolution{false}; | |
void* data{nullptr}; | |
void* data_oldschema{nullptr}; | |
void* data{nullptr}; | |
SchemaVersionT bufferVersion; |
The buffers don't need two versions of data. They should always only exist in one version. Here we simply keep track of which version the buffers are in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on how we want to split up the transformation in the long run.
A possibility is to give it an additional state whether a transformation is still needed and do it in place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But doing it in place would mean to manipulate the data
buffer, right?
In principle in the way I imagine the actual schema evolution, you send in a data
buffer into the black box and you get one back. Inside this black box we are free to do whatever we want, even simply nothing. However, i would like to avoid the situation where we have data in two different versions hanging around, as that is an easy way towards somewhat undefined state(s).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the black box approach is what I wanted. At the moment (till re-factoring) we however do not have good entities to book keep the data buffers before and after though.
To me an item that goes into the todos
python/podio/generator_utils.py
Outdated
@@ -70,9 +70,10 @@ def _is_fixed_width_type(type_name): | |||
class DataType: | |||
"""Simple class to hold information about a datatype or component that is | |||
defined in the datamodel.""" | |||
def __init__(self, klass): | |||
def __init__(self, klass, schema_version=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a default value that is a valid value in the generated code would be better to avoid issue in code generation.
Alternatively, this could be a "global" value that we inject into the generation similar to a few other values, like we do here:
podio/python/podio_class_generator.py
Lines 197 to 199 in ba1594b
data['package_name'] = self.package_name | |
data['use_get_syntax'] = self.get_syntax | |
data['incfolder'] = self.incfolder |
Mainly because all datatypes and components will have the same schema version at this point in the generation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is unfortunately not the case. I kept it separate here as chained/stacked yaml files may have different versions and thus I wanted to avoid "globals".
The default "None" we should have neither since in the future we will require a version to be there. And we want to spot the cases where we forgot,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think at least for the generation part having a "global" is fine, because these are always just "global for the current EDM" that is being generated. I.e. extension models use their schema version here and there is no cross contamination from the upstream EDM.
Didn't think about catching non-defined schema versions more gracefully with the None
default value. I think we can keep it. Maybe we should then also add a warning here to make it easier to spot from the output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it explicit now everywhere.
|
||
self._write_cmake_lists_file() | ||
self.process_schema_evolution() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be called earlier. Currently everything is already generated and this seems (or at least is supposed) to fill information that should be filled in some templates (?)
python/podio_class_generator.py
Outdated
'old_schema_components': [DataType(d) for d in | ||
self.old_datamodels_datatypes | self.old_datamodels_components]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old_datamodel_datatypes
and old_datamodel_components
are not filled anywhere, so this will always be empty at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this was a bit the blocking piece last time. Seems I forgot putting it back. :-(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the main concern for now is actually writing a schema version to the output files, should we defer this to later as well?
@tmadlener - did you ever look into the SIO crashes after your merge? |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as resolved.
This comment was marked as resolved.
Co-authored-by: Thomas Madlener <thomas.madlener@desy.de>
@tmadlener - I still have to fix the pylint messages. |
Yeah flake8 also seems to be slightly less strict locally in my case. I suspect a version mismatch is the cause for this. Here we would need some more tooling to make it easier to have the test environment available locally. But that is a general Key4hep thing, IMHO. |
I have seen that before and checked that the versions were the same as in my local pc (and the dotfiles too of course) and it was still different from the github CI 🤷 Maybe it depends on the python version but sometimes even the complains about imports are different |
@thomas - scheint als haengt es jetzt am aktuellen build von SIO in key4hep |
Yes. Unfortunately not something that we can fix really easily in this case, since we would need a build with these changes in for those to disappear. It is a bit of a catch 22. However, given that we build edm4hep and run the tests there, and also that I can run these tests in a consistent environment locally, I think those should be OK to ignore. Overall, I would like to merge #390 before this in order to make a tag before we merge this, as we haven't done one in quite some time. |
#390 just needs the flake8 errors fixed. and then I can deal with it |
Needed to resolve the conflict introduced by tagging the rest of podio |
BEGINRELEASENOTES
ENDRELEASENOTES
For the moment only relying on the ROOT backend. The tooling needed for SIO comes in a separate PR