Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: EDM4HEPSchema and Newstyle FCCSchema #1245

Merged
merged 38 commits into from
Mar 21, 2025

Conversation

prayagyadav
Copy link
Contributor

@lgray @davidlange6 @gomber

Here is a clean draft for the EDM4HEPSchema (edm4hep1) and FCCSchema based on the same. I have not yet managed to add many comments and descriptions, but I plan to add them eventually.

Workings:

  • The EDM4HEPSchema reads the edm4hep.yaml file from the assets directory. I felt this was necessary to add maximum functionality to the schema. Reading the specifications of all the 'components' and 'datatypes' from the yaml file helps to identify the 'members' (example, energy is a member of edm4hep::ReconstructedParticle datatype) and which members correspond to the various types of cross-branch relations in EDM4HEP: vector members, OneToOneRelations, OneToManyRelations and Links.

  • The Schema fetches the comments in the edm4hep.yaml file and assigns them as docstrings to the relevant branches.

  • The EDM4HEPSchema supports all these relations (With Links needing some manual boilerplate code from the user).

  • The version of the edm4hep.yaml file used is here. Please note that the way Links are represented in EDM4HEP has changed in the latest commit. @tmadlener can comment more on this. In any case, it seems necessary to find a way to track the changes from edm4hep.yaml, so that the COFFEA EDM4HEPSchema does not become obsolete after a few version changes.

Link to example Notebooks:

Tests:

  • The test sample for EDM4HEP was obtained from here (which, I believe, was pointed out by @tmadlener).
  • The test sample for the Newstyle (edm4hep1) FCC was provided by @davidlange6
  • Both the schema pass the basic tests (the tests are similar to the ones that I wrote for the oldstyle (pre-edm4hep1) FCCSchema)

Other comments:

  • I understand that this commit includes a lot of code, and its purpose may not be immediately clear, but I can provide a detailed explanation for specific snippets if needed.
  • While I have tried to add most of the functionality to the schema, I cannot figure out how to add the ExtraCode sections mentioned in edm4hep.yaml. They appear to be declarations for C++ methods specific to certain collections.

@prayagyadav prayagyadav changed the title EDM4HEPSchema and Newstyle FCCSChema feat: EDM4HEPSchema and Newstyle FCCSChema Jan 15, 2025
@lgray
Copy link
Collaborator

lgray commented Jan 15, 2025

can you rebase all the commits into 1 so that that 12MB file isn't into the repo history at all? Thanks!

@prayagyadav
Copy link
Contributor Author

can you rebase all the commits into 1 so that that 12MB file isn't into the repo history at all? Thanks!

Done

@prayagyadav prayagyadav changed the title feat: EDM4HEPSchema and Newstyle FCCSChema feat: EDM4HEPSchema and Newstyle FCCSchema Jan 21, 2025
@lgray
Copy link
Collaborator

lgray commented Jan 21, 2025

@prayagyadav is there anything more you want to do on this PR?

@prayagyadav
Copy link
Contributor Author

yeah, gotta sort out some features which I missed. Also, I have to add more comments ...

@lgray
Copy link
Collaborator

lgray commented Jan 21, 2025

OK - ping me when you are done! It's otherwise looking good so far.

@lgray
Copy link
Collaborator

lgray commented Feb 27, 2025

@prayagyadav you should be able to determine the C++ just from the file metadata right? If so if you dig through uproot a bit you can extract it without needing to resort to materializing an array.

Otherwise let's wait for that PR to go through and an uproot release.

Also @nsmith- had some interesting things to say about dealing with streamers when we were talking yesterday. May help this out.

@prayagyadav
Copy link
Contributor Author

@lgray Using this script that I found, I can list out all the relevant metadata for a given EDM4HEP based file.

I found out that some generations of FCC samples have the typename info in the podio_metadata/events___CollectionTypeInfo._1 location in the podio_metadata tree.

Unfortunately, the older generations like the Spring2021 campaign do not have the typename info in metadata.

Another issue is that the collection typenames are incorrect! Maybe it's just the problem with this script, but I am not sure.

In the long term, it would be very beneficial to get access to the Collection IDs which are stored in the metadata. Is there a way to access the podio_metadata tree at runtime, within a schema?

image

@tmadlener
Copy link

I found out that some generations of FCC samples have the typename info in the podio_metadata/events___CollectionTypeInfo._1 location in the podio_metadata tree.

Regarding the format of this metadata: I would like to point out that they are technically not part of the guarantees we give for stability as we do in other places, see also the documentation, or a concrete example of how we want to change this AIDASoft/podio#711

If you want to have the ground truth of the generation you can get to the actual full definition of the datamodel as JSON encoded string from podio_metadata.EDMDefinitions.

Another issue is that the collection typenames are incorrect! Maybe it's just the problem with this script, but I am not sure.

Maybe this is because the script gives you the names of the classes in the user-layer (see documentation), while what actually lands in the files comes from the POD layer. So if you see vector<XYZData> instead of XYZCollection that is most likely the reason for that.

@davidlange6
Copy link

@tmadlener - instead the output of the script seem shuffled? I guess Particle is not really a UserDataCollection<float> under the hood, is it?

@tmadlener
Copy link

Ah right, didn't catch that, sorry. From a quick look, I think this zip here is the issue, because there is no guarantee that the order in the idTable that links collection ids and collection names is the same as the one in CollectionTypeInfo. However, the CollectionTypeInfo has a field with the collection ID, so it can be matched. The tuple is defined here.

So Looking at events___CollectionTypeInfo._0 and matching that with the collection ID retrieved from events___idTable should give you the correct type from events___CollectionTypeInfo._1.

@lgray
Copy link
Collaborator

lgray commented Mar 10, 2025

It seems with the latest uproot we're working rather well?

@prayagyadav
Copy link
Contributor Author

Yeah. I think I have everything in place: the improved tests, the docstrings and the comments. Seems ready for a merge.

@lgray
Copy link
Collaborator

lgray commented Mar 10, 2025

how large are the yaml files that you've included in assets?

@lgray
Copy link
Collaborator

lgray commented Mar 10, 2025

Ah - only seems to be 10s of kb. Not worth compressing.

@prayagyadav
Copy link
Contributor Author

@lgray I have incorporated the changes suggested by you and @nsmith- . Please let me know if you find more issues in the code. Thanks

@prayagyadav
Copy link
Contributor Author

@lgray Hi, Do you have an update for the PR? The current status of the PR looks good to me. Let me know if some section of the code needs improvement. Thanks

@lgray
Copy link
Collaborator

lgray commented Mar 18, 2025

Hi @prayagyadav I've been taking care of issues elsewhere in the ecosystem, sorry! I'll try to review your PR today or tomorrow. A quick look seems to be in decent shape, but let me see.

@lgray
Copy link
Collaborator

lgray commented Mar 21, 2025

Code itself looks in good shape, is commented well as to function when it is unclear and is well tested.

Thanks for the contribution!

@lgray lgray merged commit f357cdc into scikit-hep:master Mar 21, 2025
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants