-
Notifications
You must be signed in to change notification settings - Fork 500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resend metadata to PID providers when metadata schema used to register PIDs is modified #5144
Comments
FWIW, the /modifyRegistrationPIDMetadataAll and {id}/modifyRegistrationMetadata api calls provide a way to do this that could be included in release instructions. As of now, they send an update whether or not it is needed, so DataCite sees a new update date. For QDR, I've modified these calls to check the existing metadata and targetUrl and only submit updates if there is a difference. Would that update be a useful contribution? |
@qqmyers to me it sounds like a useful contribution. If it's not too much effort for you to create a pull request, please go ahead. |
@pdurbin - looking into it. I've also realized that these API calls are pre-file-PID and don't handle data file updates... |
A possible consequence of this issue came up last week when a depositor reported that an Elsevier product named DataMonitor, which harvests Dataverse repository metadata from DataCite, is sometimes unable to determine which files are part of which datasets because some of the metadata that DataCite has about datasets and files in Dataverse repositories doesn't include relationTypes. DataMonitor somehow uses those relationTypes in the DataCite metadata to allow its users to filter files and datasets when searching for data. (This reminded me of the GitHub issue at #5086.) In the dataset record that DataCite has at https://search.datacite.org/works/10.7910/dvn/ayxqij and in records DataCite has for that dataset's files (e.g. https://search.datacite.org/works/10.7910/dvn/ayxqij/6pw7rz), the DataCite XML available on those pages include relationTypes that indicate which files are part of the dataset. I think that's because that dataset was published on the Harvard Dataverse Repository after the repository started using a Dataverse software update that adds relationTypes to the metadata it sends to DataCite when registering DOIs for datasets and files. In DataCite's records for the dataset at https://search.datacite.org/works/10.7910/dvn/ai2oxs and for its 105 files (e.g. https://search.datacite.org/works/10.7910/dvn/ai2oxs/pkvu06), the DataCite XML available on those pages don't include those relationTypes. I think that's because that dataset was published before the Harvard Dataverse Repository was updated to add relationTypes to the metadata it sends. It looks like #5505 would also need to be resolved if we're going to use APIs to send updated metadata to DataCite, which would include relationTypes for datasets and files that have DOIs. |
It looks like #5505 is about only sending metadata updates to DataCite when there's something new but the ability to re-send updates at all was added in pull request #5179 and documented at https://guides.dataverse.org/en/5.9/admin/dataverses-datasets.html#send-dataset-metadata-to-pid-provider So if we want to update a single record like doi:10.7910/DVN/AYXQIJ we could do that now. |
Cool! So if I gathered a list of dataset and file DOIs in a repository, like Harvard's, for which DataCite needed updated metadata, I could use that API endpoint on each DOI? Maybe I could figure out which datasets and files with DOIs in the Harvard repo were published or updated before relationTypes were added to the metadata that's sent to DataCite, then write a script to send the new metadata for those datasets and files to DataCite. It sounds like #5505 would do the work of figuring out which dataset and file metadata needs to be updated in DataCite's database and then send the updated metadata. Is that right, too? |
Yep, should work.
Right, that's the idea. |
Ceilyn and Sonia prioritized and moved to sprint ready @jggautier @scolapasta |
@pdurbin Final item in this issue is to make certain that the dev guide is updated to indicate that when the metadata exporter is changed, the release notes should let those updating their Dataverse software know that they need to apply those changes to the exports of the datasets that were published before the exporter was changed. Then, this issue can be closed. |
Sounds fine. It doesn't really fit into https://guides.dataverse.org/en/6.2/developers/making-releases.html#write-release-notes as written but I'm sure we'll figure something out. |
Kelly Stathis from DataCite let us know this week that the metadata that DataCite has for about 77,000 DOIs in Harvard Dataverse are in the Schema 3 version of their metadata standard. The first page of results in this DataCite API call shows some of these DOIs, and we can paginate through the results to see them all. Although at the end of that page, I see counts of 74,298, so maybe 77k was an older count? And when DataCite deprecates Schema 3 on January 1, 2025, Harvard Dataverse won't be able to send to DataCite any updates of the metadata of 74k+ DOIs for which they still have Schema 3 metadata. I've seen only dataset DOIs but I'm assuming some of those DOIs point to files within datasets. GitHub issues like #7551 make me think that on January 1, 2025, Harvard Dataverse will prevent the owners of those 74k+ DOIs from creating or publishing new versions, unless Harvard Dataverse sends DataCite the metadata using Schema 4. The dataset at https://doi.org/10.7910/DVN/BRCBFA was among the 74k+ DOIs in that API call, so apparently the metadata that DataCite had for it was in the Schema 3 version (and not in the Schema 4 version that I see when I looked at that dataset's "DataCite" export). I was able to use the "Send Dataset metadata to PID provider Dataverse" API endpoint to resend that dataset's metadata, and that DOI was removed from the results of that API call. https://api.datacite.org/dois?query=doi:10.7910/DVN/BRCBFA also shows Resending metadata so that depositors are able to update their data seems more pressing than the other reasons we've talked about in this and other GitHub issues. @landreev, @pdurbin, @qqmyers and anyone else who knows more about this general issue of updating the metadata that Dataverse installations export and about the recent development work to address them:
|
No one should get stuck. Any edit/publish of a new version would send the latest DataCite version. To update past ones, the /modifyRegistrationMetadata should work, and would be better if you run it on all DOIs since it will only update if the new XML is different from the XML at DataCite, but I think /modifyRegistration would be lighter weight if you can just call it for the ones you know are bad, and the aren't Drafts (which it skips). They are basically the same under the hood except for those differences as far as I recall. (There are ...All variants of these API calls, but I assume it would be bad to do all of them in one go.) |
Thanks, Jim. So, it sounds like the plan should be to deploy 6.3 in our prod. (this should happen within a few days), and then run |
This issue has been Sprint Ready since April. Any reason it can't get picked up for our upcoming sprint? @landreev @jggautier |
@cmbz It has a dependency on the prod. upgrade to 6.3. |
@jggautier FWIW - I thought DataCite was more worried about new v3 registrations being sent. That was happening at some sites because they used their DataCite account with non-Dataverse software. If the Harvard account(s) are used outside of Dataverse, making updates there might be a higher priority. That said, in addition to upping the version, I think we are adding more metadata, license info, etc. that wasn't in the originals, so updating older datasets would improve their findability. (You'd definitely want to do that after the proposed DataCite/OpenAire changes that are hopefully going into 6.4 - maybe that's a reason to delay updating right now?) |
No, it doesn't seem urgent. But also seems like something we should probably do anyway, as a matter of good housekeeping. |
Sounds good @landreev! I think we should just get this done as soon as we can. |
@cmbz you also wrote in April that "to make certain that the dev guide is updated to indicate that when the metadata exporter is changed, the release notes should let those updating their Dataverse software know that they need to apply those changes to the exports of the datasets that were published before the exporter was changed. Then, this issue can be closed." And @pdurbin replied that that "sounds fine. It doesn't really fit into https://guides.dataverse.org/en/6.2/developers/making-releases.html#write-release-notes as written but I'm sure we'll figure something out." I think we could close this issue after that's figured out, right? How do we say in the dev guides that when a release includes changes to metadata exports, the release notes should encourage folks to update the metadata exports of datasets that were already published in their repositories? To be honest I mentioned this "Schema 3" issue in this GitHub issue only because it seemed like another example of the need to make sure that when the metadata schema used to register PIDs is modified, repositories resend that metadata to PID providers. But should I or someone else create a new GitHub issue about this in Harvard Dataverse repo? Then we can record details of that work there, and we don't lose track of the broader goal of this GitHub issue. |
Well, we have an "etc" at https://guides.dataverse.org/en/6.3/developers/version-control.html#writing-release-note-snippets to stand in for any upgrade task that should be mentioned in release note snippets. Perhaps we could add an explicit bullet for "re-export all". |
I'm also curious about how often instructions related to "re-export-all" have been included in previous release notes where a change was made to the metadata sent to PID providers. I don't remember us talking in meetings about how often this has or hasn't happened, we haven't written about it in this GitHub issue, but I think it'll be useful to know. @pdurbin or others, do you have a sense off-hand about how often instructions related to "re-export-all" have been included in previous release notes? Otherwise I could take a look. If notes for previous releases often or always included instructions, then this issue might not be resolved only by making sure that the release notes include these instructions when relevant, right? Maybe the release notes haven't been clear? Or individual steps get overlooked when different repositories upgrade through multiple Dataverse versions? |
@jggautier looks like reexport all was mentioned in 12 recent releases:
|
This is awesome! Thanks @pdurbin! Seeing a list like this makes me think that re-export instructions are already almost always included in release notes 🥳. But maybe that's wrong and there have been releases that include changes to the metadata schema used to send metadata to PID providers and whose release notes don't include re-export instructions. To feel more confident that changes to the dev guide will result in updated metadata more often being sent to PID providers when the metadata schema is changed, maybe we can:
|
To focus on the most important features and bugs, we are closing issues created before 2020 (version 5.0) that are not new feature requests with the label 'Type: Feature'. If you created this issue and you feel the team should revisit this decision, please reopen the issue and leave a comment. |
2024/08/23: Reopened. Connected to GREI and already sized and prioritized. |
2024/09/25: This functionality is included in 6.4 and its release notes, therefore we are closing this issue. If it turns out that it is not sufficient, we can open a new issue that specifically addresses how users should perform the update. |
During discussion of Github issue #5060, the team agreed to make a separate issue for resending metadata -- which Dataverse had already sent when registering persistent IDs for published datasets and files -- whenever Dataverse changes the metadata schema it uses to register those PIDs.
Currently (as of Dataverse 4.9.4), Dataverse should be sending new metadata to PID providers when:
It's not sending new metadata of already-published datasets unless new versions of those datasets are published. For example, when Dataverse adds related publication information (i.e. the relationship between a dataset and articles) to its DataCite metadata, DataCite will get this new metadata only for newly published datasets and newly published versions of already published datasets. But DataCite won't know about the related publications of already published datasets for which new dataset versions will never be published.
In this example, the metadata that DataCite has for all Dataverse datasets will need to be updated, even for already published datasets that won't be getting a new version.
The text was updated successfully, but these errors were encountered: