Review 12 unpublished datasets with unreserved DOIs, check for duplicates, contact depositors #203

jggautier · 2022-12-02T17:07:02Z

After the recent DataCite outage, I used an API endpoint to see if other datasets in the Harvard repo are unpublished with unreserved DOIs. There were 17, including 5 datasets created on the day of the DataCite outage. I used another endpoint to reserve the DOIs of those 5, published the datasets, and followed up the depositors that emailed the support email to let them know their datasets were published.

The other 12 unpublished datasets whose DOIs are unreserved were created between 2019 and 2021. Info about them are in Google Sheets.

Since these datasets have been unpublished for a year or longer, we should:

Search the repository to check that the depositors haven't published their data in another dataset. Depositors do this sometimes when datasets are locked for a long time (https://github.com/IQSS/dataverse-HDV-Curation/issues/402), so I think they might've done that here, and we want to avoid having two datasets published with the same data.
Email the depositors:
- If we find datasets with the same data, email the depositors to confirm the duplicate datasets and let them know we'll be deleting the duplicates.
- For datasets where we couldn't find duplicates, reserve the dataset's DOIs and let the depositors know know that their datasets are still unpublished and that they should be able to publish them when they're ready.

If the depositors don't reply, these unpublished datasets will eventually be included in the Harvard repo curation team's "production cleanup," where the team will try to contact depositors of datasets that have been unpublished for a certain length of time to encourage the depositors to publish, and the team will remove the datasets if we can't get in touch with the depositors.

sbarbosadataverse · 2022-12-02T19:02:19Z

Thanks for capturing this, Julian! Let me know if we need further discussion.

…

On Fri, Dec 2, 2022 at 12:07 PM Julian Gautier ***@***.***> wrote: After the recent DataCite outage, I used an API endpoint <https://urldefense.proofpoint.com/v2/url?u=https-3A__guides.dataverse.org_en_5.12_api_native-2Dapi.html-3Fhighlight-3Dreserve-23list-2Dunreserved-2Dpids&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=8R6PzVqt1PEocauQgZMGXsGz29-nb19M7eqlo1d8EVs&m=VkDzXXCXFMYJRZaZHRHmy89FtZ3-d84fkyBvg4FanlzHco--epzT8j0LSMhVmGw6&s=IYr3KApN-ltRhAm3tzrpi2UhtbNh6s13tA2pkM6JzG8&e=> to see if other datasets in the Harvard repo are unpublished with unreserved DOIs. There were 17, including 5 datasets created on the day of the DataCite outage. I used another endpoint <https://urldefense.proofpoint.com/v2/url?u=https-3A__guides.dataverse.org_en_5.12_api_native-2Dapi.html-3Fhighlight-3Dreserve-23reserve-2Da-2Dpid&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=8R6PzVqt1PEocauQgZMGXsGz29-nb19M7eqlo1d8EVs&m=VkDzXXCXFMYJRZaZHRHmy89FtZ3-d84fkyBvg4FanlzHco--epzT8j0LSMhVmGw6&s=v2nveZbI7tEcTSXu1GutCVO8yIJ1a5fvK9bcWgrlkR4&e=> to reserve the DOIs of those 5, published the datasets, and followed up the depositors that emailed the support email to let them know their datasets were published. The other 12 unpublished datasets whose DOIs are unreserved were created between 2019 and 2021. Info about them are in Google Sheets <https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.google.com_spreadsheets_d_10hWVBb-2D9GiyBrdx4yZ9RvuuN2S9cXuKa9IuE-2DM1VJ5o&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=8R6PzVqt1PEocauQgZMGXsGz29-nb19M7eqlo1d8EVs&m=VkDzXXCXFMYJRZaZHRHmy89FtZ3-d84fkyBvg4FanlzHco--epzT8j0LSMhVmGw6&s=AZ9Hnd0wqn2Q1khxdHscG-9rhLOJ86OrzaWyQOlb04o&e=> . Since these datasets have been unpublished for a year or longer, we should: - Search the repository to check that the depositors haven't published their data in another dataset. Depositors do this sometimes when datasets are locked for a long time (IQSS/dataverse-HDV-Curation#402 <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_IQSS_dataverse-2DHDV-2DCuration_issues_402&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=8R6PzVqt1PEocauQgZMGXsGz29-nb19M7eqlo1d8EVs&m=VkDzXXCXFMYJRZaZHRHmy89FtZ3-d84fkyBvg4FanlzHco--epzT8j0LSMhVmGw6&s=EuwHDH-CQzvNSEYmVM88o6oMNaGOtKN1yWediKu1UNc&e=>), so I think they might've done that here, and we want to avoid having two datasets published with the same data. - Email the depositors: - If we find datasets with the same data, email the depositors to confirm the duplicate datasets and let them know we'll be deleting the duplicates. - For datasets where we couldn't find duplicates, reserve the dataset's DOIs and let the depositors know know that their datasets are still unpublished and that they should be able to publish them when they're ready. If the depositors don't reply, these unpublished datasets will eventually be included in the Harvard repo curation team's "production cleanup," where the team will try to contact depositors of datasets that have been unpublished for a certain length of time to encourage the depositors to publish, and the team will remove the datasets if we can't get in touch with the depositors. — Reply to this email directly, view it on GitHub <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_IQSS_dataverse.harvard.edu_issues_203&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=8R6PzVqt1PEocauQgZMGXsGz29-nb19M7eqlo1d8EVs&m=VkDzXXCXFMYJRZaZHRHmy89FtZ3-d84fkyBvg4FanlzHco--epzT8j0LSMhVmGw6&s=Bjf4zKtWwqtVyU-CgDX-OAaiueUlPRNhB-9jtdNlzIk&e=>, or unsubscribe <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AB7P2KS4S6ZXQ627NG65FSTWLIUEFANCNFSM6AAAAAASSDYFF4&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=8R6PzVqt1PEocauQgZMGXsGz29-nb19M7eqlo1d8EVs&m=VkDzXXCXFMYJRZaZHRHmy89FtZ3-d84fkyBvg4FanlzHco--epzT8j0LSMhVmGw6&s=e1SGobP_uEm0cTsqY3v2kCljnkMCy0w-us2RmTzonK4&e=> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

-- Sonia Barbosa Manager of Data Curation, The Harvard Dataverse Repository Manager of the Murray Research Archive <http://Murray.harvard.edu>, IQSS The Dataverse Project <http://dataverse.org> Data Science Harvard University Visit our Harvard Dataverse support website: https://support.dataverse.harvard.edu/ Need to deposit data? Visit http://dataverse.harvard.edu Harvard Library RDM services: <http://goog_1421170368> https://hlrdm.library.harvard.edu/network All Harvard Dataverse Repository inquiries should be sent to: ***@***.*** All software inquiries should be sent to: ***@***.*** Interested in sharing sensitive data? Coming soon to Harvard Dataverse: http://datatags.org/ All test Dataverse Collections should be created in our demo environment: https://demo.dataverse.org/ Join our Dataverse Community! https://groups.google.com/forum/#!forum/dataverse-communit <https://groups.google.com/forum/#!forum/dataverse-community>y

jggautier · 2022-12-05T18:07:10Z

Thanks. I was able to reserve PIDs for 10 of the 12 datasets, after making sure the data hadn't already been published in other datasets.

The spreadsheet includes the urls of the two datasets whose PIDs I haven't reserved.

For one of the those datasets, I see that its data is in a second unpublished dataset that's been submitted for review in a journal's Dataverse collection. I've contacted the depositor (https://help.hmdc.harvard.edu/Ticket/Display.html?id=331263) to ask if one of the deposits can be deleted.
The second dataset has something in its Producer Affiliation field but its Producer Name field is empty. This isn't allowed anymore (Custom Metadata: Allow Dataverse Installations to Define Conditionally Required Fields for Compound Fields dataverse#7606) because of DataCite metadata requirements (Publish Dataset: Silently fail to publish dataset, server log shows facet minlength error. dataverse#7518), so trying to reserve a PID for that dataset returns an error like:
{"status":"ERROR","message":"Problem reserving PID for dataset id #######: Response from postMetadata: 422, DOI 10.7910/dvn/#######: [facet 'minLength'] The value has a length of '0'; this underruns the allowed minimum length of '1'. at line 26, column 0."}

Looks like the depositor emailed Harvard Dataverse support to report that they couldn't publish the dataset (https://help.hmdc.harvard.edu/Ticket/Display.html?id=293853), which was created before Dataverse's "conditionally required fields" update, and in the email @jyuenger rightly guessed that the problem is due to the DataCite metadata issue.

I don't know what to put in the Producer Name field. Maybe the depositor considers themselves to be the "Producer" and didn't fill in the Producer Name field because they've already added their name to other fields (like the Author Name and Contact fields). I've followed up in an email to the depositor to ask.

Hopefully they reply and we can do something to reserve the DOI and publish the dataset (such as adding a Producer Name or deleting what's in the Producer Affiliation field).

jggautier · 2023-01-03T15:14:37Z

The depositor of one of the two remaining datasets replied over the winter break and I was able to remove that unpublished dataset.

Just one dataset to go. I just sent a follow up email (https://help.hmdc.harvard.edu/Ticket/Display.html?id=293853)

jggautier · 2023-01-09T17:11:18Z

I haven't heard back from the depositor of the last dataset whose DOI was unreserved. Because it's an unpublished dataset, I just removed what was typed in the Producer Name field, re-saved the unpublished dataset, and used the API endpoint to reserve the DOI.

The curation team will probably remove this unpublished dataset eventually since it's pretty old.

I found another dataset whose DOI was unreserved and I was able to use the API endpoint to reserve it. It looks like these unreserved DOI errors don't happen as often as datasets being locked for a long time (https://github.com/IQSS/dataverse-HDV-Curation/issues/345), but I'll be checking every so often to see if any datasets' DOIs aren't reserved and reserve them.

jggautier self-assigned this Dec 9, 2022

jggautier closed this as completed Jan 9, 2023

jggautier mentioned this issue Aug 25, 2023

Publish Dataset: Dataset can sometimes become stuck in publish lock after publishing dataset. IQSS/dataverse#8875

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review 12 unpublished datasets with unreserved DOIs, check for duplicates, contact depositors #203

Review 12 unpublished datasets with unreserved DOIs, check for duplicates, contact depositors #203

jggautier commented Dec 2, 2022

sbarbosadataverse commented Dec 2, 2022 via email

jggautier commented Dec 5, 2022 •

edited

Loading

jggautier commented Jan 3, 2023

jggautier commented Jan 9, 2023

Review 12 unpublished datasets with unreserved DOIs, check for duplicates, contact depositors #203

Review 12 unpublished datasets with unreserved DOIs, check for duplicates, contact depositors #203

Comments

jggautier commented Dec 2, 2022

sbarbosadataverse commented Dec 2, 2022 via email

jggautier commented Dec 5, 2022 • edited Loading

jggautier commented Jan 3, 2023

jggautier commented Jan 9, 2023

jggautier commented Dec 5, 2022 •

edited

Loading