Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review 12 unpublished datasets with unreserved DOIs, check for duplicates, contact depositors #203

Closed
jggautier opened this issue Dec 2, 2022 · 4 comments
Assignees

Comments

@jggautier
Copy link
Collaborator

After the recent DataCite outage, I used an API endpoint to see if other datasets in the Harvard repo are unpublished with unreserved DOIs. There were 17, including 5 datasets created on the day of the DataCite outage. I used another endpoint to reserve the DOIs of those 5, published the datasets, and followed up the depositors that emailed the support email to let them know their datasets were published.

The other 12 unpublished datasets whose DOIs are unreserved were created between 2019 and 2021. Info about them are in Google Sheets.

Since these datasets have been unpublished for a year or longer, we should:

  • Search the repository to check that the depositors haven't published their data in another dataset. Depositors do this sometimes when datasets are locked for a long time (https://github.com/IQSS/dataverse-HDV-Curation/issues/402), so I think they might've done that here, and we want to avoid having two datasets published with the same data.
  • Email the depositors:
    • If we find datasets with the same data, email the depositors to confirm the duplicate datasets and let them know we'll be deleting the duplicates.
    • For datasets where we couldn't find duplicates, reserve the dataset's DOIs and let the depositors know know that their datasets are still unpublished and that they should be able to publish them when they're ready.

If the depositors don't reply, these unpublished datasets will eventually be included in the Harvard repo curation team's "production cleanup," where the team will try to contact depositors of datasets that have been unpublished for a certain length of time to encourage the depositors to publish, and the team will remove the datasets if we can't get in touch with the depositors.

@sbarbosadataverse
Copy link

sbarbosadataverse commented Dec 2, 2022 via email

@jggautier
Copy link
Collaborator Author

jggautier commented Dec 5, 2022

Thanks. I was able to reserve PIDs for 10 of the 12 datasets, after making sure the data hadn't already been published in other datasets.

The spreadsheet includes the urls of the two datasets whose PIDs I haven't reserved.

  1. For one of the those datasets, I see that its data is in a second unpublished dataset that's been submitted for review in a journal's Dataverse collection. I've contacted the depositor (https://help.hmdc.harvard.edu/Ticket/Display.html?id=331263) to ask if one of the deposits can be deleted.

  2. The second dataset has something in its Producer Affiliation field but its Producer Name field is empty. This isn't allowed anymore (Custom Metadata: Allow Dataverse Installations to Define Conditionally Required Fields for Compound Fields dataverse#7606) because of DataCite metadata requirements (Publish Dataset: Silently fail to publish dataset, server log shows facet minlength error. dataverse#7518), so trying to reserve a PID for that dataset returns an error like:
    {"status":"ERROR","message":"Problem reserving PID for dataset id #######: Response from postMetadata: 422, DOI 10.7910/dvn/#######: [facet 'minLength'] The value has a length of '0'; this underruns the allowed minimum length of '1'. at line 26, column 0."}

    Looks like the depositor emailed Harvard Dataverse support to report that they couldn't publish the dataset (https://help.hmdc.harvard.edu/Ticket/Display.html?id=293853), which was created before Dataverse's "conditionally required fields" update, and in the email @jyuenger rightly guessed that the problem is due to the DataCite metadata issue.

    I don't know what to put in the Producer Name field. Maybe the depositor considers themselves to be the "Producer" and didn't fill in the Producer Name field because they've already added their name to other fields (like the Author Name and Contact fields). I've followed up in an email to the depositor to ask.

    Hopefully they reply and we can do something to reserve the DOI and publish the dataset (such as adding a Producer Name or deleting what's in the Producer Affiliation field).

@jggautier jggautier self-assigned this Dec 9, 2022
@jggautier
Copy link
Collaborator Author

The depositor of one of the two remaining datasets replied over the winter break and I was able to remove that unpublished dataset.

Just one dataset to go. I just sent a follow up email (https://help.hmdc.harvard.edu/Ticket/Display.html?id=293853)

@jggautier
Copy link
Collaborator Author

I haven't heard back from the depositor of the last dataset whose DOI was unreserved. Because it's an unpublished dataset, I just removed what was typed in the Producer Name field, re-saved the unpublished dataset, and used the API endpoint to reserve the DOI.

The curation team will probably remove this unpublished dataset eventually since it's pretty old.

I found another dataset whose DOI was unreserved and I was able to use the API endpoint to reserve it. It looks like these unreserved DOI errors don't happen as often as datasets being locked for a long time (https://github.com/IQSS/dataverse-HDV-Curation/issues/345), but I'll be checking every so often to see if any datasets' DOIs aren't reserved and reserve them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants