Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mint DOIs for Datasets with Handles in dataverse.harvard.edu #4

Closed
djbrooke opened this issue Feb 6, 2019 · 30 comments
Closed

Mint DOIs for Datasets with Handles in dataverse.harvard.edu #4

djbrooke opened this issue Feb 6, 2019 · 30 comments
Assignees

Comments

@djbrooke
Copy link
Contributor

djbrooke commented Feb 6, 2019

We have this endpoint:

http://guides.dataverse.org/en/latest/admin/dataverses-datasets.html#mint-new-pid-for-a-dataset

We should use it to mint DOIs for datasets in dataverse.harvard.edu with Handles in support of Make Data Count in #4821.

@landreev landreev self-assigned this Feb 11, 2019
@landreev
Copy link
Collaborator

landreev commented Feb 12, 2019

This is how it works; we take a dataset with a handle id:
https://dataverse.harvard.edu/dataset.xhtml?persistentId=hdl:1902.1/M543V1
screen shot 2019-02-12 at 11 11 14 am
then run the doi assignment api; the dataset now has the new global identifier:
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/C7Z0HI
screen shot 2019-02-12 at 11 12 52 am
Both URLs are still working. File downloads are working (the files are still stored in the directory named after the handle; we put in some code that looks for the files in the right place)

The citation is always showing the DOI, regardless of whether you've used the DOI or the handle to get to the page.

The handle only appears in the metadata tab, here:
screen shot 2019-02-12 at 11 16 09 am

Is this how we wanted it to work/look? (we didn't want that handle to appear more prominently somehow, at the top of the page, did we?)

@scolapasta
Copy link

scolapasta commented Feb 12, 2019

Yes, this is what we had decided. Looks good, seems ready for CR.

@landreev
Copy link
Collaborator

The update job is still running. The script sleeps for a few seconds between registration calls, so that we don't flood DC with requests.

@landreev
Copy link
Collaborator

Quite a few of the datasets are failing to re-register; trying to understand why.

@landreev
Copy link
Collaborator

landreev commented Feb 12, 2019

May be due to a lack of some metadata fields, that are mandatory for the DOI registration? - Like this dataset:
https://dataverse.harvard.edu/dataset.xhtml?persistentId=hdl%3A1902.1%2FEX6E1
notably, there's no author.

@djbrooke djbrooke assigned djbrooke and scolapasta and unassigned landreev and scolapasta Feb 13, 2019
@djbrooke
Copy link
Contributor Author

Thanks @landreev for getting most of these registered and for the list of those that were not successful. #5559 has been created to handle all of those that could not be registered for whatever reason. Between this and #5559, we'll unblock the Make Data Count work.

@djbrooke djbrooke transferred this issue from IQSS/dataverse Mar 13, 2019
@djbrooke djbrooke removed their assignment Mar 20, 2019
@djbrooke
Copy link
Contributor Author

djbrooke commented Apr 2, 2019

When 4.12 (contains the fix for #5559) is on prod, we can finish this up.

@landreev
Copy link
Collaborator

Started a new batch job for the still un-converted handle-ed datasets earlier today.

@landreev
Copy link
Collaborator

Of the 4225 datasets that still had handles, only 5 are still failing to obtain a DOI:

1902.1/01957
1902.1/10344
1902.1/10766
1902.1/11748
1902.1/12641

@jggautier
Copy link
Collaborator

jggautier commented May 31, 2019

A new version of the dataset 1902.1/01957 is now published with the metadata in the Producer fields removed. @landreev, could you try again to register a DOI for this dataset?

Some of the datasets are missing either a Contact Name or Contact Email. DataCite doesn't require either of these, but

Not sure if this missing Contact metadata is the culprit. 1902.1/01957 has no Contact Name. If it's able to get a DOI, then the missing Contact Name isn't the problem.

@pdurbin
Copy link
Member

pdurbin commented May 31, 2019

Dataverse does require a Contact Email to create datasets. I'm guessing these datasets were published before a Contact Email was a requirement.

@jggautier as you know, I'm of the opinion that we should simply make Contact Email a required field. EZID didn't require it but DataCite does. It would be a fix for IQSS/dataverse#3839 (thanks for linking to that issue above).

@jggautier
Copy link
Collaborator

Do you mean you think we should make Contact Name a required field? (Dataverse already requires a Contact Email.)

How can I tell that DataCite requires Contact Name (or Contact Email)? None of the DataCite schema documentation lists those as required fields.

@pdurbin
Copy link
Member

pdurbin commented May 31, 2019

@jggautier bah! Sorry, I meant Contact Name. The easiest way to exercise the bug is to simply delete the Contact Name (which is auto populated) and try to publish the dataset. I just tried this on the demo site and I was a little surprised to see that it published just fine. Since you have a superuser account maybe you could try this in production and "destroy" the dataset afterwards. Basically, I wondering if IQSS/dataverse#3839 is still a bug or not. It's hard to tell from the demo site.

@jggautier
Copy link
Collaborator

@pdurbin I published a dataset on Harvard Dataverse without Contact Name metadata. I had to delete it, but I'm sure other real datasets have been published without a Contact Name, too. To be honest, I only brought this bug up in this issue on chance that it might somehow be related (feels like I'm grasping at straws).

@scolapasta
Copy link

I reran the API for 1902.1/01957 and it now has a DOI. The others will take more investigation.

@scolapasta
Copy link

I figured it out!

The issue is that the name column for datavariables is null (well '').

This query:
select df.owner_id, df.id, count(*)
from dvobject df, datatable dt, datavariable dv
where df.id=dt.datafile_id
and dt.id=dv.datatable_id and dv.name =''
group by df.owner_id, df.id order by df.owner_id;

returns 7 files, all of which belong the the 4 datasets above.

@scolapasta
Copy link

So, todos:

  1. Decide what to do with these - is there a name we can give each variable? If not should we "uningest" and possibly reingest these files (will this affect the UNF for the published ones?)

  2. Before we fix, we may want to update our admin validator code to get files for each dataset and variables. (though that may make the one thar runs on the whole db take significantly longer.

Let's discuss this week either after standup or at backlog grooming.

@scolapasta scolapasta removed their assignment Jul 16, 2019
@djbrooke
Copy link
Contributor Author

djbrooke commented Jul 16, 2019

  1. Give a generic variable name (this will allow us to successfully mint DOIs and implement Make Data Count in Harvard Dataverse)
  2. Extend validation API to optionally validate down to the variable level (#6026)

@landreev
Copy link
Collaborator

OK, the affected variables have been given names of the type "varNN" where NN is the variable order in the datatable.

@landreev
Copy link
Collaborator

Finally, the last 4 handle datasets have been assigned DOIs.
The datasets:

DBID Handle DOI
61008 1902.1/10344 10.7910/DVN/8ZSDCH
2669292 1902.1/10766 10.7910/DVN/SIWH9F
54917 1902.1/11748 10.7910/DVN/E4TERD
56380 1902.1/12641 10.7910/DVN/AULYGY

@scolapasta I'm moving this straight to QA, since there's nothing to review.
@kcondon I guess the only QA left for this is to verify that there are no datasets left with handles for the primary identifier.

@landreev landreev removed their assignment Jul 19, 2019
@kcondon kcondon self-assigned this Jul 22, 2019
@kcondon
Copy link
Contributor

kcondon commented Jul 22, 2019

Checked dvobject table for non-dois, all clear. Also eyeballed alternativepersistentidentifier table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants