-
Notifications
You must be signed in to change notification settings - Fork 391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataverse repoprovider and URLs #900
Comments
caveat: need to read more about dataverse and think about your ideas. All other URLs that a BinderHub understands are of the form So my naive expectation was that for dataverse(s) we'd have something like https://mybinder.org/v2/dataverse/10.7910/DVN/MSIMRE which would do the same thing. It fetches all the files associated with the DOI and we hand the directory over to repo2docker. In this case it seems the DOI I found clicking on the first thing I could find on the Harvard dataverse isn't a great example because it only contains PDFs. This means we always operate on a group of files in a directory. That is "the shareable unit" in Binder-land. What is the use-case for wanting just one file (the An important thing to note is that an empty directory or one containing no special files recognised by repo2docker is a valid directory to be using. You will still get a notebook UI etc, but you will be missing any extra libraries you might have specified in a |
Sure, and if real time communication I'm fine with:
It might be cleaner to treat a DOI as a DOI without having the provider in the URL. A DOI may be hosted on certain repository software one year but then get migrated to another repository software in the future. Once the DOI resolves to a specific host, you can probably do a few tests to figure out if the DOI is being hosted by Zenodo or Dataverse or DSpace or Fedora Commons or CKAN or OSF (#216) or whatever system.
Yeah, a better example might be https://mybinder.org/v2/dataverse/10.7910/DVN/RLLL1V (Or https://mybinder.org/v2/10.7910/DVN/RLLL1V with no "provider" if you like my idea above. 😄 ) That link is a 404 right now, of course, but the DOI is https://doi.org/10.7910/DVN/RLLL1V and resolves to https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/RLLL1V I'm saying this dataset might be a good one to test with because:
I was trying to explain what's possible today with zero changes to Dataverse code. External tools are currently only available at the file level and the "fileId" parameter is required. With the https://mybinder.org/v2/dataverse/?siteUrl=https://dev2.dataverse.org&datasetId=18&fileId=30 example above I at least moved it to the end of the URL. It's noise. It can be safely ignored. For the Dataverse perspective there is a ton of value in providing external tools at the files level because we are so oriented toward data. Researchers often dig into a particular tabular files for example, and may use a variety of browser-based external tools to play around with the data a bit as they decided whether or not the file is of interest to them or not. But you shouldn't worry about this too much. 😄 File level external tools are only relevant because they work today and if there's agreement with the external tool provider that clicking "Whole Tale" or "Binder" on any files triggers a download of all the files in the dataset, the integration can work now without waiting for Dataverse to ship support for dataset level external tools as part of IQSS/dataverse#5028 . Because of this workaround, this agreement, that Whole Tale will download all the files from a dataset no matter what file you click, I was (again) able to demo the launching of Jupyter Notebooks from Dataverse recently: https://scholar.harvard.edu/pdurbin/blog/2019/jupyter-notebooks-and-crazy-ideas-for-dataverse Which brings me back to "the shareable unit is a directory". My expectation would be a button at the dataset level to build the whole thing (https://mybinder.org/v2/dataverse/10.7910/DVN/MSIMRE) and then a button at the file level that link you directly to the file inside that dataset (https://mybinder.org/v2/dataverse/10.7910/DVN/MSIMRE?filepath=some-notebook.ipynb). It's good to know that you're ok with query parameters. 😄 Yes, all the files from a Dataverse dataset would go in a directory. Dataverse supports file hierarchy so the directory could have an arbitrary number of subdirectories, just like you'd get with git. Please keep the questions coming! 😄 |
I just made pull request IQSS/dataverse#6059 and if it's merged Dataverse users will click "Explore" and then "Binder" (or whatever) and be send to URLs like https://mybinder.org/v2/dataverse?datasetPid=doi:10.7910/DVN/RLLL1V Here's the Dataverse external tool manifest I'm using:
@betatim do you have any more questions? 😄 |
@nuest hi! I see you opened pull request #951 to add add Figshare to the UI. Now that jupyterhub/repo2docker#739 has been merged (thanks!) and you're working on docs at jupyterhub/repo2docker#796 (thanks!) do you think you could help with adding Dataverse to the UI as well? I'm happy to answer any questions! |
I really like @betatim 's idea of having just
I'd be curious to hear which of those would be a preferred method (or if there are other options I'm missing). |
I'd wait a bit longer with introducing We could reduce code duplication by making repo2docker a dependency of BinderHub and refactoring the code in r2d so that it can be called as a library. For caching to work we unfortunately need to know before launching the build pod what the answer is. Which I think means we can't just forward all DOIs to repo2docker without deciding if they are supported or not. Needed: a good plan to get us from where we are now (supporting a handful of DOIs) to the end goal of |
I think I've said this before, but like to provide an external discussion here: ropensci-archive/doidata#1 There is no way right now to handle I like the idea of having a master DOI provider, but I don't see what it adds compared to the current detection approach (call the DOI and see which URL we end up with). Just calling the DOI is really pragmatic and saves us from understanding DOI schema. Re. refactoring repo2docker: made a quick search only, but I guess there is no "download all files from a DOI" Python library yet. The content providers in repo2docker are a path towards that. Such a library could also be used by BinderHub to pre-check if a DOI is supported... |
I appreciate all the chatter on this issue! Would it help if we make the scope of this issue smaller and re-title it to "Add Harvard Dataverse to UI"? I'd feel bad for the other 47 installations of Dataverse around the world but if it helps move things forward we could refactor later. When I look at the current list of repoproviders on https://mybinder.org , they are all centralized services: Harvard Dataverse is also a centralized service. Pull request #951 is about adding another centralized service (Figshare) to Binder. Should we make our lives easier and continue to focus on centralized services in the short term? Later we can think harder about supporting software that allows you to self-host (GitLab, Dataverse, etc.). What do you think? Update: I lied. Looking closer I can tell that the "Git repository" option allows arbitrary self-hosted Git servers: I still think the rest (GitHub, Zenodo, etc.) are centralized services and I'm still fine with narrowing the scope of this issue to just Harvard Dataverse if that's easier. |
To avoid confusion: with provider I was thinking of the Python class. I was thinking we could save some code by having one provider that all the different DOI items in the drop down menu could use. You give it a DOI and if it resolves to a URL we support it says "yay". Pondering this even further: the difference between the figshare and zenodo providers is mostly in the Most of the difference between (and complexity of) the If we could remove the need to extract the provider specific ID then a generic DOI provider would be "resolve this DOI, does the resulting hostname look like one we support, if yes pass the DOI to repo2docker". Which would be nice (I think).
I don't think there exists anything to do that, precisely because each archive has its own uniquely beautiful interface ;) Back in October 2017 (OMG is that a long time ago) when the idea of "doi2docker" was first suggested I assumed it would be easy because "surely there is a standard way to get the files associated with a DOI". Two years later we are still working on it, but at least we have a few supported archives now :D |
To avoid even further confusion I'll respond with code :)
If you're ok with using "3rd party" service, the above would work for general case too, provided that we have parity in our content providers. (Zenodo and Figshare are literally on the top of my todo). |
What I take out of @betatim 's last post is: this is a BinderHub issue so it's primarily about detecting DOIs that repo2docker will support, without knowing/caring how repo2docker will do that. Re. provider specific ID: Every Zenodo record has its own DOI. In repo2docker we need the provider specific ID to talk to the API, I don't think we need a provider specific ID within BinderHub, do we? Re. "Harvard Dataverse vs. Dataverse": if I understand the feature correctly, any Dataverse provider is supported, so I don't see the need to limit to the Harvard one. |
I appreciate all the discussion above. Thank you! 🎉 The title of this issue is currently "Dataverse repoprovider and URLs" but it sounds like there is interest in expanding the scope of this issue to be less about Dataverse and more about detecting which DOIs that repo2docker will support. Is that right? If so, can we please update the title of this issue? My understanding is that as of this writing, BinderHub supports the following DOIs providers with the following "prefix" or "registrant code":
Our goal is to support more DOI providers. |
(Trying to catch up after some time away from work.) I am unsure BinderHub should start keeping lists of supported DOI prefixes. Those are pretty "permanent" but will grow, and they also don't guarantee support AFAICS because we don't know that these prefixes will only point to Dataverse installations. That is likely what these listed institutions do now, but maybe one of them starts using (or is already using) the same prefix for a journal? So I think we should continue with the approach used in Re. relying on the Wholetale API for lookup: Honestly no offense, but IMO BinderHub should try to stick to resolving DOIs only. (Although Wholetale probably has more stable finances than BinderHub and can "catch up" with the repositories added to repo2docker?) |
Just as note of an explanation: that's precisely what we do... If you call that API with DOI we don't currently support (like Zenodo), you get:
so in the worst case scenario you get exactly the same thing, only in JSON rather than redirect. It's significantly slower, I give you that, because we try other things too. Nevertheless, if you look at the corresponding PR it's all water under the bridge. |
Thanks for the explanation @Xarthisius - I think you're doing something very important there, namely providing structured information based on a DOI. Something that |
This is a very good point!
I recently asked a related question at https://www.pidforum.org/t/api-for-determining-which-institution-is-using-a-doi-prefix-registrant-code/687 There I learned that from https://doi.org/ra/10.11587 (for example) you can figure out that DataCite is the registration agency:
Once you know DataCite is the registration agency, you can use DataCite APIs to get additional JSON out. https://api.datacite.org/prefixes/10.11587 (for example) provides JSON about the institution that is using that prefix.
|
Does this help? https://doi.org/api/handles/10.11587/ERDG3O
I found this at "Proxy Server REST API" at https://www.doi.org/factsheets/DOIProxy.html#rest-api |
@pdurbin it would if DOI world weren't nasty and full of corner cases...:
As illustrated above, DOI resolutions can be chained so it's always safer to resolve it until 200 is reached. |
@Xarthisius oh! That extra hop is our fault (Harvard Dataverse's fault). We changed hostnames a while back and should retarget old PIDs like that. I opened and issue about this:IQSS/dataverse.harvard.edu#40 If you want to treat some of these old PIDs as broken and not working, that's fine. It would nudge Dataverse installations to clean house a little bit. They'd be forced to update old DOI records to point them to current hostnames. Or you are welcome to keep following 302 redirects like you're doing now. Whatever works best for you, really! Thanks for working on this! |
I just had a quick chat with @betatim about integrating Dataverse with Binder and while there are already some open issues and pull requests about this...
... the intention of this issue is to discuss the details of
Over at IQSS/dataverse#4714 (comment) I described what's possible today with no changes to the Dataverse code.
One can create a binder.json file like this:
And load up that Binder "external tool" into Dataverse like this:
curl http://localhost:8080/api/admin/externalTools -X POST --upload-file mybinder.json
Once the external tool has been loaded into the installation of Dataverse a "Binder" or "MyBinder" (or whatever) button will appear under the "Explore" drop down like this:
Users clicks "MyBinder" they will be taken to URLs like the following:
https://mybinder.org/v2/dataverse/?siteUrl=https://dev2.dataverse.org&datasetId=18&fileId=30
Based on the query parameters for
siteUrl
anddatasetId
, I believe the code at jupyterhub/repo2docker#739 will be able to download all the files from Dataverse.I have a test GitHub repo with a "Launch Binder" button ready to play with: https://github.com/pdurbin/dataverse-irc-metrics
A few weeks ago I gave a demo of running a Jupyter Notebook against a TSV file in this repo using @whole-tale as an external tool: https://scholar.harvard.edu/pdurbin/blog/2019/jupyter-notebooks-and-crazy-ideas-for-dataverse
The plot I created looked something like this:
The goal is to offer two ways to user Binder with Dataverse:
I am happy to spin up Dataverse test servers to assist in this effort. At the moment, you can go to https://dev2.dataverse.org/file.xhtml?fileId=30 to see a MyBinder button.
The text was updated successfully, but these errors were encountered: