Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataverse repoprovider and URLs #900

Closed
pdurbin opened this issue Jul 15, 2019 · 20 comments
Closed

Dataverse repoprovider and URLs #900

pdurbin opened this issue Jul 15, 2019 · 20 comments

Comments

@pdurbin
Copy link
Contributor

pdurbin commented Jul 15, 2019

I just had a quick chat with @betatim about integrating Dataverse with Binder and while there are already some open issues and pull requests about this...

... the intention of this issue is to discuss the details of

  • What the URL should look like on the Binder side when someone clicks "Binder" from a dataset in Dataverse.
  • What the UI in Binder should look like when someone wants to operate on code and data stored in and installation of Dataverse. Dataverse supports both DOIs and Handles but starting with just DOIs is certainly fine as in the mockup below.

Screen Shot 2019-07-15 at 12 44 27 PM

Over at IQSS/dataverse#4714 (comment) I described what's possible today with no changes to the Dataverse code.

One can create a binder.json file like this:

{
  "displayName": "MyBinder",
  "description": "Analyze in MyBinder",
  "type": "explore",
  "toolUrl": "https://mybinder.org/v2/dataverse/",
  "contentType": "application/x-ipynb+json",
  "toolParameters": {
    "queryParameters": [
      {
        "siteUrl": "{siteUrl}"
      },
      {
        "datasetId": "{datasetId}"
      },
      {
        "fileId": "{fileId}"
      }
    ]
  }
}

And load up that Binder "external tool" into Dataverse like this:

curl http://localhost:8080/api/admin/externalTools -X POST --upload-file mybinder.json

Once the external tool has been loaded into the installation of Dataverse a "Binder" or "MyBinder" (or whatever) button will appear under the "Explore" drop down like this:

61074176-149a3700-a3e5-11e9-83c2-606839453bee

Users clicks "MyBinder" they will be taken to URLs like the following:

https://mybinder.org/v2/dataverse/?siteUrl=https://dev2.dataverse.org&datasetId=18&fileId=30

Based on the query parameters for siteUrl and datasetId, I believe the code at jupyterhub/repo2docker#739 will be able to download all the files from Dataverse.

I have a test GitHub repo with a "Launch Binder" button ready to play with: https://github.com/pdurbin/dataverse-irc-metrics

A few weeks ago I gave a demo of running a Jupyter Notebook against a TSV file in this repo using @whole-tale as an external tool: https://scholar.harvard.edu/pdurbin/blog/2019/jupyter-notebooks-and-crazy-ideas-for-dataverse

The plot I created looked something like this:

ircplot

The goal is to offer two ways to user Binder with Dataverse:

  • By entering a Dataverse DOI in Binder, run code, such as a Jupyter Notebook.
  • By clicking "Binder" from a dataset in Dataverse with code and data, run code against the data in Binder.

I am happy to spin up Dataverse test servers to assist in this effort. At the moment, you can go to https://dev2.dataverse.org/file.xhtml?fileId=30 to see a MyBinder button.

@betatim
Copy link
Member

betatim commented Jul 15, 2019

caveat: need to read more about dataverse and think about your ideas.

All other URLs that a BinderHub understands are of the form https://<hostname>/v2/<provider>/<spec>. For example https://mybinder.org/v2/gh/binder-examples/conda/master or https://mybinder.org/v2/zenodo/10.5281/zenodo.1470939. When you visit one we fetch the files associated with the <spec> from the <provider> and feed the resulting directory to repo2docker which then does its thing.

So my naive expectation was that for dataverse(s) we'd have something like https://mybinder.org/v2/dataverse/10.7910/DVN/MSIMRE which would do the same thing. It fetches all the files associated with the DOI and we hand the directory over to repo2docker. In this case it seems the DOI I found clicking on the first thing I could find on the Harvard dataverse isn't a great example because it only contains PDFs.

This means we always operate on a group of files in a directory. That is "the shareable unit" in Binder-land. What is the use-case for wanting just one file (the fileID parameter)? When I head over to https://dev2.dataverse.org/dataset.xhtml?persistentId=doi:10.5072/FK2/NYNHAM&version=1.1 the "explore" button is at the file level. A notebook by itself isn't super useful because it can't specify the environment in which it needs to be executed. Which brings me back to "the shareable unit is a directory". My expectation would be a button at the dataset level to build the whole thing (https://mybinder.org/v2/dataverse/10.7910/DVN/MSIMRE) and then a button at the file level that ink you directly to the file inside that dataset (https://mybinder.org/v2/dataverse/10.7910/DVN/MSIMRE?filepath=some-notebook.ipynb).


An important thing to note is that an empty directory or one containing no special files recognised by repo2docker is a valid directory to be using. You will still get a notebook UI etc, but you will be missing any extra libraries you might have specified in a requirements.txt. So I think it is Ok to feed a directory full of PDFs to repo2docker (especially if you imagine it is actually a bunch of data files).

@pdurbin
Copy link
Contributor Author

pdurbin commented Jul 16, 2019

caveat: need to read more about dataverse and think about your ideas.

Sure, and if real time communication I'm fine with:

All other URLs that a BinderHub understands are of the form https://<hostname>/v2/<provider>/<spec>. For example https://mybinder.org/v2/gh/binder-examples/conda/master or https://mybinder.org/v2/zenodo/10.5281/zenodo.1470939. When you visit one we fetch the files associated with the <spec> from the <provider> and feed the resulting directory to repo2docker which then does its thing.

It might be cleaner to treat a DOI as a DOI without having the provider in the URL. A DOI may be hosted on certain repository software one year but then get migrated to another repository software in the future. Once the DOI resolves to a specific host, you can probably do a few tests to figure out if the DOI is being hosted by Zenodo or Dataverse or DSpace or Fedora Commons or CKAN or OSF (#216) or whatever system.

So my naive expectation was that for dataverse(s) we'd have something like https://mybinder.org/v2/dataverse/10.7910/DVN/MSIMRE which would do the same thing. It fetches all the files associated with the DOI and we hand the directory over to repo2docker. In this case it seems the DOI I found clicking on the first thing I could find on the Harvard dataverse isn't a great example because it only contains PDFs.

Yeah, a better example might be https://mybinder.org/v2/dataverse/10.7910/DVN/RLLL1V

(Or https://mybinder.org/v2/10.7910/DVN/RLLL1V with no "provider" if you like my idea above. 😄 )

That link is a 404 right now, of course, but the DOI is https://doi.org/10.7910/DVN/RLLL1V and resolves to https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/RLLL1V

I'm saying this dataset might be a good one to test with because:

  • It has Jupyter Notebooks
  • Unlike many, many datasets in Harvard Dataverse, it went through a curation process for the American Journal of Political Science (AJPS) and includes the following note: "This dataset underwent an independent verification process that replicated the tables and figures in the primary article."

This means we always operate on a group of files in a directory. That is "the shareable unit" in Binder-land. What is the use-case for wanting just one file (the fileID parameter)? When I head over to https://dev2.dataverse.org/dataset.xhtml?persistentId=doi:10.5072/FK2/NYNHAM&version=1.1 the "explore" button is at the file level. A notebook by itself isn't super useful because it can't specify the environment in which it needs to be executed.

I was trying to explain what's possible today with zero changes to Dataverse code. External tools are currently only available at the file level and the "fileId" parameter is required. With the https://mybinder.org/v2/dataverse/?siteUrl=https://dev2.dataverse.org&datasetId=18&fileId=30 example above I at least moved it to the end of the URL. It's noise. It can be safely ignored.

For the Dataverse perspective there is a ton of value in providing external tools at the files level because we are so oriented toward data. Researchers often dig into a particular tabular files for example, and may use a variety of browser-based external tools to play around with the data a bit as they decided whether or not the file is of interest to them or not. But you shouldn't worry about this too much. 😄 File level external tools are only relevant because they work today and if there's agreement with the external tool provider that clicking "Whole Tale" or "Binder" on any files triggers a download of all the files in the dataset, the integration can work now without waiting for Dataverse to ship support for dataset level external tools as part of IQSS/dataverse#5028 . Because of this workaround, this agreement, that Whole Tale will download all the files from a dataset no matter what file you click, I was (again) able to demo the launching of Jupyter Notebooks from Dataverse recently: https://scholar.harvard.edu/pdurbin/blog/2019/jupyter-notebooks-and-crazy-ideas-for-dataverse

Which brings me back to "the shareable unit is a directory". My expectation would be a button at the dataset level to build the whole thing (https://mybinder.org/v2/dataverse/10.7910/DVN/MSIMRE) and then a button at the file level that link you directly to the file inside that dataset (https://mybinder.org/v2/dataverse/10.7910/DVN/MSIMRE?filepath=some-notebook.ipynb).

It's good to know that you're ok with query parameters. 😄

Yes, all the files from a Dataverse dataset would go in a directory. Dataverse supports file hierarchy so the directory could have an arbitrary number of subdirectories, just like you'd get with git.

Please keep the questions coming! 😄

@pdurbin
Copy link
Contributor Author

pdurbin commented Jul 26, 2019

I just made pull request IQSS/dataverse#6059 and if it's merged Dataverse users will click "Explore" and then "Binder" (or whatever) and be send to URLs like https://mybinder.org/v2/dataverse?datasetPid=doi:10.7910/DVN/RLLL1V

Here's the Dataverse external tool manifest I'm using:

{
  "displayName": "MyBinder",
  "description": "Analyze in MyBinder",
  "type": "explore",
  "scope": "dataset",
  "toolUrl": "https://mybinder.org/v2/dataverse",
  "toolParameters": {
    "queryParameters": [
      {
        "datasetPid": "{datasetPid}"
      }
    ]
  }
}

@betatim do you have any more questions? 😄

@pdurbin
Copy link
Contributor Author

pdurbin commented Sep 18, 2019

@nuest hi! I see you opened pull request #951 to add add Figshare to the UI. Now that jupyterhub/repo2docker#739 has been merged (thanks!) and you're working on docs at jupyterhub/repo2docker#796 (thanks!) do you think you could help with adding Dataverse to the UI as well? I'm happy to answer any questions!

@Xarthisius
Copy link
Contributor

I really like @betatim 's idea of having just /v2/doi/<spec> instead of extending the dropdown with each individual provider. The latter approach also duplicates a lot of DOI resolution / provider detection code from r2d on binderhub side, which in case of Zenodo and Figshare is fairly easy, but for Dataverse is going to be a pain. There are multiple ways of tackling that problem:

  1. Add DataverseProvider with a fairly convoluted .get_resolved_ref (current approach)
  2. Use r2d on a DOI internally in binderhub and see what comes up.
  3. Use an external service via HTTP request that would resolve the DOI and return all the information that binderhub needs.

I'd be curious to hear which of those would be a preferred method (or if there are other options I'm missing).

@betatim
Copy link
Member

betatim commented Sep 19, 2019

I'd wait a bit longer with introducing /v2/doi/<spec> till we support "more" (what ever that means) DOI providers. It feels like until we have a >50% chance(??) of being able to handle a DOI that a user is likely to try we should make it explicit. Though maybe we can make one repo provider already that we use for all kinds of DOI that are supported?

We could reduce code duplication by making repo2docker a dependency of BinderHub and refactoring the code in r2d so that it can be called as a library.

For caching to work we unfortunately need to know before launching the build pod what the answer is. Which I think means we can't just forward all DOIs to repo2docker without deciding if they are supported or not.

Needed: a good plan to get us from where we are now (supporting a handful of DOIs) to the end goal of /v2/doi/<anyDOIevermade>.

@nuest
Copy link
Contributor

nuest commented Sep 20, 2019

I think I've said this before, but like to provide an external discussion here: ropensci-archive/doidata#1 There is no way right now to handle anyDOIevermade, because DOIs intentionally point to human-readable landing pages, not data.

I like the idea of having a master DOI provider, but I don't see what it adds compared to the current detection approach (call the DOI and see which URL we end up with). Just calling the DOI is really pragmatic and saves us from understanding DOI schema.

Re. refactoring repo2docker: made a quick search only, but I guess there is no "download all files from a DOI" Python library yet. The content providers in repo2docker are a path towards that. Such a library could also be used by BinderHub to pre-check if a DOI is supported...

@pdurbin
Copy link
Contributor Author

pdurbin commented Sep 20, 2019

I appreciate all the chatter on this issue!

Would it help if we make the scope of this issue smaller and re-title it to "Add Harvard Dataverse to UI"? I'd feel bad for the other 47 installations of Dataverse around the world but if it helps move things forward we could refactor later.

When I look at the current list of repoproviders on https://mybinder.org , they are all centralized services:

Screen Shot 2019-09-20 at 10 57 26 AM

Harvard Dataverse is also a centralized service. Pull request #951 is about adding another centralized service (Figshare) to Binder. Should we make our lives easier and continue to focus on centralized services in the short term? Later we can think harder about supporting software that allows you to self-host (GitLab, Dataverse, etc.). What do you think?


Update: I lied. Looking closer I can tell that the "Git repository" option allows arbitrary self-hosted Git servers:

Screen Shot 2019-09-20 at 3 13 22 PM

I still think the rest (GitHub, Zenodo, etc.) are centralized services and I'm still fine with narrowing the scope of this issue to just Harvard Dataverse if that's easier.

@betatim
Copy link
Member

betatim commented Sep 20, 2019

I like the idea of having a master DOI provider, but I don't see what it adds compared to the current detection approach

To avoid confusion: with provider I was thinking of the Python class. I was thinking we could save some code by having one provider that all the different DOI items in the drop down menu could use. You give it a DOI and if it resolves to a URL we support it says "yay".

Pondering this even further: the difference between the figshare and zenodo providers is mostly in the get_resolved_ref() method. get_repo_url() returns back the spec in both situations. get_build_slug() gives back a string combining the name and resolved ref.

Most of the difference between (and complexity of) the get_resolved_ref() methods is extracting the provider specific ID. I can't remember why we don't just return a (escaped) version of the DOI if the hostname matches one we recognise. It might be because a DOI can point to different Zenodo entries?

If we could remove the need to extract the provider specific ID then a generic DOI provider would be "resolve this DOI, does the resulting hostname look like one we support, if yes pass the DOI to repo2docker". Which would be nice (I think).

Getting files from DOIs

I don't think there exists anything to do that, precisely because each archive has its own uniquely beautiful interface ;) Back in October 2017 (OMG is that a long time ago) when the idea of "doi2docker" was first suggested I assumed it would be easy because "surely there is a standard way to get the files associated with a DOI". Two years later we are still working on it, but at least we have a few supported archives now :D

@Xarthisius
Copy link
Contributor

Xarthisius commented Sep 20, 2019

I like the idea of having a master DOI provider, but I don't see what it adds compared to the current detection approach

To avoid confusion: with provider I was thinking of the Python class. I was thinking we could save some code by having one provider that all the different DOI items in the drop down menu could use. You give it a DOI and if it resolves to a URL we support it says "yay".

To avoid even further confusion I'll respond with code :)

class DataverseProvider(RepoProvider):

    @gen.coroutine
    def get_resolved_ref(self):
        client = AsyncHTTPClient()
        lookup_url = "https://data.wholetale.org/api/v1/repository/lookup?{}".format(
                urlencode({"dataId": json.dumps([self.spec])}))
        r = yield client.fetch(req)

        resp = json.loads(r.body)
        assert resp[0]["repository"].lower() == "dataverse"

        self.record = resp[0]["doi"]
        return self.record_id

    def get_build_slug(self):
        return "dataverse-" + escapism.escape(self.record_id, escape_char="-").lower()

If you're ok with using "3rd party" service, the above would work for general case too, provided that we have parity in our content providers. (Zenodo and Figshare are literally on the top of my todo).

@nuest
Copy link
Contributor

nuest commented Sep 22, 2019

What I take out of @betatim 's last post is: this is a BinderHub issue so it's primarily about detecting DOIs that repo2docker will support, without knowing/caring how repo2docker will do that.

Re. provider specific ID: Every Zenodo record has its own DOI. In repo2docker we need the provider specific ID to talk to the API, I don't think we need a provider specific ID within BinderHub, do we?

Re. "Harvard Dataverse vs. Dataverse": if I understand the feature correctly, any Dataverse provider is supported, so I don't see the need to limit to the Harvard one.

@pdurbin
Copy link
Contributor Author

pdurbin commented Sep 22, 2019

I appreciate all the discussion above. Thank you! 🎉

The title of this issue is currently "Dataverse repoprovider and URLs" but it sounds like there is interest in expanding the scope of this issue to be less about Dataverse and more about detecting which DOIs that repo2docker will support. Is that right? If so, can we please update the title of this issue?

My understanding is that as of this writing, BinderHub supports the following DOIs providers with the following "prefix" or "registrant code":

Our goal is to support more DOI providers.

@pdurbin
Copy link
Contributor Author

pdurbin commented Sep 30, 2019

I thought it might be helpful to provide here a list of DOI "prefixes" or "registrant codes" used by installations of Dataverse:

Would it be helpful if I also provide the equivalent for installations of Dataverse that use Handles instead of DOIs? There are fewer of these, maybe three or four.

Xarthisius added a commit to Xarthisius/binderhub that referenced this issue Oct 1, 2019
Xarthisius added a commit to Xarthisius/binderhub that referenced this issue Oct 4, 2019
@nuest
Copy link
Contributor

nuest commented Oct 8, 2019

(Trying to catch up after some time away from work.)

I am unsure BinderHub should start keeping lists of supported DOI prefixes. Those are pretty "permanent" but will grow, and they also don't guarantee support AFAICS because we don't know that these prefixes will only point to Dataverse installations. That is likely what these listed institutions do now, but maybe one of them starts using (or is already using) the same prefix for a journal?

So I think we should continue with the approach used in repo2docker and resolve the URL and then check if the URL is supported. For now we probably live with some duplication of code between BinderHub and repo2docker.

Re. relying on the Wholetale API for lookup: Honestly no offense, but IMO BinderHub should try to stick to resolving DOIs only. (Although Wholetale probably has more stable finances than BinderHub and can "catch up" with the repositories added to repo2docker?)

@Xarthisius
Copy link
Contributor

Re. relying on the Wholetale API for lookup: Honestly no offense, but IMO BinderHub should try to stick to resolving DOIs only.

Just as note of an explanation: that's precisely what we do... If you call that API with DOI we don't currently support (like Zenodo), you get:

$ curl -s -H 'Accept: application/json' \
  'https://data.wholetale.org/api/v1/repository/lookup?dataId=%5B%2210.5281%2Fzenodo.3465431%22%5D' | jq '.'
[
  {
    "dataId": "https://zenodo.org/record/3465431",
    "doi": null,
    "name": "3465431",
    "repository": "HTTP",
    "size": 30566
  }
]

so in the worst case scenario you get exactly the same thing, only in JSON rather than redirect. It's significantly slower, I give you that, because we try other things too.

Nevertheless, if you look at the corresponding PR it's all water under the bridge.

@nuest
Copy link
Contributor

nuest commented Oct 8, 2019

Thanks for the explanation @Xarthisius - I think you're doing something very important there, namely providing structured information based on a DOI. Something that doi.org should offer...

@pdurbin
Copy link
Contributor Author

pdurbin commented Oct 8, 2019

I am unsure BinderHub should start keeping lists of supported DOI prefixes. Those are pretty "permanent" but will grow, and they also don't guarantee support AFAICS because we don't know that these prefixes will only point to Dataverse installations. That is likely what these listed institutions do now, but maybe one of them starts using (or is already using) the same prefix for a journal?

This is a very good point!

providing structured information based on a DOI. Something that doi.org should offer...

I recently asked a related question at https://www.pidforum.org/t/api-for-determining-which-institution-is-using-a-doi-prefix-registrant-code/687

There I learned that from https://doi.org/ra/10.11587 (for example) you can figure out that DataCite is the registration agency:

[
  {
    "DOI": "10.11587",
    "RA": "DataCite"
  }
]

Once you know DataCite is the registration agency, you can use DataCite APIs to get additional JSON out.

https://api.datacite.org/prefixes/10.11587 (for example) provides JSON about the institution that is using that prefix.

curl https://api.datacite.org/prefixes/10.11587 | jq '.included[0].attributes.name' shows "The Austrian Social Science Data Archive"

@pdurbin
Copy link
Contributor Author

pdurbin commented Oct 8, 2019

Does this help?

https://doi.org/api/handles/10.11587/ERDG3O

{
  "responseCode": 1,
  "handle": "10.11587/ERDG3O",
  "values": [
    {
      "index": 100,
      "type": "HS_ADMIN",
      "data": {
        "format": "admin",
        "value": {
          "handle": "10.admin/codata",
          "index": 300,
          "permissions": "111111111111"
        }
      },
      "ttl": 86400,
      "timestamp": "2019-06-27T09:18:24Z"
    },
    {
      "index": 1,
      "type": "URL",
      "data": {
        "format": "string",
        "value": "https://data.aussda.at/citation?persistentId=doi:10.11587/ERDG3O"
      },
      "ttl": 86400,
      "timestamp": "2019-06-27T09:18:24Z"
    }
  ]
}

I found this at "Proxy Server REST API" at https://www.doi.org/factsheets/DOIProxy.html#rest-api

@Xarthisius
Copy link
Contributor

Xarthisius commented Oct 8, 2019

@pdurbin it would if DOI world weren't nasty and full of corner cases...:

$ curl -s -L https://doi.org/api/handles/1902.1/22315 | jq '.values[1].data.value'
"http://thedata.harvard.edu/dvn/study?globalId=hdl:1902.1/22315"

$ curl -s -IL https://doi.org/1902.1/22315 | grep Location
Location: https://thedata.harvard.edu/dvn/study?globalId=hdl:1902.1/22315
Location: https://dataverse.harvard.edu/dataset.xhtml?persistentId=hdl:1902.1/22315

As illustrated above, DOI resolutions can be chained so it's always safer to resolve it until 200 is reached.

@pdurbin
Copy link
Contributor Author

pdurbin commented Oct 11, 2019

@Xarthisius oh! That extra hop is our fault (Harvard Dataverse's fault). We changed hostnames a while back and should retarget old PIDs like that. I opened and issue about this:IQSS/dataverse.harvard.edu#40

If you want to treat some of these old PIDs as broken and not working, that's fine. It would nudge Dataverse installations to clean house a little bit. They'd be forced to update old DOI records to point them to current hostnames.

Or you are welcome to keep following 302 redirects like you're doing now. Whatever works best for you, really! Thanks for working on this!

Xarthisius added a commit to Xarthisius/binderhub that referenced this issue Nov 20, 2019
pdurbin added a commit to pdurbin/binderhub that referenced this issue Dec 5, 2019
pdurbin added a commit to pdurbin/binderhub that referenced this issue Dec 5, 2019
Xarthisius pushed a commit to Xarthisius/binderhub that referenced this issue Dec 8, 2019
choldgraf added a commit that referenced this issue Dec 10, 2019
Add Dataverse to UI. Fixes #900
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants