Spec: DataStore and FileStore Consolidation

Originally based on this Google doc: https://docs.google.com/a/okfn.org/document/d/1cBW89bWtT2uMovasxHOHZ9hFcE0Gy612I9JjwPKTG5E/edit

Problem

The problems with the DataStore and FileStore in CKAN 2.0 include:

There's a big Data API button at the top of resource pages that, in almost all cases, is disabled, prompting lots of users to ask how can I enable this button?
Users don't understand or care about details such as whether their data has been added to the datastore or the filestore or both, users should be protected from this complexity (here "users" probably does not include sysadmins who need to deploy CKAN and therefore need to know about the filestore and the datastore and everything)
It's possible for the source file (uploaded or linked to) and the corresponding data in the datastore to diverge from eachother, meaning that the data seen in the data preview or data api is different from what you get if you download the file, which is confusing. And then if someone uploads a new copy of the source file it'll overwrite the edits in the datastore!
Currently on most (all?) of our sites no resource files are getting pulled into the datastore because we're not deploying the datastorer extension because it's too hard to deploy and maintain, so the datastore API is not available and previews work via the dataproxy which is unreliable.
If someone uploads an Excel file containing multiple sheets, only the first sheet goes into the datastore. This needs to be communicated to the user, or the behaviour improved to eg. create a resource for each sheet.

Decisions

Whenever there is a source file, i.e. if a file has been uploaded to a CKAN resource or linked-to from a CKAN resource, then the datastore API's editing functions should be disabled, so that the data in the datastore and in the filestore cannot diverge. The only way to update the data will be to upload a new version of the file.
We'll disable data previews for data that is not in the datastore, because the dataproxy is too unreliable. All data previews will work via the datastore. - Does this mean we'll remove the dataproxy from CKAN entirely?
We'll move the Data API button on the resource pages to make it much less prominent. Move it to the bottom of the page. Also make it mention all our APIs not just the Data API. Could also indicate here whether the datastore is enabled for the resource. Examples:
- http://data.gov.uk/dataset/vat_registered_businesses
- http://opendatacommunities.org/datasets/additional-affordable-dwellings
From the technical side, I think we want to get the datastorer service finished and have all our sites configured to use it. This means that all resource files (linked-to or uploaded ones) will get pushed into the datastore automatically soon after the file is uploaded or updated. In the meantime, we're working on a paster command/cron job to pull them in "manually" (this is much simpler to implement than the datastorer service, but after upload or update files don't get pulled into the datastore until the next time a sysadmin or cron job runs the command). This paster command will probably also come in handy in situations where the more complex datastorer service is not setup or has fallen over for some reason. I'm guessing we're going to throw away the ckanext-datastorer extension and never use it again.
There was also a Catalog Only option mentioned, where we would deploy a CKAN for someone with no data preview (no preview at all, not even data proxy) and no datastore or datastorer, the resource download link more prominent, resource page much simpler showcasing URL, download link and additional info.

User Stories

Mark says he thinks these are common user stories:

I am a publisher with a large dataset. I want to easily make a small correction to my data so that anyone who uses the data in the future gets the correct version.
I am an experimental scientist. I want data from my field instruments to be recorded automatically and incrementally, so my calculations which use the data get all the most recent data included.
I am a local council. I want to publish a dataset of locations of street furniture, but include a mechanism for citizens to make/suggest corrections or additions, so that the data is as accurate as possible for anyone who uses it subsequently. (I probably need to be able to approve updates, so that malicious or spam changes are not presented to users.)
I am a researcher/data wrangler/journalist. I want to know which version of the data I have and how it has been processed, so that I understand the data I am working with.
I am a researcher/data wrangler/journalist. I want to be sure I have the most recent/accurate version of the data so that my results are as up-to-date / accurate as possible.
I work for a data publisher, and have some data in a spreadsheet. I want to understand clearly what I need to do to publish my data, so that I don't get paralysed by a confusing choice and give up.

Note the common feature of wanting the data to be accurate for anyone who uses it, which implies that there shouldn't be a particular way of accessing the data that gives a wrong or out-of-date version - at least, not unless there is a very clear health warning.

Suggested new datastore behaviour

Note: not everyone agrees with this, it's just a suggestion having to choose between four different types of resource at on the create-resource page may be too confusing, and it may be too hard to clearly and concisely communicate the subtle differences between the different resource types to the user.

Also note: we could implement 1, 2 and 3 below fairly easily and then leave 4, which seems like it needs much more thought and new UI mockups and implementation, until later.

We'll have 4 different types of resource, user chooses between them at resource-create time, on the new resource page where we currently have three choices 1. Link to a file, 2. Link to an API and 3. Upload a file, we will now have four choices:

Link to a file
- Datastorer service will pull file into datastore, so it can be previewed and queried using the datastore API.
- Datastore API will be read-only
  - Datastore and data preview will not always be up-to-date with source file, if the source file has changed and the datastorer service has not yet updated the datastore.
    - How will the datastorer service know when the source file has been updated?
  - The only way to update the resource file is to update the source file on the remote site that hosts it, then wait for the datastorer service to update the datastore
  - What happens if the source file disappears?
    - Datastorer service deletes the data from the datastore?
      - Does it just delete the data from the datastore, or does it delete the resource itself? (probably just the data)
      - What does the resource page look like after this has happened?
Link to an API

No datastore/r, no preview, download buttons say "API Endpoint" not "Download".
Note: this is currently quite broken, some of the download buttons say "Downlaod" and the preview says "No preview available" and "Resource format not specified". Needs to be clearer that nothing has gone wrong, but API links do not have previews.

Upload a file (to the filestore)
- As far as datastore/r and preview are concerned, this works exactly like 1. Link to a file, both in terms of user experience and the implementation in the background
- This might be slightly easier than 1 when it comes to updating the datastore after a new version of the source file has been uploaded, because CKAN can ping the datastorer service.
Create a pure-datastore resource
- What do we call this in the UI?
  - Import Data?

In this case there is no source file, so the datastore and preview cannot be out of date
This is the only case where the datastore API's editing functions are enabled
You can add data to the resource using the datastore API, or you can "import" a file into the datastore
If you have imported a file, then you can import a new version of the same file or import a different file, but doing so will wipe out any data currently in the datastore (UI should warn users about this)
The user can still download the file, but in this case what they download is a dump from the datastore
What does the UI for this look like? (Mockups needed)
What does the create-resource page look like, now that we have these four choices? We need to somehow clearly and concisely show the user what the differences between the four choices are.
How do we explain to users that datastore queries and previews may not always be up to date with the source files that they'll get when they click the download button?

Suggestion: add a note near previews saying something like Preview and data API use cached data that may be up to X days old, click on download to get the latest direct data source/
If CKAN knows that the source file has been updated and it has pinged the datastorer service to update the datastore, do we want to notify the user on the resource page that an update is pending? Can we somehow show the status of the update?

Are 3 and 4 too similar? 3. User selects a file to upload, CKAN is busy for a bit while it uploads, when it finishes the user gets a resource page with a data preview and a download button. 4. User selects a file to "import", CKAN is busy for a bit while it uploads, when it finishes the user gets a resource page with a data preview and a download button. This seems likely to get many users asking what is the difference these two things seem exactly the same? In fact there are subtle differences, with 3 (or 1) when you click download you download the original file, with 4 you download a dump from the datastore that, depending on the file, may be exactly the same as, completely different to or subtly different to the original file. In 3 (or 1) the datastore API editing functions are disabled in 4 they're enabled. In 3 you can upload a new version of the file later, in 4 if you do this it'll overwrite any edits you did using the datastore API. These seem like fairly subtle distinctions that depend on understanding what the filestore and datastore are.

Alternative Suggestion: FileStore and DataStore as one resource type

Can pure-filestore and pure-datastore resources be consolidated into a single resource type from the user's point of view? So that the user does not have to choose between uploading a file to the filestore and importing a file into the datastore at resource-create time, but can simply choose between uploading a file or linking to one?

Note: this below is pretty much how CKAN currently behaves! The only differences are:

If user has edited a resource using the Data API, then the download button downloads the current version of the data (exported from the datastore as CSV) and a separate "download original file" button appears
If the user has edited a resource using the Data API and then tries to upload a new version of the resource file, they get a warning saying this will overwrite the data stored in CKAN.

...

1. User uploads a file                  2. User links to a file
        |                                    |
        V                                    |
2. File goes into CKAN's FileStore           |
        |                                    |
        |        ____________________________|
        |        |
        V        V
3. File goes into CKAN's DataStore.
   Data preview and data query API (read-only) are enabled.
   "Download" button to download original file is shown.
    |
    V
4. At this point, the user can upload a new version of
   the file and the data in the datastore will be replaced.
    |
    V
5. User chooses to enable the data update API, and makes some edits to the
   data in the datastore.
    |
    V
6. "Download" button now downloads the current version of the data from the
   datastore, exported as a CSV file.
   A new "Download original file" button appears that downloads the original
   file from the FileStore (maybe this is hidden behind the download current
   version button, e.g. in small text or in a dropdown etc
    |
    V
7. At this point if the user tries to upload a new version of the file,
   they get a warning that the data currently in CKAN will be overwritten
   with the data from the file, and can choose to continue if they wish.

Notes:

At 2 the user never needs to know about or see the word "FileStore" all they know is they can download the file.

At 3 it would be the datastorer service or paster command/cron job that pushes the data into the datastore, but again the user never needs to know about this or see the word "DataStore".

At 4, maybe it would be nice to one day support versioning of uploaded files in the filestore, so that users can preview and download older versions of files.

At 5, maybe the user would have to explicitly click a button that says something like "Enable data editing for this resource" or maybe data editing is just always enabled and the following changes simply happen automatically after the first time the user does a data editing action.

At 5, this is where the user would get the web-based data editor and the data versioning and data history viewing, if we ever implement such features. (But for now data editing just means using the datastore update API)

At 7, if we ever implement datastore versioning then after uploading a new version of their file and wiping out their data, they could still use the data history view to get back the previous version of their data.

Provide feedback

Saved searches