Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROPOSAL] Add download hash to Metadata #1703

Closed
wants to merge 9 commits into from

Conversation

techman83
Copy link
Member

NOTE Merging this will update all the current metadata on the next NetKAN-bot run. This is to aid in progressing #1682

@dbent - This is pretty much a duplicate of the DownloadSize Transformer. It generates a SHA1 of the downloaded file.

Tested on the redirect netkan

{
    "spec_version": 1,
    "identifier": "PlaneMode",
    "$kref": "#/ckan/http/http://www.zengei.com/tmp/redirect1",
    "version": "1.3.1",
    "license": "MIT"
}

Produces

{
    "spec_version": 1,
    "identifier": "PlaneMode",
    "license": "MIT",
    "version": "1.3.1",
    "download": "http://www.zengei.com/tmp/PlaneMode-1.3.1.zip",
    "download_size": 17530,
    "download_hash": "1F4B3F21A77D4A302E3417A7C7A24A0B63740FC5",
    "x_generated_by": "netkan"
}

Which matches what I can produce with shasum

leon@beast /tmp $ shasum C942169E-netkan-PlaneMode.zip | awk {'print $1}' |tr '[:lower:]' '[:upper:]' 
1F4B3F21A77D4A302E3417A7C7A24A0B63740FC5

I'm not hugely tied to the field name or the method of generation if there are better suggestions.

@dbent
Copy link
Member

dbent commented May 3, 2016

A few things:

  • You can stick the new hash method in IFileService the intent of that is to have all file related operations in an single interface that can be easily mocked for testing purposes.
  • As per Support sha1/md5 fingerprints of downloaded files #62 we may use the hash for security purposes and SHA1 is on its way out for those purposes. I'd recommend the 256-bit variant of SHA2.
  • I'd add the type of hash to the metadata in some capacity so we can change it later on, something like:
  {
     "download_hash": {
       "type": "sha256",
       "digest": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
     }
  }

Other than that looks good.

@dbent
Copy link
Member

dbent commented May 3, 2016

Also, I was indifferent towards the casing of the digest but RFC 4648 specifies A-F for its alphabet so upper case is the way to go!

@dbent
Copy link
Member

dbent commented May 3, 2016

And, I just realized you need the sha1 for the Internet Archive, so I'd recommend:

"download_hash": [
  {
    "type": "sha1",
    "digest": "1F4B3F21A77D4A302E3417A7C7A24A0B63740FC5"
  },
  {
    "type": "sha256",
    "digest": "E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855"
  }
]

@techman83 techman83 force-pushed the feature/filehash branch from 0c2baff to cd6a38c Compare May 3, 2016 16:34
@techman83
Copy link
Member Author

techman83 commented May 3, 2016

Having an array of hashes is a little more future proof as well. We can get the client to prefer stronger and fallback if in the future SHA256 is broken.

{
  "spec_version": 1,
  "identifier": "PlaneMode",
  "license": "MIT",
  "version": "1.3.1",
  "download": "http://www.zengei.com/tmp/PlaneMode-1.3.1.zip",
  "download_size": 17530,
  "download_hash": [
    {
      "type": "sha1",
      "digest": "1F4B3F21A77D4A302E3417A7C7A24A0B63740FC5"
    },
    {
      "type": "sha256",
      "digest": "7AB22F6E427B8FC96CF4635AF76515DE7EBBC8CAEDD2627FFFD7229A02EAA62D"
    }
  ],
  "x_generated_by": "netkan"
}

After writing it I figured 'FileService' is probably the place for such things, wasn't hard to merge it all.

@dbent
Copy link
Member

dbent commented May 3, 2016

LGTM, just need to update Spec.md for the new structure.

@dbent
Copy link
Member

dbent commented May 3, 2016

@techman83 Sorry, I just realized, it would be easier to work with and more compact if we did:

"download_hash": {
  "sha1": "1F4B3F21A77D4A302E3417A7C7A24A0B63740FC5",
  "sha256": "E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855"
}

Then if we want the sha256 hash we can just index off the download_hash object rather than iterate through the array looking for the appropriate type. This also makes it impossible to have duplicate hashes of the same type.

@techman83
Copy link
Member Author

Yep, that's easier for me too. A little less awkward to build the object as well!

{
  "spec_version": 1,
  "identifier": "PlaneMode",
  "license": "MIT",
  "version": "1.3.1",
  "download": "http://www.zengei.com/tmp/PlaneMode-1.3.1.zip",
  "download_size": 17530,
  "download_hash": {
    "sha1": "1F4B3F21A77D4A302E3417A7C7A24A0B63740FC5",
    "sha256": "7AB22F6E427B8FC96CF4635AF76515DE7EBBC8CAEDD2627FFFD7229A02EAA62D"
  },
  "x_generated_by": "netkan"
}

@techman83
Copy link
Member Author

Schema updated and tested.

feature/filehash!CKAN *> python bin/ckan-validate.py /tmp/DogeCoinFlag-v1.02.ckan 
Validating /tmp/DogeCoinFlag-v1.02.ckan.. Success!

@techman83
Copy link
Member Author

Closing this as the new changes go beyond just a 'download_hash'.

@dbent
Copy link
Member

dbent commented May 4, 2016

Closing this as I assume @techman83 just forgot to actually do the deed. 😉

@dbent dbent closed this May 4, 2016
@pjf pjf removed the pull request label May 4, 2016
@techman83
Copy link
Member Author

👯

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants