-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
.jlap
JSON Patch Incremental Updates for repodata.json
#20
base: main
Are you sure you want to change the base?
Conversation
If we generate the patch file before it hits the CDN e.g. in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor punctuation edits. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't given this a thorough read-though yet, but here are a couple comments right away to consider. I would also suggest you think about providing a smaller / simpler example of the jlap patch - maybe include 1 or 2 patches rather than 9.
If there are properties in the patch that are extraneous, consider cutting them out, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few punctuation/grammar clarifications.
Co-authored-by: Katherine Kinnaman <kkinnaman@anaconda.com>
@conda-incubator/steering This pull request falls under the Enhancement Proposal Approval policy of the conda governance policy, please vote and/or comment on this PR. It needs 60% of the Steering Council to vote To vote, please leave a regular pull request review (see
Also, if you would like to suggest changes to the current proposal, please leave a comment below, use the regular review comments or push to this branch. This vote will end on 2022-10-21. |
@dholth I think this is a great idea and I recall Phil Elson wanting to implement something like this way back in 2018 but never got to it. I'm definitely+1. The only confusion I have is: should this be a CEP? It is a desired improvement but does it have any operational impacts that warrants voting? I may be missing something (read the doc on my mobile phone) but I did not see any reason for this to be a CEP. I'd love to see the PR that implements this merged though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're voting now, my answer is no. I don't think we've reached a consensus on how this format might interact with the power loader stuff from Wolf and possible optimizations for repodata sharding and patching.
I think if we work some of those details out, and if we have we should put that in the CEP (or I missed it!, then we could and should reconsider this.
I like the idea of being able to implement this in conda without worrying about CEPs. It does not change the format of repodata.json at all. I was able to meet Wolf in July. The C++ json library they are using implements jsonpatch. Mamba still wants to use the same on-disk http cache as conda, which is a question of paths. The CEP does mention other patching schemes, like the one RPM uses that is generic on bytes instead of generic on json, but is only a C library and is not as optimized for HTTP round trips. We will be able to get fantastic network performance in Python conda without powerloader. |
That's all great and I support it. Let's try it live in conda and with anaconda.org. We'll learn a lot. I simply don't want to see us declaring standards on this for conda just yet. |
Please use faster compression formats if you're using any of them. Gzip and bzip2 are laughably slow at decompression compared to zstd. |
@mariusvniekerk we will benchmark the whole thing, and we have zstandard available because of |
Voting ResultsThis was a standard, non-timed-out vote.
This vote has NOT reached quorum (at least 60% of 16). It has also NOT passed since it recorded 0 "yes" votes and 1 "no". |
FYI I talked to an internal team that also needs the old cache filenames, will update to keep the filenames the same. |
This feature is available in conda 23.3.1 with the |
@wolfv @ocefpaf @beckermr @mariusvniekerk Maybe you are not all still voting members, but how do you feel about this CEP now that there is a complete implementation? |
.jlap
JSON Patch Incremental Updates for repodata.json
cep-jlap.md
Outdated
<tr><td> Created </td><td> Mar 30, 2022</td></tr> | ||
<tr><td> Updated </td><td> Oct 11, 2022</td></tr> | ||
<tr><td> Discussion </td><td> NA </td></tr> | ||
<tr><td> Implementation </td><td> https://github.com/dholth/repodata-fly </td></tr> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might as well link to the implementations in conda and librattler. conda's jlap core is pretty elegant. The network error handling less so.
@baszalmstra @jaimergp ping |
@travishathaway implemented this CEP in rattler. There it is enabled by default which means pixi is using this by default and we have been using this as such for a while. This may be due to the implementation in rattler but we have not been particularly happy with it. As far as I can tell the algorithm provides a tradeoff between bandwidth vs compute. You transfer significantly less data but computing the full repodata requires some CPU and (in our case) a lot of memory (in the order of GBs). People with low memory (embedded) device often run into out- of-memory issues because of this. With the defaulting of the zstd compressed files the download overhead has also become significantly less. But I also realize that this is not a scalable solution. But! I have not yet actively profiled this, or really tried to improve the current situation. I will make some time for that in the next few weeks and report back here with some actual numbers. Im very curious about how other people experience this. @wolfv, @jaimergp Have you tried using JLAP? Its not implemented in (micro)mamba right? @travishathaway Do you think there are more avenues we can explore to optimize the performance of the implementation in rattler? |
On my machine, memray shows a peak of about 420MB for Some users do have more compute than bandwidth. |
In Python, it might be possible to have a sqlite-backed repodata.json that returns proxy objects for |
I thought about this for a bit and revisited the Rust implementation. Perhaps one thing that would help is removing the re-ordering of the document that is done in order for the checksums to match at the end. That's just a hunch though. No matter what, we will always have to at least load the full repodata.json into memory in order to successfully apply the patches. I think that's just the sad reality we must face when choosing to use a JSON file as our database 😭. Perhaps we could come up with a way to partition repodata into separate files, but that seems like it would only yield an unacceptably complex solution. Please let us know what happens when you perform a proper profiling. I am interested to hear the results. |
@travishathaway It's designed so that the checksum doesn't match at the end. We don't specify the server's json formatting. But we keep track of the on-disk versus putative "logically equivalent to server-side repodata.json with this checksum". |
Yeah but I dont think thats strictly required. Its because we use JSON as the level of abstraction at which we patch the content. We dont have to parse and load the entire JSON to query information from it in rattler by sparsely reading the content. I haven't though about this enough yet but we could also patch on the byte level using zchunk for instance. The CEP also mentions zchunk. Has that been evaluated? Im very curious about any issues encountered there. |
If we look at https://gist.github.com/dholth/cc76ce07f1c6ff099f440fc901bea35b, which are the 'op' and 'path' properties of individual json patch elements, it shows that almost all of the patch operations in the current https://conda.anaconda.org/conda-forge/linux-64/repodata.jlap look like In Python, it would be easy to author a lazy object that looked like With knowledge of the patch format, it might be reasonable in a less-dynamic language to load objects (individual package records) based on the first part of the patch path, then send the rest of the path to the json patch library.
One of the first prototypes was in Rust but I found that the pypy version was a bit faster on diffs. IIRC Python's |
I've add a two-file scheme to this CEP. Instead of trying to update repodata.json each time, we accumulate patches in a simple overlay. It will be a while before (re-)writing those additions approaches the time needed to reserialize a 220MB conda-forge/linux-64/repodata.json. There is an implementation in conda+libmamba that shows writing and reading patches. It eliminates about half of the time needed (the time spent calling json.dumps() on repodata.json). It will eliminate all |
Describes a system to save client bandwith on
conda install
when fetchingrepodata.json
updates, based on patch series from specific earlier versions of the file.