Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.jlap JSON Patch Incremental Updates for repodata.json #20

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

dholth
Copy link
Contributor

@dholth dholth commented Apr 27, 2022

Describes a system to save client bandwith on conda install when fetching repodata.json updates, based on patch series from specific earlier versions of the file.

@dholth
Copy link
Contributor Author

dholth commented Apr 27, 2022

If we generate the patch file before it hits the CDN e.g. in conda index, then it can't include the original repodata.json headers. The idea is that a paranoid client checks that repodata.json's Last-Modified or ETag is within a certain range compared to the patches, in case a mirror stops providing the .jlap file but does not delete it.
The individual patch lines should have a size limit. If an individual patch is too large we can leave it out. In testing a very large patch happened when a bug computed the patch between the empty .json {} and the entire repodata.json. If we leave out a necessary patch then clients will download the whole .json again.
Each patch has a to and from hash. This is so the patch generator can be dumb. It also makes it possible to do more sophisticated path-finding between the revision we have and the latest revision. This might not help for repodata.json which is mostly additive, but it might be able to do something for current_repodata.json that adds and subtracts packages each time.
If the .jlap is too large we could drop earlier lines based on hysteresis. For example, if we had a 3MB limit, we could cut the first 2MB whenever the file exceeds 3MB. The file would stay between 1 and 3 MB, but include some continuity when it was reduced in size.
We could include a next url. But the client would have to avoid following redirect loops or downloading so many patches that it might as well have downloaded a complete repodata

Copy link
Contributor

@kathatherine kathatherine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor punctuation edits. Thanks!

Copy link

@barabo barabo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't given this a thorough read-though yet, but here are a couple comments right away to consider. I would also suggest you think about providing a smaller / simpler example of the jlap patch - maybe include 1 or 2 patches rather than 9.

If there are properties in the patch that are extraneous, consider cutting them out, too.

Copy link
Contributor

@kathatherine kathatherine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few punctuation/grammar clarifications.

@jezdez jezdez changed the title repodata.json updates via patch series to save client bandwidth CEP 10: JSON Patch Incremental Updates for repodata.json Oct 11, 2022
@jezdez
Copy link
Member

jezdez commented Oct 14, 2022

@conda-incubator/steering

This pull request falls under the Enhancement Proposal Approval policy of the conda governance policy, please vote and/or comment on this PR.

It needs 60% of the Steering Council to vote yes to pass.

To vote, please leave a regular pull request review (see Files changed tab above > green Review changes button) with one of the following options:

  • Approve for yes
  • Request Changes for no.

Also, if you would like to suggest changes to the current proposal, please leave a comment below, use the regular review comments or push to this branch.

This vote will end on 2022-10-21.

@ocefpaf
Copy link

ocefpaf commented Oct 14, 2022

@dholth I think this is a great idea and I recall Phil Elson wanting to implement something like this way back in 2018 but never got to it. I'm definitely+1.

The only confusion I have is: should this be a CEP? It is a desired improvement but does it have any operational impacts that warrants voting? I may be missing something (read the doc on my mobile phone) but I did not see any reason for this to be a CEP. I'd love to see the PR that implements this merged though.

Copy link
Contributor

@beckermr beckermr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're voting now, my answer is no. I don't think we've reached a consensus on how this format might interact with the power loader stuff from Wolf and possible optimizations for repodata sharding and patching.

I think if we work some of those details out, and if we have we should put that in the CEP (or I missed it!, then we could and should reconsider this.

@dholth
Copy link
Contributor Author

dholth commented Oct 14, 2022

I like the idea of being able to implement this in conda without worrying about CEPs. It does not change the format of repodata.json at all.

I was able to meet Wolf in July. The C++ json library they are using implements jsonpatch. Mamba still wants to use the same on-disk http cache as conda, which is a question of paths. The CEP does mention other patching schemes, like the one RPM uses that is generic on bytes instead of generic on json, but is only a C library and is not as optimized for HTTP round trips.

We will be able to get fantastic network performance in Python conda without powerloader.

@beckermr
Copy link
Contributor

That's all great and I support it. Let's try it live in conda and with anaconda.org. We'll learn a lot.

I simply don't want to see us declaring standards on this for conda just yet.

@mariusvniekerk
Copy link

mariusvniekerk commented Oct 22, 2022

Please use faster compression formats if you're using any of them. Gzip and bzip2 are laughably slow at decompression compared to zstd.
Our files are big enough that decompression speed matters.
-1 for pickles as those very tightly couple us to python.

@dholth
Copy link
Contributor Author

dholth commented Oct 23, 2022

@jezdez
Copy link
Member

jezdez commented Oct 26, 2022

Voting Results

This was a standard, non-timed-out vote.

This vote has NOT reached quorum (at least 60% of 16).

It has also NOT passed since it recorded 0 "yes" votes and 1 "no".

@jezdez jezdez removed the vote Voting following governance policy label Oct 26, 2022
@dholth
Copy link
Contributor Author

dholth commented Oct 26, 2022

FYI I talked to an internal team that also needs the old cache filenames, will update to keep the filenames the same.

@jezdez jezdez changed the title CEP 10: JSON Patch Incremental Updates for repodata.json JSON Patch Incremental Updates for repodata.json Mar 31, 2023
@dholth
Copy link
Contributor Author

dholth commented Mar 31, 2023

This feature is available in conda 23.3.1 with the "--experimental=jlap" flag. It is the most impressive after the third cache expiration - on an empty cache, we fetch repodata.json; the first jlap fetch, fetches the entire remote jlap file; the third fetches only the end of the remote jlap file.

@dholth
Copy link
Contributor Author

dholth commented May 8, 2023

@wolfv @ocefpaf @beckermr @mariusvniekerk Maybe you are not all still voting members, but how do you feel about this CEP now that there is a complete implementation?

@dholth dholth changed the title JSON Patch Incremental Updates for repodata.json .jlap JSON Patch Incremental Updates for repodata.json Sep 27, 2023
cep-jlap.md Outdated
<tr><td> Created </td><td> Mar 30, 2022</td></tr>
<tr><td> Updated </td><td> Oct 11, 2022</td></tr>
<tr><td> Discussion </td><td> NA </td></tr>
<tr><td> Implementation </td><td> https://github.com/dholth/repodata-fly </td></tr>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well link to the implementations in conda and librattler. conda's jlap core is pretty elegant. The network error handling less so.

@dholth
Copy link
Contributor Author

dholth commented Oct 20, 2023

@baszalmstra @jaimergp ping

@baszalmstra
Copy link
Contributor

@travishathaway implemented this CEP in rattler. There it is enabled by default which means pixi is using this by default and we have been using this as such for a while.

This may be due to the implementation in rattler but we have not been particularly happy with it. As far as I can tell the algorithm provides a tradeoff between bandwidth vs compute. You transfer significantly less data but computing the full repodata requires some CPU and (in our case) a lot of memory (in the order of GBs). People with low memory (embedded) device often run into out- of-memory issues because of this.

With the defaulting of the zstd compressed files the download overhead has also become significantly less. But I also realize that this is not a scalable solution.

But! I have not yet actively profiled this, or really tried to improve the current situation. I will make some time for that in the next few weeks and report back here with some actual numbers.

Im very curious about how other people experience this. @wolfv, @jaimergp Have you tried using JLAP? Its not implemented in (micro)mamba right?

@travishathaway Do you think there are more avenues we can explore to optimize the performance of the implementation in rattler?

@dholth
Copy link
Contributor Author

dholth commented Oct 20, 2023

On my machine, memray shows a peak of about 420MB for repodata = json.load(open("/Users/dholth/miniconda3/pkgs/cache/09cdf8bf.json")) which is the conda-forge noarch repodata.json.

Some users do have more compute than bandwidth.

@dholth
Copy link
Contributor Author

dholth commented Oct 20, 2023

In Python, it might be possible to have a sqlite-backed repodata.json that returns proxy objects for "packages" and "packages.conda", and apply typical "patch": [{"op": "add", "path": "/packages/pre_commit-3.4.0-hd3eb1b0_1.tar.bz2" ... without loading the entire file. (I've timed repodata.json-to-sqlite that processes the full repodata.json each time and it is too much slower). Of course this would also make "expects cached json" programs unhappy.

@travishathaway
Copy link
Contributor

travishathaway commented Oct 24, 2023

@baszalmstra,

I thought about this for a bit and revisited the Rust implementation. Perhaps one thing that would help is removing the re-ordering of the document that is done in order for the checksums to match at the end. That's just a hunch though.

No matter what, we will always have to at least load the full repodata.json into memory in order to successfully apply the patches. I think that's just the sad reality we must face when choosing to use a JSON file as our database 😭.

Perhaps we could come up with a way to partition repodata into separate files, but that seems like it would only yield an unacceptably complex solution.

Please let us know what happens when you perform a proper profiling. I am interested to hear the results.

@dholth
Copy link
Contributor Author

dholth commented Oct 24, 2023

@travishathaway It's designed so that the checksum doesn't match at the end. We don't specify the server's json formatting. But we keep track of the on-disk versus putative "logically equivalent to server-side repodata.json with this checksum".

@baszalmstra
Copy link
Contributor

baszalmstra commented Oct 25, 2023

No matter what, we will always have to at least load the full repodata.json into memory in order to successfully apply the patches. I think that's just the sad reality we must face when choosing to use a JSON file as our database 😭.

Yeah but I dont think thats strictly required. Its because we use JSON as the level of abstraction at which we patch the content. We dont have to parse and load the entire JSON to query information from it in rattler by sparsely reading the content.

I haven't though about this enough yet but we could also patch on the byte level using zchunk for instance. The CEP also mentions zchunk. Has that been evaluated? Im very curious about any issues encountered there.

@dholth
Copy link
Contributor Author

dholth commented Oct 25, 2023

If we look at https://gist.github.com/dholth/cc76ce07f1c6ff099f440fc901bea35b, which are the 'op' and 'path' properties of individual json patch elements, it shows that almost all of the patch operations in the current https://conda.anaconda.org/conda-forge/linux-64/repodata.jlap look like ('add', '/packages.conda/pymeep-1.27.0-mpi_mpich_py310h1234567_2.conda'), and operate on a single package at a time.

In Python, it would be easy to author a lazy object that looked like repodata.json but loaded individual packages from storage by hooking the dictionary index [] operator on packages and packages.conda.

With knowledge of the patch format, it might be reasonable in a less-dynamic language to load objects (individual package records) based on the first part of the patch path, then send the rest of the path to the json patch library.

zchunk is neat. It's hard to implement in pure Python. Running it against the possibly-formatted repodata.json and preserving whitespace is a waste. It would make more individual requests to the server to get the necessary patch information.

One of the first prototypes was in Rust but I found that the pypy version was a bit faster on diffs. IIRC Python's json.loads() is also faster than serde to a generic serde_json::Value When applying the patches, parsing and serializing JSON take all the time; actually applying the patches is very fast because they operate at the logical JSON level.

@dholth
Copy link
Contributor Author

dholth commented Nov 16, 2023

I've add a two-file scheme to this CEP. Instead of trying to update repodata.json each time, we accumulate patches in a simple overlay. It will be a while before (re-)writing those additions approaches the time needed to reserialize a 220MB conda-forge/linux-64/repodata.json. There is an implementation in conda+libmamba that shows writing and reading patches. It eliminates about half of the time needed (the time spent calling json.dumps() on repodata.json). It will eliminate all repodata.json parsing on the Python side for patches that only add complete new packages. A bit more work is needed to integrate conda with the new two-file API for the solver.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.