`.jlap` JSON Patch Incremental Updates for repodata.json #20

dholth · 2022-04-27T15:56:43Z

Describes a system to save client bandwith on conda install when fetching repodata.json updates, based on patch series from specific earlier versions of the file.

dholth · 2022-04-27T19:43:24Z

If we generate the patch file before it hits the CDN e.g. in conda index, then it can't include the original repodata.json headers. The idea is that a paranoid client checks that repodata.json's Last-Modified or ETag is within a certain range compared to the patches, in case a mirror stops providing the .jlap file but does not delete it.
The individual patch lines should have a size limit. If an individual patch is too large we can leave it out. In testing a very large patch happened when a bug computed the patch between the empty .json {} and the entire repodata.json. If we leave out a necessary patch then clients will download the whole .json again.
Each patch has a to and from hash. This is so the patch generator can be dumb. It also makes it possible to do more sophisticated path-finding between the revision we have and the latest revision. This might not help for repodata.json which is mostly additive, but it might be able to do something for current_repodata.json that adds and subtracts packages each time.
If the .jlap is too large we could drop earlier lines based on hysteresis. For example, if we had a 3MB limit, we could cut the first 2MB whenever the file exceeds 3MB. The file would stay between 1 and 3 MB, but include some continuity when it was reduced in size.
We could include a next url. But the client would have to avoid following redirect loops or downloading so many patches that it might as well have downloaded a complete repodata

kathatherine

Some minor punctuation edits. Thanks!

cep-incremental-repodata.md

barabo

I haven't given this a thorough read-though yet, but here are a couple comments right away to consider. I would also suggest you think about providing a smaller / simpler example of the jlap patch - maybe include 1 or 2 patches rather than 9.

If there are properties in the patch that are extraneous, consider cutting them out, too.

cep-incremental-repodata.md

kathatherine

Just a few punctuation/grammar clarifications.

cep-incremental-repodata.md

Co-authored-by: Katherine Kinnaman <kkinnaman@anaconda.com>

jezdez · 2022-10-14T20:45:34Z

@conda-incubator/steering

This pull request falls under the Enhancement Proposal Approval policy of the conda governance policy, please vote and/or comment on this PR.

It needs 60% of the Steering Council to vote yes to pass.

To vote, please leave a regular pull request review (see Files changed tab above > green Review changes button) with one of the following options:

Approve for yes
Request Changes for no.

Also, if you would like to suggest changes to the current proposal, please leave a comment below, use the regular review comments or push to this branch.

This vote will end on 2022-10-21.

ocefpaf · 2022-10-14T21:04:01Z

@dholth I think this is a great idea and I recall Phil Elson wanting to implement something like this way back in 2018 but never got to it. I'm definitely+1.

The only confusion I have is: should this be a CEP? It is a desired improvement but does it have any operational impacts that warrants voting? I may be missing something (read the doc on my mobile phone) but I did not see any reason for this to be a CEP. I'd love to see the PR that implements this merged though.

beckermr

If we're voting now, my answer is no. I don't think we've reached a consensus on how this format might interact with the power loader stuff from Wolf and possible optimizations for repodata sharding and patching.

I think if we work some of those details out, and if we have we should put that in the CEP (or I missed it!, then we could and should reconsider this.

dholth · 2022-10-14T22:58:14Z

I like the idea of being able to implement this in conda without worrying about CEPs. It does not change the format of repodata.json at all.

I was able to meet Wolf in July. The C++ json library they are using implements jsonpatch. Mamba still wants to use the same on-disk http cache as conda, which is a question of paths. The CEP does mention other patching schemes, like the one RPM uses that is generic on bytes instead of generic on json, but is only a C library and is not as optimized for HTTP round trips.

We will be able to get fantastic network performance in Python conda without powerloader.

beckermr · 2022-10-15T00:19:30Z

That's all great and I support it. Let's try it live in conda and with anaconda.org. We'll learn a lot.

I simply don't want to see us declaring standards on this for conda just yet.

mariusvniekerk · 2022-10-22T14:57:35Z

Please use faster compression formats if you're using any of them. Gzip and bzip2 are laughably slow at decompression compared to zstd.
Our files are big enough that decompression speed matters.
-1 for pickles as those very tightly couple us to python.

dholth · 2022-10-23T02:49:14Z

@mariusvniekerk we will benchmark the whole thing, and we have zstandard available because of .conda https://dholth.github.io/conda-benchmarks/#conda_install.TimeInstall.time_explicit_install?conda-package-handling=1.9.0&conda-package-handling=2.0.0a2&p-threads=1&p-threads=5

jezdez · 2022-10-26T16:41:05Z

Voting Results

This was a standard, non-timed-out vote.

Among Steering Council members there are 0 "yes", 1 "no", and no abstentions.

This vote has NOT reached quorum (at least 60% of 16).

It has also NOT passed since it recorded 0 "yes" votes and 1 "no".

dholth · 2022-10-26T17:17:20Z

FYI I talked to an internal team that also needs the old cache filenames, will update to keep the filenames the same.

dholth · 2023-03-31T17:04:54Z

This feature is available in conda 23.3.1 with the "--experimental=jlap" flag. It is the most impressive after the third cache expiration - on an empty cache, we fetch repodata.json; the first jlap fetch, fetches the entire remote jlap file; the third fetches only the end of the remote jlap file.

dholth · 2023-05-08T22:06:59Z

@wolfv @ocefpaf @beckermr @mariusvniekerk Maybe you are not all still voting members, but how do you feel about this CEP now that there is a complete implementation?

dholth · 2023-09-27T18:06:45Z

cep-jlap.md

+<tr><td> Created </td><td> Mar 30, 2022</td></tr>
+<tr><td> Updated </td><td> Oct 11, 2022</td></tr>
+<tr><td> Discussion </td><td> NA </td></tr>
+<tr><td> Implementation </td><td> https://github.com/dholth/repodata-fly </td></tr>


Might as well link to the implementations in conda and librattler. conda's jlap core is pretty elegant. The network error handling less so.

dholth · 2023-10-20T16:59:33Z

@baszalmstra @jaimergp ping

cep-jlap.py

baszalmstra · 2023-10-20T18:58:41Z

@travishathaway implemented this CEP in rattler. There it is enabled by default which means pixi is using this by default and we have been using this as such for a while.

This may be due to the implementation in rattler but we have not been particularly happy with it. As far as I can tell the algorithm provides a tradeoff between bandwidth vs compute. You transfer significantly less data but computing the full repodata requires some CPU and (in our case) a lot of memory (in the order of GBs). People with low memory (embedded) device often run into out- of-memory issues because of this.

With the defaulting of the zstd compressed files the download overhead has also become significantly less. But I also realize that this is not a scalable solution.

But! I have not yet actively profiled this, or really tried to improve the current situation. I will make some time for that in the next few weeks and report back here with some actual numbers.

Im very curious about how other people experience this. @wolfv, @jaimergp Have you tried using JLAP? Its not implemented in (micro)mamba right?

@travishathaway Do you think there are more avenues we can explore to optimize the performance of the implementation in rattler?

dholth · 2023-10-20T19:31:03Z

On my machine, memray shows a peak of about 420MB for repodata = json.load(open("/Users/dholth/miniconda3/pkgs/cache/09cdf8bf.json")) which is the conda-forge noarch repodata.json.

Some users do have more compute than bandwidth.

dholth · 2023-10-20T20:37:19Z

In Python, it might be possible to have a sqlite-backed repodata.json that returns proxy objects for "packages" and "packages.conda", and apply typical "patch": [{"op": "add", "path": "/packages/pre_commit-3.4.0-hd3eb1b0_1.tar.bz2" ... without loading the entire file. (I've timed repodata.json-to-sqlite that processes the full repodata.json each time and it is too much slower). Of course this would also make "expects cached json" programs unhappy.

travishathaway · 2023-10-24T10:58:23Z

@baszalmstra,

I thought about this for a bit and revisited the Rust implementation. Perhaps one thing that would help is removing the re-ordering of the document that is done in order for the checksums to match at the end. That's just a hunch though.

No matter what, we will always have to at least load the full repodata.json into memory in order to successfully apply the patches. I think that's just the sad reality we must face when choosing to use a JSON file as our database 😭.

Perhaps we could come up with a way to partition repodata into separate files, but that seems like it would only yield an unacceptably complex solution.

Please let us know what happens when you perform a proper profiling. I am interested to hear the results.

dholth · 2023-10-24T12:37:49Z

@travishathaway It's designed so that the checksum doesn't match at the end. We don't specify the server's json formatting. But we keep track of the on-disk versus putative "logically equivalent to server-side repodata.json with this checksum".

baszalmstra · 2023-10-25T07:50:18Z

No matter what, we will always have to at least load the full repodata.json into memory in order to successfully apply the patches. I think that's just the sad reality we must face when choosing to use a JSON file as our database 😭.

Yeah but I dont think thats strictly required. Its because we use JSON as the level of abstraction at which we patch the content. We dont have to parse and load the entire JSON to query information from it in rattler by sparsely reading the content.

I haven't though about this enough yet but we could also patch on the byte level using zchunk for instance. The CEP also mentions zchunk. Has that been evaluated? Im very curious about any issues encountered there.

dholth · 2023-10-25T12:37:44Z

If we look at https://gist.github.com/dholth/cc76ce07f1c6ff099f440fc901bea35b, which are the 'op' and 'path' properties of individual json patch elements, it shows that almost all of the patch operations in the current https://conda.anaconda.org/conda-forge/linux-64/repodata.jlap look like ('add', '/packages.conda/pymeep-1.27.0-mpi_mpich_py310h1234567_2.conda'), and operate on a single package at a time.

In Python, it would be easy to author a lazy object that looked like repodata.json but loaded individual packages from storage by hooking the dictionary index [] operator on packages and packages.conda.

With knowledge of the patch format, it might be reasonable in a less-dynamic language to load objects (individual package records) based on the first part of the patch path, then send the rest of the path to the json patch library.

zchunk is neat. It's hard to implement in pure Python. Running it against the possibly-formatted repodata.json and preserving whitespace is a waste. It would make more individual requests to the server to get the necessary patch information.

One of the first prototypes was in Rust but I found that the pypy version was a bit faster on diffs. IIRC Python's json.loads() is also faster than serde to a generic serde_json::Value When applying the patches, parsing and serializing JSON take all the time; actually applying the patches is very fast because they operate at the logical JSON level.

Merge

dholth · 2023-11-16T21:42:13Z

I've add a two-file scheme to this CEP. Instead of trying to update repodata.json each time, we accumulate patches in a simple overlay. It will be a while before (re-)writing those additions approaches the time needed to reserialize a 220MB conda-forge/linux-64/repodata.json. There is an implementation in conda+libmamba that shows writing and reading patches. It eliminates about half of the time needed (the time spent calling json.dumps() on repodata.json). It will eliminate all repodata.json parsing on the Python side for patches that only add complete new packages. A bit more work is needed to integrate conda with the new two-file API for the solver.

dholth mentioned this pull request Apr 29, 2022

Integration of .jlap incremental repodata conda/conda#11447

Closed

kathatherine suggested changes May 3, 2022

View reviewed changes

kathatherine reviewed May 25, 2022

View reviewed changes

cep-incremental-repodata.md Outdated Show resolved Hide resolved

barabo reviewed Jun 7, 2022

View reviewed changes

cep-incremental-repodata.md Outdated Show resolved Hide resolved

cep-incremental-repodata.md Outdated Show resolved Hide resolved

kathatherine suggested changes Sep 28, 2022

View reviewed changes

cep-incremental-repodata.md Outdated Show resolved Hide resolved

cep-incremental-repodata.md Outdated Show resolved Hide resolved

jezdez changed the title ~~repodata.json updates via patch series to save client bandwidth~~ CEP 10: JSON Patch Incremental Updates for repodata.json Oct 11, 2022

dholth and others added 17 commits October 14, 2022 22:44

propose incremental repodata updates

556c615

update

f4d128b

hard wrap & update e-mail

f6d3ded

describe json lines format

cb6ce86

word wrap

437dd55

describe current implementation

aef13a2

hard wrap

f582515

add space

55d3f32

grammar edits

4b82979

explain checksum algorithm and payload types

9f8b636

shorten example

f9379c9

apply suggestions

2badee0

debug example code

8fb2fab

Apply suggestions from code review

c29be68

Co-authored-by: Katherine Kinnaman <kkinnaman@anaconda.com>

assign number to cep

d8c6812

acknowledge other repodata.json generators

bd05885

link to example code

0540969

jezdez force-pushed the main branch from 9161269 to 0540969 Compare October 14, 2022 20:44

beckermr requested changes Oct 14, 2022

View reviewed changes

jezdez removed the vote Voting following governance policy label Oct 26, 2022

kenodegard mentioned this pull request Mar 7, 2023

Implement .jlap incremental repodata conda/conda#11640

Closed

2 tasks

jezdez changed the title ~~CEP 10: JSON Patch Incremental Updates for repodata.json~~ JSON Patch Incremental Updates for repodata.json Mar 31, 2023

rename cep without number

571045f

dholth changed the title ~~JSON Patch Incremental Updates for repodata.json~~ .jlap JSON Patch Incremental Updates for repodata.json Sep 27, 2023

explain extension; link to jsonlines.org

536641b

dholth commented Sep 27, 2023

View reviewed changes

jaimergp reviewed Oct 20, 2023

View reviewed changes

cep-jlap.py Outdated Show resolved Hide resolved

Update implementation links

d715e9e

Merge branch 'main' into main

2bc91bd

baszalmstra mentioned this pull request Oct 28, 2023

Enable repodata.json.zst by default conda/conda#13256

Closed

2 tasks

travishathaway mentioned this pull request Nov 6, 2023

enable repodata.jlap/repodata.json.zst by default conda/conda#13273

Closed

3 tasks

dholth added 3 commits November 16, 2023 12:36

fix link

e3dc6b9

document two-file cache scheme

416ee84

Merge pull request #1 from dholth/main-dholth

c95ef8f

Merge

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`.jlap` JSON Patch Incremental Updates for repodata.json #20

`.jlap` JSON Patch Incremental Updates for repodata.json #20

dholth commented Apr 27, 2022 •

edited

Loading

dholth commented Apr 27, 2022 •

edited

Loading

kathatherine left a comment

barabo left a comment •

edited

Loading

kathatherine left a comment •

edited

Loading

jezdez commented Oct 14, 2022

ocefpaf commented Oct 14, 2022

beckermr left a comment

dholth commented Oct 14, 2022

beckermr commented Oct 15, 2022

mariusvniekerk commented Oct 22, 2022 •

edited

Loading

dholth commented Oct 23, 2022

jezdez commented Oct 26, 2022

dholth commented Oct 26, 2022

dholth commented Mar 31, 2023

dholth commented May 8, 2023

dholth Sep 27, 2023

dholth commented Oct 20, 2023

baszalmstra commented Oct 20, 2023

dholth commented Oct 20, 2023 •

edited

Loading

dholth commented Oct 20, 2023 •

edited

Loading

travishathaway commented Oct 24, 2023 •

edited

Loading

dholth commented Oct 24, 2023

baszalmstra commented Oct 25, 2023 •

edited

Loading

dholth commented Oct 25, 2023 •

edited

Loading

dholth commented Nov 16, 2023

.jlap JSON Patch Incremental Updates for repodata.json #20

Are you sure you want to change the base?

.jlap JSON Patch Incremental Updates for repodata.json #20

Conversation

dholth commented Apr 27, 2022 • edited Loading

dholth commented Apr 27, 2022 • edited Loading

kathatherine left a comment

Choose a reason for hiding this comment

barabo left a comment • edited Loading

Choose a reason for hiding this comment

kathatherine left a comment • edited Loading

Choose a reason for hiding this comment

jezdez commented Oct 14, 2022

ocefpaf commented Oct 14, 2022

beckermr left a comment

Choose a reason for hiding this comment

dholth commented Oct 14, 2022

beckermr commented Oct 15, 2022

mariusvniekerk commented Oct 22, 2022 • edited Loading

dholth commented Oct 23, 2022

jezdez commented Oct 26, 2022

Voting Results

dholth commented Oct 26, 2022

dholth commented Mar 31, 2023

dholth commented May 8, 2023

dholth Sep 27, 2023

Choose a reason for hiding this comment

dholth commented Oct 20, 2023

baszalmstra commented Oct 20, 2023

dholth commented Oct 20, 2023 • edited Loading

dholth commented Oct 20, 2023 • edited Loading

travishathaway commented Oct 24, 2023 • edited Loading

dholth commented Oct 24, 2023

baszalmstra commented Oct 25, 2023 • edited Loading

dholth commented Oct 25, 2023 • edited Loading

dholth commented Nov 16, 2023

`.jlap` JSON Patch Incremental Updates for repodata.json #20

`.jlap` JSON Patch Incremental Updates for repodata.json #20

dholth commented Apr 27, 2022 •

edited

Loading

dholth commented Apr 27, 2022 •

edited

Loading

barabo left a comment •

edited

Loading

kathatherine left a comment •

edited

Loading

mariusvniekerk commented Oct 22, 2022 •

edited

Loading

dholth commented Oct 20, 2023 •

edited

Loading

dholth commented Oct 20, 2023 •

edited

Loading

travishathaway commented Oct 24, 2023 •

edited

Loading

baszalmstra commented Oct 25, 2023 •

edited

Loading

dholth commented Oct 25, 2023 •

edited

Loading