-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A proposed redesign for Gecko's encoders and decoders. #92
Comments
@Ms2ger Thank you for the pointer. I haven't read the whole thing, but I will reply to the particular section mentioning the shortcoming (or design differences) of Encoding here:
Ultimately, I won't release 1.0.0 without those cases addressed (or explicitly rejected). I had no energy to work on them for some months (fortunately for me Encoding seems to be a very stable library), but I can definitely say that they are mostly within a reach and if someone is interested I can always give a required permission. |
I only briefly looked at one example, but that should be replaceable with an FST since you're using it as a map. (As FST will do compression on both the keys and values, lookup is proportional to the key size and initializing an FST in memory that has already been created is basically free and trivial to do with |
First of all, I think it's completely reasonable from @lifthrasiir's perspective not to make changes and not allow me to make changes because of what I'd like to have for Gecko. In general, from the perspective of a package maintainer, it's not a good idea to let some person on the Internet rewrite the package to their requirements or request that the package be rewritten to their requirements. It probably makes sense for everyone to wait and see if I actually get some code written according to my design. As for reconciling what I'd like to have for Gecko and having such a framework expose the current rust-encoding API, I think the fundamental API-level mismatches are:
Is there a design doc saying what this would look like?
The only way to get accurate numbers is to build the alternative and compare. It's possible to estimate the table sizes, but that doesn't take into account the size of the more complex lookup code. But for starters, I'd get rid of all the encode-side index data structures and search the decode-optimized data structures in reverse, which should get rid of at least half of the footprint. By compression I don't mean stuff like Deflate or Brotli. I mean things like storing only a quarter of Windows-1252 as a lookup table and three quarters as offsets that happen to be zero and omitting ranges of uninteresting index slots as seen in the Gecko Big5 data. |
I don't either. :-) An FST does "compression" by virtue of representing a map as a finite state machine and minimizing the machine. Just throwing it out there---I certainly don't know whether it fairs better or worse than something more domain specific as you suggest. |
Okay, I've done some research in the table size in my spare time. (The API thing is a bit harder to research, and while I have a brief design idea and some code from past months, it would take some more weeks til I can come up with a concrete code or proposal.) Specifically, I've tested three methods for table "compression":
A crude modification to The result is as follows. As a comparison I've also put sizes of Gecko's current mapping tables below:
(*) Gecko has no or very slow encoder implementation for these indices. Corresponding backward mapping in Encoding has been also omitted. Gecko surely beats Encoding, but not that much, especially given that it lacks an encoding table for Big5 and JIS X 0212. When they have been taken into account Encoding is only ~10% bigger than Gecko, and much more compression is possible when encoding legacy encodings don't need to be that fast (I think it is the case given that Big5 encoding is sooooo slow, but I would be glad to be clarified). It is also noted that Gecko's current data structure is not very well optimized. Gecko does a binary search through blocks of consecutive data values (which may be optimized for one lone entry or values with a constant difference to indices), while Encoding relies to a simple lookup table (plus bitset for Big5) for forward mapping and a two-level trie for backward mapping. Since the constant difference case is relatively common for CJK encodings Gecko should be theoretically superior (and that explained ~35% of the original Encoding overhead), but in reality that wasn't. Given all this data, I feel that the table size concern is mostly a myth (fortunately!) and it is more about the matter of penalizing encoding legacy encodings. What do you think? |
How does comparing to current Gecko result in that conclusion? Except for Big5, Gecko currently has tables for both encode and decode. My proposal is to have only decode tables in Gecko in the future (as seen with Big5 today) and searching the decode table (linearly) on the encoder side. This should, trivially, save the size of the encoder tables (while making encode slower). I.e. make the resulting encode-only tables smaller that the encode+decode tables in either current Gecko or in rust-encoding.
I don't have my analysis script on this computer, but when I looked at consecutive index slots that map to consecutive code points in CJK a couple of weeks back, I noticed that there aren't massive such ranges. (This surprised me, since I had expected Unicode to have taken non-Kanji/Hanja parts of old Japanese and Korean standards more directly.) Therefore, there don't seem to be big wins available from doing offsets in CJK, since the bookkeeping data grows large. I haven't done a proper analysis of the single-byte encodings yet, but I see more opportunity there by splitting the 7 bits (of the upper non-ASCII half) to a 3-bit prefix and a 4-bit suffix and indexing to a table of 8 16-bit values using the 3-bit prefix to find either a value whose most-significant bit tags it as a code point offset, so that you take the offset plus the 4-bit suffix and use that as a code point, or as a lookup index offset, so you take the offset plus the 4-bit suffix and use that to index into a lookup table common across all single-byte encodings to find a code point. It's unclear to me if this is worth it, since the single-byte data isn't that large to begin with. |
To clarify the thinking behind my single-byte idea: It obviously adds overhead of 432 bytes of data plus some bytes of more instructions. What it would save is stuff like lookup tables for A0 upwards for windows-1252 and 9F downwards for various ISO-8859-* encodings as well as stuff like E0 to EF for ISO-8859-8 and various similar ranges in the Cyrillic encodings. Additionally, the part about having a common lookup table would compress away some duplicate lookup table ranges between similar encodings. Again, I'm not sure if the savings are worth the trouble. |
(To avoid any confusion, the preceding comment was made without fully reading the proposal itself. I have since read that and understand why Big5 is that slow and comparing against the current Gecko is a bit pointless there.) I now have a working (fully tested and benchmarked) version of Encoding with smaller backward mappings. The mapping function is now at most 50x to 150x slower than the original version (the worst case is for the unsupported characters), while the actual encoding process (that's what it is actually tuned for) is only about 10x slower because the sample texts have no error and contains many ASCII characters as well. As a ballpark figure, the corresponding results for the latter are as follows:
The table size is as follows: (the parenthesized numbers)
For the single-byte indices a plain linear search is applied because there is not much point optimizing 128-entry linear search. (It took only 20KB of table for corresponding optimized mappings, after all.) For the multi-byte indices the traditional two-layer trie is used, where the lower layer contains a list of index ranges EDIT: I've special-cased a single entry range so that the table size does not change but cache miss is reduced a lot (I haven't noticed this early because I originally focused on its impact on the data size). This optimization improves GB 18030 a lot. Does this trade-off look fine to you? |
Why do the size reduction slow down decode, too? As for what tradeoff level is right for encode, it's hard to say. The way I'm looking at it, it's good to reduce the size legacy encodings take in Gecko as much as is possible without making non-UTF-8 pages that have unescaped non-ASCII in URLs take a noticeable perf hit or non-UTF-8 form submission with normal amounts of the sort of text that the user would realistically enter into the form noticeably slow. The main reason why I care about size in Gecko is that engineering is blocked from advancing non-legacy i18n correctness (e.g. collation) on Android due to total app size considerations, so making legacy stuff that's not perf-sensitive take as little space as possible seems reasonable. Of course, everything is perf-sensitive if perf becomes user-perceivably awful. If it's easy to parametrize the tradeoff at compile time, one option would be zapping encode table totally on Android and letting there be some encode acceleration elsewhere. |
Partly because the "reduced (old)" and the "reduced (new)" benchmarks were not run at the same time---decoding performance was mostly same for them, while encoding performance is improved as I've indicated. If you mind that I'll update the results, but keep in mind that Partly because the forward mapping table is "compressed" (actually, remapped) to avoid large gaps (as you have suggested in the proposal)---Japanese and Korean encodings have affected and slowed down by about 25%. This corresponds to the size difference between "original" and "method 1" above (about 20KB). I'm not sure if taking this remapping out improves or degrades the performance.
Does the uncompressed app size matter here? Otherwise you can make the table data more compressible (by LZ77, obviously) by simply encoding a difference value instead, i.e. storing I also wonder how much you can get with (say) 200KB of data and code. (That's the entire size of backward mappings in Gecko, FYI.) Any pointer would be appreciated.
If you agree to this plan I can commit a version of Encoding with customizable |
(Just to be clear, I've pushed the |
@lifthrasiir What's the unit of your table? Byte Or Bit? |
@lygstate IEEE 1541-2002 standardizes on |
Tell me, where is the |
@lygstate Wait, wasn't that about this comment? Anyway, in that table the unit is byte (implied in other ways, but to be clear), but I wouldn't never use bits as a unit there. |
Whoa. I had written a reply already. I have no idea why the submission has apparently failed. Sorry. Trying again:
Yes. Storage on cheap devices is sometimes cited as a problem, but in any case, libxul.so is not compressed in the APK (IIRC to make it faster to access at run-time), so the code is delivered over the network and stored on the device as uncompressed. (Yes, it's sad.)
Bug 864843 was the original bug where landing of ICU (converters excluded already) was blocked on Android over size concerns despite the non-landing causing badness in various places in the codebase because we can't categorically get rid of Netscape-era legacy code, because Android uses the legacy code instead of ICU for collation, etc. The bug is now marked as FIXED, because the bug morphed in scope to cover b2gdroid only. The remains of the original scope moved to bug 1215247. As can be seen in those bugs, ICU is over 3 MB increase, so a 200 KB decrease cannot buy enough space for ICU. On one hand, there is no single 3 MB decrease to be found, so finding 200 KB in one place in already a lot. On the other hand, if it's not enough, then there's a possibility that the saving will go to waste (i.e. space by random careless stuff). I expect there's not going to be a 3 MB decrease, but I was planning on arguing that we should take ICU after making an effort to make some real savings, after documenting more thoroughly the badness of having to keep around Netscape-era non-ICU code paths just for Android and after Web apps starting to expect the ICU-enabled features to be there in Chrome for Android. As for whether it makes sense to have something faster than linear search for the encoder side at a relatively small space cost, I don't have an idea of how much space cost is appropriate for what speedup. My current thinking is that if linear search is good enough for the use cases, then any optimization is YAGNI. My current thinking may well be misguided and I haven't proven the "good enough" part. |
Yesterday's conversation with @hsivonen from
|
Now I have investigated a new way to do encoding convert by succinct data structure. The main idea comes from http://arxiv.org/pdf/0902.1038.pdf |
@lygstate The figure looks interesting, but please note however that the method 3 above was a proof-of-concept that was much slower than the current implementation; you should look at this comment to see the current figure. Your scheme seems to have 223K total data compared to 486K optimal (that really equals to "method 1" above) and 185K traded-off (10x slower); it wouldn't match @hsivonen's intended use case anyway (the required encoding perf is bounded by network I/O so 10x slowdown is probably acceptable) and will result in significant slowdown in the "optimized" encoding process. [1] Even if it does not impact the decoding performance (and it should!) it may incur too much complexity for the size reduction. Ultimately that approach requires the actual Rust code that can be tested and benchmarked. I'll gladly review and test such a PR. [1] My wild guess is that your scheme will be 3--5x slower depending on the trade-off; the current trie-based encoding is essentially two memory ops plus a handful number of arithmetic operations. The succinct dictionary that forms the basis of wavelet tree already requires three or more memory ops, and that's just to locate the easily-compressible run including that index (if you have faithfully implemented the paper). My personal threshold for encoding slowdown for the "optimized" build is, by comparison, probably 1.5--2x (about 100MB/s in my fastest box). [2] [2] Of course, as I've indicated above, the encoder and decoder using those functions are not super-optimized, so this figure may change. |
@lifthrasiir I didn't test the performance yet, hot there won't be a big slowdown. |
@lygstate If you think there won't be a big slowdown, please file a PR showing that. |
@lifthrasiir I am working on that. |
Now that I've written the code and it has been deployed, I feel I should follow up here.
I think this was the right design decision considering the Gecko use case. It allowed for a clean C++ wrapper and it allowed for SIMD acceleration of the ASCII range, which is very common in HTML/CSS/JS inputs.
During the development of encoding_rs I came up with a solution that imposes no complication on callers that don't care about how errors are attributed to input bytes but still allows for precise attribution for callers that do care (that are then responsible for remembering up to 6 bytes of past input).
encoding_rs as configured in Firefox ended up shipping without any encode-specific data tables. That is, the legacy encoders work by doing linear search over decode-oriented data tables (except for EUC-KR Hangul, which is nicely enough sorted that binary search works). I tested that the wall-clock time for encoding to legacy encodings the amount of Japanese or Chinese text that one might expect a user to submit as one form submission was within acceptable wall-clock time on a Raspberry Pi 3, which stood in for a slow phone. The result was that Gecko's binary size went down despite encoding_rs adding decode to UTF-8 and encode from UTF-8--i.e. the exposed feature set grew. The above regressed Japanese and Chinese legacy encode speed compared to uconv, so as a backup plan (had the above turned out to be user-visibly slow despite the RPi measurement), I implemented an option (the default for encoding_rs) to have binary search-based encode acceleration tables for encoding ideograph into the legacy Japanese and Chinese encodings. (I figured that ideographs have so little use in the modern Korean context that I didn't do this for EUC-KR.) Obviously, this is still slower than the approach rust-encoding takes. I don't know if anyone actually wants the half-way approach, but I haven't had a good enough reason to implement yet another approach in order to win encode benchmarks. (encoding_rs is faster than rust-encoding when decoding HTML content. When encoding plain-text content, encoding_rs is faster for "encode" to UTF-8, which just returns a borrowed |
@hsivonen is proposing to replace Gecko's encoding converters by a library written in Rust: https://docs.google.com/document/d/13GCbdvKi83a77ZcKOxaEteXp1SOGZ_9Fmztb9iX22v0/edit.
He identifies some arguments against using rust-encoding in its current form. Since I believe it would be great to have the wider Rust ecosystem as well as Servo and Gecko share the same library, I'd appreciate your thoughts on the proposal and whether it would be an option to adjust rust-encoding to the proposed design (keeping the current API in place as much as possible).
The text was updated successfully, but these errors were encountered: