A proposed redesign for Gecko's encoders and decoders. #92

Ms2ger · 2015-12-04T12:56:59Z

@hsivonen is proposing to replace Gecko's encoding converters by a library written in Rust: https://docs.google.com/document/d/13GCbdvKi83a77ZcKOxaEteXp1SOGZ_9Fmztb9iX22v0/edit.

He identifies some arguments against using rust-encoding in its current form. Since I believe it would be great to have the wider Rust ecosystem as well as Servo and Gecko share the same library, I'd appreciate your thoughts on the proposal and whether it would be an option to adjust rust-encoding to the proposed design (keeping the current API in place as much as possible).

lifthrasiir · 2015-12-05T12:13:57Z

@Ms2ger Thank you for the pointer. I haven't read the whole thing, but I will reply to the particular section mentioning the shortcoming (or design differences) of Encoding here:

Encoding is essentially a Rust-only library, thus UTF-8 as the internal encoding. But I guess changing it to be generic over UTF-8 or UTF-16 is relatively easy (the output via StringWriter etc. is already generic, with a slight preference to UTF-8 though).
A good point, and I was wondering how to support that use case in Encoding. (I know that because this is strictly necessary for no_std environments.) This is also the reason that I want to separate the low-level API (finer-grained streaming and allocation) from the mid-level API (streaming only), which is the main feature of long-delayed 0.3.0.
Is there the data size comparison? AFAIK Encoding's data table is about 500 KB, which I thought reasonable for most uses and it gives a good balance between the data size and the decoder speed. The table itself can be compressed to (estimated) 200 KB, but that would require a non-trivial amount of pre-unpacking code and/or significant overhead. I'm open to any suggestion to further reduce the data table without a significant overhead; alternatively, it might be possible to make index functions pluggable.
Totally agreed. In fact, if you see the decoder code thoroughly, some encodings are indeed not up to date (especially JIS); I failed to easily update them when the spec has changed.

Ultimately, I won't release 1.0.0 without those cases addressed (or explicitly rejected). I had no energy to work on them for some months (fortunately for me Encoding seems to be a very stable library), but I can definitely say that they are mostly within a reach and if someone is interested I can always give a required permission.

BurntSushi · 2015-12-05T14:42:24Z

I'm open to any suggestion to further reduce the data table without a significant overhead

I only briefly looked at one example, but that should be replaceable with an FST since you're using it as a map. (As FST will do compression on both the keys and values, lookup is proportional to the key size and initializing an FST in memory that has already been created is basically free and trivial to do with lazy_static!.)

hsivonen · 2015-12-08T11:50:44Z

@Ms2ger:

Since I believe it would be great to have the wider Rust ecosystem as well as Servo and Gecko share the same library, I'd appreciate your thoughts on the proposal and whether it would be an option to adjust rust-encoding to the proposed design (keeping the current API in place as much as possible).

First of all, I think it's completely reasonable from @lifthrasiir's perspective not to make changes and not allow me to make changes because of what I'd like to have for Gecko. In general, from the perspective of a package maintainer, it's not a good idea to let some person on the Internet rewrite the package to their requirements or request that the package be rewritten to their requirements. It probably makes sense for everyone to wait and see if I actually get some code written according to my design.

As for reconciling what I'd like to have for Gecko and having such a framework expose the current rust-encoding API, I think the fundamental API-level mismatches are:

I'm seeking to have converter internals that write to a caller-provided buffer by treating it as contiguous memory. rust-encoding customizes outputs via traits that pass data via per character or byte method calls, so there's no way to get a target slice out of a ByteWriter or a StringWriter, so reconciling the approaches would invite the API user to potentially use the API in way that'd be harmful for performance. It wouldn't be not cool to set up such a gotcha for the API user, so if the internals were the kind of internals that I think make sense for Gecko, the API compatibility should probably be limited to String and Vec<u8> instead of any StringWriter or ByteWriter.
rust-encoding reports the erroneous byte sequence to DecoderTrap. There exists a use case (Validator.nu!) for decoders that indicate which bytes were in error and are, nonetheless, streaming. Having both correct identification of the entire malformed byte sequence and having a streaming API (appropriate for Gecko) is a combination that's just asking for a bad balance between complexity and benefit. (java.nio tries to have that combination, fails to document it properly and once you've figured out how it works, it's annoying to use--especially if all you want to do is to treat the errors as fatal.) For Gecko, I want decoders that have a reasonable API that doesn't try to indicate which bytes were in error even if it means that some super-cool View Source error reporting is forever impossible.

@lifthrasiir:

This is also the reason that I want to separate the low-level API (finer-grained streaming and allocation) from the mid-level API (streaming only), which is the main feature of long-delayed 0.3.0.

Is there a design doc saying what this would look like?

Is there the data size comparison?

The only way to get accurate numbers is to build the alternative and compare. It's possible to estimate the table sizes, but that doesn't take into account the size of the more complex lookup code.

But for starters, I'd get rid of all the encode-side index data structures and search the decode-optimized data structures in reverse, which should get rid of at least half of the footprint.

By compression I don't mean stuff like Deflate or Brotli. I mean things like storing only a quarter of Windows-1252 as a lookup table and three quarters as offsets that happen to be zero and omitting ranges of uninteresting index slots as seen in the Gecko Big5 data.

BurntSushi · 2015-12-08T12:16:27Z

By compression I don't mean stuff like Deflate or Brotli.

I don't either. :-) An FST does "compression" by virtue of representing a map as a finite state machine and minimizing the machine. Just throwing it out there---I certainly don't know whether it fairs better or worse than something more domain specific as you suggest.

lifthrasiir · 2015-12-19T17:13:33Z

@hsivonen @Ms2ger @BurntSushi

Okay, I've done some research in the table size in my spare time. (The API thing is a bit harder to research, and while I have a brief design idea and some code from past months, it would take some more weeks til I can come up with a concrete code or proposal.)

Specifically, I've tested three methods for table "compression":

Remapping. The original forward (index-to-char) mapping has some big holes to simplify the decoder description; the remapping code has been introduced to euc-kr, jis0208 and jis0212 indices to reduce the gap.
Remapping plus linear search for the lower part of the backward (char-to-index) mapping trie. The linear search is enabled when there are at most min(4, 2^(4-w)) entries per a group of 2^w entries, and is represented by one control word (has #entries * w bits for indices, hence 2^(4-w) limit) and multiple data words.
Remapping plus aggressive linear search for consecutive data words. Similar to method 2, but in this time a group of 2^w entries is composed of multiple blocks: either gap-data block (1+n words) or gap-run block (2 words). w is valid up to 8, but I sticked to w = 6 to avoid too much linear search.

A crude modification to gen_index.py is available here. I've also tried probably dozens of other methods, but sticked to them as they were easy to implement in the Rust code and gave a good balance between the code and the size.

The result is as follows. As a comparison I've also put sizes of Gecko's current mapping tables below:

Index	Gecko	Original	Method 1	Method 2	Method 3
ibm866	570	1276	1276	948	694
iso-8859-2	656	636	636	602	452
iso-8859-3	534	640	640	522	396
iso-8859-4	702	622	622	604	431
iso-8859-5	154	842	842	649	552
iso-8859-6	166	550	550	463	335
iso-8859-7	266	1068	1068	780	578
iso-8859-8	194	834	834	710	544
iso-8859-10	630	898	898	645	655
iso-8859-13	700	898	898	771	682
iso-8859-14	494	1168	1168	1050	600
iso-8859-15	202	838	838	647	562
iso-8859-16	578	1132	1132	778	650
koi8-r	744	1372	1372	1164	714
koi8-u	856	1404	1404	1167	744
macintosh	804	3418	3418	1919	2475
windows-874	190	902	902	777	559
windows-1250	882	1098	1098	911	742
windows-1251	446	1034	1034	787	632
windows-1252	302	1204	1204	788	599
windows-1253	460	1140	1140	785	621
windows-1254	412	1268	1268	784	616
windows-1255	504	1236	1236	792	619
windows-1256	730	1364	1364	921	685
windows-1257	940	1098	1098	911	748
windows-1258	684	1332	1332	1094	663
x-mac-cyrillic	476	1320	1320	996	663
big5	N/A	151232	151232	111578	92544
big5 (*)	37856	37680	37680	37680	37680
euc-kr	94356	118860	105736	86830	66588
gb18030	94374	106760	106760	103560	81782
jis0208 (#)	95070	71464	65144	53834	41512
jis0212	N/A	59130	57646	46368	33180
jis0212 (*)	12784	14206	12722	12722	12722
Total	N/A	538038	517110	425135	338571
Total (*)	348716	379558	358630	317591	263249

(*) Gecko has no or very slow encoder implementation for these indices. Corresponding backward mapping in Encoding has been also omitted.
(#) Gecko has a duplicate sjis/euc-jp mapping, thus has been penalized here.

Gecko surely beats Encoding, but not that much, especially given that it lacks an encoding table for Big5 and JIS X 0212. When they have been taken into account Encoding is only ~10% bigger than Gecko, and much more compression is possible when encoding legacy encodings don't need to be that fast (I think it is the case given that Big5 encoding is sooooo slow, but I would be glad to be clarified).

It is also noted that Gecko's current data structure is not very well optimized. Gecko does a binary search through blocks of consecutive data values (which may be optimized for one lone entry or values with a constant difference to indices), while Encoding relies to a simple lookup table (plus bitset for Big5) for forward mapping and a two-level trie for backward mapping. Since the constant difference case is relatively common for CJK encodings Gecko should be theoretically superior (and that explained ~35% of the original Encoding overhead), but in reality that wasn't.

Given all this data, I feel that the table size concern is mostly a myth (fortunately!) and it is more about the matter of penalizing encoding legacy encodings. What do you think?

hsivonen · 2015-12-21T09:18:24Z

Given all this data, I feel that the table size concern is mostly a myth (fortunately!)

How does comparing to current Gecko result in that conclusion? Except for Big5, Gecko currently has tables for both encode and decode. My proposal is to have only decode tables in Gecko in the future (as seen with Big5 today) and searching the decode table (linearly) on the encoder side. This should, trivially, save the size of the encoder tables (while making encode slower). I.e. make the resulting encode-only tables smaller that the encode+decode tables in either current Gecko or in rust-encoding.

Since the constant difference case is relatively common for CJK encodings Gecko should be theoretically superior (and that explained ~35% of the original Encoding overhead), but in reality that wasn't.

I don't have my analysis script on this computer, but when I looked at consecutive index slots that map to consecutive code points in CJK a couple of weeks back, I noticed that there aren't massive such ranges. (This surprised me, since I had expected Unicode to have taken non-Kanji/Hanja parts of old Japanese and Korean standards more directly.) Therefore, there don't seem to be big wins available from doing offsets in CJK, since the bookkeeping data grows large.

I haven't done a proper analysis of the single-byte encodings yet, but I see more opportunity there by splitting the 7 bits (of the upper non-ASCII half) to a 3-bit prefix and a 4-bit suffix and indexing to a table of 8 16-bit values using the 3-bit prefix to find either a value whose most-significant bit tags it as a code point offset, so that you take the offset plus the 4-bit suffix and use that as a code point, or as a lookup index offset, so you take the offset plus the 4-bit suffix and use that to index into a lookup table common across all single-byte encodings to find a code point.

It's unclear to me if this is worth it, since the single-byte data isn't that large to begin with.

hsivonen · 2015-12-21T09:34:10Z

To clarify the thinking behind my single-byte idea:

It obviously adds overhead of 432 bytes of data plus some bytes of more instructions. What it would save is stuff like lookup tables for A0 upwards for windows-1252 and 9F downwards for various ISO-8859-* encodings as well as stuff like E0 to EF for ISO-8859-8 and various similar ranges in the Cyrillic encodings. Additionally, the part about having a common lookup table would compress away some duplicate lookup table ranges between similar encodings.

Again, I'm not sure if the savings are worth the trouble.

lifthrasiir · 2015-12-21T16:52:56Z

@hsivonen

(To avoid any confusion, the preceding comment was made without fully reading the proposal itself. I have since read that and understand why Big5 is that slow and comparing against the current Gecko is a bit pointless there.)

I now have a working (fully tested and benchmarked) version of Encoding with smaller backward mappings. The mapping function is now at most 50x to 150x slower than the original version (the worst case is for the unsupported characters), while the actual encoding process (that's what it is actually tuned for) is only about 10x slower because the sample texts have no error and contains many ASCII characters as well. As a ballpark figure, the corresponding results for the latter are as follows:

ORIGINAL
japanese::eucjp_tests::decode                 4,382 ns/iter (+/- 2,150) = 137 MB/s
japanese::eucjp_tests::encode                 3,797 ns/iter (+/- 2,945) = 237 MB/s
japanese::iso2022jp_tests::decode             4,538 ns/iter (+/- 3,892) = 141 MB/s
japanese::iso2022jp_tests::encode             4,442 ns/iter (+/- 1,234) = 202 MB/s
japanese::windows31j_tests::decode            5,404 ns/iter (+/- 5,289) = 111 MB/s
japanese::windows31j_tests::encode            4,717 ns/iter (+/- 3,802) = 190 MB/s
korean::windows949_tests::decode              5,148 ns/iter (+/- 1,876) = 123 MB/s
korean::windows949_tests::encode              4,116 ns/iter (+/- 4,521) = 217 MB/s
simpchinese::gb18030_tests::decode            5,098 ns/iter (+/- 4,435) = 142 MB/s
simpchinese::gb18030_tests::encode            3,979 ns/iter (+/- 3,264) = 273 MB/s
simpchinese::gbk_tests::encode                4,686 ns/iter (+/- 1,338) = 232 MB/s
simpchinese::hz_tests::decode                 5,380 ns/iter (+/- 5,011) = 135 MB/s
simpchinese::hz_tests::encode                 5,064 ns/iter (+/- 3,877) = 215 MB/s
tradchinese::bigfive2003_tests::decode        6,367 ns/iter (+/- 2,400) = 114 MB/s
tradchinese::bigfive2003_tests::encode        4,607 ns/iter (+/- 1,453) = 236 MB/s

REDUCED (OLD)
japanese::eucjp_tests::decode                 5,226 ns/iter (+/- 1,289) = 115 MB/s
japanese::eucjp_tests::encode                53,703 ns/iter (+/- 15,902) = 16 MB/s
japanese::iso2022jp_tests::decode             5,682 ns/iter (+/- 4,556) = 112 MB/s
japanese::iso2022jp_tests::encode            54,233 ns/iter (+/- 24,292) = 16 MB/s
japanese::windows31j_tests::decode            5,230 ns/iter (+/- 3,271) = 115 MB/s
japanese::windows31j_tests::encode           52,661 ns/iter (+/- 23,542) = 17 MB/s
korean::windows949_tests::decode              5,207 ns/iter (+/- 3,161) = 122 MB/s
korean::windows949_tests::encode             49,466 ns/iter (+/- 6,506) = 18 MB/s
simpchinese::gb18030_tests::decode            5,393 ns/iter (+/- 5,565) = 134 MB/s
simpchinese::gb18030_tests::encode          113,724 ns/iter (+/- 38,308) = 9 MB/s
simpchinese::gbk_tests::encode              110,458 ns/iter (+/- 68,267) = 9 MB/s
simpchinese::hz_tests::decode                 4,770 ns/iter (+/- 4,911) = 152 MB/s
simpchinese::hz_tests::encode               118,813 ns/iter (+/- 33,223) = 9 MB/s
tradchinese::bigfive2003_tests::decode        6,032 ns/iter (+/- 1,880) = 120 MB/s
tradchinese::bigfive2003_tests::encode       70,489 ns/iter (+/- 53,987) = 15 MB/s

REDUCED (NEW)
japanese::eucjp_tests::decode                 5,793 ns/iter (+/- 5,017) = 104 MB/s
japanese::eucjp_tests::encode                52,997 ns/iter (+/- 26,588) = 16 MB/s
japanese::iso2022jp_tests::decode             6,132 ns/iter (+/- 4,216) = 104 MB/s
japanese::iso2022jp_tests::encode            48,047 ns/iter (+/- 11,639) = 18 MB/s
japanese::windows31j_tests::decode            5,449 ns/iter (+/- 4,586) = 110 MB/s
japanese::windows31j_tests::encode           48,734 ns/iter (+/- 14,315) = 18 MB/s
korean::windows949_tests::decode              5,713 ns/iter (+/- 2,281) = 111 MB/s
korean::windows949_tests::encode             42,401 ns/iter (+/- 18,249) = 21 MB/s
simpchinese::gb18030_tests::decode            5,144 ns/iter (+/- 3,920) = 141 MB/s
simpchinese::gb18030_tests::encode           74,920 ns/iter (+/- 81,547) = 14 MB/s
simpchinese::gbk_tests::encode               74,454 ns/iter (+/- 26,760) = 14 MB/s
simpchinese::hz_tests::decode                 5,179 ns/iter (+/- 4,108) = 140 MB/s
simpchinese::hz_tests::encode                75,950 ns/iter (+/- 66,161) = 14 MB/s
tradchinese::bigfive2003_tests::decode        6,232 ns/iter (+/- 203) = 116 MB/s
tradchinese::bigfive2003_tests::encode       64,160 ns/iter (+/- 35,266) = 16 MB/s

The table size is as follows: (the parenthesized numbers)

[omitted]
generating index big5... (cached) 37680 + 113552 (14694) = 151232 (52374) bytes.
generating index euc-kr... (cached) 34376 + 71360 (7638) = 105736 (42014) bytes.
generating index gb18030... (cached) 47880 + 57216 (8556) = 105096 (56436) bytes.
generating index jis0208... (cached) 15888 + 49256 (5698) = 65144 (21586) bytes.
generating index jis0212... (cached) 12722 + 44924 (618) = 57646 (13340) bytes.
generating index gb18030-ranges... (cached) 832 + 832 (832) = 1664 (1664) bytes.
total 517110 (194326) bytes.

For the single-byte indices a plain linear search is applied because there is not much point optimizing 128-entry linear search. (It took only 20KB of table for corresponding optimized mappings, after all.) For the multi-byte indices the traditional two-layer trie is used, where the lower layer contains a list of index ranges s <= ... < e that may correspond to given char value. The number of total index entries scanned is limited to 256. I've also experimented with 512, 1024 and 4096 and concluded that they are not really worth. For example, 512 reduces the table by 9KB at the expense of 2--3x slowdown.

EDIT: I've special-cased a single entry range so that the table size does not change but cache miss is reduced a lot (I haven't noticed this early because I originally focused on its impact on the data size). This optimization improves GB 18030 a lot.

Does this trade-off look fine to you?

hsivonen · 2015-12-23T12:38:22Z

Why do the size reduction slow down decode, too?

As for what tradeoff level is right for encode, it's hard to say. The way I'm looking at it, it's good to reduce the size legacy encodings take in Gecko as much as is possible without making non-UTF-8 pages that have unescaped non-ASCII in URLs take a noticeable perf hit or non-UTF-8 form submission with normal amounts of the sort of text that the user would realistically enter into the form noticeably slow.

The main reason why I care about size in Gecko is that engineering is blocked from advancing non-legacy i18n correctness (e.g. collation) on Android due to total app size considerations, so making legacy stuff that's not perf-sensitive take as little space as possible seems reasonable. Of course, everything is perf-sensitive if perf becomes user-perceivably awful.

If it's easy to parametrize the tradeoff at compile time, one option would be zapping encode table totally on Android and letting there be some encode acceleration elsewhere.

lifthrasiir · 2015-12-23T14:27:49Z

@hsivonen

Why do the size reduction slow down decode, too?

Partly because the "reduced (old)" and the "reduced (new)" benchmarks were not run at the same time---decoding performance was mostly same for them, while encoding performance is improved as I've indicated. If you mind that I'll update the results, but keep in mind that cargo bench is not very reliable in producing reproducible results.

Partly because the forward mapping table is "compressed" (actually, remapped) to avoid large gaps (as you have suggested in the proposal)---Japanese and Korean encodings have affected and slowed down by about 25%. This corresponds to the size difference between "original" and "method 1" above (about 20KB). I'm not sure if taking this remapping out improves or degrades the performance.

The main reason why I care about size in Gecko is that engineering is blocked from advancing non-legacy i18n correctness (e.g. collation) on Android due to total app size considerations, so making legacy stuff that's not perf-sensitive take as little space as possible seems reasonable.

Does the uncompressed app size matter here? Otherwise you can make the table data more compressible (by LZ77, obviously) by simply encoding a difference value instead, i.e. storing x - f(x) instead of f(x). While you are correct that there is no "big" consecutive region in those tables, I was able to exploit smaller strips of consecutive regions (e.g. KS X 1001 Hangul, GB 18030 GBK compat mapping), so that may be a good short-term solution.

I also wonder how much you can get with (say) 200KB of data and code. (That's the entire size of backward mappings in Gecko, FYI.) Any pointer would be appreciated.

If it's easy to parametrize the tradeoff at compile time, one option would be zapping encode table totally on Android and letting there be some encode acceleration elsewhere.

gen_index.py takes about 30 seconds in my laptop. The integration to the build process is thus easy (I probably need to expose some options though). At the moment the backward mapping table weighs 38KB, and it is possible to further optimize that at expense of further performance degradation.

If you agree to this plan I can commit a version of Encoding with customizable gen_index.py soon (probably tomorrow or so, I've got an one-week vacation til the end of year :).

lifthrasiir · 2015-12-31T10:38:10Z

(Just to be clear, I've pushed the no-optimized-legacy-encoding feature as 726d6da.)

lygstate · 2016-01-17T20:17:23Z

@lifthrasiir What's the unit of your table? Byte Or Bit?

lifthrasiir · 2016-01-18T00:42:04Z

@lygstate IEEE 1541-2002 standardizes on B as an abbreviation of Byte.

lygstate · 2016-01-18T02:00:10Z

@lifthrasiir



@hsivonen @Ms2ger @BurntSushi

Okay, I've done some research in the table size in my spare time. (The API thing is a bit harder to research, and while I have a brief design idea and some code from past months, it would take some more weeks til I can come up with a concrete code or proposal.)

Specifically, I've tested three methods for table "compression":

    Remapping. The original forward (index-to-char) mapping has some big holes to simplify the decoder description; the remapping code has been introduced to euc-kr, jis0208 and jis0212 indices to reduce the gap.
    Remapping plus linear search for the lower part of the backward (char-to-index) mapping trie. The linear search is enabled when there are at most min(4, 2^(4-w)) entries per a group of 2^w entries, and is represented by one control word (has #entries * w bits for indices, hence 2^(4-w) limit) and multiple data words.
    Remapping plus aggressive linear search for consecutive data words. Similar to method 2, but in this time a group of 2^w entries is composed of multiple blocks: either gap-data block (1+n words) or gap-run block (2 words). w is valid up to 8, but I sticked to w = 6 to avoid too much linear search.

A crude modification to gen_index.py is available here. I've also tried probably dozens of other methods, but sticked to them as they were easy to implement in the Rust code and gave a good balance between the code and the size.

The result is as follows. As a comparison I've also put sizes of Gecko's current mapping tables below:
Index   Gecko   Original    Method 1    Method 2    Method 3
ibm866  570     1276    1276    948     694
iso-8859-2  656     636     636     602     452
iso-8859-3  534     640     640     522     396
iso-8859-4  702     622     622     604     431
iso-8859-5  154     842     842     649     552
iso-8859-6  166     550     550     463     335
iso-8859-7  266     1068    1068    780     578
iso-8859-8  194     834     834     710     544
iso-8859-10     630     898     898     645     655
iso-8859-13     700     898     898     771     682
iso-8859-14     494     1168    1168    1050    600
iso-8859-15     202     838     838     647     562
iso-8859-16     578     1132    1132    778     650
koi8-r  744     1372    1372    1164    714
koi8-u  856     1404    1404    1167    744
macintosh   804     3418    3418    1919    2475
windows-874     190     902     902     777     559
windows-1250    882     1098    1098    911     742
windows-1251    446     1034    1034    787     632
windows-1252    302     1204    1204    788     599
windows-1253    460     1140    1140    785     621
windows-1254    412     1268    1268    784     616
windows-1255    504     1236    1236    792     619
windows-1256    730     1364    1364    921     685
windows-1257    940     1098    1098    911     748
windows-1258    684     1332    1332    1094    663
x-mac-cyrillic  476     1320    1320    996     663
big5    N/A     151232  151232  111578  92544
big5 (*)    37856   37680   37680   37680   37680
euc-kr  94356   118860  105736  86830   66588
gb18030     94374   106760  106760  103560  81782
jis0208 (#)     95070   71464   65144   53834   41512
jis0212     N/A     59130   57646   46368   33180
jis0212 (*)     12784   14206   12722   12722   12722
Total   N/A     538038  517110  425135  338571
Total (*)   348716  379558  358630  317591  263249

(*) Gecko has no or very slow encoder implementation for these indices. Corresponding backward mapping in Encoding has been also omitted.
(#) Gecko has a duplicate sjis/euc-jp mapping, thus has been penalized here.

Gecko surely beats Encoding, but not that much, especially given that it lacks an encoding table for Big5 and JIS X 0212. When they have been taken into account Encoding is only ~10% bigger than Gecko, and much more compression is possible when encoding legacy encodings don't need to be that fast (I think it is the case given that Big5 encoding is sooooo slow, but I would be glad to be clarified).

It is also noted that Gecko's current data structure is not very well optimized. Gecko does a binary search through blocks of consecutive data values (which may be optimized for one lone entry or values with a constant difference to indices), while Encoding relies to a simple lookup table (plus bitset for Big5) for forward mapping and a two-level trie for backward mapping. Since the constant difference case is relatively common for CJK encodings Gecko should be theoretically superior (and that explained ~35% of the original Encoding overhead), but in reality that wasn't.

Given all this data, I feel that the table size concern is mostly a myth (fortunately!) and it is more about the matter of penalizing encoding legacy encodings. What do you think?

Tell me, where is the B

lifthrasiir · 2016-01-18T02:04:52Z

@lygstate Wait, wasn't that about this comment? Anyway, in that table the unit is byte (implied in other ways, but to be clear), but I wouldn't never use bits as a unit there.

hsivonen · 2016-01-18T08:28:42Z

@lifthrasiir

Whoa. I had written a reply already. I have no idea why the submission has apparently failed. Sorry. Trying again:

Does the uncompressed app size matter here?

Yes. Storage on cheap devices is sometimes cited as a problem, but in any case, libxul.so is not compressed in the APK (IIRC to make it faster to access at run-time), so the code is delivered over the network and stored on the device as uncompressed. (Yes, it's sad.)

I also wonder how much you can get with (say) 200KB of data and code. (That's the entire size of backward mappings in Gecko, FYI.) Any pointer would be appreciated.

Bug 864843 was the original bug where landing of ICU (converters excluded already) was blocked on Android over size concerns despite the non-landing causing badness in various places in the codebase because we can't categorically get rid of Netscape-era legacy code, because Android uses the legacy code instead of ICU for collation, etc. The bug is now marked as FIXED, because the bug morphed in scope to cover b2gdroid only. The remains of the original scope moved to bug 1215247.

As can be seen in those bugs, ICU is over 3 MB increase, so a 200 KB decrease cannot buy enough space for ICU. On one hand, there is no single 3 MB decrease to be found, so finding 200 KB in one place in already a lot. On the other hand, if it's not enough, then there's a possibility that the saving will go to waste (i.e. space by random careless stuff).

I expect there's not going to be a 3 MB decrease, but I was planning on arguing that we should take ICU after making an effort to make some real savings, after documenting more thoroughly the badness of having to keep around Netscape-era non-ICU code paths just for Android and after Web apps starting to expect the ICU-enabled features to be there in Chrome for Android.

As for whether it makes sense to have something faster than linear search for the encoder side at a relatively small space cost, I don't have an idea of how much space cost is appropriate for what speedup. My current thinking is that if linear search is good enough for the use cases, then any optimization is YAGNI. My current thinking may well be misguided and I haven't proven the "good enough" part.

lifthrasiir · 2016-01-20T04:30:34Z

Yesterday's conversation with @hsivonen from #rust convinced me of the lowest-level redesign. I already agreed that the half-filled output is problematic (that is partly a reason that the current stateful decoder is complex), but wasn't sure about other reasons (turned out to be an excessive number of virtual calls). Relevant logs:

<Yurume> hsivonen: ah, I think I have asked about the specific API design of your encoding-rs proposal. pasting the log:
<Yurume> ii) I don't understand exactly why self-growing buffer (besides from no UTF-16) is bad. is it just a callback avoidance, or is there any more important reason?
<hsivonen> Yurume: if you allow custom self-growing buffers, the write to the buffer is a virtual call, right?
<Yurume> in the current Encoding API it would be.
<hsivonen> Yurume: also, it prevents the caller from allocating one buffer, e.g. one memory page, and recycling it
<hsivonen> Yurume: thirdly, when the caller is OK with segmenting the output across multiple buffers, it causes useless copying
<hsivonen> Yurume: for example, in Gecko, the HTML parser has a linked list of UTF-16 buffers
<hsivonen> Yurume: so it doesn't want the decoder to realloc. Instead, it wants to add a buffer to its linked list on its own
<Yurume> virtual call concern is what I have indeed missed (thank you for the answer!). I think a careful API design may solve recycling situation and multiple buffers, but that would give more burden to the caller.
<Yurume> hsivonen: make sure that any identifier that is free in the macro body is lexically available at the position of macro definition itself
<hsivonen> Yurume: aside: it looked to me like your single-byte decoder had a virtual call in the inner loop
<Yurume> I think so
<Yurume> the current implementation is not really very tuned
<Yurume> mostly because, well, I didn't have a real-world situation that pushes the boundary

lygstate · 2016-01-22T19:10:05Z

C:\CI-Cor\ks\iconv-full>babel-node generation\fm-dbcs.js
Need space for big5 is 62246 Bytes, table count: 2
Need space for gbk is 50014 Bytes, table count: 2
Need space for jis0208 is 33930 Bytes, table count: 3
Need space for jis0212 is 28108 Bytes, table count: 1
Need space for euc-kr is 48994 Bytes, table count: 1
DBCS encodings regenerated.

Now I have investigated a new way to do encoding convert by succinct data structure.
It's even better than the Method 3 that presented before.
There is no compression on the data,
The time complexity for each table are O(logN) + 4
The 4 comes from log(65536)/log(log(65536) )
N is the the whole compressed array.

The main idea comes from http://arxiv.org/pdf/0902.1038.pdf
And because all the code points count are less than 32768
So the space need is (N <= 32768)
(NlogN + N)Bits = 16N Bits = 2NByes for all the permutation.

lifthrasiir · 2016-01-23T08:33:43Z

@lygstate The figure looks interesting, but please note however that the method 3 above was a proof-of-concept that was much slower than the current implementation; you should look at this comment to see the current figure.

Your scheme seems to have 223K total data compared to 486K optimal (that really equals to "method 1" above) and 185K traded-off (10x slower); it wouldn't match @hsivonen's intended use case anyway (the required encoding perf is bounded by network I/O so 10x slowdown is probably acceptable) and will result in significant slowdown in the "optimized" encoding process. [1] Even if it does not impact the decoding performance (and it should!) it may incur too much complexity for the size reduction. Ultimately that approach requires the actual Rust code that can be tested and benchmarked. I'll gladly review and test such a PR.

[1] My wild guess is that your scheme will be 3--5x slower depending on the trade-off; the current trie-based encoding is essentially two memory ops plus a handful number of arithmetic operations. The succinct dictionary that forms the basis of wavelet tree already requires three or more memory ops, and that's just to locate the easily-compressible run including that index (if you have faithfully implemented the paper). My personal threshold for encoding slowdown for the "optimized" build is, by comparison, probably 1.5--2x (about 100MB/s in my fastest box). [2]

[2] Of course, as I've indicated above, the encoder and decoder using those functions are not super-optimized, so this figure may change.

lygstate · 2016-01-23T11:03:29Z

@lifthrasiir I didn't test the performance yet, hot there won't be a big slowdown.

lifthrasiir · 2016-01-23T11:08:10Z

@lygstate If you think there won't be a big slowdown, please file a PR showing that.

lygstate · 2016-01-23T11:27:06Z

@lifthrasiir I am working on that.

hsivonen · 2018-02-03T13:10:19Z

It probably makes sense for everyone to wait and see if I actually get some code written according to my design.

Now that I've written the code and it has been deployed, I feel I should follow up here.

I'm seeking to have converter internals that write to a caller-provided buffer by treating it as contiguous memory

I think this was the right design decision considering the Gecko use case. It allowed for a clean C++ wrapper and it allowed for SIMD acceleration of the ASCII range, which is very common in HTML/CSS/JS inputs.

There exists a use case (Validator.nu!) for decoders that indicate which bytes were in error and are, nonetheless, streaming. Having both correct identification of the entire malformed byte sequence and having a streaming API (appropriate for Gecko) is a combination that's just asking for a bad balance between complexity and benefit.

During the development of encoding_rs I came up with a solution that imposes no complication on callers that don't care about how errors are attributed to input bytes but still allows for precise attribution for callers that do care (that are then responsible for remembering up to 6 bytes of past input).

The main reason why I care about size in Gecko is that engineering is blocked from advancing non-legacy i18n correctness (e.g. collation) on Android due to total app size considerations, so making legacy stuff that's not perf-sensitive take as little space as possible seems reasonable

encoding_rs as configured in Firefox ended up shipping without any encode-specific data tables. That is, the legacy encoders work by doing linear search over decode-oriented data tables (except for EUC-KR Hangul, which is nicely enough sorted that binary search works).

I tested that the wall-clock time for encoding to legacy encodings the amount of Japanese or Chinese text that one might expect a user to submit as one form submission was within acceptable wall-clock time on a Raspberry Pi 3, which stood in for a slow phone.

The result was that Gecko's binary size went down despite encoding_rs adding decode to UTF-8 and encode from UTF-8--i.e. the exposed feature set grew.

The above regressed Japanese and Chinese legacy encode speed compared to uconv, so as a backup plan (had the above turned out to be user-visibly slow despite the RPi measurement), I implemented an option (the default for encoding_rs) to have binary search-based encode acceleration tables for encoding ideograph into the legacy Japanese and Chinese encodings. (I figured that ideographs have so little use in the modern Korean context that I didn't do this for EUC-KR.)

Obviously, this is still slower than the approach rust-encoding takes. I don't know if anyone actually wants the half-way approach, but I haven't had a good enough reason to implement yet another approach in order to win encode benchmarks.

(encoding_rs is faster than rust-encoding when decoding HTML content. When encoding plain-text content, encoding_rs is faster for "encode" to UTF-8, which just returns a borrowed Cow and on the legacy side for encoding Latin-script languages with infrequent non-ASCII, such as English, German, French and Portuguese. For everything else, including Latin-script languages with frequent non-ASCII, such as Turkish and Czech, rust-encoding is faster at encoding to legacy encodings. This result is expect considering the design priorities and could be changed with relative ease at the cost of binary size.)

hsivonen mentioned this issue Jan 4, 2017

Add support for encoding_rs behind a new feature flag servo/rust-url#262

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A proposed redesign for Gecko's encoders and decoders. #92

A proposed redesign for Gecko's encoders and decoders. #92

Ms2ger commented Dec 4, 2015

lifthrasiir commented Dec 5, 2015

BurntSushi commented Dec 5, 2015

hsivonen commented Dec 8, 2015

BurntSushi commented Dec 8, 2015

lifthrasiir commented Dec 19, 2015

hsivonen commented Dec 21, 2015

hsivonen commented Dec 21, 2015

lifthrasiir commented Dec 21, 2015

hsivonen commented Dec 23, 2015

lifthrasiir commented Dec 23, 2015

lifthrasiir commented Dec 31, 2015

lygstate commented Jan 17, 2016

lifthrasiir commented Jan 18, 2016

lygstate commented Jan 18, 2016

lifthrasiir commented Jan 18, 2016

hsivonen commented Jan 18, 2016

lifthrasiir commented Jan 20, 2016

lygstate commented Jan 22, 2016

lifthrasiir commented Jan 23, 2016

lygstate commented Jan 23, 2016

lifthrasiir commented Jan 23, 2016

lygstate commented Jan 23, 2016

hsivonen commented Feb 3, 2018

A proposed redesign for Gecko's encoders and decoders. #92

A proposed redesign for Gecko's encoders and decoders. #92

Comments

Ms2ger commented Dec 4, 2015

lifthrasiir commented Dec 5, 2015

BurntSushi commented Dec 5, 2015

hsivonen commented Dec 8, 2015

BurntSushi commented Dec 8, 2015

lifthrasiir commented Dec 19, 2015

hsivonen commented Dec 21, 2015

hsivonen commented Dec 21, 2015

lifthrasiir commented Dec 21, 2015

hsivonen commented Dec 23, 2015

lifthrasiir commented Dec 23, 2015

lifthrasiir commented Dec 31, 2015

lygstate commented Jan 17, 2016

lifthrasiir commented Jan 18, 2016

lygstate commented Jan 18, 2016

lifthrasiir commented Jan 18, 2016

hsivonen commented Jan 18, 2016

lifthrasiir commented Jan 20, 2016

lygstate commented Jan 22, 2016

lifthrasiir commented Jan 23, 2016

lygstate commented Jan 23, 2016

lifthrasiir commented Jan 23, 2016

lygstate commented Jan 23, 2016

hsivonen commented Feb 3, 2018