Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Replace JSON parser with RapidJSON #7708

Merged
merged 2 commits into from
Sep 6, 2019
Merged

Conversation

lpereira
Copy link

This gets rid of the Casablanca JSON parser and replaces it with RapidJSON, which is used in in-situ mode. This reduces temporary allocations quite significantly; so much so that, in my test, this shaves ~5.5ms from startup time of a webapp ASP.NET program.

This is a draft PR mostly to gather comments around portability issues (e.g. string vs. wide strings), and possibly some gotchas that I may have missed when converting the code to use this other library. The APIs are very similar, so it shouldn't be difficult to compare the diff.

RapidJSON is based off of commit d87b698, as there has not been a release in a few years now. I have not tested it with SIMD support, but since that's a compile-time option, I don't know if it's a good idea to use it (maybe SSE2 on x64, as that's part of the ISA, but we can't enable SSE4.2 for everybody; in any case, I'd have to measure to see if it's worth the trouble).

@AaronRobinsonMSFT
Copy link
Member

@lpereira Can you also provide the binary size comparison for both with Casablanca and with RapidJSON. The startup win is awesome!

/cc @jeffschwMSFT

Copy link
Member

@AaronRobinsonMSFT AaronRobinsonMSFT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a first pass.

json::object &val = prop.second.as_object();
e.assembly = val.at(_X("assembly")).as_string();
e.type = val.at(_X("type")).as_string();
auto &val = prop.value.GetObject();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a huge fan of auto when the type is observed on the right side or when using iterators. In this case, the type is opaque from usage and difficult to determine in review. I personally am okay with this but other might have a different opinion, just an FYI.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I'm a fan of the explicit type.

@lpereira
Copy link
Author

lpereira commented Aug 13, 2019

@lpereira Can you also provide the binary size comparison for both with Casablanca and with RapidJSON. The startup win is awesome!

These are the sizes with RapidJSON (release mode, Linux, x86-64, stripped):

   text	   data	    bss	    dec	    hex	filename
 306034	   2292	   1056	 309382	  4b886	libhostfxr.so
 258763	   2612	    800	 262175	  4001f	libhostpolicy.so

And these are the sizes with Casablanca (release mode, Linux, x86-64, stripped):

   text	   data	    bss	    dec	    hex	filename
 340175	   4340	   1088	 345603	  54603	libhostfxr.so
 294478	   4636	    840	 299954	  493b2	libhostpolicy.so

Which is pretty nifty, too: ~69KB of text + ~9KB of data shaved, for ~79KB of gain in size.

@lpereira
Copy link
Author

Force-pushed to fix a build issue (it was not picking up the copy of RapidJSON in the repository; I have it installed as part of the system so the build was working just fine but broke in CI), and addressed some of the comments by @AaronRobinsonMSFT (thanks!)

Copy link
Member

@elinor-fung elinor-fung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How hard would it be to also get time/size diffs on Windows? 😃

json::object &val = prop.second.as_object();
e.assembly = val.at(_X("assembly")).as_string();
e.type = val.at(_X("type")).as_string();
auto &val = prop.value.GetObject();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I'm a fan of the explicit type.

@vitek-karas
Copy link
Member

We'll also need to add attribution to the https://github.com/dotnet/core-setup/blob/master/THIRD-PARTY-NOTICES.TXT (or maybe elsewhere - I'm not sure exactly how this works)

@lpereira
Copy link
Author

We'll also need to add attribution to the https://github.com/dotnet/core-setup/blob/master/THIRD-PARTY-NOTICES.TXT (or maybe elsewhere - I'm not sure exactly how this works)

Absolutely -- I'll also add a README file pointing to RapidJSON, with information about which commit hash I obtained the files from, this kind of stuff.

@lpereira
Copy link
Author

(Force-pushed with some changes, addressing some of the comments here -- but mostly to check if this will actually build on Windows.)

@lpereira
Copy link
Author

Force-pushed after cleaning up the encoding story: data is read from the file as UTF-8, but processed as UTF-16 on Windows. This unfortunately means that we can't do in-situ parsing on Windows at this point; we might be able to convert the UTF-8 data to UTF-16 and parse that in-situ, but I'll probably do that later. At this point I'm only looking towards making this work on Windows too, things can be improved later.

@lpereira
Copy link
Author

@elinor-fung wrote:

How hard would it be to also get time/size diffs on Windows? smiley

Now that this is building on Windows, it should be possible to get the size diff with ease. I imagine performance will be slightly better, but not as good as on non-Windows platforms due to in-situ parsing not being used, though.

@lpereira
Copy link
Author

The test failures are expected: they're testing malformed JSON files, and output has changed. Will need to change the tests themselves.

@lpereira lpereira force-pushed the rapidjson-master branch 2 times, most recently from 5eef42e to 49004f0 Compare August 30, 2019 22:52
@lpereira
Copy link
Author

Finally all tests are passing on Windows.

@lpereira lpereira marked this pull request as ready for review August 30, 2019 23:38
@lpereira
Copy link
Author

lpereira commented Sep 3, 2019

Rebased, force-pushed, and addressed comments by @vitek-karas.

If this variable is set in your environment, running `build.sh -t` will
pass it to the test harness, causing these tests to fail.

Teach Command.EnvironmentVariable() to handle `null` as an empty string,
so that if the DotNetRoot() extension method is called with a `null`
parameter, the $DOTNET_ROOT and $DOTNET_ROOT(x86) environment variables
are essentially unset as far as pal::getenv() is concerned.
This decreases the host startup performance by a few milliseconds (at
the order of 5ms) at the same time that the host libraries are reduced
by almost 80KiB in size on a release build for x86-64/Linux.

For comparison, these are the sizes with RapidJSON (release mode, Linux,
x86-64, stripped):

     text	   data	    bss	    dec	    hex	filename
   306034	   2292	   1056	 309382	  4b886	libhostfxr.so
   258763	   2612	    800	 262175	  4001f	libhostpolicy.so

And these are the sizes with Casablanca (release mode, Linux, x86-64,
stripped):

     text	   data	    bss	    dec	    hex	filename
   340175	   4340	   1088	 345603	  54603	libhostfxr.so
   294478	   4636	    840	 299954	  493b2	libhostpolicy.so

RapidJSON is based off of commit d87b698d0fcc10a5f632ecbc80a9cb2a8fa094a5.
@lpereira
Copy link
Author

lpereira commented Sep 5, 2019

Force-pushed addressing comments.

Copy link
Member

@elinor-fung elinor-fung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still be interested in seeing the Windows sizes / times too.

@lpereira
Copy link
Author

lpereira commented Sep 5, 2019

I'd still be interested in seeing the Windows sizes / times too.

If I did everything right, here are the sizes (used size from mingw in WSL). Same deal, Windows, x86-64, release, stripped binaries. Before:

   text    data     bss     dec     hex filename
546434   14504       0  560938   88f2a hostpolicy.dll
554414   11952       0  566366   8a45e hostfxr.dll

After:

   text    data     bss     dec     hex filename
530634   10928       0  541562   8437a hostfxr.dll
523864   13480       0  537344   83300 hostpolicy.dll

So there's some size reduction on Windows as well, although not quite as dramatic (~48KB overall).

The timing difference is less dramatic, clocking in at ~1.8ms. I imagined that it wouldn't be in the same ballpark as on Linux, due to it not using the in-situ parsing mode and the required conversion, but it's still a win. (I might be making some rookie mistake when measuring this on Windows, though.)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants