Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download test data from an external source during build #1588

Closed
wants to merge 4 commits into from

Conversation

nickaein
Copy link
Contributor

@nickaein nickaein commented Apr 28, 2019

Fixes #96.

This is an attempt to reduce the repository size by moving the test data files test/data to an external repository. This can significantly cuts down the size by ~50% (from ~435 MB to ~245 MB).

To make the build process seamless and minimize the effects on the users, the external repository will be downloaded by the build scripts automatically and only when the user desire to build the tests, either by cmake or make. Therefore, this separation should be transparent to regular users.

A similar PR can be considered for the files in benchmarks/data later.

Note: As for now, the test data is being downloaded from my personal repository. They should be moved into a new repository owned by @nlohmann before merging this PR.


Pull request checklist

Read the Contribution Guidelines for detailed information.

  • Changes are described in the pull request, or an existing issue is referenced.
  • The test suite compiles and runs without error.
  • Code coverage is 100%. Test cases can be added by editing the test suite.
  • The source code is amalgamated; that is, after making changes to the sources in the include/nlohmann directory, run make amalgamate to create the single-header file single_include/nlohmann/json.hpp. The whole process is described here.

Please don't

  • The C++11 support varies between different compilers and versions. Please note the list of supported compilers. Some compilers like GCC 4.7 (and earlier), Clang 3.3 (and earlier), or Microsoft Visual Studio 13.0 and earlier are known not to work due to missing or incomplete C++11 support. Please refrain from proposing changes that work around these compiler's limitations with #ifdefs or other means.
  • Specifically, I am aware of compilation problems with Microsoft Visual Studio (there even is an issue label for these kind of bugs). I understand that even in 2016, complete C++11 support isn't there yet. But please also understand that I do not want to drop features or uglify the code just to make Microsoft's sub-standard compiler happy. The past has shown that there are ways to express the functionality such that the code compiles with the most recent MSVC - unfortunately, this is not the main objective of the project.
  • Please refrain from proposing changes that would break JSON conformance. If you propose a conformant extension of JSON to be supported by the library, please motivate this extension.
  • Please do not open pull requests that address multiple issues.

@nickaein
Copy link
Contributor Author

The LGTM test fails with the following log messages. The build environment might not have access to the outside world to download the repo.

[2019-04-28 22:11:39] [build] Scanning dependencies of target doctest_main
[2019-04-28 22:11:39] [build] [  1%] Building CXX object test/CMakeFiles/doctest_main.dir/src/unit.cpp.o
[2019-04-28 22:11:50] [build] [  1%] Built target doctest_main
[2019-04-28 22:11:50] [build] Scanning dependencies of target ExternalTestData
[2019-04-28 22:11:50] [build] [  2%] Creating directories for 'ExternalTestData'
[2019-04-28 22:11:51] [build] [  3%] Performing download step (git clone) for 'ExternalTestData'
[2019-04-28 22:11:51] [build] Cloning into 'data'...
[2019-04-28 22:11:54] [build] fatal: the remote end hung up unexpectedly
[2019-04-28 22:11:54] [build] Cloning into 'data'...
[2019-04-28 22:11:54] [build] fatal: the remote end hung up unexpectedly
[2019-04-28 22:11:54] [build] Cloning into 'data'...
[2019-04-28 22:11:55] [build] fatal: the remote end hung up unexpectedly
[2019-04-28 22:11:55] [build] -- Had to git clone more than once:
[2019-04-28 22:11:55] [build]           3 times.
[2019-04-28 22:11:55] [build] CMake Error at /opt/src/_lgtm_build_dir/test/ExternalTestData-prefix/tmp/ExternalTestData-gitclone.cmake:66 (message):
[2019-04-28 22:11:55] [build]   Failed to clone repository:
[2019-04-28 22:11:55] [build]   'https://github.com/nickaein/nlohmann-json-testdata.git'
[2019-04-28 22:11:55] [build] make[2]: *** [test/CMakeFiles/ExternalTestData.dir/build.make:91: test/ExternalTestData-prefix/src/ExternalTestData-stamp/ExternalTestData-download] Error 1
[2019-04-28 22:11:55] [build] make[1]: *** [CMakeFiles/Makefile2:1801: test/CMakeFiles/ExternalTestData.dir/all] Error 2
[2019-04-28 22:11:55] [build] make: *** [Makefile:141: all] Error 2

@nickaein
Copy link
Contributor Author

As documented here, LGTM doesn't support git clone for C++ projects.

For a workaround, I modified CMakeLists.txt to check for LGTM_WORKSPACE variable and skip downloading the external test data if LGTM is detected. Since we don't actually run the tests in LGTM, the external data were not needed in the first place.

@jaredgrubb
Copy link
Contributor

Don't the git objects still exist in the repo? I understand that this would adjust the size of the working directory (the files currently checked out), but the repo-clone and on-disk storage would not be affected by this patch, right?

@nickaein
Copy link
Contributor Author

nickaein commented Apr 29, 2019

@jaredgrubb That's correct. The largest directories are test/data (~197 MBs), benchmarks/data (~60 MBs) and the git history .git (~179 MBs). This PR tackles the first one.

To reduce .git size, we have to rewrite the git history to remove the commits related to the large files. I'm still not clear how much of hassle it would cause to the downstream though.

@coveralls
Copy link

Coverage Status

Coverage remained the same at 100.0% when pulling 800ef23 on nickaein:external-tests into ee4028b on nlohmann:develop.

@jaredgrubb
Copy link
Contributor

@nlohmann : Just to confirm that after this PR happens, we will rewrite the repo? I personally don't object, I just want to make sure that is the PoR, or else this patch really doesn't add any benefit.

@nickaein
Copy link
Contributor Author

nickaein commented Apr 29, 2019

Is this PR useful on its own?

Very good point! I agree that the PR would be less effective if we are not going to reduce the size of .git later. However, I believe this can be still useful since:

  1. Having this change in place, the future additions to test/data and benchmarks/data will not further increase the size of repository. We might be stuck with a ~200 MBs .git for now, but it will stay around the same size in the future.

  2. The users who does not need the full repo history can make a shallow clone (e.g. --depth 1 switch), assuming their setup supports it. Currently, even a shallow clone contains test data lead to a large repo.

But still, I agree that removing test/benchmark data from git history would be the way to go.

Rewriting history is probably a bad idea

Regarding the git history rewrite, I think I was too optimistic initially. It could be controversial to say the least. Ignoring contributors, the most significant side-effect of rewriting history for the end-users is that all commit hashes will be changed. This invalidates all previous references that based on commit hashes which is be problematic.

For instance, in Yocto project (a Linux distribution builder), a library can be pinned down to a version using its commit hash (they also support git tags):

http://cgit.openembedded.org/meta-openembedded/tree/meta-oe/recipes-devtools/nlohmann-json/nlohmann-json_git.bb?h=master

They have pinned down to version 3.3.0 by setting this hash:

SRCREV=aafad2be1f3cd259a1e79d2f6fcf267d1ede9ec7

A history rewrite would force them to change this commit. Also, a history rewrite will premananetly invalides all links like above.

To put it simply, rewriting the history would not only affects the contributors, but also the users and scripts who rely on current commit hashes.

Possible solution

A safer option would be:

  1. At some point (e.g. at version 4.0.0) we make this repository read-only and mark it as legacy.
  2. A new repository (e.g. https://github.com/nlohmann/modern-json) is created as a clone of legacy repo with its history rewritten to remove all test/benchmark data.
  3. Since we are rewriting the history, we can also do other things like revisiting the directories layout, removing/squashing other commits, etc. if necessary.
  4. The future developments is done inside the new repository.

This can be done other way around to keep the repository stars and followers: Rename the current repo to modern-json and create a read-only repo at previous URL json.

In this way, all of the previous history with original commit IDs will be kept as before and will be available as long as desired. The legacy repo can warns the users to use to new repo for the latest releases. Any newcomer or a current user who is willing to upgrade can use to the new repo.

I believe this is the viable and least intrusive solution for reducing the size of git history significantly. The question of When this should be done?, and Does it worth it? is up to the @nlohmann and other contributors to discuss and answer.

@eliasdaler
Copy link

Just want to say that even if the repo remains as-is, I'd want to have release zips with source code which contain no testing data. This for me would be very useful and a good compromise.

@stale
Copy link

stale bot commented Jun 20, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated label Jun 20, 2019
@nlohmann nlohmann removed the state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated label Jun 25, 2019
@stale
Copy link

stale bot commented Jul 20, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated label Jul 20, 2019
@stale stale bot closed this Jul 27, 2019
@tuananh
Copy link

tuananh commented Sep 16, 2019

Just want to say that even if the repo remains as-is, I'd want to have release zips with source code which contain no testing data. This for me would be very useful and a good compromise.

yeah, i dont really worry about repo size. as long as release remains small.

@eliasdaler
Copy link

The problem that remains for me is that release zip doesn't contain CMakeLists.txt, so it's hard to integrate the library into CMake build process (you have to make the target yourself)

@nickaein nickaein deleted the external-tests branch April 23, 2021 23:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated
Projects
None yet
Development

Successfully merging this pull request may close these issues.

17 MB / 90 MB repo size!?
6 participants