Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build problems for several processor architectures #474

Open
lepalom opened this issue Jun 1, 2020 · 19 comments
Open

Build problems for several processor architectures #474

lepalom opened this issue Jun 1, 2020 · 19 comments

Comments

@lepalom
Copy link

lepalom commented Jun 1, 2020

We are trying to incorporate 0.6.1 to Debian and we have found several build problems in some platforms. You can check it here:
https://buildd.debian.org/status/package.php?p=fcl&suite=experimental

For instance,

  • In arm64:
    92% tests passed, 3 tests failed out of 37
    The following tests FAILED:
    9 - test_fcl_capsule_capsule (Failed)
    32 - test_gjk_libccd-inl_epa (Failed)
    34 - test_gjk_libccd-inl_gjk_doSimplex2 (Failed)

  • In i386:
    81% tests passed, 7 tests failed out of 37
    The following tests FAILED:
    9 - test_fcl_capsule_capsule (Failed)
    21 - test_fcl_signed_distance (Failed)
    30 - test_convex (Failed)
    31 - test_capsule (Failed)
    32 - test_gjk_libccd-inl_epa (Failed)
    35 - test_sphere_box (Failed)
    36 - test_sphere_cylinder (Failed)

  • In mips64:
    Does not make test.

  • In mipsel,
    Does not build because memory problems.

  • In ppc64el,
    92% tests passed, 3 tests failed out of 37
    The following tests FAILED:
    9 - test_fcl_capsule_capsule (Failed)
    32 - test_gjk_libccd-inl_epa (Failed)

  • In s390x,
    95% tests passed, 2 tests failed out of 37
    The following tests FAILED:
    9 - test_fcl_capsule_capsule (Failed)
    32 - test_gjk_libccd-inl_epa (Failed)

Please, could you take one eye in the build logs and help us to make run fcl in that platforms?

@sherm1
Copy link
Member

sherm1 commented Jun 1, 2020

Random observation: lots of warnings like this in the logs:

/include/fcl/geometry/bvh/BVH_model-inl.h:141:11: 
warning: ‘void* memcpy(void*, const void*, size_t)’ writing to an object of type 
‘struct fcl::BVNode<fcl::RSS<double> >’ with no trivial copy-assignment; 
use copy-assignment or copy-initialization instead [-Wclass-memaccess]
  141 |     memcpy(bvs, other.bvs, sizeof(BVNode<BV>) * num_bvs);

Not clear whether those might be causing the test failures.

@simonschmeisser
Copy link

Two months till the feature freeze for the next Debian version, would be nice to get 0.6 in!

@sherm1
Copy link
Member

sherm1 commented Nov 10, 2020

@jamiesnape @SeanCurtis-TRI any thoughts on diagnosing these problems?

@SeanCurtis-TRI
Copy link
Contributor

Sorry for the long delay in responding to this. You can consider that delay a reflection of the current level of maintainer's resource availability. In principle, we like the idea of broad platform support, in practice that can be problematic. As such, we propose the following, and invite further insights and arguments from the community.

Currently supported

  • amd64 (currently under test in CI)

Probably will explicitly support

Probably won't explicitly support -- available on Travis-CI but not worth the CI "cost"

  • ppc64el
  • s390x

Almost definitely will not provide formal support

  • i386
  • mips64
  • mipsel

For arm64, we're willing to extend the CI to cover that architecture (and deal with whatever failures arise). We're loathe to claim support for any platform that we don't have legitimate CI support for. We're unsure about the utility of the power pc platforms (even though TravisCI does support them) and are reluctant to force every PR to run through them without some good reasons for explicit support.

This is our first pass on "resolving" this issue and, again, we're open to other views. But the message that I really hope to make clear is that we're going to have to rely on the community's help (which has been great as recent build PRs attest) if we want to extend support across more platforms.

@jamiesnape
Copy link
Contributor

There was historically some utility in running i386 builds to flush out a few obscure bugs that assume 64-bits, but now ARM is mostly at 64-bits, maybe the assumption is fine regardless. Looks (unsurprisingly) like a lot of tolerance issues, but anything mentioning memcpy raises red flags for me with multi-architecture support.

@lepalom
Copy link
Author

lepalom commented Dec 14, 2020

FYI. @roehling did a good job patching fcl to be build in several platforms. Check:

https://buildd.debian.org/status/package.php?p=fcl&suite=experimental

@jamiesnape
Copy link
Contributor

jamiesnape commented Dec 17, 2020

So these three patches? https://sources.debian.org/src/fcl/0.6.1-3/debian/patches/ (the last one being the most relevant)

@lepalom
Copy link
Author

lepalom commented Dec 17, 2020

Yes. If you want, please use them.

@jamiesnape
Copy link
Contributor

jamiesnape commented Dec 17, 2020

They all look good to me. I can PR them if @SeanCurtis-TRI and/or @sherm1 like the idea.

@SeanCurtis-TRI
Copy link
Contributor

Seems like a definite improvement. I'm in favor of it.

@jamiesnape
Copy link
Contributor

Seems like it should be easy enough, so I will put up a PR shortly.

@jamiesnape
Copy link
Contributor

#517

@roehling
Copy link
Contributor

roehling commented Dec 17, 2020

I'm glad you find my work useful. Still, I want to point out that those patches do not really eliminate the test failures. I have to confess I "fixed" the failing builds by ignoring the failing test results on non-amd64 architectures…

I believe that the test failures are largely due to differences in hardware floating point formats which lead to unexpected loss of precision. You can eliminate all test failures on i386 with -mfpmath=sse -msse2, so the infamous x87 FPU does not mess up intermediate results with gratuitous conversions between double and extended precision. The s390x architecture has a non-IEEE floating point register format as well.

The most interesting test failures are those of arm64, both because of your stated intention to support this architecture, and because it seems to be the common denominator in all test failures, so if you fix the problem there, there is a good chance that the improved numerical stability will also fix the other problematic architectures:

The following tests FAILED:
	  9 - test_fcl_capsule_capsule (Failed)
	 32 - test_gjk_libccd-inl_epa (Failed)
	 34 - test_gjk_libccd-inl_gjk_doSimplex2 (Failed)

For reference, I'll post the outputs of the failing tests here. Maybe someone with a better grasp of the codebase has a good idea on how to tackle those:

[ RUN      ] CapsuleCapsuleSegmentTest/0.OverlappingCenterLines
./test/test_fcl_capsule_capsule.cpp:755: Failure
Value of: this->RunTestConfiguration( CapsuleCapsuleSegmentTest<TypeParam>::kMultipleOverlap)
  Actual: false (Values at (0, 0) exceed tolerance: 11.862063780292655 vs. 12.178490743795633, diff = 0.31642696350297861, tolerance = 2.0097183471152322e-14
m1 =
 11.862063780292655
 13.293042338207796
-2.8160494855694154
m2 =
 12.178490743795633
 13.201144346700222
-1.6102598123986924
delta=
-0.31642696350297861
0.091897991507574162
 -1.2057896731707229)
Expected: true
[  FAILED  ] CapsuleCapsuleSegmentTest/0.OverlappingCenterLines, where TypeParam = double (0 ms)
[ RUN      ] FCL_GJK_EPA.ComputeVisiblePatchColinearNewVertex
./test/narrowphase/detail/convexity_based_algorithm/test_gjk_libccd-inl_epa.cpp:731: Failure
Value of: std::unordered_set<ccd_pt_edge_t*>( {&(tet.e(1)), &(tet.e(4)), &(tet.e(5))})
  Actual: { 0xaaaaedb29090, 0xaaaaedb29110, 0xaaaaedb28090 }
Expected: border_edges
Which is: { 0xaaaaedb28110, 0xaaaaedb29110, 0xaaaaedb27320 }
./test/narrowphase/detail/convexity_based_algorithm/test_gjk_libccd-inl_epa.cpp:733: Failure
Value of: 3u
  Actual: 3
Expected: visible_faces.size()
Which is: 1
./test/narrowphase/detail/convexity_based_algorithm/test_gjk_libccd-inl_epa.cpp:734: Failure
Value of: std::unordered_set<ccd_pt_face_t*>( {&(tet.f(0)), &(tet.f(1)), &(tet.f(3))})
  Actual: { 0xaaaaedb27c10, 0xaaaaedb27560, 0xaaaaedb27500 }
Expected: visible_faces
Which is: { 0xaaaaedb27c10 }
./test/narrowphase/detail/convexity_based_algorithm/test_gjk_libccd-inl_epa.cpp:736: Failure
Value of: 3u
  Actual: 3
Expected: internal_edges.size()
Which is: 0
./test/narrowphase/detail/convexity_based_algorithm/test_gjk_libccd-inl_epa.cpp:737: Failure
Value of: std::unordered_set<ccd_pt_edge_t*>( {&(tet.e(0)), &(tet.e(2)), &(tet.e(3))})
  Actual: { 0xaaaaedb27320, 0xaaaaedb28110, 0xaaaaedb27c70 }
Expected: internal_edges
Which is: {}
[  FAILED  ] FCL_GJK_EPA.ComputeVisiblePatchColinearNewVertex (1 ms)
[ RUN      ] DoSimplex2Test.OriginInSimplex
./test/narrowphase/detail/convexity_based_algorithm/test_gjk_libccd-inl_gjk_doSimplex2.cpp:72: Failure
Value of: 0
Expected: norm_OB_.dot(phat_OB_)
Which is: 7.5711e-18
[  FAILED  ] DoSimplex2Test.OriginInSimplex (0 ms)
[ RUN      ] DoSimplex2Test.NeedMoreComputing
./test/narrowphase/detail/convexity_based_algorithm/test_gjk_libccd-inl_gjk_doSimplex2.cpp:72: Failure
Value of: 0
Expected: norm_OB_.dot(phat_OB_)
Which is: 7.5711e-18
[  FAILED  ] DoSimplex2Test.NeedMoreComputing (0 ms)
[ RUN      ] DoSimplex2Test.Region1Boundary1
./test/narrowphase/detail/convexity_based_algorithm/test_gjk_libccd-inl_gjk_doSimplex2.cpp:72: Failure
Value of: 0
Expected: norm_OB_.dot(phat_OB_)
Which is: 7.5711e-18
[  FAILED  ] DoSimplex2Test.Region1Boundary1 (5 ms)
[ RUN      ] DoSimplex2Test.Region1Boundary2
./test/narrowphase/detail/convexity_based_algorithm/test_gjk_libccd-inl_gjk_doSimplex2.cpp:72: Failure
Value of: 0
Expected: norm_OB_.dot(phat_OB_)
Which is: 7.5711e-18
[  FAILED  ] DoSimplex2Test.Region1Boundary2 (5 ms)
[ RUN      ] DoSimplex2Test.Region1Boundary3
./test/narrowphase/detail/convexity_based_algorithm/test_gjk_libccd-inl_gjk_doSimplex2.cpp:72: Failure
Value of: 0
Expected: norm_OB_.dot(phat_OB_)
Which is: 7.5711e-18
[  FAILED  ] DoSimplex2Test.Region1Boundary3 (4 ms)

@SeanCurtis-TRI
Copy link
Contributor

@roehling Thanks for the elaboration. While it doesn't get us to the targeted support, at the very least, it's a monotonic improvement of the code. :) So, I'll take the little victory today even if we still haven't won the battle.

@jamiesnape
Copy link
Contributor

We can automatically pass -mfpmath=sse -msse2 on i386 only, that is pretty trivial, and won't affect anyone else, but it won't be exercised by CI here, so if anything else i386 breaks it is not going to get caught.

@sancelot
Copy link

sancelot commented Dec 19, 2020

Regarding my issues trying to compiling for MSYS2/Mingw64 with gcc 10.2 (#516) , this patch solved many problems .
But some dll imports/exports definition problems are remaining

@jamiesnape jamiesnape changed the title Build problems in several platforms Build problems for several processor architectures Dec 21, 2020
@mahiuchun
Copy link
Contributor

The test result on my Apple M1 box:

95% tests passed, 2 tests failed out of 41
The following tests FAILED:
21 - test_fcl_signed_distance (Failed)
38 - test_sphere_box (Failed)

We can make the tests pass with a slight increase of tolerances.

@jamiesnape
Copy link
Contributor

@mahiuchun We would welcome a PR with the exact tolerance changes.

@mahiuchun
Copy link
Contributor

#522

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants