Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent compilation failures due to std::_Select utility #792

Closed
leezu opened this issue May 5, 2020 · 5 comments
Closed

Intermittent compilation failures due to std::_Select utility #792

leezu opened this issue May 5, 2020 · 5 comments
Labels
external This issue is unrelated to the STL

Comments

@leezu
Copy link

leezu commented May 5, 2020

Describe the bug
Various Nvidia software (Thrust, CUB) causes intermittent compilation failures with MSVC 2019.
All come in the style of ::_Select<__formal>::_Apply': 'X' is not a valid template type argument for parameter '<unnamed-symbol>'. Would this be a bug in MSVC / STL or do you have any recommendations about how NVidia should adapt their software for MSVC support?

This affects many projects using MSVC and Cuda, such as MXNet and PyTorch.

There are more examples of the intermittent failures at NVIDIA/thrust#1090

@alliepiper
Copy link

To summarize the discussion on the linked issue -- it looks like there is an issue with MSVC's STL in some usage of its internal std::_Select<bool> utility, since it seems to be getting a non-bool for its template argument.

Some details from my attempts to track this down:

The line that the issue refers to is an expansion of this macro:

#define __THRUST_DEFINE_HAS_NESTED_TYPE(trait_name, nested_type_name) \
template<typename T> \
  struct trait_name  \
{                    \
  typedef char yes_type; \
  typedef int  no_type;  \
  template<typename S> static yes_type test(typename S::nested_type_name *); \
  template<typename S> static no_type  test(...); \
  static bool const value = sizeof(test<T>(0)) == sizeof(yes_type);\
  typedef thrust::detail::integral_constant<bool, value> type;\
};

With these inputs:

__THRUST_DEFINE_HAS_NESTED_TYPE(has_value_type, value_type)

(Let's please just ignore the reserved __ identifier for now unless it's actually relevant...fixing these is on my TODO list!)

This is a standard and common MPL pattern to test for a member type alias, and does not invoke any STL traits, etc. The thrust::detail::integral_constant seems like a possible hiding place for this sort of bug, but its implementation is straightforward:

 template<typename T, T v>
   struct integral_constant
   {
     THRUST_INLINE_INTEGRAL_MEMBER_CONSTANT T value = v;

     typedef T                       value_type;
     typedef integral_constant<T, v> type;

     #if THRUST_CPP_DIALECT >= 2011
     integral_constant() = default;
     integral_constant(integral_constant const&) = default;
     integral_constant& operator=(integral_constant const&) = default;

     constexpr __host__ __device__
     integral_constant(std::integral_constant<T, v>) noexcept {}
     #endif

     THRUST_CONSTEXPR __host__ __device__ 
     operator value_type() const THRUST_NOEXCEPT { return value; }
     THRUST_CONSTEXPR __host__ __device__ 
     value_type operator()() const THRUST_NOEXCEPT { return value; }
   };

Maybe it's being used in a _Select call but the implicit conversion isn't firing? Otherwise, I don't see where this bad _Select instantiation is coming from, and the issue being intermittent is especially bizarre.

The macros used above expand to nothing surprising:

#if THRUST_CPP_DIALECT >= 2011
#  define THRUST_INLINE_INTEGRAL_MEMBER_CONSTANT static constexpr
#  define THRUST_CONSTEXPR constexpr
#  define THRUST_NOEXCEPT noexcept
#else
#  define THRUST_INLINE_INTEGRAL_MEMBER_CONSTANT static const
#  define THRUST_CONSTEXPR
#  define THRUST_NOEXCEPT throw()
#endif

The error messages don't provide any other breadcrumbs to track down, and only contains unexpanded local template typename parameters like U1. Any tips from the Visual Studio team to help diagnose this further are welcome.

I'll keep following this issue in case Thrust needs to follow up with this, though this seems like a toolchain / environment issue since it is intermittent and also reported to affect Eigen headers in pytorch, too.

We have not been able to reproduce this issue in Thrust directly.

@CaseyCarter CaseyCarter added the bug Something isn't working label May 5, 2020
@StephanTLavavej
Copy link
Member

This is definitely an unusual compiler bug - 99.9% of the time, when the STL triggers compiler bugs, they are totally deterministic.

Any tips from the Visual Studio team to help diagnose this further are welcome.

It looks like the error might be emitted by the MSVC compiler, after the CUDA compiler has split up the source code for the host vs. device compilers? (My understanding of the CUDA compilation flow is still very vague.) If so, is it possible to capture the source code sent to MSVC when the error happens? It would be useful to know if the intermittent behavior is happening before or after MSVC processes the code.

Separately, I observe that the __THRUST_DEFINE_HAS_NESTED_TYPE macro involved here uses classic C++98/03 SFINAE. That should behave reliably, but years ago we switched the STL over to using void_t SFINAE. The idea was that void_t, in addition to being less verbose and more powerful, can be used consistently - and if we perform all SFINAE through the same pattern, we're less likely to trigger unusual compiler bugs. (The STL has triggered about a million zillion SFINAE bugs over the years.) Perhaps using the void_t technique (doesn't have to be with the STL helper types) in this macro would avoid whatever bizarre behavior is happening here, although that would be a workaround instead of finding the root cause.

As for _Select, there are only a few uses throughout the entire STL; they are:

STL/stl/inc/atomic

Lines 1481 to 1494 in 65d98ff

template <class _Ty>
using _Choose_atomic_base2_t =
typename _Select<is_integral_v<_Ty> && !is_same_v<bool, _Ty>>::template _Apply<_Atomic_integral_facade<_Ty>,
typename _Select<is_pointer_v<_Ty> && is_object_v<remove_pointer_t<_Ty>>>::template _Apply<_Atomic_pointer<_Ty>,
_Atomic_storage<_Ty>>>;
#if _HAS_CXX20
template <class _Ty>
using _Choose_atomic_base_t =
typename _Select<is_floating_point_v<_Ty>>::template _Apply<_Atomic_floating<_Ty>, _Choose_atomic_base2_t<_Ty>>;
#else // ^^^ _HAS_CXX20 // !_HAS_CXX20 vvv
template <class _Ty>
using _Choose_atomic_base_t = _Choose_atomic_base2_t<_Ty>;
#endif //_HAS_CXX20

STL/stl/inc/xmemory

Lines 736 to 739 in 65d98ff

template <class _Alloc, class _Value_type>
using _Maybe_rebind_alloc_t =
typename _Select<is_same_v<typename _Alloc::value_type, _Value_type>>::template _Apply<_Alloc&,
_Rebind_alloc_t<_Alloc, _Value_type>>;

STL/stl/inc/type_traits

Lines 966 to 971 in 65d98ff

template <>
struct _Make_signed2<4> {
template <class _Ty>
using _Apply = // assumes LLP64
typename _Select<is_same_v<_Ty, long> || is_same_v<_Ty, unsigned long>>::template _Apply<long, int>;
};

STL/stl/inc/type_traits

Lines 1011 to 1017 in 65d98ff

template <>
struct _Make_unsigned2<4> {
template <class _Ty>
using _Apply = // assumes LLP64
typename _Select<is_same_v<_Ty, long> || is_same_v<_Ty, unsigned long>>::template _Apply<unsigned long,
unsigned int>;
};

STL/stl/inc/type_traits

Lines 1204 to 1209 in 65d98ff

template <class _Ty>
struct decay { // determines decayed version of _Ty
using _Ty1 = remove_reference_t<_Ty>;
using _Ty2 = typename _Select<is_function_v<_Ty1>>::template _Apply<add_pointer<_Ty1>, remove_cv<_Ty1>>;
using type = typename _Select<is_array_v<_Ty1>>::template _Apply<add_pointer<remove_extent_t<_Ty1>>, _Ty2>::type;
};

We're using _Select to improve compiler throughput, and there are no known bugs in the library code itself. Accordingly, I believe that this issue should be resolved as external because it is a compiler bug (of as-yet-unknown origin), not a bug in library code, although I'd be happy to keep the issue open for discussion for a while.

@alliepiper
Copy link

It looks like the error might be emitted by the MSVC compiler, after the CUDA compiler has split up the source code for the host vs. device compilers? (My understanding of the CUDA compilation flow is still very vague.) If so, is it possible to capture the source code sent to MSVC when the error happens?

Good idea. @leezu @mikoro Could one of you try this? According to this, it looks like the intermediate version can be obtained by:

  1. Change -c in the failing nvcc invocation to -cuda and remove the -o [filename] option. Invoke this in a VS developer terminal.
  2. This should produce a [source filename].cpp.ii file. Paste that into a github snippet/pastebin and link it here.
  3. Maybe do it a few times and diff the results to see if the nvcc output is consistent.

Separately, I observe that the __THRUST_DEFINE_HAS_NESTED_TYPE macro involved here uses classic C++98/03 SFINAE. That should behave reliably, but years ago we switched the STL over to using void_t SFINAE.

It's good to know that void_t is a preferred way of handling this on MSVC. Thrust recently (last month) deprecated C++03, and we're looking to simplify our metaprogramming layer in these sorts of ways.

As for _Select, there are only a few uses throughout the entire STL; they are:

The allocator usage might be relevant, since the error points at Thrust's has_value_type<T> trait, and
has_value_type<T> is only used in is_allocator<T>. I'm not quite sure how this fits in yet, since @leezu and @mikoro have confirmed that their code does not use this trait directly, and I can't find any internal usages of it, either.

I'd be happy to keep the issue open for discussion for a while.

That'd be great -- until we figure this out, it'll be helpful to use this issue to discuss any MSVC-related issues that come up.

@StephanTLavavej StephanTLavavej added external This issue is unrelated to the STL and removed bug Something isn't working labels May 6, 2020
@alliepiper
Copy link

In light of NVIDIA/thrust#1090 (comment), this can be closed. Thanks for the help!

@CaseyCarter
Copy link
Contributor

In light of thrust/thrust#1090 (comment), this can be closed. Thanks for the help!

Glad you finally ran the bug down - this was a nasty one. Let us know if there's anything else we can do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
external This issue is unrelated to the STL
Projects
None yet
Development

No branches or pull requests

4 participants