-
Notifications
You must be signed in to change notification settings - Fork 849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorized (SIMD) Numerical Schemes #1022
Conversation
template<size_t nDim> | ||
struct CCompressiblePrimitives { | ||
enum : size_t {nVar = nDim+4}; | ||
VectorDbl<nVar> all; | ||
FORCEINLINE Double& temperature() { return all(0); } | ||
FORCEINLINE Double& pressure() { return all(nDim+1); } | ||
FORCEINLINE Double& density() { return all(nDim+2); } | ||
FORCEINLINE Double& enthalpy() { return all(nDim+3); } | ||
FORCEINLINE Double& velocity(size_t iDim) { return all(iDim+1); } | ||
FORCEINLINE const Double& temperature() const { return all(0); } | ||
FORCEINLINE const Double& pressure() const { return all(nDim+1); } | ||
FORCEINLINE const Double& density() const { return all(nDim+2); } | ||
FORCEINLINE const Double& enthalpy() const { return all(nDim+3); } | ||
FORCEINLINE const Double& velocity(size_t iDim) const { return all(iDim+1); } | ||
FORCEINLINE const Double* velocity() const { return &velocity(0); } | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also trying to address the magic indexes (iDim+1) by defining specific types for primitive and conservative variables rather that using raw arrays.
Initially I thought of making temperature, pressure, etc.
references pointing into the correct position of the storage vector (all
) but unfortunately that disables compiler generated constructors... so they are functions.
/*! | ||
* \brief Get a SIMD gather iterator to the inner dimension of the container. | ||
*/ | ||
template<size_t nCols, class T, size_t N> | ||
FORCEINLINE CInnerIterGather<simd::Array<T,N> > innerIter(simd::Array<T,N> row) const noexcept | ||
{ | ||
return CInnerIterGather<simd::Array<T,N> >(m_data, IsRowMajor? 1 : this->rows(), IsRowMajor? row*nCols : row); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only places where the "vector nature" of the SIMD type needs to be handled explicitly are when retrieving data from containers (C2DContainer and CVectorOfMatrix) and when putting it back in other containers (CSysVector and CSysMatrix).
Data is retrieved via iterators (see #789 for the reason, spoiler alert, it's for performance).
EDIT: Or via methods that copy the data into a static container of simd types, the iterator approach only performs well if the target CPU has good performance gather instructions (the majority don't).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in #789 I mentioned that I want these new numerics to be thread-safe and have minimal virtual overhead.
For that I'm using CRTP (the curiously recurring template pattern) to have static polymorphism, the mechanics are explained below.
const CVariable& solution_, | ||
UpdateType updateType, | ||
CSysVector<su2double>& vector, | ||
CSysMatrix<su2mixedfloat>& matrix) const final { | ||
|
||
const bool implicit = (config.GetKind_TimeIntScheme() == EULER_IMPLICIT); | ||
const auto& solution = static_cast<const CEulerVariable&>(solution_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another significant change from the current numerics is that (again for thread safety) we do not "set" anything into them, we pass them the whole variable structure (the solver nodes
) and the numerics can read (and only read) any data it needs.
Similarly to what is done in the solvers, since we know the type of variable that goes with the numeric scheme we can static_cast
to the derived type to avoid polymorphism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see Base Roe flux is working with CEulerVariable. Given it's a inviscid flux considering density, momentums and energy, I suppose this is something that will work for all compressible solvers (compressible euler, NS and RANS) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will work since CEulerVariable is used in all of those in one way or another, i.e. CNSVariable is also a CEulerVariable with extra methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pcarruscag , this significant contribution on vectorization seems very valuable as one more initiative to speed up (and tidy up) the code. It's great to see SU2 evolving to make use of handy programming and C++ performance strategies.
I am not an expert in this type of implementation, but overall it seems to be implemented in a neat and advanced way, making good use of C++ tools, functions and user defined types.
const CVariable& solution_, | ||
UpdateType updateType, | ||
CSysVector<su2double>& vector, | ||
CSysMatrix<su2mixedfloat>& matrix) const final { | ||
|
||
const bool implicit = (config.GetKind_TimeIntScheme() == EULER_IMPLICIT); | ||
const auto& solution = static_cast<const CEulerVariable&>(solution_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see Base Roe flux is working with CEulerVariable. Given it's a inviscid flux considering density, momentums and energy, I suppose this is something that will work for all compressible solvers (compressible euler, NS and RANS) ?
SU2_CFD/include/numerics_simd/flow/diffusion/viscous_fluxes.hpp
Outdated
Show resolved
Hide resolved
Thank you for the review Catarina, based on your comments I will try to explain the new structure better. The interfaceThese numerics classes are still abstract and so the outside world only needs to known about their interface: class CNumericsSIMD {
public:
virtual void ComputeFlux(...) const = 0;
}; Any class that wants to be a simd numerics needs to inherit from this base and in so doing is forced to implement Static decorator patternNow, in this structure we don't have convective and viscous classes separate. Instead, we have convective schemes that can be "decorated" with viscous fluxes. For example: template<int nDim>
class CCompressibleViscousFlux : public CNumericsSIMD {
protected:
void viscousTerms(...) const {...}
}; Here note that this class template is derived from CNumericsSIMD but it does not implement Then convective classes also need to be class templates so that we can programatically change their base class: template<class Base>
class MyConvectiveScheme : public Base {
public:
void ComputeFlux(...) const {
// do my own thing
Base::viscousTerms(...);
// update the linear system with the result
}
}; Now to create an instance of this class template we do for example: auto obj = new MyConvectiveScheme< CCompressibleViscousFlux<2> >(...); which would create an object for 2D problems with viscous terms. And so we need at least 4 instantiations of these class templates, 2D/3D with or without viscous terms, and this is done in the factory method implemented in CNumericsSIMD.cpp, which is the only cpp in this entire implementation. Static polymorphismAnother concept used in this implementation for efficiency is static polymorphism. template<class Derived>
class Parent {
public:
void parentMethod() {
// "I know I am also a Derived and so I can cast myself."
auto derived = static_cast<Derived*>(this);
// "now I can use a method of derived"
derived->childMethod();
}
};
// A derived class needs to inform the parent about itself
class Child : public Parent<Child> {
public:
void childMethod() {...}
}; Why is this better? Note that 2 of these derived classes don't actually have the same parent, i.e. one inherits from Putting it all togetherFor vectorized central schemes we have something like: // Intermediate class for centered schemes, note the 2 template parameters
template<class Derived, class Base>
class CCenteredBase : public Base {
public:
// Main public method implemented here making use of "Derived" and "Base".
void ComputeFlux(...) const final {
... // gather data, do some computation
derived->DissipationTerms(...); // static polymorphism
Base::ViscousTerms(...); // static decorator
... // update linear system
}
}
// A final centered scheme, which is what we instantiate, with some viscous decorator.
template<class Decorator>
class CJSTScheme : public CCenteredBase<CJSTScheme<Decorator>, Decorator> {
protected:
void DissipationTerms(...) const {...}
} |
PR#1022 SIMD introduced some minor difference in my(!) cht reg tests. 8184779..4b9f2a8x contains #1022 & #1080 (only 5 lines). The change is in solid only and only affects the 2D case. Not the 3D. I dont know what specifically introduced the changes, but as they are small I for now assume that it is just a little numeric change.
Proposed Changes
The objective is to provide a natural (i.e. readable) way to write code that extracts all the performance modern CPU have to offer.
As described in the last update of #789 this will require completely different numerics classes, also to address other inefficiencies they have. For now all centered schemes and plain Roe have been "ported" (this also adds JST with matrix dissipation).
Most of the auxiliary CNumerics functions were also ported as consequence (inviscid fluxes, Jacobians, and so on) and so implementing new schemes only gets easier.
If you want to test this, the new numerics are called for Roe on ideal gas without low Mach corrections (enabled by option USE_VECTORIZATION=YES) and also for centered schemes.
To get good performance on gcc you need the following optimization flags:
-O2 -funroll-loops -ffast-math -march=?? -mtune=??
Where ?? should be something with AVX (e.g.haswell
,skylake
).There are AVX and AVX512 specific optimizations.
Also the OpenMP simd directive is used, at least on gcc this makes a difference (because it is stubborn at vectorizing) so -Dwith-omp=true when calling meson.py.
You can expect around 30% speedup +- 10% on problems that do not need a lot of linear solver iterations, or more on machines with AVX512.
Related Work
#789
PR Checklist