Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Begin migrate ScalarQuantizer to simdlib #3613

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mdouze
Copy link
Contributor

@mdouze mdouze commented Jul 5, 2024

Summary:
As a demo for Mengdi.

The steps to fully migrate to simdlib are:

  1. change all function interfaces to use the generic simd8float32 and friends prototypes -- make sure it compiles on fbcode.

  2. make sure it also compiles on ARM

  3. see which functions can be mirgrated to only use the generic codepath

  4. benchmark if the simd emulated path is competitve with the scalar (for platforms without specific SIMD support)

Differential Revision: D59395882

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D59395882

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D59395882

mdouze added a commit to mdouze/faiss that referenced this pull request Jul 5, 2024
Summary:
Pull Request resolved: facebookresearch#3613

As a demo for Mengdi.

The steps to fully migrate to simdlib are:

1. change all function interfaces to use the generic simd8float32 and friends prototypes -- make sure it compiles on fbcode.

2. make sure it also compiles on ARM

3. see which functions can be mirgrated to only use the generic codepath

4. benchmark if the simd emulated path is competitve with the scalar (for platforms without specific SIMD support)

Differential Revision: D59395882
@mdouze mdouze force-pushed the export-D59395882 branch from 6c98579 to 8e2068a Compare July 5, 2024 14:01
Summary:
Pull Request resolved: facebookresearch#3613

As a demo for Mengdi.

The steps to fully migrate to simdlib are:

1. change all function interfaces to use the generic simd8float32 and friends prototypes -- make sure it compiles on fbcode.

2. make sure it also compiles on ARM

3. see which functions can be mirgrated to only use the generic codepath

4. benchmark if the simd emulated path is competitve with the scalar (for platforms without specific SIMD support)

The rationale here is that there are many SIMD instructions that are straightforward, like adding or subtracting registers, they can be put in common between implementations. The only code that may remain with arch-specific intrinsics is where they way of doing things is very different between AVX and NEON.

Differential Revision: D59395882
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D59395882

@mdouze mdouze force-pushed the export-D59395882 branch from 8e2068a to 05027cd Compare July 5, 2024 14:08
@alexanderguzhva
Copy link
Contributor

@mdouze Do you have any plans to support ARM SVE, if possible? The primary problem of simdlib with ARM SVE is that it implies SIMD registers of a variable size. Technically, there are two the popular models on the market: Amazon Graviton 3 with SIMD width 256b and an upcoming Graviton 4 with SIMD with 512b, so maybe one could stick with 256 bits for now.

@mdouze
Copy link
Contributor Author

mdouze commented Jul 29, 2024

@alexanderguzhva IMO it would be great to support SVE.
What I don't understand is if the SVE size needs to be known at compile time. In that case, we could just add it as another SIMD compile for the 256 and 512 versions.

@alexanderguzhva
Copy link
Contributor

alexanderguzhva commented Jul 29, 2024

@mdouze Yes, the SVE size is known at the compile time. Usually, it is done via svcntb() instruction. The PROBLEM is that for x86 you can have registers, such as __m256, to be a part of a class or struct, but you cannot have SVE registers such as svuint8_t to be so. This will trigger a compiler error O_o. So, you will have to use workarounds, such as keeping std::uint8_t tmp[16]; inside your simdlib for SVE256, and do loads / stores between a register and a buffer. I'm not sure how compiler will be able to optimize it, I hope it will be.

@alexanderguzhva
Copy link
Contributor

what is the status of this diff? Should I wait before I bring some updates to ScalarQuantizer?

@mengdilin
Copy link
Contributor

@alexanderguzhva I'm starting to work on this but it's gonna take some time. If you want to make your changes in now, feel free to and I can work on refactoring later down the line

@alexanderguzhva
Copy link
Contributor

@mengdilin any time estimates on your end? Basically, are you in a stage where you know what to do exactly or are you in a research stage?

mengdilin added a commit to mengdilin/faiss that referenced this pull request Aug 26, 2024
Summary:
Pull Request resolved: facebookresearch#3613

As a demo for Mengdi.

The steps to fully migrate to simdlib are:

1. change all function interfaces to use the generic simd8float32 and friends prototypes -- make sure it compiles on fbcode.

2. make sure it also compiles on ARM

3. see which functions can be mirgrated to only use the generic codepath

4. benchmark if the simd emulated path is competitve with the scalar (for platforms without specific SIMD support)

The rationale here is that there are many SIMD instructions that are straightforward, like adding or subtracting registers, they can be put in common between implementations. The only code that may remain with arch-specific intrinsics is where they way of doing things is very different between AVX and NEON.

Differential Revision: D59395882
@mengdilin
Copy link
Contributor

@alexanderguzhva I think I can finish up AVX2/Neon in ScalarQuantizer around October (have other work items at hand atm). My understanding here is I should move the respective parts of AVX2 and Neon code in ScalarQuantizer into faiss/utils/simdlib_avx2.h and faiss/utils/simdlib_neon.h as part of my SIMD ramp-up. I've made some progress on the refactor, but I have not thought about how simdlib can be extended to support SVE. Before committing my progress, I'm building out a performance regression test suites that can ensure my changes don't introduce regressions across AVX2, Neon, and no optimizations.

I'm a SIMD noob here. Let me know if I'm moving in the right direction for the refactor or if I'm missing anything major.

mengdilin added a commit to mengdilin/faiss that referenced this pull request Sep 25, 2024
Summary:
Pull Request resolved: facebookresearch#3613

As a demo for Mengdi.

The steps to fully migrate to simdlib are:

1. change all function interfaces to use the generic simd8float32 and friends prototypes -- make sure it compiles on fbcode.

2. make sure it also compiles on ARM

3. see which functions can be mirgrated to only use the generic codepath

4. benchmark if the simd emulated path is competitve with the scalar (for platforms without specific SIMD support)

The rationale here is that there are many SIMD instructions that are straightforward, like adding or subtracting registers, they can be put in common between implementations. The only code that may remain with arch-specific intrinsics is where they way of doing things is very different between AVX and NEON.

Differential Revision: D59395882
mengdilin added a commit to mengdilin/faiss that referenced this pull request Sep 25, 2024
Summary:
Pull Request resolved: facebookresearch#3613

As a demo for Mengdi.

The steps to fully migrate to simdlib are:

1. change all function interfaces to use the generic simd8float32 and friends prototypes -- make sure it compiles on fbcode.

2. make sure it also compiles on ARM

3. see which functions can be mirgrated to only use the generic codepath

4. benchmark if the simd emulated path is competitve with the scalar (for platforms without specific SIMD support)

The rationale here is that there are many SIMD instructions that are straightforward, like adding or subtracting registers, they can be put in common between implementations. The only code that may remain with arch-specific intrinsics is where they way of doing things is very different between AVX and NEON.

Differential Revision: D59395882
@vorj
Copy link
Contributor

vorj commented Oct 3, 2024

@mdouze @alexanderguzhva I found it now, so I comment about above discussion:

an upcoming Graviton 4 with SIMD with 512b

Graviton4 has 128bit SVE registers:

user@ip-172-31-xx-xx:/tmp$ cat test.cpp
#include<iostream>
#include<arm_sve.h>

int main(){
        std::cout << svcntb()*8 << std::endl;
}
user@ip-172-31-xx-xx:/tmp$ g++ -march=armv9-a+sve2 -otest test.cpp
user@ip-172-31-xx-xx:/tmp$ ./test
128

What I don't understand is if the SVE size needs to be known at compile time.

Yes, the SVE size is known at the compile time. Usually, it is done via svcntb() instruction. The PROBLEM is that for x86 you can have registers, such as __m256, to be a part of a class or struct, but you cannot have SVE registers such as svuint8_t to be so. This will trigger a compiler error O_o.

Let's summarize the information around this:

  • svcntb() can't be called at compile time, because the function is not constexpr .
  • Usually, the SVE registers size can't be detected at compile time
    • Because the size is different for each CPUs, for example Graviton3 has 256bit and Graviton4 has 128bit
    • We can execute same binary on each CPUs because the binary detects SVE register length at run time
    • Thus, sizeof(svuint8_t) is not determined at compile time. This causes:
      • We can't arithmetic operate to a pointer of SVE register
        • Because (to simplify to the extereme) ptr + 1 means that reinterpret_cast<char*>(ptr) + sizeof(*ptr)
      • We can't create an array of SVE registers
        • Because arr[1] means that *(arr + 1) . Again, we can't arithmetic operate to the pointer.
      • We can't contain SVE registers into class
        • Because obj.member means that obj->*(&klass::member) . Member object pointer is caluculated with the offset from a head of a class at compile time, but anyone can't caluculate the unsized offset.

          struct S{
            svint8_t x;  // start from 0byte to ???byte of S
            int y;       // start from... where?
          };
    • So, programming the abstracted code with SVE needs some techniques. Although some abstraction is possible to make excellent use of C++ templates, the appearance of the code is quite complicated than the tradisional SIMD codes.
  • Actually, you can fix the register size of SVE for your code with -msve-vector-bits= option
    • svcntb() can't be called at compile time even if this case, but we can detect the SVE vector length with __ARM_FEATURE_SVE_BITS macro
    • When this option is passed to compiler, SVE register types will be sized types.
      • Thus sizeof(svuint8_t) will be enabled and there will be no limitation in programming.
    • However, the option makes the binary unportable across CPUs that have different SVE vector length
    • see more info at here

I've tried to make simdlib supporting SVE, but as you know that is extremely hard job. For the time being, it's better to write SVE code without much abstraction IMHO. If the package file size bloat is acceptable, fixing the vector length is an alternative.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

6 participants