Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Target][TOPI] Use LLVM for x86 CPU feature lookup #15685

Merged
merged 1 commit into from
Sep 14, 2023
Merged

[Target][TOPI] Use LLVM for x86 CPU feature lookup #15685

merged 1 commit into from
Sep 14, 2023

Conversation

cbalint13
Copy link
Contributor

@cbalint13 cbalint13 commented Sep 6, 2023

Hi folks,

This PR leverage LLVM itself for CPU features lookup, replacing hard-coded lists.
In order to keep maintainability with X86 families & features we can rely on LLVM.


Changes:

  • Introduce a single target_has_feature(XXX) replacing all target_has_XXX()
  • PY+FFI: expose new llvm_x86_get_archlist, llvm_x86_get_features & llvm_x86_has_feature
  • PY: expose new target_has_feature wrapper to _ffi.llvm_x86_has_feature

There is a test unit for a comprehensive check with the old behaviour.
For better reliability, this way of feature checking can be implemented for other arches.

Thanks,
~Cristian.

Cc: @elvin-n , @vvchernov , @echuraev , @vinx13 , @jcf94 , @masahi

Copy link
Contributor

@kparzysz-quic kparzysz-quic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks!

Copy link
Member

@junrushao junrushao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an awesome feature I've been thinking of Thanks for the patch!

Copy link
Contributor

@echuraev echuraev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you for your contribution. It is very useful!

@junrushao
Copy link
Member

The CI fails because the LLVM version on CI is pretty low (==10). I'm curious if there's any variant of this API on LLVM 10? If not, we should bump LLVM to 15 or 16

@cbalint13
Copy link
Contributor Author

cbalint13 commented Sep 8, 2023

The CI fails because the LLVM version on CI is pretty low (==10). I'm curious if there's any variant of this API on LLVM 10? If not, we should bump LLVM to 15 or 16

Folks,
@junrushao ,

Yes, I am aware of this llvm<=10 issue, so llvm==11 would be the minimum (tested).
Looking at llvm<=10 to see other way (little +extra code costs) of tapping differently into their API.

I am strongly opting for a backward compatibility for this case.

The API fracture, at a first glance:

Allow a little time (1-2 day) to investigate a way going down llvm<11, then I'll be back with the results.

@cbalint13
Copy link
Contributor Author

cbalint13 commented Sep 10, 2023

I am strongly opting for a backward compatibility for this case.
Allow a little time (1-2 day) to investigate a way going down llvm<11, then I'll be back with the results.

Details of investigation:

Post llvm>=11:

  • Direct access via TargetParsers public headers.
  • Immediate, no need for target-machine, details are fetched via the public getters.

Implementation here is slim, infos coming from LLVM are precise and maintained.

Prior llvm<=10:

  • The useful data from tablegen descriptor in class MCSubtargetInfo is private without useful accessor.
  • There is a unhelpful way of passing std::string("help") at the creation of MCSubtargetInfo().
  • Useful queries can be done, only via full llvm taget-machine, this compatibility is implemented here.

There is a burden of LLVMint() and the target-machine creation, but final check/legalizer is precise w.r.t to the arch.


@junrushao , @kparzysz-quic, @echuraev

There are quite some changes now, please help re-reviewing them.

Thanks.

Copy link
Contributor

@vvchernov vvchernov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @cbalint13! Thank you for big work and good improvement and unification of target check. For me it looks like that "avx512bw" is not the best name to split usual avx512 (e.g. for skylake) and avx512 with VNNI (e.g. for cascadelake), may be return "avx512", I do not see places where it uses in other context

python/tvm/target/x86.py Outdated Show resolved Hide resolved
@cbalint13
Copy link
Contributor Author

cbalint13 commented Sep 11, 2023

Hello @cbalint13! Thank you for big work and good improvement and unification of target check. For me it looks like that "avx512bw" is not the best name to split usual avx512 (e.g. for skylake) and avx512 with VNNI (e.g. for cascadelake), may be return "avx512", I do not see places where it uses in other context

@vvchernov ,

Good question regarding avx512bw !

Please help me with a double check on these statements below:

There are also other avx512 subsets, but none holding instructions for our topi/tir intrinsics, on their names here:

"avx512vl", "avx512dq", "avx512cd", "avx512er", "avx512pf", "avx512vbmi", "avx512ifma",
"avx5124vnniw", "avx5124fmaps", "avx512vpopcntdq", "avx512vbmi2","avx512vnni", 
"avx512bitalg", "avx512bf16"

UPDATE: a much clear view on what avx512bw provides: llvm/clang/Basic/BuiltinsX86.def#L1057-L1058 .


Some arches investigated here:

  • The "skylake" don't have it:
print( tvm.target.codegen.llvm_x86_get_features("skylake") )
["cmov", "mmx", "popcnt", "sse", "sse2", "sse3", "ssse3", "sse4.1", "sse4.2", 
"avx", "avx2", "fma", "bmi", "bmi2", "aes", "pclmul", "adx", "clflushopt", "cx16",
"cx8", "crc32", "f16c", "fsgsbase", "fxsr", "invpcid", "lzcnt", "movbe", "prfchw", 
"rdrnd", "rdseed", "sahf", "sgx", "x87", "xsave", "xsavec", "xsaveopt", "xsaves"]
  • The "skylake-avx512" have it:
print( tvm.target.codegen.llvm_x86_get_features("skylake-avx512") )
["cmov", "mmx", "popcnt", "sse", "sse2", "sse3", "ssse3", "sse4.1", "sse4.2", "avx", "avx2", "fma",
 "avx512f", "bmi", "bmi2", "aes", "pclmul", "avx512vl", "avx512bw", "avx512dq", "avx512cd", 
"adx", "clflushopt", "clwb", "cx16", "cx8", "crc32", "f16c", "fsgsbase", "fxsr", "invpcid", 
"lzcnt", "movbe", "pku", "prfchw", "rdrnd", "rdseed", "sahf", "x87", "xsave", "xsavec", 
"xsaveopt", "xsaves"]
  • The "cascadelake" have it:
print( tvm.target.codegen.llvm_x86_get_features("cascadelake") )
["cmov", "mmx", "popcnt", "sse", "sse2", "sse3", "ssse3", "sse4.1", "sse4.2", 
"avx", "avx2", "fma", "avx512f", "bmi", "bmi2", "aes", "pclmul",
 "avx512vl", "avx512bw", "avx512dq", "avx512cd", "avx512vnni",
 "adx", "clflushopt", "clwb", "cx16", "cx8", "crc32", "f16c", "fsgsbase", 
"fxsr", "invpcid", "lzcnt", "movbe", "pku", "prfchw", "rdrnd", "rdseed", 
"sahf", "x87", "xsave", "xsavec", "xsaveopt", "xsaves"]
  • The "knl" which have some avx512* but not the desired avx512bw:
print( tvm.target.codegen.llvm_x86_get_features("knl") )
["cmov", "mmx", "popcnt", "sse", "sse2", "sse3", "ssse3", "sse4.1", "sse4.2", 
"avx", "avx2", "fma", "avx512f", "bmi", "bmi2", "aes", "pclmul", 
"avx512cd", "avx512er", "avx512pf", "adx", "cx16", "cx8", "crc32", "f16c",
 "fsgsbase", "fxsr", "invpcid", "lzcnt", "movbe", "prefetchwt1", "prfchw", 
"rdrnd", "rdseed", "sahf", "x87", "xsave", "xsaveopt"]
  • The odd "alderlake" have avxvnni but not avx512bw either avx512vnni:
print( tvm.target.codegen.llvm_x86_get_features("knl") )
["cmov", "mmx", "popcnt", "sse", "sse2", "sse3", "ssse3", "sse4.1", "sse4.2", 
"fma", "bmi", "bmi2", "aes", "pclmul", "gfni", "vpclmulqdq", "adx", "cldemote", 
"clflushopt", "clwb", "cx16", "cx8", "crc32", "f16c", "fsgsbase", "fxsr", "invpcid", 
"widekl", "lzcnt", "movbe", "movdir64b", "movdiri", "pconfig", "pku", "prfchw", 
"ptwrite", "rdpid", "rdrnd", "rdseed", "sahf", "serialize", "sgx", "sha", "shstk", 
"vaes", "waitpkg", "x87", "xsave", "xsavec", "xsaveopt", "xsaves", "hreset", 
"avxvnni"]

BTW, "alderlake" here, is the only "strangeness" having avxvnni but not avx512vnni (intel's early one ?).

  • The "sapphirerapids" have avx512bw avx512vnni , and amx-int8 among with lot of new interesting things.
print( tvm.target.codegen.llvm_x86_get_features("sapphirerapids") )
["cmov", "mmx", "popcnt", "sse", "sse2", "sse3", "ssse3", "sse4.1", "sse4.2", "avx", "avx2", "fma", 
"avx512f", "bmi", "bmi2", "aes", "pclmul", "avx512vl", "avx512bw", "avx512dq", "avx512cd", 
"avx512vbmi", "avx512ifma", "avx512vpopcntdq", "avx512vbmi2", "gfni", "vpclmulqdq", 
"avx512vnni", "avx512bitalg", "avx512bf16", "adx", "amx-bf16", "amx-int8", "amx-tile", 
"cldemote", "clflushopt", "clwb", "cx16", "cx8", "crc32", "enqcmd", "f16c", "fsgsbase", 
"fxsr", "invpcid", "lzcnt", "movbe", "movdir64b", "movdiri", "pconfig", "pku", "prfchw", 
"ptwrite", "rdpid", "rdrnd", "rdseed", "sahf", "serialize", "sgx", "sha", "shstk", "tsxldtrk", 
"uintr", "vaes", "waitpkg", "wbnoinvd", "x87", "xsave", "xsavec", "xsaveopt", "xsaves", 
"avx512fp16", "avxvnni"]
  • The full archlist as of llvm=17:
print( tvm.target.codegen.llvm_x86_get_archlist() )
["i386", "i486", "winchip-c6", "winchip2", "c3", "i586", "pentium", "pentium-mmx", "pentiumpro", 
"i686", "pentium2", "pentium3", "pentium3m", "pentium-m", "c3-2", "yonah", "pentium4", "pentium4m", 
"prescott", "nocona", "core2", "penryn", "bonnell", "atom", "silvermont", "slm", "goldmont", 
"goldmont-plus", "tremont", "nehalem", "corei7", "westmere", "sandybridge", "corei7-avx", 
"ivybridge", "core-avx-i", "haswell", "core-avx2", "broadwell", "skylake", "skylake-avx512", 
"skx", "cascadelake", "cooperlake", "cannonlake", "icelake-client", "rocketlake", "icelake-server", 
"tigerlake", "sapphirerapids", "alderlake", "raptorlake", "meteorlake", "sierraforest", "grandridge", 
"graniterapids", "graniterapids-d", "emeraldrapids", "knl", "knm", "lakemont", "k6", "k6-2", "k6-3", 
"athlon", "athlon-tbird", "athlon-xp", "athlon-mp", "athlon-4", "k8", "athlon64", "athlon-fx", "opteron", 
"k8-sse3", "athlon64-sse3", "opteron-sse3", "amdfam10", "barcelona", "btver1", "btver2", "bdver1", 
"bdver2", "bdver3", "bdver4", "znver1", "znver2", "znver3", "znver4", 
"x86-64", "x86-64-v2", "x86-64-v3", "x86-64-v4", "geode"]

I put Cc @Qianshui-Jiang (author of #13642) he might help us with a second opinion in this too.

@cbalint13 cbalint13 requested a review from vvchernov September 11, 2023 12:20
@vvchernov
Copy link
Contributor

Hello @cbalint13! Very hard work! One note I should say TVM needs three avx512 intrinsics: vpmaddwd, vpmaddubsw and vpaddd. For the latter "+" is used but it is automatically replaced by intrinsics by llvm, and it is in avx512f set (I checked it here).

@cbalint13
Copy link
Contributor Author

cbalint13 commented Sep 11, 2023

Hello @cbalint13! Very hard work! One note I should say TVM needs three avx512 intrinsics: vpmaddwd, vpmaddubsw and vpaddd. For the latter "+" is used but it is automatically replaced by intrinsics by llvm, and it is in avx512f set (I checked it here).

@vvchernov ,

See now, very good point !

Let's ask for booth, will change with these comments:

// avx512f:  llvm.x86.avx512.addpd.w.512 (LLVM auto, added)
// avx512bw: llvm.x86.avx512.pmaddubs.w.512 (TVM required)
           + llvm.x86.avx512.pmaddw.d.512
if target_has_features(["avx512f", "avx512bw"]):

Just as side note (curiosity) on this topic:

"F" would stand for foundation, probably avx512bw implies that CPU also have avx512f (but not vice-versa)
One example of partial avx512 is the "knl" having the avx512f (foundation) part but not having the avx512bw part.
Maybe one day someone can add separate _compute(), _update() for topi (more precise control) not letting LLVM itself.
I would be curious of outcome, in the first (i. case) LLVM would ommit addpd.w.512 by adding something suboptimal.

  • i. "llvm -mcpu=x86-64 -mattr=+avx512bw"
  • ii. "llvm -mcpu=x86-64 -mattr=+avx512bw,+avx512f"
    As i. versus ii. , "x86-64" is the plainest configuration for llvm, allowing only explicit flags.

Anyway, let's go now with avx512bw && avx512f.

@vvchernov
Copy link
Contributor

Hello @elvin-n! May be second opinion from your to double check

@cbalint13
Copy link
Contributor Author

cbalint13 commented Sep 11, 2023

"F" would stand for foundation, probably avx512bw implies that CPU also have avx512f (but not vice-versa)
One example of partial avx512 is the "knl" having the avx512f (foundation) part but not having the avx512bw part.

@vvchernov ,

One more informal experiment:

  • I made this experiment below, and results disprove that avx512bw implies avx512f (aka "foundation part").
  • The x86-64-v4 popping up is kind of "arch-generic", so let's enforce avx512bw && avx512f presence.
  • For this ODD case, something like llvm -mcpu=x86-64-v4 -mattr=+avx512f passes the test below (checked).
$ cat ./tvm-check-avx512bw.py 
#!/usr/bin/python3

import tvm
from tvm.target import codegen
from tvm.target.x86 import target_has_features

for mcpu in codegen.llvm_x86_get_archlist():
  with tvm.target.Target("llvm -mcpu=%s" % mcpu):
    if target_has_features("avx512bw"):
      has_avx512f = target_has_features("avx512f")
      print("ARCH [%s] having `avx512bw` has avx512f=[%i]" % (mcpu, has_avx512f))
  • With LLVM=17:
$ ./tvm-check-avx512bw.py
ARCH [skylake-avx512] having `avx512bw` has avx512f=[1]
ARCH [skx] having `avx512bw` has avx512f=[1]
ARCH [cascadelake] having `avx512bw` has avx512f=[1]
ARCH [cooperlake] having `avx512bw` has avx512f=[1]
ARCH [cannonlake] having `avx512bw` has avx512f=[1]
ARCH [icelake-client] having `avx512bw` has avx512f=[1]
ARCH [rocketlake] having `avx512bw` has avx512f=[1]
ARCH [icelake-server] having `avx512bw` has avx512f=[1]
ARCH [tigerlake] having `avx512bw` has avx512f=[1]
ARCH [sapphirerapids] having `avx512bw` has avx512f=[1]
ARCH [graniterapids] having `avx512bw` has avx512f=[1]
ARCH [graniterapids-d] having `avx512bw` has avx512f=[1]
ARCH [emeraldrapids] having `avx512bw` has avx512f=[1]
ARCH [znver4] having `avx512bw` has avx512f=[1]
ARCH [x86-64-v4] having `avx512bw` has avx512f=[0] <--- ODD !!! (but pass with -mattr=+avx512f).
  • With LLVM=10:
$ ./tvm-check-avx512bw.py 
ARCH [cannonlake] having `avx512bw` has avx512f=[1]
ARCH [cascadelake] having `avx512bw` has avx512f=[1]
ARCH [cooperlake] having `avx512bw` has avx512f=[1]
ARCH [icelake-client] having `avx512bw` has avx512f=[1]
ARCH [icelake-server] having `avx512bw` has avx512f=[1]
ARCH [skx] having `avx512bw` has avx512f=[1]
ARCH [skylake-avx512] having `avx512bw` has avx512f=[1]
ARCH [tigerlake] having `avx512bw` has avx512f=[1]

So once again lets enforce avx512bw && avx512f.

Copy link
Contributor

@vvchernov vvchernov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Qianshui-Jiang
Copy link
Contributor

Qianshui-Jiang commented Sep 13, 2023

@cbalint13 @vvchernov big thanks for your hard work and dicussion! Here is few comments.
Actually before the CasecadeLake we use avx512 to handle the 8bit and 16bit integer, so pmaddubs used here is for SkyLake.

And start from CasecadeLake we have avx512vnni, so there is other more instrcutions like vpdpbusd and vpdpwssd, it fused some of pmadd instruction we used before,

Now when we move to SapphireRapids, we have amx-vnni, which use amx instrctions to handle the 8bit integer.

Yes ur right, avxvnni in AlderLake is due to the lack of its avx512 instruction set in 12th client CPU.

I support to let TVM use features names to decide the schedule method, seems much clear than using arch name.

@cbalint13
Copy link
Contributor Author

@Qianshui-Jiang ,

Thank you much for your time & clarifications !

@cbalint13
Copy link
Contributor Author

cbalint13 commented Sep 14, 2023

To sum it up:

  • compliant across all versions, llvm<=10 issue is solved ( Cc @junrushao ), tested versions: llvm={10,14,16,17}
  • implementation here exposes single target_has_features(XXX) replacing the old target_has_XXX() functions
  • it also expose friendly llvm_x86_get_archlist() and llvm_x86_get_features(arch) to obtain full arch related infos

There is a comprehensive test unit that also checks the old behaviour (by incorporating the old static lookup table).

This is the final state, passing the CI, until explicit review-requests.

@junrushao
Copy link
Member

I am very excited about this feature and cannot wait to try it out myself! Thank you @cbalint13 for this super well-documented and well-tested PR, and it's going to be super useful for downstream applications!

@junrushao junrushao merged commit 67df20f into apache:main Sep 14, 2023
@cbalint13
Copy link
Contributor Author

I am very excited about this feature and cannot wait to try it out myself! Thank you @cbalint13 for this super well-documented and well-tested PR, and it's going to be super useful for downstream applications!

Was just a simple idea of utility, crediting the work already done before: @vvchernov , @Qianshui-Jiang and @elvin-n (the original lookup).

Thanks folks !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants