Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update forked repository to latest #1

Merged
merged 1,109 commits into from
Jan 8, 2025
Merged

Update forked repository to latest #1

merged 1,109 commits into from
Jan 8, 2025

Conversation

joncamp
Copy link

@joncamp joncamp commented Jan 8, 2025

No description provided.

snnn and others added 30 commits October 30, 2024 19:25
…ly correct (#22624)

As the title suggests, recompilation is done if a mismatch is detected.
Changed the logs to reflect that behavior.
### Description
Consolidate the gpu data transfer in CUDA, ROCm and Migraphx EP.
(1) Remove some redundant stream synchronize on default stream according
to spec of cudaMemcpy
(2) consolidate CUDA, ROCm and MigrphaX to try use same logic.

### Motivation
This is a follow up on reviewing
#22589.

### Context


https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html#api-sync-behavior
##### cudaMemcpy()
* For transfers from pageable host memory to device memory, a stream
sync is performed before the copy is initiated. The function will return
once the pageable buffer has been copied to the staging memory for DMA
transfer to device memory, **but the DMA to final destination may not
have completed**.
* For transfers from pinned host memory to device memory, the function
is synchronous with respect to the host.
* For transfers from device to either pageable or pinned host memory,
the function returns only once the copy has completed.
* For transfers from device memory to device memory, **no host-side
synchronization is performed**.
* For transfers from any host memory to any host memory, the function is
fully synchronous with respect to the host.

#### cudaMemcpyAsync

* For transfers between device memory and pageable host memory, the
function might be synchronous with respect to host.
* For transfers from any host memory to any host memory, the function is
fully synchronous with respect to the host.
* If pageable memory must first be staged to pinned memory, the driver
may synchronize with the stream and stage the copy into pinned memory.
 * For all other transfers, the function should be fully asynchronous.


https://rocm.docs.amd.com/projects/HIP/en/latest/doxygen/html/group___memory.html

##### hipMemcpyAsync()

If host or dest are not pinned, the memory copy will be performed
synchronously. For best performance, use hipHostMalloc to allocate host
memory that is transferred asynchronously.
on HCC hipMemcpyAsync does not support overlapped H2D and D2H copies.
For hipMemcpy, the copy is always performed by the device associated
with the specified stream.

##### hipMemcpy()
For hipMemcpy, the copy is always performed by the current device (set
by hipSetDevice).

https://github.com/ROCm/ROCm/blob/roc-5.7.x/tools/autotag/templates/rocm_changes/5.6.1.md

ROCm 5.6.1 release note: hipMemcpy device-to-device (intra device) is
now asynchronous with respect to the host
)

### Description
The CastNonStringTester test in CastOpTest was failing due to bitwise
mismatches when casting other types to bool. This was caused by bool
being represented as uint8 in DML. Added a clipping step in
DmlOperatorCast to ensure correct bitwise matching after casting to bool
ref: https://dev.azure.com/microsoft/OS/_workitems/edit/44572678


### Motivation and Context
Since opset 18, 'scales' and 'sizes' constant inputs can be 2D tensors,
transpose for 2D tensors are not supported at current implementation,
fix it by only allowing 4D constant inputs.
### Description
Now, we need to build cuda and dml in one package.
But CUDA EP and DML EP can't run in one process.
It will throw the exception of `the GPU device instance has been
suspended`
So the issue is CUDA EP and DML EP coexist in compile time but can't
exist in run time.

This PR is to split cuda ep test and dml ep test in all unit tests.
The solution is to use 2 environment variable, NO_CUDA_TEST and
NO_DML_TEST, in CI.

For example, if NO_CUDA_TEST is set, the DefaultCudaExecutionProvider
will be nullptr, and the test will not run with CUDA EP.
In debugging, the CUDAExecutionProvider will not be called. 
I think, as long as cuda functions, like cudaSetDevice, are not called,
DML EP tests can pass.

Disabled java test of testDIrectML because it doesn't work now even
without CUDA EP.
### Description
1. Add concurrency setting to codeql workflow
2. Modify lint workflow's PATH setting.


### Motivation and Context
To save machine resource.
…info" (#22669)

Reverts #22556 since it causes incorrect fallback.
- cast 
 - argmax
 - gelu 
 - cast 
 - LayerNorm 
 - GroupNorm 
 - InstanceNorm

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
### Description
<!-- Describe your changes. -->
BUG #22031

In the demucs model, there are lots of MatMul ops with shapes like
below:
`input[0]: [3448,1,512] | float32, input[1]: [512,1536] | float32,
output[0]: [3448,1,1536] | float32`

We can see that for this kind of shape, the batch size is a big value,
but M = 1. Our current algorithm is based on [M, N] to partition tiles,
which is not efficient for such kind of shapes. This PR reshapes the
inputs to improve the matmul performance.
Before:  [3448,1,512] x [512,1536] =  [3448,1,1536]
After: [1, 3448, 512] x [512, 1536] = [1, 3448, 1536] , then the output
can be reshaped to [3448, 1, 1536]

The overall MatMul time in demucs model becomes 1778.45 ms from 4418.17
ms on my iGPUs.

---------

Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
…e can be easily used (#22345)

### Description
The local build of the native library was being included by almost every
project, but is only needed to run tests. Due to the multiple inclusions
attempting to use a pre-built package was clashing with any local builds
that were available.

Create a helper file to include either a local built of a pre-built
package and include that in the two test projects.

Cleanup various miscellaous things.

### Motivation and Context

Create setup to simplify running on-device tests with the nuget
packages.
Use suggest-changes@v2
(parkerbxyz/suggest-changes#36 (comment))
to post suggested changes as comments instead of requested changes to
streamline the review process.

- Also updated the script to `set +e` to ignore exit code only for the
linter run. So that if there is errors in dependency installation we can
still get signals.
These logs are not quite useful and create a lot of noise during
debugging, especially when working with large models.
Allow writing security events to post lint messages on PRs.
### Description
Partial answer to issue #19997. The example succeeds after this change.
…e ETW registration can fail (#22699)

### Description
Make ETW provider registration non-fatal and not throw an exception

Needs to work under build with exceptions enabled & --disable_exceptions

### Motivation and Context
ORT should not crash

Addresses #22475. Private tested by filer of that issue
### Description

Fixes:
(1) cpu kernel: applying scale before bias and mask like other MHA ops
(2) cpu kernel: correct offset during appending past to present.
(3) cuda kernel: apply mask if provided; fix output_qk offset.

Add DMMHA unit tests
…ary (#22695)

### Description

Add I/O binding example using onnx data type in python API summary. The
API is available since 1.20 release.

### Motivation and Context

Follow up of #22306 to add
some documentation.
…sources that are CPU accessible. (#22680)

Adjust max chunk size to fix error limit check from DX12 for large
resources that are CPU accessible.

### Description
Current agility SDK restricts CPU visible buffers to 0xFFFF0000 bytes or
slightly smaller than 4GiB. Verified restriction is still in latest
Agility SDK 1.614.1.


### Motivation and Context
Allocation of Resources 4GiB or larger fail in DX12 verification layer.

---------

Co-authored-by: Dwayne Robinson <dwayner@microsoft.com>
This PR supports Sign and CumSum operators for WebNN EP. @Honry @fdwr
PTAL, thanks.
ORT will optimize same scalar initializers into one, we should not skip
such scalar registration as a WebNN Constant.
### Description
<!-- Describe your changes. -->
Set SDL's git submodule to false. 


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
* Previous job's SDL logs:It has 'git submodule sync' command, which
means 'git submodule sync synchronizes all submodules while git
submodule sync'

* After set sdl git submodules to false, the logs don't have 'git
submodule sync' command.
- Pass inputs to WebNN directly, WebNN will handle the broadcasting
- If `zero_point` is not provided, make a WebNN Constant with 0 values
and same shape as `scale` input
WebNN doesn't provide dedicate op for SimplifiedLayerNormalization, use
a couple of WebNN ops to emulate it in WebNN EP.

X --> Pow --> ReduceMean --> Add --> Sqrt --> Div -> Mul
### Description

This change enhances the Node.js binding with the following features:
- support WebGPU EP
- lazy initialization of `OrtEnv`
- being able to initialize ORT with default log level setting from
`ort.env.logLevel`.
- session options:
  - `enableProfiling` and `profileFilePrefix`: support profiling.
  - `externalData`: explicit external data (optional in Node.js binding)
- `optimizedModelFilePath`: allow dumping optimized model for diagnosis
purpose
  - `preferredOutputLocation`: support IO binding.

======================================================
`Tensor.download()` is not implemented in this PR.
Build pipeline update is not included in this PR.
BUG #22031

The total Gemm time in demucs model becomes 181.14 ms from over 1000 ms
on my iGPUs.

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Refactor the cmake code that is related to delay loading. Provide a
cmake option to control if delay loading should be enabled or not.
Disabling the option when python is enabled, due to a known issue. 

### Motivation and Context
ONNX Runtime's python package depends on DirectML.dll, but supposedly
the DLL should be delay loaded.
This PR only refactor the code. It doesn't change the behavior.
### Description
The nuget-zip-java packaging pipeline has been failed for 4 days since
it's introduced in #22591
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Honry and others added 28 commits December 24, 2024 12:44
The algorithm of `SkipSimplifiedLayerNormalization` is quite similar to
the `SimplifiedLayerNormalization`, only different is
`SkipSimplifiedLayerNormalization` provides an additional output used
for calculating the sum of the input, skip and bias (if it exits).

BTW, fix a bug in `SimplifiedLayerNormalization`, adding bias if it
exits.
### Description
Refactor compute plan profiling

Support cache coreml model to speed up session initialization. this is
only support by user provided entry and user responsible to manage the
cache


With the cache, session initialization time can be reduced by 50% or
more:
|model| before| after|
|--|--|--|
|yolo11.onnx| 0.6s|0.1s|
|yolo11-fp16.onnx|1.8s|0.1s|


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: wejoncy <wejoncy@.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
### Description
Enable delay loading hooker for python packages
The SAL2 macros are not always available there

### Description

Make SAL2 macros only available on MSVC.

### Motivation and Context

#1175
Remove PostBuildCleanup tasks since it is deprecated. It is to address a
warning in our pipelines:

"Task 'Post Build Cleanup' version 3 (PostBuildCleanup@3) is dependent
on a Node version (6) that is end-of-life. Contact the extension owner
for an updated version of the task. Task maintainers should review Node
upgrade guidance: https://aka.ms/node-runner-guidance"

Now the cleanup is controlled in another place:

https://learn.microsoft.com/en-us/azure/devops/pipelines/yaml-schema/workspace?view=azure-pipelines


The code change was generated by the following Linux command:
```bash
find . -name \*.yml -exec sed -i '/PostBuildCleanup/,+2d' {} \;
```
### Description
Make arrays with cubin data const.


### Motivation and Context
Non-const arrays are put into the .data section which might cause
excessive memory usage in some scenarios. Making cubin arrays const
allows them to be put into the .rodata section.
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
For legacy jetson users who use jetpack 5.x, the latest TRT version is
8.5.
Add version check to newer trt features to fix build on jetpack 5.x
(cuda11.8+gcc11 are required)


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Changed all support tensor  type from ir 9 to ir 10.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
- See issue #23205

Co-authored-by: Yueqing Zhang <yueqingz@amd.com>
### Description
The Web CI pipeline uses three different Windows machine pools:
1. onnxruntime-Win2022-webgpu-A10
2. onnxruntime-Win2022-VS2022-webgpu-A10
3. onnxruntime-Win-CPU-2022-web

This PR merges them together to reduce ongoing maintenance cost.
### Description

Use `https.get` instead of `fetch` in ORT Nodejs binding package install
script.

### Motivation and Context

According to discussions in #23232, the package `global-agent` cannot
work with `fetch` API. To make it work with the proxy agent, this PR
replaces the `fetch` API with `https.get` in the install script.
### Description
This PR is convenient to do post processing for the generated json file
when profiling is enabled. Kernel type can be used to aggregate the same
type kernels' overall time.
Move Linux GPU CI pipeline to A10 machines which are more advanced.
Retire onnxruntime-Linux-GPU-T4 machine pool.
Disable run_lean_attention test because the new machines do not have
enough shared memory.

```
skip loading trt attention kernel fmha_mhca_fp16_128_256_sm86_kernel because no enough shared memory
[E:onnxruntime:, sequential_executor.cc:505 ExecuteKernel] Non-zero status code returned while running MultiHeadAttention node. Name:'MultiHeadAttention_0' Status Message: CUDA error cudaErrorInvalidValue:invalid argument
```
…#23232)

### Description
Add proxy agent to fetch request



### Motivation and Context
Fixes #23231

---------

Signed-off-by: Junze Wu <junze.wu@intel.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
### Description

Update `mocha` to v11.0.1 and `fs-extra` to v11.2.0

```
# npm audit report

nanoid  <3.3.8
Severity: moderate
Predictable results in nanoid generation when given non-integer values - GHSA-mwcw-c2x4-8c55
fix available via `npm audit fix`
node_modules/nanoid
  mocha  8.2.0 - 10.2.0
  Depends on vulnerable versions of nanoid
  node_modules/mocha

2 moderate severity vulnerabilities
```
### Description
1. Currently Python-Cuda-Publishing-Pipeline only publishes Linux
wheels, not Windows wheels. It is because recently we refactored the
upstream pipeline("Python-CUDA-Packaging-Pipeline") to use 1ES PT. This
PR fixed the issue
2. tools/ci_build/github/azure-pipelines/stages/py-win-gpu-stage.yml no
longer includes component-governance-component-detection-steps.yml ,
because 1ES PT already inserted such a thing
3. Delete tools/ci_build/github/windows/eager/requirements.txt because
it is no longer used.

### Motivation and Context
The "Python-CUDA-Packaging-Pipeline" is for CUDA 12.
"Python CUDA ALT Packaging Pipeline" is for CUDA 11.

The two pipelines are very similar, except the CUDA versions are
different.
Each of them has three parts: build, test, publish.
"Python-CUDA-Packaging-Pipeline" is the first part: build.
"Python CUDA12 Package Test Pipeline" is the second part.
"Python-Cuda-Publishing-Pipeline" is the third part that publishes the
packages to an internal ADO feed.
### Description
Separating result processor out from profiler.py without changing the
behaviors of current profile.py



### Motivation and Context
Less dependency and smaller code for processing profile from other
scenarios.

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
The input should be added by skip and bias (if it exits) firstly.
### Description
This PR 1) uses override shape instead of tensor original shape in
shader key to reduce some shader variants; 2) adds indices shape rank to
shader key in case some potential errors.
### Description
Fusing Pad & AveragePool requires AveragePool to use
`count_include_pad=1`. If the AveragePool already set some padding and
`count_include_pad=0`, fusion can't happen.

This PR adds a condition to perform fusion depending on those
attributes. If fusion occurs, `count_include_pad` is always set to `1`.

### Motivation and Context
Fix #22177 (mislabelled as a performance issue but there's an actual bug
in the implementation)
Bug introduced in #21556
mitigates #23183 while we
investigate final solution
### Description
Fix comparison of narrow type with wide type in loop condition.

### Motivation and Context
Comparison between types of different widths in a loop condition can
cause the loop to fail to terminate.
Some quantized models have QDQ around Conv/Gemm but the weight and/or
bias are not quantized. This PR adds WeightBiasQuantization optimizer to
quantize float weight and/or bias to INT8 and INT32 tensors
respectively. We only do this for weight and/or bias initializer so that
ConstantFolding will fold the sub-graph to real quantized initializers
during the graph optimization next round.
ONNX's MatMul is same as numpy.matmul, which supports input tensors with
rank >= 1. But QNN's MatMul can only support input tensors with rank >=
2. This PR is to add MatMulOpBuilder for QNN EP to build QNN graph to
support all possible cases of ONNX's MatMul, by adding Reshape nodes if
necessary, e.g., if Reshape 1D input to 2D if exists, and Reshape output
to expected shape at the end.
 
This PR also tries to use FullyConnected Op for MatMul if 2nd input is
2D initializer or 1D tensor because FullyConnected is faster than MatMul
on QNN EP. If 2nd input is 2D tensor, we require it an initializer
because FullyConnected requires 2nd input in [n, k] shape, we can
transpose it when graph building if it's an initializer (we don't want
to add extra Transpose node).

Use swin_base model as example, which contains several MatMul nodes with
2nd input is 2D initializer (not followed by Add), running on Gen3
mobile device, before the change, it takes 34.8876 ms, after this
change, it's 27.0639 ms.
### Description
Add a temporary path to RN 0.69.3 to update the boost url


### Motivation and Context
Fix the React-native CI until we update the RN to 0.70.15 or 0.73.3+
versions
### Description

Changes vcpkg manifest and configuration file (vcpkg.json &
vcpkg-configuration.json)

* Update vcpkg version to
https://github.com/microsoft/vcpkg/releases/tag/2024.12.16
* Use protobuf 3.21.12(= `v21.12`) to sync with
[cmake/deps.txt](https://github.com/microsoft/onnxruntime/blob/main/cmake/deps.txt)
  * Resolve #22750
* Add `onnx` to vcpkg manifest so `find_package(ONNX)` and
`find_dependency(Protobuf)` can work as expected.
  * Currently, It uses 1.16.2
* v1.17.0 will become available after
microsoft/vcpkg#42942

However, `onnx` in vcpkg doesn't configure
`ONNX_DISABLE_STATIC_REGISTRATION` build option.

* microsoft/vcpkg#38879
* Create "cmake/vcpkg-triplets/" folder and triplet files which use
`VCPKG_CMAKE_CONFIGURE_OPTIONS` for the option
* This requires `VCPKG_OVERLAY_TRIPLETS` environment variable for CI
steps, which is a bit inconvenient.
     I will try to find simple way to get same result

### Motivation and Context

* Help #23158 
  * "ONNX is not consumed from vcpkg"
* "Mismatch protobuf version. When vcpkg is enabled , we should not
fetch protoc from Github which may cause version mismatches."
* microsoft/vcpkg#43126
* #21348
@joncamp joncamp merged commit d578127 into Cephable:main Jan 8, 2025
joncamp pushed a commit that referenced this pull request Jan 8, 2025
### Description
Add [Lean Attention](https://arxiv.org/abs/2405.10480) and the
integration with MultiHeadAttention operator for LLM in GPU.

LeanAttention speeds up self-attention for the token-generation phase
(decode-phase) of decoder-only transformer models, especially on long
context lengths.

- [x] Initial implementation of Lean Attention (by Srikant Bharadwaj)
- [x] Integration with MultiHeadAttention operator
- [x] Add parity tests
- [x] Add benchmark

#### Implementation Details

(1) Lean Attention is enabled in build for Linux, and disabled for
Windows
(2) Lean Attention is disabled by default. Need enable it through cuda
provider option sdpa_kernel, or use environment variable
`ORT_ENABLE_LEAN_ATTENTION=1`
(3) It only works for token-generation (sequence_length==1,
past_sequence_length > 0).
(4) Like flash attention, it only works in Ampere or newer GPU.

We can revisit #1 and #2 after comparing with
DecoderMaskedMultiHeadAttention and XQA kernels.

#### Benchmark

```
cd onnxruntime/test/python/transformers 
/bin/bash benchmark_mha.sh lean
```

Example outputs in H100:

Note that past and present does not share buffer for MHA for now, so we
can see low tflops. The relative ratio will change after buffer sharing
is enabled. But we expect that the order (kernel A is faster than B)
will remain the same after buffer sharing is enabled.

Note that common settings `sequence_length=1;
causal=True;attn_bias=None;cuda_graph=False` are not shown in the below
table.

batch_size | past_sequence_length | num_heads | head_size |
average_latency | tflops | kernel
-- | -- | -- | -- | -- | -- | --
1 | 512 | 16 | 64 | 0.000059 | 0.0178 | ort:flash
1 | 512 | 16 | 64 | 0.000068 | 0.0155 | ort:efficient
1 | 512 | 16 | 64 | 0.000065 | 0.0161 | ort:math
1 | 512 | 16 | 64 | 0.000060 | 0.0176 | ort:lean
1 | 512 | 32 | 128 | 0.000062 | 0.0674 | ort:flash
1 | 512 | 32 | 128 | 0.000064 | 0.0661 | ort:efficient
1 | 512 | 32 | 128 | 0.000067 | 0.0625 | ort:math
1 | 512 | 32 | 128 | 0.000062 | 0.0678 | ort:lean
1 | 1024 | 16 | 64 | 0.000061 | 0.0345 | ort:flash
1 | 1024 | 16 | 64 | 0.000086 | 0.0244 | ort:efficient
1 | 1024 | 16 | 64 | 0.000065 | 0.0322 | ort:math
1 | 1024 | 16 | 64 | 0.000063 | 0.0332 | ort:lean
1 | 1024 | 32 | 128 | 0.000075 | 0.1125 | ort:flash
1 | 1024 | 32 | 128 | 0.000088 | 0.0951 | ort:efficient
1 | 1024 | 32 | 128 | 0.000079 | 0.1068 | ort:math
1 | 1024 | 32 | 128 | 0.000072 | 0.1171 | ort:lean
1 | 2048 | 16 | 64 | 0.000069 | 0.0606 | ort:flash
1 | 2048 | 16 | 64 | 0.000125 | 0.0336 | ort:efficient
1 | 2048 | 16 | 64 | 0.000064 | 0.0655 | ort:lean
1 | 2048 | 32 | 128 | 0.000098 | 0.1720 | ort:flash
1 | 2048 | 32 | 128 | 0.000132 | 0.1270 | ort:efficient
1 | 2048 | 32 | 128 | 0.000092 | 0.1828 | ort:lean
1 | 4096 | 16 | 64 | 0.000076 | 0.1097 | ort:flash
1 | 4096 | 16 | 64 | 0.000207 | 0.0406 | ort:efficient
1 | 4096 | 16 | 64 | 0.000069 | 0.1209 | ort:lean
1 | 4096 | 32 | 128 | 0.000140 | 0.2394 | ort:flash
1 | 4096 | 32 | 128 | 0.000213 | 0.1575 | ort:efficient
1 | 4096 | 32 | 128 | 0.000139 | 0.2419 | ort:lean
1 | 8192 | 16 | 64 | 0.000104 | 0.1609 | ort:flash
1 | 8192 | 16 | 64 | 0.000392 | 0.0428 | ort:efficient
1 | 8192 | 16 | 64 | 0.000093 | 0.1809 | ort:lean
1 | 8192 | 32 | 128 | 0.000212 | 0.3160 | ort:flash
1 | 8192 | 32 | 128 | 0.000360 | 0.1866 | ort:efficient
1 | 8192 | 32 | 128 | 0.000212 | 0.3162 | ort:lean
1 | 16384 | 16 | 64 | 0.000139 | 0.2410 | ort:flash
1 | 16384 | 16 | 64 | 0.000731 | 0.0459 | ort:efficient
1 | 16384 | 16 | 64 | 0.000136 | 0.2465 | ort:lean
1 | 16384 | 32 | 128 | 0.000361 | 0.3722 | ort:flash
1 | 16384 | 32 | 128 | 0.000667 | 0.2014 | ort:efficient
1 | 16384 | 32 | 128 | 0.000357 | 0.3765 | ort:lean
1 | 32768 | 16 | 64 | 0.000210 | 0.3194 | ort:flash
1 | 32768 | 16 | 64 | 0.001428 | 0.0470 | ort:efficient
1 | 32768 | 16 | 64 | 0.000209 | 0.3211 | ort:lean
1 | 32768 | 32 | 128 | 0.000659 | 0.4074 | ort:flash
1 | 32768 | 32 | 128 | 0.001270 | 0.2114 | ort:efficient
1 | 32768 | 32 | 128 | 0.000651 | 0.4123 | ort:lean
1 | 65536 | 16 | 64 | 0.000355 | 0.3785 | ort:flash
1 | 65536 | 16 | 64 | 0.002736 | 0.0491 | ort:efficient
1 | 65536 | 16 | 64 | 0.000349 | 0.3845 | ort:lean
1 | 65536 | 32 | 128 | 0.001251 | 0.4290 | ort:flash
1 | 65536 | 32 | 128 | 0.002480 | 0.2165 | ort:efficient
1 | 65536 | 32 | 128 | 0.001239 | 0.4333 | ort:lean
4 | 512 | 16 | 64 | 0.000063 | 0.0665 | ort:flash
4 | 512 | 16 | 64 | 0.000069 | 0.0607 | ort:efficient
4 | 512 | 16 | 64 | 0.000066 | 0.0634 | ort:math
4 | 512 | 16 | 64 | 0.000062 | 0.0674 | ort:lean
4 | 512 | 32 | 128 | 0.000100 | 0.1677 | ort:flash
4 | 512 | 32 | 128 | 0.000099 | 0.1703 | ort:efficient
4 | 512 | 32 | 128 | 0.000108 | 0.1557 | ort:math
4 | 512 | 32 | 128 | 0.000092 | 0.1818 | ort:lean
4 | 1024 | 16 | 64 | 0.000077 | 0.1094 | ort:flash
4 | 1024 | 16 | 64 | 0.000099 | 0.0850 | ort:efficient
4 | 1024 | 16 | 64 | 0.000081 | 0.1038 | ort:math
4 | 1024 | 16 | 64 | 0.000072 | 0.1161 | ort:lean
4 | 1024 | 32 | 128 | 0.000143 | 0.2343 | ort:flash
4 | 1024 | 32 | 128 | 0.000137 | 0.2447 | ort:efficient
4 | 1024 | 32 | 128 | 0.000150 | 0.2245 | ort:math
4 | 1024 | 32 | 128 | 0.000135 | 0.2496 | ort:lean
4 | 2048 | 16 | 64 | 0.000096 | 0.1757 | ort:flash
4 | 2048 | 16 | 64 | 0.000156 | 0.1078 | ort:efficient
4 | 2048 | 16 | 64 | 0.000089 | 0.1892 | ort:lean
4 | 2048 | 32 | 128 | 0.000223 | 0.3010 | ort:flash
4 | 2048 | 32 | 128 | 0.000217 | 0.3101 | ort:efficient
4 | 2048 | 32 | 128 | 0.000209 | 0.3209 | ort:lean
4 | 4096 | 16 | 64 | 0.000137 | 0.2448 | ort:flash
4 | 4096 | 16 | 64 | 0.000256 | 0.1312 | ort:efficient
4 | 4096 | 16 | 64 | 0.000133 | 0.2530 | ort:lean
4 | 4096 | 32 | 128 | 0.000389 | 0.3450 | ort:flash
4 | 4096 | 32 | 128 | 0.000376 | 0.3574 | ort:efficient
4 | 4096 | 32 | 128 | 0.000354 | 0.3794 | ort:lean
4 | 8192 | 16 | 64 | 0.000210 | 0.3198 | ort:flash
4 | 8192 | 16 | 64 | 0.000453 | 0.1480 | ort:efficient
4 | 8192 | 16 | 64 | 0.000206 | 0.3260 | ort:lean
4 | 8192 | 32 | 128 | 0.000725 | 0.3705 | ort:flash
4 | 8192 | 32 | 128 | 0.000693 | 0.3874 | ort:efficient
4 | 8192 | 32 | 128 | 0.000653 | 0.4114 | ort:lean
4 | 16384 | 16 | 64 | 0.000355 | 0.3782 | ort:flash
4 | 16384 | 16 | 64 | 0.000849 | 0.1581 | ort:efficient
4 | 16384 | 16 | 64 | 0.000346 | 0.3874 | ort:lean
4 | 16384 | 32 | 128 | 0.001395 | 0.3848 | ort:flash
4 | 16384 | 32 | 128 | 0.001337 | 0.4017 | ort:efficient
4 | 16384 | 32 | 128 | 0.001252 | 0.4288 | ort:lean
4 | 32768 | 16 | 64 | 0.000647 | 0.4146 | ort:flash
4 | 32768 | 16 | 64 | 0.001649 | 0.1628 | ort:efficient
4 | 32768 | 16 | 64 | 0.000639 | 0.4204 | ort:lean
4 | 32768 | 32 | 128 | 0.002721 | 0.3947 | ort:flash
4 | 32768 | 32 | 128 | 0.002601 | 0.4128 | ort:efficient
4 | 32768 | 32 | 128 | 0.002434 | 0.4411 | ort:lean
4 | 65536 | 16 | 64 | 0.001231 | 0.4361 | ort:flash
4 | 65536 | 16 | 64 | 0.003238 | 0.1658 | ort:efficient
4 | 65536 | 16 | 64 | 0.001217 | 0.4412 | ort:lean
4 | 65536 | 32 | 128 | 0.005357 | 0.4009 | ort:flash
4 | 65536 | 32 | 128 | 0.005118 | 0.4196 | ort:efficient
4 | 65536 | 32 | 128 | 0.004781 | 0.4492 | ort:lean
16 | 512 | 16 | 64 | 0.000098 | 0.1724 | ort:flash
16 | 512 | 16 | 64 | 0.000104 | 0.1616 | ort:efficient
16 | 512 | 16 | 64 | 0.000118 | 0.1420 | ort:math
16 | 512 | 16 | 64 | 0.000087 | 0.1926 | ort:lean
16 | 512 | 32 | 128 | 0.000220 | 0.3062 | ort:flash
16 | 512 | 32 | 128 | 0.000208 | 0.3237 | ort:efficient
16 | 512 | 32 | 128 | 0.000237 | 0.2838 | ort:math
16 | 512 | 32 | 128 | 0.000209 | 0.3216 | ort:lean
16 | 1024 | 16 | 64 | 0.000136 | 0.2465 | ort:flash
16 | 1024 | 16 | 64 | 0.000150 | 0.2235 | ort:efficient
16 | 1024 | 16 | 64 | 0.000148 | 0.2266 | ort:math
16 | 1024 | 16 | 64 | 0.000129 | 0.2611 | ort:lean
16 | 1024 | 32 | 128 | 0.000367 | 0.3663 | ort:flash
16 | 1024 | 32 | 128 | 0.000351 | 0.3829 | ort:efficient
16 | 1024 | 32 | 128 | 0.000400 | 0.3357 | ort:math
16 | 1024 | 32 | 128 | 0.000349 | 0.3853 | ort:lean
16 | 2048 | 16 | 64 | 0.000209 | 0.3206 | ort:flash
16 | 2048 | 16 | 64 | 0.000243 | 0.2762 | ort:efficient
16 | 2048 | 16 | 64 | 0.000201 | 0.3338 | ort:lean
16 | 2048 | 32 | 128 | 0.000671 | 0.4002 | ort:flash
16 | 2048 | 32 | 128 | 0.000645 | 0.4163 | ort:efficient
16 | 2048 | 32 | 128 | 0.000642 | 0.4185 | ort:lean
16 | 4096 | 16 | 64 | 0.000360 | 0.3732 | ort:flash
16 | 4096 | 16 | 64 | 0.000425 | 0.3162 | ort:efficient
16 | 4096 | 16 | 64 | 0.000341 | 0.3933 | ort:lean
16 | 4096 | 32 | 128 | 0.001292 | 0.4156 | ort:flash
16 | 4096 | 32 | 128 | 0.001251 | 0.4291 | ort:efficient
16 | 4096 | 32 | 128 | 0.001241 | 0.4327 | ort:lean
16 | 8192 | 16 | 64 | 0.000666 | 0.4030 | ort:flash
16 | 8192 | 16 | 64 | 0.000804 | 0.3339 | ort:efficient
16 | 8192 | 16 | 64 | 0.000627 | 0.4283 | ort:lean
16 | 8192 | 32 | 128 | 0.002541 | 0.4226 | ort:flash
16 | 8192 | 32 | 128 | 0.002454 | 0.4376 | ort:efficient
16 | 8192 | 32 | 128 | 0.002438 | 0.4405 | ort:lean
16 | 16384 | 16 | 64 | 0.001292 | 0.4156 | ort:flash
16 | 16384 | 16 | 64 | 0.001571 | 0.3417 | ort:efficient
16 | 16384 | 16 | 64 | 0.001217 | 0.4411 | ort:lean
16 | 16384 | 32 | 128 | 0.005042 | 0.4260 | ort:flash
16 | 16384 | 32 | 128 | 0.004859 | 0.4420 | ort:efficient
16 | 16384 | 32 | 128 | 0.004827 | 0.4449 | ort:lean
16 | 32768 | 16 | 64 | 0.002537 | 0.4233 | ort:flash
16 | 32768 | 16 | 64 | 0.003103 | 0.3461 | ort:efficient
16 | 32768 | 16 | 64 | 0.002385 | 0.4501 | ort:lean
16 | 32768 | 32 | 128 | 0.009961 | 0.4312 | ort:flash
16 | 32768 | 32 | 128 | 0.009605 | 0.4472 | ort:efficient
16 | 32768 | 32 | 128 | 0.009524 | 0.4510 | ort:lean
16 | 65536 | 16 | 64 | 0.005019 | 0.4279 | ort:flash
16 | 65536 | 16 | 64 | 0.006133 | 0.3502 | ort:efficient
16 | 65536 | 16 | 64 | 0.004703 | 0.4566 | ort:lean
16 | 65536 | 32 | 128 | 0.019746 | 0.4350 | ort:flash
16 | 65536 | 32 | 128 | 0.019027 | 0.4515 | ort:efficient
16 | 65536 | 32 | 128 | 0.018864 | 0.4554 | ort:lean

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.