-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get index of first non ascii char #39507
Conversation
265901b
to
280db35
Compare
0807a27
to
9a7de4e
Compare
Tagging subscribers to this area: @tannergooding |
@@ -747,6 +787,18 @@ private static unsafe nuint GetIndexOfFirstNonAsciiChar_Sse2(char* pBuffer, nuin | |||
goto FoundNonAsciiDataInFirstOrSecondVector; | |||
} | |||
} | |||
else if (AdvSimd.Arm64.IsSupported) | |||
{ | |||
currentMask = Unicode.Utf16Utility.GetNonAsciiBytes(AdvSimd.AddSaturate(combinedVector, asciiMaskForAddSaturate).AsByte(), bitmask); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit: The goto FoundNonAsciiDataInFirstOrSecondVector;
statement is replicated across many branches, which could make the logic harder to modify longer-term. Consider factoring SIMD logic into a predicate method, e.g. bool ContainNonAsciiDataInFirstOrSecondVector()
and then consume like so:
if (ContainNonAsciiDataInFirstOrSecondVector(..args..))
{
goto FoundNonAsciiDataInFirstOrSecondVector;
}
Do you mind sending a PR to |
|
||
Vector128<byte> bitmask = BitConverter.IsLittleEndian ? | ||
Vector128.Create(0x80402010_08040201).AsByte() : | ||
Vector128.Create(0x01020408_10204080).AsByte(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is also true for other PRs that Carlos sent, but can you remind me why we use bitMask
for !IsLittleEndian
if it is not supported? Is the idea that when we start supporting, the code will just work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, that's the idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should not map the SSE2 logic for AdvSimd. AdvSimd
performance can be better and doesn't have to go through AddSaturate()
logic.
} | ||
else if (AdvSimd.Arm64.IsSupported) | ||
{ | ||
currentMask = Unicode.Utf16Utility.GetNonAsciiBytes(AdvSimd.AddSaturate(firstVector, asciiMaskForAddSaturate).AsByte(), bitmask); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should check out excellent feedback from @TamarChristinaArm in #39050 (comment) and #39050 (comment) about optimizing this code. I think you should follow the same because in the end, all you are doing this calling BitOperations.TrailingZeroCount()
on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with splitting it off. In this case it looks like this code only cares about the index of the first non-zero element. So we can do even better here.
If you use a mask 0x0f00
and BIC
you can get a much more efficient sequence (see https://github.com/ARM-software/optimized-routines/blob/224cb5f67b71757b99fe1e10b5a437c17a1d733c/string/aarch64/strlen.S#L164)
essentially
cmlt v1.16b, v1.16b, #0
bic v1.8h, 0x0f, lsl 8
umaxp v1.16b, v1.16b, v1.16b
fmov x1, d1
rbit x0, x0
clz x0, x1
as the sequence to get the first element that has the msb
set. Of course for the cases where it's just doing if (currentMask != 0)
you just need a maxp
and fmov
as I explained in the other post.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is for little endian btw, for big-endian you need a slight variantion. though this can be avoided loading with LD1
instead of LDR
(which I think is what LoadVector128
does here).
} | ||
else if (AdvSimd.Arm64.IsSupported) | ||
{ | ||
firstVector = AdvSimd.LoadVector128((ushort*)pBuffer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note for the future, this would be a great place for LoadPair #39243
} | ||
else if (AdvSimd.Arm64.IsSupported) | ||
{ | ||
currentMask = Unicode.Utf16Utility.GetNonAsciiBytes(AdvSimd.AddSaturate(firstVector, asciiMaskForAddSaturate).AsByte(), bitmask); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with splitting it off. In this case it looks like this code only cares about the index of the first non-zero element. So we can do even better here.
If you use a mask 0x0f00
and BIC
you can get a much more efficient sequence (see https://github.com/ARM-software/optimized-routines/blob/224cb5f67b71757b99fe1e10b5a437c17a1d733c/string/aarch64/strlen.S#L164)
essentially
cmlt v1.16b, v1.16b, #0
bic v1.8h, 0x0f, lsl 8
umaxp v1.16b, v1.16b, v1.16b
fmov x1, d1
rbit x0, x0
clz x0, x1
as the sequence to get the first element that has the msb
set. Of course for the cases where it's just doing if (currentMask != 0)
you just need a maxp
and fmov
as I explained in the other post.
Updated Perf:
|
Filed #40805 to follow up on Tamar's suggestions here. Considering that we're already seeing decent perf improvements here, how do folks feel about merging this PR now and investigating the suggestions in a future PR? |
Couple of things:
With that, I am not sure if we should rush this in for RC1. |
@pgovind I'm going to close this PR since this work is on hold. When we get back around to this work, we can create a fresh PR. |
I think this is ready and can be reviewed now.
Implements GetIndexOfFirstNonAsciiChar from #35034
Updated Perf: