Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Invalid characters in URI query are handled differently when reading from GPU #10036

Closed
hyperbolic2346 opened this issue Dec 13, 2023 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@hyperbolic2346
Copy link
Collaborator

Describe the bug
Invalid characters such as extended ascii encoded in hex format, eg %9F, are handled differently when parse_uri is returning a query. The GPU kernel will leave these values as-is and return an extended ascii character, but this is invalid UTF-8 and the CPU replaces these characters with � (0xefbfbd).

Steps/Code to reproduce bug
Call parse_uri with a string such as "http://www.nvidia.com/object.php?object=ะก-Ð%9Fะฑ".

Expected behavior
It would be ideal to produce bit-for-bit compatible results as the CPU. It is unknown what other characters are replaced.

@hyperbolic2346 hyperbolic2346 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Dec 13, 2023
@hyperbolic2346
Copy link
Collaborator Author

This seems very similar to #9560 and the solutions may be the same depending on how CSV parsing is handled.

@hyperbolic2346
Copy link
Collaborator Author

This may be a Java difference from Spark. Spark shell is not unescaping these values.

@hyperbolic2346 hyperbolic2346 self-assigned this Dec 19, 2023
@hyperbolic2346
Copy link
Collaborator Author

Spark calls the raw API for the Java URI functions which doesn't decode. This means we can simply pass the hex-encoded data as-is and there is no translation necessary.

@mattahrens mattahrens closed this as not planned Won't fix, can't repro, duplicate, stale Dec 20, 2023
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Dec 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants