Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: Don't rewrite unicode escape sequences to unicode #1188

Closed
wessberg opened this issue Oct 26, 2020 · 4 comments · Fixed by #1191
Closed

Suggestion: Don't rewrite unicode escape sequences to unicode #1188

wessberg opened this issue Oct 26, 2020 · 4 comments · Fixed by #1191
Assignees
Milestone

Comments

@wessberg
Copy link
Contributor

wessberg commented Oct 26, 2020

Describe the feature

Over at esbuild (which is often used in conjunction with swc to target ES5), the decision was recently made to switch over to producing ASCII-only output by default.

The decision was triggered specifically by the discussion in this issue (and originally this one) due to a combination of interoperability and performance concerns.

First, in terms of interoperability, this is in line with other bundlers such as Webpack, Parcel, and Rollup, and also in line with babel that also preserves unicode escape sequences.

Parcel and Webpack generates minified bundles by default, and they use terser for that. Terser applies an optimization that transforms unicode escape sequences into utf-8 characters, so you might end out with unicode-characters that weren't there in your source code in your output bundle(s) from these tools - but the default behavior of them before terser is in the picture is to respect whatever was written in the source code.

From a performance perspective, however, it get's really interesting. It turns out that many browsers have an optimized scanning path for ASCII code which seems to indicate that parsing an ASCII string is around 1.7x faster in V8 than parsing an UTF-8 string. So there's potential performance gains as well.

@kdy1
Copy link
Member

kdy1 commented Oct 26, 2020

It's a breaking change for rust users. My midterm exam ends on tomorrow, so I'll see if it's viable then.

cc @bartlomieju @ry @dsherret
How do you think about this?

@ry
Copy link

ry commented Oct 26, 2020

SGTM

@kdy1 kdy1 added this to the v1.2.37 milestone Oct 27, 2020
@kdy1 kdy1 mentioned this issue Oct 27, 2020
@kdy1 kdy1 self-assigned this Oct 27, 2020
@kitsonk
Copy link
Contributor

kitsonk commented Oct 28, 2020

Actually we just had this reported in Deno: denoland/deno#8161 and I was just about to raise an issue about it here.

Given the input of:

export function getIndex(c: string): number {
  return "\x00\r\n\x85\u2028\u2029".indexOf(c);
}

We would expect the output to be:

export function getIndex(c) {
    return "\x00\r\n\x85\u2028\u2029".indexOf(c);
}

But we are instead getting (the unicode chars have been encoded to unicode):

export function getIndex(c) {
    return "\0\r\n�

".indexOf(c);
}

@kdy1 kdy1 mentioned this issue Oct 28, 2020
kdy1 added a commit that referenced this issue Oct 29, 2020
swc_ecma_codegen:
 - Emit only ascii characters. (#1187, #1188)
@swc-bot
Copy link
Collaborator

swc-bot commented Oct 26, 2022

This closed issue has been automatically locked because it had no new activity for a month. If you are running into a similar issue, please create a new issue with the steps to reproduce. Thank you.

@swc-project swc-project locked as resolved and limited conversation to collaborators Oct 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

Successfully merging a pull request may close this issue.

5 participants