Suggestion: Don't rewrite unicode escape sequences to unicode #1188

wessberg · 2020-10-26T10:44:38Z

Describe the feature

Over at esbuild (which is often used in conjunction with swc to target ES5), the decision was recently made to switch over to producing ASCII-only output by default.

The decision was triggered specifically by the discussion in this issue (and originally this one) due to a combination of interoperability and performance concerns.

First, in terms of interoperability, this is in line with other bundlers such as Webpack, Parcel, and Rollup, and also in line with babel that also preserves unicode escape sequences.

Parcel and Webpack generates minified bundles by default, and they use terser for that. Terser applies an optimization that transforms unicode escape sequences into utf-8 characters, so you might end out with unicode-characters that weren't there in your source code in your output bundle(s) from these tools - but the default behavior of them before terser is in the picture is to respect whatever was written in the source code.

From a performance perspective, however, it get's really interesting. It turns out that many browsers have an optimized scanning path for ASCII code which seems to indicate that parsing an ASCII string is around 1.7x faster in V8 than parsing an UTF-8 string. So there's potential performance gains as well.

The text was updated successfully, but these errors were encountered:

kdy1 · 2020-10-26T11:04:46Z

It's a breaking change for rust users. My midterm exam ends on tomorrow, so I'll see if it's viable then.

cc @bartlomieju @ry @dsherret
How do you think about this?

ry · 2020-10-26T12:30:45Z

SGTM

kitsonk · 2020-10-28T03:37:34Z

Actually we just had this reported in Deno: denoland/deno#8161 and I was just about to raise an issue about it here.

Given the input of:

export function getIndex(c: string): number {
  return "\x00\r\n\x85\u2028\u2029".indexOf(c);
}

We would expect the output to be:

export function getIndex(c) {
    return "\x00\r\n\x85\u2028\u2029".indexOf(c);
}

But we are instead getting (the unicode chars have been encoded to unicode):

export function getIndex(c) {
    return "\0\r\n�  ".indexOf(c);
}

swc_ecma_codegen: - Emit only ascii characters. (#1187, #1188)

swc-bot · 2022-10-26T00:30:35Z

This closed issue has been automatically locked because it had no new activity for a month. If you are running into a similar issue, please create a new issue with the steps to reproduce. Thank you.

wessberg mentioned this issue Oct 26, 2020

Not all Unicode Escape Sequences are escaped correctly, leading to "Unterminated string constant" errors #1187

Closed

kdy1 added this to the v1.2.37 milestone Oct 27, 2020

kdy1 mentioned this issue Oct 27, 2020

Fix issues #1189

Merged

kdy1 self-assigned this Oct 27, 2020

kitsonk mentioned this issue Oct 28, 2020

tests(cli): add test for improper unicode encoding denoland/deno#8162

Merged

kdy1 mentioned this issue Oct 28, 2020

Emit only ascii #1191

Merged

kdy1 closed this as completed in #1191 Oct 29, 2020

kdy1 added a commit that referenced this issue Oct 29, 2020

Emit only ascii (#1191)

32b3bbd

swc_ecma_codegen: - Emit only ascii characters. (#1187, #1188)

swc-project locked as resolved and limited conversation to collaborators Oct 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion: Don't rewrite unicode escape sequences to unicode #1188

Suggestion: Don't rewrite unicode escape sequences to unicode #1188

wessberg commented Oct 26, 2020 •

edited

Loading

kdy1 commented Oct 26, 2020

ry commented Oct 26, 2020

kitsonk commented Oct 28, 2020

swc-bot commented Oct 26, 2022

Suggestion: Don't rewrite unicode escape sequences to unicode #1188

Suggestion: Don't rewrite unicode escape sequences to unicode #1188

Comments

wessberg commented Oct 26, 2020 • edited Loading

kdy1 commented Oct 26, 2020

ry commented Oct 26, 2020

kitsonk commented Oct 28, 2020

swc-bot commented Oct 26, 2022

wessberg commented Oct 26, 2020 •

edited

Loading