Added support for alternate character encodings #107

hexane360 · 2020-12-30T23:42:57Z

This PR adds support for alternate UTF character encodings. It hews as closely as possible to the YAML spec, as described in section 5.2. In particular, this means that byte order marks (BOMs) are allowed at the beginning of any document, not just the beginning of a stream.

codecov · 2020-12-30T23:44:36Z

Codecov Report

Merging #107 (b4a3627) into master (86e6ca6) will increase coverage by 0.55%.
The diff coverage is 98.18%.

@@            Coverage Diff             @@
##           master     #107      +/-   ##
==========================================
+ Coverage   83.82%   84.38%   +0.55%     
==========================================
  Files          12       12              
  Lines        1478     1531      +53     
==========================================
+ Hits         1239     1292      +53     
  Misses        239      239

Impacted Files	Coverage Δ
src/YAML.jl	`94.59% <ø> (+5.40%)`	⬆️
src/buffered_input.jl	`93.75% <80.00%> (-2.41%)`	⬇️
src/parser.jl	`75.65% <100.00%> (-0.16%)`	⬇️
src/scanner.jl	`84.43% <100.00%> (+1.00%)`	⬆️
src/tokens.jl	`100.00% <100.00%> (ø)`
src/writer.jl	`98.07% <0.00%> (+0.03%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 86e6ca6...b4a3627. Read the comment docs.

kescobo · 2020-12-31T19:45:46Z

Project.toml

@@ -6,6 +6,7 @@ version = "0.4.5"
 Base64 = "2a0f44e3-6c83-55bd-87e4-b1978d98bd5f"
 Dates = "ade2ca70-3891-5945-98fb-dc099432e06a"
 Printf = "de0858da-6303-5e67-8744-51eddeeeb8d7"
+StringEncodings = "69024149-9ee7-55f6-a4c4-859efe599b68"


Does this need a compat entry, or is it part of the StdLib?

It definitely needs a compat entry.

kescobo · 2020-12-31T19:47:23Z

src/scanner.jl

@@ -1419,6 +1474,9 @@ function scan_plain_spaces(stream::TokenStream, indent::Integer,
                forwardchars!(stream)
            else
                push!(breaks, scan_line_break(stream))
+                if peek(stream.input) == '\uFEFF'


Can you add something to test this branch?

Yeah, that's a good idea. IIRC it's triggered for a YAML document with just a bare string:

--- just a string \uFEFF--- another document

kescobo · 2020-12-31T19:50:33Z

I don't know what this is for exactly, but if it gets us closer to YAML spec I'm all for it. I left a couple of naive comments, would love to get a more knowledgable reviewer if possible.

Also - any idea what's going on with v1.3 tests? I don't remember why we test on v1.3, but I have some vague memory that it's there for a reason, so I don't just want to nuke it.

hexane360 · 2021-01-02T06:05:23Z

I'll need to look into what's going on with Julia 1.3. I suspect method resolution is behaving slightly differently.

I'd appreciate a more knowledgeable reviewer taking a look.

My goal with this is to be able to handle almost any valid YAML file in the wild. Maybe exotic UTF encodings are pretty rare, but there's still a lot of software (e.g. notepad.exe) that generates byte order marks, so handling those correctly is important.

hexane360 · 2021-01-22T18:43:23Z

Julia 1.3 was easier than expected; it turns out io.jl used the peek function differently.

Just now, I added a test case that should cover line 1477. It's needed for a document with a BOM after a document with trailing whitespace.

kescobo · 2021-01-22T19:02:36Z

Fantastic, thanks! I'm just going to wait for CI to complete and then merge 👍

hexane360 · 2021-01-25T18:44:03Z

I think the checks are stalled. Oddly enough though, it shows as completed in Github Actions: https://github.com/JuliaData/YAML.jl/actions/runs/504374788

christopher-dG · 2021-01-25T20:44:07Z

Restarted CI to see if that magically fixes anything, but I think the required check might be set up incorrectly (cc @kescobo).

kescobo · 2021-01-26T14:38:04Z

I don't understand why julia 1 isn't completing... That's the only think I listed as being required. But that's identical right now to julia 1.5, right? I'll just force merge and remove the lock for now.

Added support for alternate character encodings

77af681

kescobo reviewed Dec 31, 2020

View reviewed changes

Fix for encodings on Julia 1.3

c1a53a8

kescobo approved these changes Jan 22, 2021

View reviewed changes

Added test for BOM after document with trailing whitespace

b4a3627

kescobo merged commit 94d8f13 into JuliaData:master Jan 26, 2021

hexane360 mentioned this pull request May 22, 2021

Fails on flow sequences #114

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added support for alternate character encodings #107

Added support for alternate character encodings #107

hexane360 commented Dec 30, 2020

codecov bot commented Dec 30, 2020 •

edited

Loading

kescobo Dec 31, 2020

hexane360 Jan 2, 2021

kescobo Dec 31, 2020

hexane360 Jan 2, 2021

kescobo commented Dec 31, 2020

hexane360 commented Jan 2, 2021

hexane360 commented Jan 22, 2021

kescobo commented Jan 22, 2021

hexane360 commented Jan 25, 2021

christopher-dG commented Jan 25, 2021

kescobo commented Jan 26, 2021

Added support for alternate character encodings #107

Added support for alternate character encodings #107

Conversation

hexane360 commented Dec 30, 2020

codecov bot commented Dec 30, 2020 • edited Loading

Codecov Report

kescobo Dec 31, 2020

Choose a reason for hiding this comment

hexane360 Jan 2, 2021

Choose a reason for hiding this comment

kescobo Dec 31, 2020

Choose a reason for hiding this comment

hexane360 Jan 2, 2021

Choose a reason for hiding this comment

kescobo commented Dec 31, 2020

hexane360 commented Jan 2, 2021

hexane360 commented Jan 22, 2021

kescobo commented Jan 22, 2021

hexane360 commented Jan 25, 2021

christopher-dG commented Jan 25, 2021

kescobo commented Jan 26, 2021

codecov bot commented Dec 30, 2020 •

edited

Loading