Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for alternate character encodings #107

Merged
merged 3 commits into from
Jan 26, 2021

Conversation

hexane360
Copy link
Contributor

This PR adds support for alternate UTF character encodings. It hews as closely as possible to the YAML spec, as described in section 5.2. In particular, this means that byte order marks (BOMs) are allowed at the beginning of any document, not just the beginning of a stream.

@codecov
Copy link

codecov bot commented Dec 30, 2020

Codecov Report

Merging #107 (b4a3627) into master (86e6ca6) will increase coverage by 0.55%.
The diff coverage is 98.18%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #107      +/-   ##
==========================================
+ Coverage   83.82%   84.38%   +0.55%     
==========================================
  Files          12       12              
  Lines        1478     1531      +53     
==========================================
+ Hits         1239     1292      +53     
  Misses        239      239              
Impacted Files Coverage Δ
src/YAML.jl 94.59% <ø> (+5.40%) ⬆️
src/buffered_input.jl 93.75% <80.00%> (-2.41%) ⬇️
src/parser.jl 75.65% <100.00%> (-0.16%) ⬇️
src/scanner.jl 84.43% <100.00%> (+1.00%) ⬆️
src/tokens.jl 100.00% <100.00%> (ø)
src/writer.jl 98.07% <0.00%> (+0.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 86e6ca6...b4a3627. Read the comment docs.

@@ -6,6 +6,7 @@ version = "0.4.5"
Base64 = "2a0f44e3-6c83-55bd-87e4-b1978d98bd5f"
Dates = "ade2ca70-3891-5945-98fb-dc099432e06a"
Printf = "de0858da-6303-5e67-8744-51eddeeeb8d7"
StringEncodings = "69024149-9ee7-55f6-a4c4-859efe599b68"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need a compat entry, or is it part of the StdLib?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It definitely needs a compat entry.

@@ -1419,6 +1474,9 @@ function scan_plain_spaces(stream::TokenStream, indent::Integer,
forwardchars!(stream)
else
push!(breaks, scan_line_break(stream))
if peek(stream.input) == '\uFEFF'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add something to test this branch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good idea. IIRC it's triggered for a YAML document with just a bare string:

---
just a string
\uFEFF---
another document

@kescobo
Copy link
Collaborator

kescobo commented Dec 31, 2020

I don't know what this is for exactly, but if it gets us closer to YAML spec I'm all for it. I left a couple of naive comments, would love to get a more knowledgable reviewer if possible.

Also - any idea what's going on with v1.3 tests? I don't remember why we test on v1.3, but I have some vague memory that it's there for a reason, so I don't just want to nuke it.

@hexane360
Copy link
Contributor Author

I'll need to look into what's going on with Julia 1.3. I suspect method resolution is behaving slightly differently.

I'd appreciate a more knowledgeable reviewer taking a look.

My goal with this is to be able to handle almost any valid YAML file in the wild. Maybe exotic UTF encodings are pretty rare, but there's still a lot of software (e.g. notepad.exe) that generates byte order marks, so handling those correctly is important.

@hexane360
Copy link
Contributor Author

Julia 1.3 was easier than expected; it turns out io.jl used the peek function differently.

Just now, I added a test case that should cover line 1477. It's needed for a document with a BOM after a document with trailing whitespace.

@kescobo
Copy link
Collaborator

kescobo commented Jan 22, 2021

Fantastic, thanks! I'm just going to wait for CI to complete and then merge 👍

@hexane360
Copy link
Contributor Author

I think the checks are stalled. Oddly enough though, it shows as completed in Github Actions: https://github.com/JuliaData/YAML.jl/actions/runs/504374788

@christopher-dG
Copy link
Contributor

Restarted CI to see if that magically fixes anything, but I think the required check might be set up incorrectly (cc @kescobo).

@kescobo
Copy link
Collaborator

kescobo commented Jan 26, 2021

I don't understand why julia 1 isn't completing... That's the only think I listed as being required. But that's identical right now to julia 1.5, right? I'll just force merge and remove the lock for now.

@kescobo kescobo merged commit 94d8f13 into JuliaData:master Jan 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants