-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added support for alternate character encodings #107
Conversation
Codecov Report
@@ Coverage Diff @@
## master #107 +/- ##
==========================================
+ Coverage 83.82% 84.38% +0.55%
==========================================
Files 12 12
Lines 1478 1531 +53
==========================================
+ Hits 1239 1292 +53
Misses 239 239
Continue to review full report at Codecov.
|
@@ -6,6 +6,7 @@ version = "0.4.5" | |||
Base64 = "2a0f44e3-6c83-55bd-87e4-b1978d98bd5f" | |||
Dates = "ade2ca70-3891-5945-98fb-dc099432e06a" | |||
Printf = "de0858da-6303-5e67-8744-51eddeeeb8d7" | |||
StringEncodings = "69024149-9ee7-55f6-a4c4-859efe599b68" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need a compat entry, or is it part of the StdLib?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It definitely needs a compat entry.
@@ -1419,6 +1474,9 @@ function scan_plain_spaces(stream::TokenStream, indent::Integer, | |||
forwardchars!(stream) | |||
else | |||
push!(breaks, scan_line_break(stream)) | |||
if peek(stream.input) == '\uFEFF' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add something to test this branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a good idea. IIRC it's triggered for a YAML document with just a bare string:
---
just a string
\uFEFF---
another document
I don't know what this is for exactly, but if it gets us closer to YAML spec I'm all for it. I left a couple of naive comments, would love to get a more knowledgable reviewer if possible. Also - any idea what's going on with v1.3 tests? I don't remember why we test on v1.3, but I have some vague memory that it's there for a reason, so I don't just want to nuke it. |
I'll need to look into what's going on with Julia 1.3. I suspect method resolution is behaving slightly differently. I'd appreciate a more knowledgeable reviewer taking a look. My goal with this is to be able to handle almost any valid YAML file in the wild. Maybe exotic UTF encodings are pretty rare, but there's still a lot of software (e.g. notepad.exe) that generates byte order marks, so handling those correctly is important. |
Julia 1.3 was easier than expected; it turns out Just now, I added a test case that should cover line 1477. It's needed for a document with a BOM after a document with trailing whitespace. |
Fantastic, thanks! I'm just going to wait for CI to complete and then merge 👍 |
I think the checks are stalled. Oddly enough though, it shows as completed in Github Actions: https://github.com/JuliaData/YAML.jl/actions/runs/504374788 |
Restarted CI to see if that magically fixes anything, but I think the required check might be set up incorrectly (cc @kescobo). |
I don't understand why |
This PR adds support for alternate UTF character encodings. It hews as closely as possible to the YAML spec, as described in section 5.2. In particular, this means that byte order marks (BOMs) are allowed at the beginning of any document, not just the beginning of a stream.