-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCF Document: RFC3986 or RFC3987? #808
Comments
I don't believe 3987 supersedes 3986; more that it builds on it. The base URI is defined in 3986, for example. 3987 only makes reference to using the algorithms in 3986 in the relative IRI references section. ODF does similar, which I expect is what OCF drew on (see, for example, http://docs.oasis-open.org/office/v1.2/os/OpenDocument-v1.2-os-part3.html#__RefHeading__752821_826425813). The space question is interesting. Most of this prose goes back to EPUB 2.0, and I can't recall it being questioned except for some additions to the list in 3.4. I just tested percent encoding a space, and epubcheck and Readium couldn't make sense of the rootfile. They do handle unencoded spaces, though. I wonder if the example has come to influence implementations to the point where if we enforce what is supposed to be done what problems will it cause for existing implementations and content? But this isn't my comfort zone, either. |
Hm. I am even more lost at this point, probably because I never looked at the OASIS document before. But what the reference you give seems to say that ODF relies solely on 3987 which reinforces my first point, actually: there is no real reasons why our OCF document should refer to 3986 at all! |
To be more precise: 3987 does refer to 3986 for the various BNF constructions et al. But that is internal to 3987; our own starting point should be 3987. |
On the issue of space characters, I have made some extra research, for reference. RFC 3987 says at Relative IRI references:
Earlier in the RFC3987 it says:
The limitations it refers to are in the ABNF Rules. The important entry is the reference to
The last entry is a bunch of extra Unicode characters, irrelevant here; The next point to look at is the definition of ALPHA. The strange thing is that
I.e., ALPHA does not include the space character. What this means is that the space character is definitely not allowed as a relative URI in RFC3987 (or RFC3986). More pragmatically, I believe this makes sense in practice. While it is true that a file name on my Mac may include a space character, I am not sure this is true on all Linux systems or Windows (mainly in Windows 10 it is fine, never checked). I know that whenever I push a file from my machine to my Web server, I am careful in exchanging the space character to, say, a |
I wonder what I have been doing.... First, I think that we should explicitly say that a file or path name matches isegment-nz in RFC 3987. Or, we might want to even use ipath-rootless (but have to disallow a trailing empty segment.) Second, W3C has Legacy Extended IRIs (LEIRIs). It shows a list of characters allowed by legacy variants of IRIs but disallowed by RF 3987. Some characters in this list are explicitly disallowed by OCF 3.1 but they do not have to be. They are shown below:
Third, I believe that the characters in the next itemized list are already disallowed by RFC 3987 and do not have to be menteiond in OCF 3.1.
Fourth, we disallow some characters, although they are allowed by RFC 3987. I do not think that we should lift this limitation.
Fifth, we should explicitly allow the use of the space chracter, although we use RFC 3987 as a basis. |
As far as I know, the only reason for referencing RFC 3986 is to borrow some terms which are not even mentioned in RFC 3987. |
Just one comment…
|
Let me roll back a little bit: I have just checked and it seems that space is allowed in Linux; my bad. However, if one uses a URL to access a file that is in the 'exploded' version of an EPUB instance, that URL MUST use %20 instead of a space per existing RFC-s. Ie, although my argument on Linux is wrong, I still keep to my conclusion: it may backfire on us later if we allow spaces in file names within a publication... |
I have to mention that spaces in files names inside the EPUB package are a subject for rejection of EPUB files by distributors today. |
At this point, I believe that a radical re-write of the OCF document in terms of references may be too much to do; let us leave this (and flag this for clean-up!) for a later version. But (also in view of @laudrain's comment) I believe the issue of the space character is real and a bug we should not perpetuate. To solve the issue for EPUB 3.1, I would propose the following: change all the examples in the document (there quite some) by removing the space character, or replacing it with the
|
+1 to Ivan's above. I don't think we should wade into spec changes beyond the examples. |
+1 for me, just change the examples. |
#755 - change alt-script to alt-rep and clarify language #761 - make image cmts required when there is a viewport #773 - update roadmap and add diagram #778 - clarify package conformance #780 - generalize backwards compatibility statement #800 - clarify svg handling for fxl documents #808 - replace spaces with underscores in rootfile examples #822 - fix obsolete feature labels/descriptions #823 - add note about incomplete RS requirements for scrolled-continuous #824 - add clearer content model for nav elements #826 - note toc nav is required in intro #828 - clarify ordering requirements for toc nav references #829 - note optional use of pagebreak with page-list adds a link to the informative a11y faq; patches errata not applied to doi examples; probably some other minor stuff, too
Should we do something in EPUB 3.2? |
This issue just came up in an epubcheck discussion about the reporting of spaces in file names and URIs. Epubcheck emits a warning if spaces are included in file names, but that's being done without any specific resolution to this issue. (Technically, the only invalid aspect right now is if URIs that reference the files are not percent encoded.) Do we want to revisit this and perhaps note in the section on file naming restrictions that the use of spaces, while maybe not forbidden, is not recommended so that we can pair the warning up with a proper statement in the specification? |
Should we reference the WHATWG URL specification? Should we consider WHATWG URL API in Node.js as a reference implementation? |
But I am also aware of MY URL ISN’T YOUR URL. |
This is certainly what newer W3C documents refer do (similarly to the references to the WhatWG HTML spec). The W3C and the WhatWG are hammering out an agreement on working together, and the URL spec is definitely part of that agreement.
I am not sure why we look at a reference implementation here. I would think what really count is the implementations in browsers rather than node.js and, in this sense, what counts is the relevant URL test suite... |
I'm curious: most OSes (Linux, Windows, MacOS) accept spaces in file names, and URI or IRI referencing those files % encode spaces. Nothing special here. |
Ya, they just have a way of tripping up command line tools, piping operations, etc. Making them illegal is probably a bit much, but I tested back to the epubcheck 3.0.1 release from 2013 and it has been emitting a warning about their use since at least then, so adding a warning to the specification isn't really changing reality in any way. But if we decide they should be allowed, then conversely epubcheck needs to be modified. |
Yes, to retain compatibility with HTML! For example, here's how HTML defines the
|
I think we need to talk about the space character, and whether we can move to using the WHATWG URL spec. |
I don't think this is an issue anymore. There was a time when the URL specification only defined parsing of URLs (as complained about in the article mentioned above), but it now includes a section that defines the syntax for valid URLs. Looks like that was added sometime around 2017. Without that syntax, we'd have lost validation, and that would have made for an interoperability mess. |
The issue was discussed in a meeting on 2021-05-07 List of resolutions:
View the transcript5. OCF Document: RFC3986 or RFC3987?See github issue #808. Dave Cramer: basically the question is generally how we define URLs in our spec
Dave Cramer: if we change our spec to refer to that instead of these RFCs then we are better off
|
(I know this is a huge can of worms, and I regularly get it wrong; maybe it is the case this time, too...)
The issue is what the allowed characters are for file names. More specifically, what the allowed characters are for the value of
full-path
in thecontainer.xml
file./téléphone
is all right, although this would not be acceptable via RFC3986. (Although I wonder whether the reference to RFC3986 is necessary in the first place; in this respect RFC3987 supersedes RFC3986, doesn't it)full-path
. This seems to be in contradiction with what is in Section 3.3path
portion thereof) to avoid mixup? Isn't it enough to quote RFC3987 and, if necessary, list the possible restrictions (I have not checked which of the characters listed in the 4th bullet point are excluded from the path segment of an IRI anyway) when it comes to file name? (Yes, of course, we have to refer to the last portion of an IRI path as the file name.)full-path="EPUB/Great Expectation.opf"
is not a valid value for@full-path
, although it is indeed a valid file name per Section 3.4:-(As i said, it is a can of worm, and one of you guys may prove me wrong in my interpretation... But if I am right, my proposal would be:
The text was updated successfully, but these errors were encountered: