Root of locators for identifying/retrieving content within a Web Publication #44

tcole3 · 2017-08-21T22:45:51Z

The DPub IG identified use cases for being able to on-the-fly mint a locator that points to arbitrary segments of content (smaller granularity than a Primary / Secondary Resource) within a Web Publication, e.g., [1] [2] [3]. There is competing prior art.

First question (this issue): Should we take the EPUB CFI approach and mint locators that require parsing the Web Publication Manifest in order to resolve? Or should we take the Web Annotation approach of minting locators based on the URLs of the Primary / Secondary Resource containing the target? Or is there another approach we should consider?

CFI Approach: EPUB 3.1 uses Canonical Fragment Identifiers (CFI) as a way to help support use cases requiring a way to address arbitrary content within a publication. The EPUB Canonical Fragment Identifiers 1.1 [4] specifies that, "The process of resolving an EPUB CFI to a location within an EPUB Publication begins with the root package element of the Package Document." When used as a fragment identifier, CFIs are appended to the EPUB URL (e.g., book.epub#epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/3:10)), which is not necessarily the URL of the containing Primary or Secondary Resource. There are trade-offs to tying all such locators to the publication (rather than constituent resources) and relying on having to parse the manifest in order to resolve the CFI locator.

Web Anno Approach: An alternative approach [5] [6] is to model the location of arbitrary content within a Web Publication by relying on the URL of the Primary or Secondary Resource containing the content. This of course assumes that every Primary / Secondary Resource listed in a Web Publication manifest has a URL (a requirement that is not yet expressed in most recent draft of our spec, but seems to be the direction we're heading - e.g., see issue #5 from last month). This has the immediate advantage (?) that the locator can be used independent of and potentially persist beyond the life of the Web Publication, assuming of course the Primary/Secondary Resource involves persists. Note, the Web Annotation data model provides an optional way (scope attribute) that we might be able to use for including as well the URL of the Web Publication involved, in cases where this might be useful.

Neither the CFI approach nor the Web Annotation approach alone covers all use cases, but both models are extensible and there is actually significant overlap and some similarities between the two approaches. We probably could extend either approach to meet our needs. I'm working on a table to compare the approaches by generic use case supported.

So, in terms of minting locators for content within Web Publications, which approach should we use?

Is there already consensus in the Working Group that resolving locators for Web Publication content based on parsing the Manifest is the wrong strategy for Web Publications? (Full disclosure, I'm partial to the Web Anno approach, in part because I suspect it requires less work to adapt for our use.)

[1] - https://www.w3.org/TR/pwp-ucr/#identify_const_resources
[2] - https://www.w3.org/TR/pwp-ucr/#random-access
[3] - https://www.w3.org/TR/dpub-annotation-uc/
[4] - http://www.idpf.org/epub/linking/cfi/epub-cfi.html
[5] - https://www.w3.org/TR/selectors-states/
[6] - https://www.w3.org/TR/annotation-model/

laudrain · 2017-08-22T07:06:09Z

Besides any technical detailed points, I am in favor of Web Anno approach.

CFI is very EPUB-ish and I am not aware of generalized implementation even in the EPUB eco-system.

Web Annotation is very Web-ish... There are use cases where collaborative Web would bring useful annotations to WP content like image description for a11y. Web Annotation is more likely to be used here than CFI.

tcole3 · 2017-08-28T16:27:43Z

For another option that we could consider, see #18 (which is being closed given that we can discuss here).

murata2makoto · 2017-09-10T11:01:09Z

I see a lot of values in the Web Anno approach, but is it possible to create a single URI from a primary/secondary resource URI and a locator, which is based on the annotation data model? I am not saying that the Web anno approach is useless. I am just saying that we also need a mechanism that allows us to create a single URI.

I am against the EPUBCFI approach, because it does not play with the OWP well. More about this, see EPUB features hampering unification with OWP.

iherman · 2017-09-11T05:28:56Z

Hi @murata0204,

is it possible to create a single URI from a primary/secondary resource URI and a locator, which is based on the annotation data model?

The answer is yes and no.

Yes: the Web Annotation WG has defined a fragment identifier format of the selectors, as part of the separate Selector Note. The fragment identifier is fairly simple, it is a functional "translation" of the selector structure. The identifiers can become fairly long, but that is due to the complexity of the corresponding selector. There is even a simple converter tool to show how a selector would look like in the form of a fragment (and vice versa). The underlying Javascript is a simple parser that could be incorporated into an implementation. That fragment identifer, combines with the URL of a specific resource, provides a proper URL for the selection, much like EPUBCFI does.

No: the obstacle is not technical. At this moment, that fragment is defined in a WG Note, ie, it has no real formal weight. A fragment ID must be registered through IANA, and is defined for a specific media type. This means that, formally, the fragment identifier syntax and semantics defined in this section should be registered for each media type separately. In our case this means the fragment identifier should be registered for HTML and SVG, and this constitutes, in fact, an "extension" to the text/html (resp. application/svg+xml) media types (at least in my understanding). Ie, this should be done together with the Web App WG (resp. SVG WG), and through an IETF document (see a separate example for such document). All this may become a tedious procedure. We may have to go down that route; I simply do not know at this point.

/CC @azaroth42 @tcole3 @BigBlueHat

murata2makoto · 2017-09-11T07:25:29Z

@iherman Nice to hear that it is technically possible. The current situation sounds similar to W3C media fragments, but a WG Note has no formal weight.

murata2makoto · 2017-09-11T07:30:11Z

@iherman Is it possible for a locator in the Web Annotation data model to take advantage of fragment identifiers of resources? For example, can a locator use SVG view fragments?

iherman · 2017-09-11T07:37:58Z

@murata0204

is it possible for a locator in the Web Annotation data model to take advantage of fragment identifiers of resources? For example, can a locator use SVG view fragments?

Yes. First of all, the selector model includes a special selector called fragment selector that allow for all other forms of fragment identifies to be 'incorporated' into the selector model. (B.t.w., epubcfi is explicitly mentioned among those:-)

Furthermore, note that the URI version of the selectors are of the form http://www.example.org#selector(...) using the functional notation, i.e., the selector syntax is orthogonal to the traditional simple fragment id-s or would "include", as a simple sub-expression, any more complex fragments.

As for SVG, that is also included in the model explicitly at the SVG Fragments. SVG fragments also follow a functional syntax, ie, the orthogonality also applies.

tcole3 · 2017-09-11T15:52:36Z

@iherman, @azaroth42, @BigBlueHat - The Selector Reference Note mentioned earlier in this thread makes use of selector and state from the Web Annotation data model, but not scope. The Web Annotation scope property "...captures the context in which it [the annotation] was made". I've forgotten, why was scope or some more generalized property analogous to scope not included in Selector Reference Note?

One potential advantage of the Web Annotation approach is that a segment of a Resource is located relative to the Resource's url alone, i.e., without needing to first retrieve the Web Publication infoset (whereas the EPUB CFI approach would presumably require always starting with the WP analog to the EPub Package Document). On the other hand this decouples a segment of a resource that is a component of a Web Publication from any direct link to the Web Publication itself.

Not clear that a direct link to the Web Publication infoset from the locator of the segment of a Resource included as part of that Web Publication is essential, but if such a link is desired, might scope or a similar extension property be useful for this purpose?

azaroth42 · 2017-09-11T16:05:34Z

scope sounds appropriate for that usage to me.

I think it was omitted to focus on the more commonly used parts -- both for accessibility, and that the note would have quickly become identical to the section in the model if we had included everything.

tcole3 · 2017-09-15T15:50:19Z

A few questions illustrated with concrete examples of possible candidate locators identifying segments of text in a WP. As stand-in for a WP, I'll use the instance of Moby Dick that Dave and Benjamin put up for their html-first proposal. Assume https://dauwhe.github.io/html-first/MobyDickNav/html/epigraph.html is a WP Resource as we've defined in our current draft. Assume I want to reference the phrase, "higgledy-piggledy" in the text. Using the model described in section 5 of the Selectors & States Web Anno Working Group Note, either of the following URLs would work (they use different methods, but identify the same text fragment):

https://dauwhe.github.io/html-first/MobyDickNav/html/epigraph.html#selector(type=TextQuoteSelector,exact=higgledy-piggledy,prefix=at%20least%2C%20take%20the%20,suffix=%20whale%20statements%2C%20however)

https://dauwhe.github.io/html-first/MobyDickNav/html/epigraph.html#selector(type=TextPositionSelector,start=387,end=404)

We'll use the latter for the rest of this post since it is simpler and shorter, albeit more brittle.

Question 1: The identifier of the Web Publication is not mentioned. Should it be?

If so, we could consider extending the Selectors & States approach to allow the expression of the scope property as defined in Web Anno Data Model. Would have to discuss exactly how to do this, but making up a WP canonical identifier of http://example.org/dauwhe/MobyDick.wpub, a possibility might be:

https://dauwhe.github.io/html-first/MobyDickNav/html/epigraph.html#selector(type=TextPositionSelector,start=387,end=404)&scope=https%3A%2F%2Fexample.org%2Fdauwhe%2FMobyDick.wpub

Question 2: As noted in Selectors & States Note, HTML already has an IANA-registered fragment identifier scheme (syntax and semantics). Additionally we can expect that some WP Resources might themselves have fragment identifiers in their URLs. If we wanted to create WP locators that highlighted the parent Web Publication canonical identifer, would a better approach be to make explicit use of the WebAnno source property, e.g.:

https://example.org/dauwhe/MobyDick.wpub#selector(type=TextPositionSelector,start=387,end=404)&source=https%3A%2F%2Fdauwhe.github.io%2Fhtml-first%2FMobyDickNav%2Fhtml%2Fepigraph.html

We could then register this class of fragment identifier for Web Publications. A drawback of this approach is that the default behavior of a UA would be to fetch MobyDick.wpub and assume that it could isolate the text of interest within this file, but in reality epigraph.html would have to be retrieved to get the text segment of interest. Note, this is how CFIs (also fragment identifiers in syntax) involving indirection work already.

Question 3: Do we need to have a way to refer to the whole of a WP Resource in the context of the Web Publication of which it is a part, e.g., to refer to the whole of the epigraph, e.g.:

https://example.org/dauwhe/MobyDick.wpub#source=https%3A%2F%2Fdauwhe.github.io%2Fhtml-first%2FMobyDickNav%2Fhtml%2Fepigraph.html

Or

https://dauwhe.github.io/html-first/MobyDickNav/html/epigraph.html#scope=https%3A%2F%2Fexample.org%2Fdauwhe%2FMobyDick.wpub

This seems possibly of interest for WP Resources that are part of multiple Web Publications, although I personally do not have a use case in mind and we are starting to get far removed from how fragment identifiers normally work.

Finally (for this post), 2 more questions:

Question 4: Does it matter which URL is used (i.e., as source) if the Web Publication Resource is available from multiple URLs? Or another way to ask this, do we want a SHOULD (we can't enforce a MUST) that the URL for the Web Publication Resource referenced should be the same as the URL used to reference the Resource in the manifest? This is a non-issue if we go with a CFI approach.

Question 5: Do we need to support a single URL that identifies a chunk of content spanning multiple WP Resources - e.g., spanning from the last paragraph of chapter 2, all the way through chapter 3, and ending with the first paragraph of chapter 4?

iherman · 2017-09-18T09:29:11Z

@tcole3

Question 1: The identifier of the Web Publication is not mentioned. Should it be?

If so, we could consider extending the Selectors & States approach to allow the expression of the scope property as defined in Web Anno Data Model. Would have to discuss exactly how to do this, but making up a WP canonical identifier of http://example.org/dauwhe/MobyDick.wpub, a possibility might be:

https://dauwhe.github.io/html-first/MobyDickNav/html/epigraph.html#selector(type=TextPositionSelector,start=387,end=404)&scope=https%3A%2F%2Fexample.org%2Fdauwhe%2FMobyDick.wpub

I am not sure what would that bring in practice. To give a tentative answer to my own question:-) I guess what I would get is that a, say, Annotation Server would contain annotations for the WP only if they include the WP identifier. But I am not sure what practical use cases this would reflect, ie, whether it is worth the trouble.

Question 2: As noted in Selectors & States Note, HTML already has an IANA-registered fragment identifier scheme (syntax and semantics). Additionally we can expect that some WP Resources might themselves have fragment identifiers in their URLs. If we wanted to create WP locators that highlighted the parent Web Publication canonical identifer, would a better approach be to make explicit use of the WebAnno source property, e.g.:

https://example.org/dauwhe/MobyDick.wpub#selector(type=TextPositionSelector,start=387,end=404)&source=https%3A%2F%2Fdauwhe.github.io%2Fhtml-first%2FMobyDickNav%2Fhtml%2Fepigraph.html

That actually would make more sense imho. However... (see below)

We could then register this class of fragment identifier for Web Publications. A drawback of this approach is that the default behavior of a UA would be to fetch MobyDick.wpub and assume that it could isolate the text of interest within this file, but in reality epigraph.html would have to be retrieved to get the text segment of interest. Note, this is how CFIs (also fragment identifiers in syntax) involving indirection work already.

That would also require that we would register Web Publications as a separate media type. Otherwise we cannot define fragid-s. That being said, and turning around the argument: because fragid-s must be registered in conjunction to a media type, if we have a full control of the WP media type we have an absolute liberty to do that. But the issue of several HTTP requests to get to the target may become a major obstacle:-(

Question 3: Do we need to have a way to refer to the whole of a WP Resource in the context of the Web Publication of which it is a part, e.g., to refer to the whole of the epigraph, e.g.:

https://example.org/dauwhe/MobyDick.wpub#source=https%3A%2F%2Fdauwhe.github.io%2Fhtml-first%2FMobyDickNav%2Fhtml%2Fepigraph.html

Or

https://dauwhe.github.io/html-first/MobyDickNav/html/epigraph.html#scope=https%3A%2F%2Fexample.org%2Fdauwhe%2FMobyDick.wpub

This seems possibly of interest for WP Resources that are part of multiple Web Publications, although I personally do not have a use case in mind and we are starting to get far removed from how fragment identifiers normally work.

I do not know:-(

Finally (for this post), 2 more questions:

Question 4: Does it matter which URL is used (i.e., as source) if the Web Publication Resource is available from multiple URLs? Or another way to ask this, do we want a SHOULD (we can't enforce a MUST) that the URL for the Web Publication Resource referenced should be the same as the URL used to reference the Resource in the manifest? This is a non-issue if we go with a CFI approach.

I would be in favour of a should, but this is more of an instinctive reaction rather than use case based...

Question 5: Do we need to support a single URL that identifies a chunk of content spanning multiple WP Resources - e.g., spanning from the last paragraph of chapter 2, all the way through chapter 3, and ending with the first paragraph of chapter 4?

For some features the Range Selector almost provides that; it contains a start selector and an end selector, and I would think that it should be possible to define a separate source for each selector. (I am not sure we define it this way but, syntax/vocabulary wise it is possible I guess.). Whether we want to go beyond that: I am not sure the added complexity would be worth. But it is a matter of use cases...

iherman · 2017-09-29T07:38:33Z

Worth noting something re EPUBCFI

@dauwhe notes, in another comment:

handling top-level non-HTML resources in general,

I only just discovered that the HTML spec does talk about this quite a bit, essentially giving the user agent responsibility to generate the “missing” elements.

What this means is that, in HTML5, and in contrast to XML/XHTML, the structure you see in the textual encoding of the content via angular brackets does not necessarily align with the DOM tree that the browser creates. Which makes any fragment identifier dependent on the XML-like structure virtually unusable. And… I believe that would exactly be the case of EPUBCFI :-(

dauwhe · 2017-09-29T14:26:23Z

This is really interesting.

See this example

The markup does not include tbody, but it's in the DOM, and therefore styleable by CSS. It would be fun to test CFI on a similar example!

tcole3 · 2017-09-29T15:11:02Z

I was planning to include this note from 4..3 (XPath Selectors) of the Web Anno Data Model (https://www.w3.org/TR/annotation-model/#xpath-selector). We can wordsmith as necessary:

Note: Implementers should note that the HTML5 specification allows parsers to add elements into the DOM that are considered to be missing. XPaths should be constructed to include these elements, rather than from the element structure in the document.

BillKasdorf · 2017-09-29T15:11:06Z

Well I’ll be dommed! From: Dave Cramer [mailto:notifications@github.com] Sent: Friday, September 29, 2017 10:26 AM To: w3c/wpub Cc: Subscribed Subject: Re: [w3c/wpub] Root of locators for identifying/retrieving content within a Web Publication (#44) This is really interesting. See this example<http://software.hixie.ch/utilities/js/live-dom-viewer/?%3Chtml%3E%0A%3Chead%3E%0A%3Cstyle%3Etbody%20%7B%20color%3A%20red%20%7D%3C%2Fstyle%3E%0A%3C%2Fhead%3E%0A%3Cbody%3E%0A%3Ctable%3E%0A%3Ctr%3E%3Ctd%3Ehello%3C%2Ftd%3E%3C%2Ftr%3E%0A%3C%2Ftable%3E%0A%3C%2Fbody%3E%0A%3C%2Fhtml%3E> The markup does not include tbody, but it's in the DOM, and therefore styleable by CSS. It would be fun to test CFI on a similar example! — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#44 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AIYxNS1aMQG7JWS_cipZykk-Whoqy66Xks5snP4SgaJpZM4O96Su>.

llemeurfr · 2017-09-29T16:37:18Z

The table/tbody test prepared by Dave is really surprising, as the presence of tbody is optional in a table, therefore this element is not really *missing*. Laurent

…

Le 29 sept. 2017 à 17:11, Tim Cole ***@***.***> a écrit : I was planning to include this note from 4..3 (XPath Selectors) of the Web Anno Data Model (https://www.w3.org/TR/annotation-model/#xpath-selector <https://www.w3.org/TR/annotation-model/#xpath-selector>). We can wordsmith as necessary: Note: Implementers should note that the HTML5 specification allows parsers to add elements into the DOM that are considered to be missing. XPaths should be constructed to include these elements, rather than from the element structure in the document. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#44 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AOQD0tc7sgEL2T_S94oSsyY-diO6j4mpks5snQiIgaJpZM4O96Su>.

mattgarrish · 2017-09-29T16:57:16Z

@llemeurfr It's no different than adding in the other implied tags.

The HTML tag omission rules for tbody allow the missing tags to be inferred when there are only tr elements present. Combine that with the table definition and you have to have zero or more explicit tbody tags or one or more tr elements which represent an implied tbody.

iherman · 2017-12-05T13:38:44Z

This issue has become moot in this repository, the locator document is now managed in a different repository. Closing the issue in this repo.

tcole3 added the topic:locators label Aug 21, 2017

TzviyaSiegman mentioned this issue Aug 28, 2017

WP-dependent URIs of resources #18

Closed

w3c deleted a comment from css-meeting-bot Sep 18, 2017

iherman closed this as completed Dec 5, 2017

iherman mentioned this issue May 8, 2019

Manifest files need their own MIME Media Type (because canonicalization) #409

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Root of locators for identifying/retrieving content within a Web Publication #44

Root of locators for identifying/retrieving content within a Web Publication #44

tcole3 commented Aug 21, 2017

laudrain commented Aug 22, 2017

tcole3 commented Aug 28, 2017

murata2makoto commented Sep 10, 2017

iherman commented Sep 11, 2017

murata2makoto commented Sep 11, 2017

murata2makoto commented Sep 11, 2017

iherman commented Sep 11, 2017

tcole3 commented Sep 11, 2017

azaroth42 commented Sep 11, 2017

tcole3 commented Sep 15, 2017

iherman commented Sep 18, 2017 •

edited

Loading

iherman commented Sep 29, 2017

dauwhe commented Sep 29, 2017

tcole3 commented Sep 29, 2017

BillKasdorf commented Sep 29, 2017 via email

llemeurfr commented Sep 29, 2017 via email

mattgarrish commented Sep 29, 2017

iherman commented Dec 5, 2017

Root of locators for identifying/retrieving content within a Web Publication #44

Root of locators for identifying/retrieving content within a Web Publication #44

Comments

tcole3 commented Aug 21, 2017

laudrain commented Aug 22, 2017

tcole3 commented Aug 28, 2017

murata2makoto commented Sep 10, 2017

iherman commented Sep 11, 2017

murata2makoto commented Sep 11, 2017

murata2makoto commented Sep 11, 2017

iherman commented Sep 11, 2017

tcole3 commented Sep 11, 2017

azaroth42 commented Sep 11, 2017

tcole3 commented Sep 15, 2017

iherman commented Sep 18, 2017 • edited Loading

iherman commented Sep 29, 2017

dauwhe commented Sep 29, 2017

tcole3 commented Sep 29, 2017

BillKasdorf commented Sep 29, 2017 via email

llemeurfr commented Sep 29, 2017 via email

mattgarrish commented Sep 29, 2017

iherman commented Dec 5, 2017

iherman commented Sep 18, 2017 •

edited

Loading