Bring back popular EPUB 2.x/3.x metadata? #187

HadrienGardeur · 2018-05-07T15:51:42Z

This is based on my gap analysis between EPUB 3.2 and WP: #176 (comment)

Many dc:* metadata from the EPUB specification have no equivalent in WP: dc:coverage, dc:description, dc:format, dc:publisher, dc:relation, dc:rights, dc:source, dc:subject or dc:type

Back in the days of EPUB 3.0.1, we also received feedback from the wider community about the popularity of some of these elements. @kovidgoyal, creator of Calibre provided the following comment:

As for the proposed metadata changes. I'll say the following metadata fields are most often used by calibre users:

title
multiple authors
rating
series
series_index
tags
publisher
identifiers (isbn, asin, etc -- one identifier is pointless)
comments (dc:description)

The text was updated successfully, but these errors were encountered:

HadrienGardeur · 2018-05-07T15:58:08Z

IMO this is an item primarily for EPUB4 but we need to address extensibility in WP to make sure that is is easily possible.

In the list of elements listed above, the following ones were never well-defined enough to be properly used by the community and IMO should be dropped:

dc:coverage
dc:format
dc:relation

dc:rights and dc:source were used sometimes, but it was always very difficult to know the proper use for them. If we reintroduce them, it should probably be in a different way (for example by linking to a CC license using a link for rights).

In @kovidgoyal comment he mentions the requirement for multiple identifiers, which we haven't addressed in the current WP draft (although this could be handled using links as well if we support #186).

He also mentions dc:publisher, dc:description and dc:subject as popular items. We don't have an equivalent for them in our infoset.

Finally, dc:type was meant to express the nature/type of a publication. We don't have an equivalent for this in our current infoset as well. We might inherit this from JSON-LD though if we decide to use that serialization, but it's worth discussing this at an infoset level as well.

HadrienGardeur · 2018-05-07T16:33:13Z

On a separate note, we've had people asking for series over and over again, yet this has never been properly handled in EPUB-land (I'm not counting our poor attempt at dealing with series using vague terms and refines in EPUB 3.0.x).

For comics/manga (cc @murata2makoto for example), this can be as important as the title itself.

I'll also let @kovidgoyal chime in, I'm sure he has something to say about this.

kovidgoyal · 2018-05-07T16:42:51Z

Yes, indeed, I do. series is the single most enquired about metadata field behind only title and authors. calibre uses a custom metadata element for it in EPUB 2 and uses the refines mechanism for it in EPUB 3. Entire genres of fiction publish works that are almost always part of a larger series or universe. It's insanely useful for end users -- they use it to find the next book in the series. it's insanely useful for publishers -- they use it to sell more books in a series. It's completely baffling to me how such a useful and basic piece of metadat has been overlooked for so long.

Coming to technical issues, calibre supports a infinite number of series fields (one builtin one and infinite user defined ones). Series numbers are stored as floating point numbers with a precision of two digits, which I have found is enough for ~ all use cases involving series numbers. It is important to also allow the values of zero and negative numbers (prequels and the like).

llemeurfr · 2018-05-07T17:12:05Z

This is where IMO the WP infoset should be a bit richer; i.e. define 15 dc metadata, with a precise meaning (more precise than what dc defines, tailored for digital publications), a precise format format for dates and a string format for all others. This would allow user agents to display such core metadata.

We could also create an extension point in the json manifest (and in the corresponding schema) to allow authors' injection of custom metadata, but as a reading system developer I don't have a strong view on this.

iherman · 2018-05-08T11:14:06Z

@llemeurfr I am a little bit concerned about

define 15 dc metadata, with a precise meaning (more precise than what dc defines, tailored for digital publications)

if you mean that we should keep on using, say dc:format but with a more precise meaning than what DCMI specifies. We should, instead (but this is maybe what you referred to), define the equivalents of those terms in our own namespace with a more precise definition (though, in RDF sense, we can define those as being sub-properties of their DCMI equivalents). I wonder whether this is not worth considering together with schema.org as well, as a possible extension thereof.

mattgarrish · 2018-05-09T00:17:20Z

I wonder whether this is not worth considering together with schema.org as well, as a possible extension thereof.

If there's a need for more precise metadata, we should definitely work with the appropriate metadata groups to provide it. Dublin Core's lack of precision has been at the heart of EPUB's various failed forays into attributes and refinements. Let's not go on another of those adventures here.

iherman · 2018-05-09T10:26:44Z

So, what would be the action item here? We could

invite Tom Baker to a (possibly separate) call on the way this could be worked out with DCMI
similar action with DanBri and schema.org

Although, for the latter, we sort of know: some of the people in this group have already gone through the exercise of adding terms to schema.org, and the only thing we should find out from DanBri is whether schema.org would be interested, in principle, in a set of more precise, publication oriented terms.

Note that DanBri will be in Berlin next week, meaning that Garth, Tzviya, Rachel and I can talk to him (and some other people may be there as well, I have not checked). Ie, it would be good to agree on an action this week...

HadrienGardeur · 2018-05-09T10:38:00Z

This issue deals primarily with the infoset @iherman.

Without a serialization, it feels too early to invite anyone to talk about DC or schema.org. If we adopt the WAM, we won't have easy access to either vocabulary for instance.

iherman · 2018-05-09T10:53:58Z

@HadrienGardeur, respectfully disagree. There are a number of metadata items, listed in this issue, that we think should be reproduced somehow; for the time being as part of the infoset items. My understanding is that EPUB used the dc terms, but with a semantics that is more restrictive than DCMI's (see, eg, #187 (comment)). That is not a proper way of doing things: we should not have our own terms for terms that are used elsewhere.

We do not have a serialization yet, true. We are discussing whether those terms would be in a JSON manifest or an HTML <meta>/<link> element; whichever way we go, the terms themselves should be properly defined in a standard vocabulary. (Even if we adopt the WAM, we would have to add our extensions to the WAM, and these terms are part of them.)

Bottom line: having a clear understanding on our options is important, the earlier we do it, the better is.

HadrienGardeur · 2018-05-09T11:31:40Z

@iherman I'm OK with discussing options as long as we clearly keep in mind that they won't necessarily be available to us.

That said, I still think that this is a separate issue from what I've raised here. This issue deals primarily with:

the gap between the EPUB and WP infosets for metadata
popularity and usage of certain metadata in the EPUB world

There's a significant gap between what our group (and previously the IDPF) considers to be useful, and what power users and developers are actually using.
I think that instead of sticking our heads in the sand, we need to have an open discussion with the larger community and broaden our horizon.

iherman · 2018-05-09T11:35:57Z

@HadrienGardeur yes, there may be these two issues. And, I believe (speaking as a former Semantic Web person) we cannot expect to cover all the metadata ourselves. We should have a core, and clear mechanism to allow using other vocabularies if needed. I was thinking about that 'core' only.

llemeurfr · 2018-05-09T11:37:45Z

@iherman I agree that refining existing definition without creating our own vocabulary would be dangerous.

I like DC metadata because they cover the basics of what is needed for describing any publication, are universally known, make an ISO standard, are quite a small set (DCMES) and can be serialized in multiple ways. And this is exactly what EPUB/WP is missing currently.

But first, why should we add a restricted set of metadata to the WP infoset? the answer is interoperability of searches in a personal catalog of publications. We are adding metadata for end users of Calibre or any other reading systems (sorry, user agents) able to maintain a 100+ catalog of publications. ONIX is a B2B vocabulary, we want a B2C vocabulary here. A vocabulary that is easy to implement in every user agent, which will be displayed in a "publication screen" and allow simple full text searches in the catalog.

If we don't define a correct set of metadata in the spec and rely on publishers and end users adding ANY metadata from e.g. schema.org, interop will be dead.

We all agree on identifier (multiple), modified(a qualified date), title, some specializations of creator and contributor (multiple), language(multiple, with a main one).

What users still need is a set of descriptive metadata, including publisher, description, subject (as textual keywords) and even coverage(example could be "Russia, 19th century" for "War and Peace").

Then there is the question of type, format, source, relation and rights. Their use by end users is not obvious at all. But if we suppress them from our list, we create a subset of DC that will be questioned. If we define them as textual metadata, only useful for display, we use the standard as-is, which IMO is better.

And sure, there is a need for additional metadata relative to series (comics, bd, manga etc.), which are all strings also.

This will make a good 'core'.

TzviyaSiegman · 2018-05-09T16:20:48Z

If our goal is WEB Publications, we should be using the metadata of the Web, which is primarily schema.org. I imagine we would point to an ONIX record, but there is no reason to embed the ONIX record. Since there is schema.org and dcmi aligmnent, we should embrace it.

BigBlueHat · 2018-05-10T20:53:43Z

There is also the Open Graph Protocol which (afaict) is frequently used either alongside schema.org terms or often in place of them. The OGP use case is "pretty" embedding across social networking platforms. It was created by Facebook, is used by Twitter, Buffer, etc, and can also inform "rich site snippets" for Google listing.

Ultimately there is not one singular vocabulary, so being clear about where we expect to find it (OGP picked RDFa Lite; schema.org is frequently processed out of RDFa, Microdata, and/or JSON-LD, etc).

So perhaps the "core choice" is less about what we express, but how we express it.

Linked Data (regardless of encoding) seems to be the mostly widely used across our shared industry space as well as among search engines and other index providers (Library of Congress, Wikipedia, etc).

avneeshsingh · 2018-05-25T10:13:34Z

In addition, metadata was added to EPUB 3 for accessibility conformance reporting. We also need to find path for bringing it in WP/PWP
https://idpf.github.io/epub-vocabs/package/a11y/

HadrienGardeur · 2018-06-04T16:06:57Z

With the alignment to schema.org, most of the important DC terms should have an equivalent available:

dc:publisher can be mapped to http://schema.org/publisher
dc:description can be mapped to http://schema.org/description
dc:subject could be mapped to http://schema.org/keywords

Series are also handled in schema.org through http://schema.org/Series

llemeurfr · 2018-06-04T16:43:19Z

IMO dc:subject would be better translated using http://schema.org/about, def: The subject matter of the content. This is rdf oriented indeed but understandable.

dc:subject in EPUB 3 was specified to use individual terms (optionally from controlled vocabularies ) and keywords(plural) is by schema.org def: Keywords or tags used to describe this content. Multiple entries in a keywords list are typically delimited by commas.

Note: It was indeed the choice of the IPTC when subject was mapped to schema.org (http://schema.org/NewsArticle).

By the way, how can we map
<dc:subject opf:authority="http://www.ams.org/msc/msc2010.html" opf:term="11"> Number Theory </dc:subject>
using schema.org?

HadrienGardeur · 2018-06-04T16:46:22Z

dc:subject has been mostly used for simple strings in various revisions of EPUB, which is a good match for http://schema.org/keywords.

It's only in EPUB 3.1 that opf:authority and opf:term were added, and I'm not sure if they'll make the cut in 3.2 (@dauwhe or @GarthConboy probably know that better than I do).

llemeurfr · 2018-06-04T16:50:52Z

This is mostly not a matter of plain value vs controlled vocabulary but a matter of semantics and cardinality.

subject: individual expression of the subject matter of the work. The property can be repeated.
keywords: comma separated tags tht may not refer to the subject matter of the work. The property can't be repeated
about: see subject.

HadrienGardeur · 2018-06-04T16:54:39Z

If we stick to the EPUB 3.1+ use of dc:subject, I agree with you that http://schema.org/about is a better fit (you'll notice that I used "could" in my list of initial suggestions).

If we stick to EPUB 3.0.1 and before, it's really somewhow in between the two.

Right now it looks like the 3.2 draft is following the direction of 3.1 but using refines instead of attributes.

dauwhe · 2018-06-04T17:02:05Z

opf:authority and opf:term are in the current 3.2 draft, as no one has complained that they exist :)

iherman · 2018-06-04T17:10:28Z

A meta question/issue: at the moment, e.g., subject is not part of the minimal infoset in the spec. If we keep it this way, then we would not have an 'authority' to decide on whether dc:subject or schema:keywords is used.

That being said, it is good to look at these, because it may validate or invalidate our decision to go the schema.org way. I also wonder whether we would have some sort of a best-practice like document detailing how the various schema.org terms would be used. But that is not in the spec, I presume.

llemeurfr · 2018-06-04T21:01:49Z

I think we should get rid of controlled vocabularies in descriptive metadata both in WP and in 3.2. These metadata are for user filter/search purposes and localized text is much simpler/efficient in this use case. Much different from the B2B Onix use case.

I’m also in favor of adding « about » to the infoset because of this user use case. Whatever the consensus goes, guidelines will be a must.

laudrain · 2018-06-05T07:06:10Z

There is a "best practice" effort in the CG for 3.2. Should we start one for WP, and when?

iherman · 2019-03-18T17:33:09Z

This issue was discussed in a meeting.

No actions or resolutions

View the transcript

2.4. Bring back popular EPUB 2.x/3.x metadata?
Wendy Reid: #187
Ivan Herman: +1
Wendy Reid: #187: Bring back popular EPUB 2.x/3.x metadata?
Luc Audrain: +1
Ivan Herman: it’s all doable with the extension mechanism, not necessary for the core spec
Wendy Reid: close #187

HadrienGardeur added the topic:manifest label May 7, 2018

HadrienGardeur added the topic:metadata label May 7, 2018

HadrienGardeur mentioned this issue Jun 11, 2018

Scope of our infoset #176

Closed

wareid added the propose closing label Feb 7, 2019

wareid closed this as completed Mar 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bring back popular EPUB 2.x/3.x metadata? #187

Bring back popular EPUB 2.x/3.x metadata? #187

HadrienGardeur commented May 7, 2018

HadrienGardeur commented May 7, 2018

HadrienGardeur commented May 7, 2018

kovidgoyal commented May 7, 2018 •

edited

Loading

llemeurfr commented May 7, 2018

iherman commented May 8, 2018

mattgarrish commented May 9, 2018

iherman commented May 9, 2018

HadrienGardeur commented May 9, 2018

iherman commented May 9, 2018

HadrienGardeur commented May 9, 2018

iherman commented May 9, 2018

llemeurfr commented May 9, 2018 •

edited

Loading

TzviyaSiegman commented May 9, 2018

BigBlueHat commented May 10, 2018

avneeshsingh commented May 25, 2018

HadrienGardeur commented Jun 4, 2018

llemeurfr commented Jun 4, 2018 •

edited

Loading

HadrienGardeur commented Jun 4, 2018

llemeurfr commented Jun 4, 2018

HadrienGardeur commented Jun 4, 2018

dauwhe commented Jun 4, 2018

iherman commented Jun 4, 2018

llemeurfr commented Jun 4, 2018

laudrain commented Jun 5, 2018

iherman commented Mar 18, 2019

Bring back popular EPUB 2.x/3.x metadata? #187

Bring back popular EPUB 2.x/3.x metadata? #187

Comments

HadrienGardeur commented May 7, 2018

HadrienGardeur commented May 7, 2018

HadrienGardeur commented May 7, 2018

kovidgoyal commented May 7, 2018 • edited Loading

llemeurfr commented May 7, 2018

iherman commented May 8, 2018

mattgarrish commented May 9, 2018

iherman commented May 9, 2018

HadrienGardeur commented May 9, 2018

iherman commented May 9, 2018

HadrienGardeur commented May 9, 2018

iherman commented May 9, 2018

llemeurfr commented May 9, 2018 • edited Loading

TzviyaSiegman commented May 9, 2018

BigBlueHat commented May 10, 2018

avneeshsingh commented May 25, 2018

HadrienGardeur commented Jun 4, 2018

llemeurfr commented Jun 4, 2018 • edited Loading

HadrienGardeur commented Jun 4, 2018

llemeurfr commented Jun 4, 2018

HadrienGardeur commented Jun 4, 2018

dauwhe commented Jun 4, 2018

iherman commented Jun 4, 2018

llemeurfr commented Jun 4, 2018

laudrain commented Jun 5, 2018

iherman commented Mar 18, 2019

kovidgoyal commented May 7, 2018 •

edited

Loading

llemeurfr commented May 9, 2018 •

edited

Loading

llemeurfr commented Jun 4, 2018 •

edited

Loading