-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bring back popular EPUB 2.x/3.x metadata? #187
Comments
IMO this is an item primarily for EPUB4 but we need to address extensibility in WP to make sure that is is easily possible. In the list of elements listed above, the following ones were never well-defined enough to be properly used by the community and IMO should be dropped:
In @kovidgoyal comment he mentions the requirement for multiple identifiers, which we haven't addressed in the current WP draft (although this could be handled using links as well if we support #186). He also mentions Finally, |
On a separate note, we've had people asking for series over and over again, yet this has never been properly handled in EPUB-land (I'm not counting our poor attempt at dealing with series using vague terms and refines in EPUB 3.0.x). For comics/manga (cc @murata2makoto for example), this can be as important as the title itself. I'll also let @kovidgoyal chime in, I'm sure he has something to say about this. |
Yes, indeed, I do. series is the single most enquired about metadata field behind only title and authors. calibre uses a custom metadata element for it in EPUB 2 and uses the refines mechanism for it in EPUB 3. Entire genres of fiction publish works that are almost always part of a larger series or universe. It's insanely useful for end users -- they use it to find the next book in the series. it's insanely useful for publishers -- they use it to sell more books in a series. It's completely baffling to me how such a useful and basic piece of metadat has been overlooked for so long. Coming to technical issues, calibre supports a infinite number of series fields (one builtin one and infinite user defined ones). Series numbers are stored as floating point numbers with a precision of two digits, which I have found is enough for ~ all use cases involving series numbers. It is important to also allow the values of zero and negative numbers (prequels and the like). |
This is where IMO the WP infoset should be a bit richer; i.e. define 15 dc metadata, with a precise meaning (more precise than what dc defines, tailored for digital publications), a precise format format for dates and a string format for all others. This would allow user agents to display such core metadata. We could also create an extension point in the json manifest (and in the corresponding schema) to allow authors' injection of custom metadata, but as a reading system developer I don't have a strong view on this. |
@llemeurfr I am a little bit concerned about
if you mean that we should keep on using, say |
If there's a need for more precise metadata, we should definitely work with the appropriate metadata groups to provide it. Dublin Core's lack of precision has been at the heart of EPUB's various failed forays into attributes and refinements. Let's not go on another of those adventures here. |
So, what would be the action item here? We could
Although, for the latter, we sort of know: some of the people in this group have already gone through the exercise of adding terms to schema.org, and the only thing we should find out from DanBri is whether schema.org would be interested, in principle, in a set of more precise, publication oriented terms. Note that DanBri will be in Berlin next week, meaning that Garth, Tzviya, Rachel and I can talk to him (and some other people may be there as well, I have not checked). Ie, it would be good to agree on an action this week... |
This issue deals primarily with the infoset @iherman. Without a serialization, it feels too early to invite anyone to talk about DC or schema.org. If we adopt the WAM, we won't have easy access to either vocabulary for instance. |
@HadrienGardeur, respectfully disagree. There are a number of metadata items, listed in this issue, that we think should be reproduced somehow; for the time being as part of the infoset items. My understanding is that EPUB used the dc terms, but with a semantics that is more restrictive than DCMI's (see, eg, #187 (comment)). That is not a proper way of doing things: we should not have our own terms for terms that are used elsewhere. We do not have a serialization yet, true. We are discussing whether those terms would be in a JSON manifest or an HTML Bottom line: having a clear understanding on our options is important, the earlier we do it, the better is. |
@iherman I'm OK with discussing options as long as we clearly keep in mind that they won't necessarily be available to us. That said, I still think that this is a separate issue from what I've raised here. This issue deals primarily with:
There's a significant gap between what our group (and previously the IDPF) considers to be useful, and what power users and developers are actually using. |
@HadrienGardeur yes, there may be these two issues. And, I believe (speaking as a former Semantic Web person) we cannot expect to cover all the metadata ourselves. We should have a core, and clear mechanism to allow using other vocabularies if needed. I was thinking about that 'core' only. |
@iherman I agree that refining existing definition without creating our own vocabulary would be dangerous. I like DC metadata because they cover the basics of what is needed for describing any publication, are universally known, make an ISO standard, are quite a small set (DCMES) and can be serialized in multiple ways. And this is exactly what EPUB/WP is missing currently. But first, why should we add a restricted set of metadata to the WP infoset? the answer is interoperability of searches in a personal catalog of publications. We are adding metadata for end users of Calibre or any other reading systems (sorry, user agents) able to maintain a 100+ catalog of publications. ONIX is a B2B vocabulary, we want a B2C vocabulary here. A vocabulary that is easy to implement in every user agent, which will be displayed in a "publication screen" and allow simple full text searches in the catalog. If we don't define a correct set of metadata in the spec and rely on publishers and end users adding ANY metadata from e.g. schema.org, interop will be dead. We all agree on What users still need is a set of descriptive metadata, including Then there is the question of And sure, there is a need for additional metadata relative to series (comics, bd, manga etc.), which are all strings also. This will make a good 'core'. |
If our goal is WEB Publications, we should be using the metadata of the Web, which is primarily schema.org. I imagine we would point to an ONIX record, but there is no reason to embed the ONIX record. Since there is schema.org and dcmi aligmnent, we should embrace it. |
There is also the Open Graph Protocol which (afaict) is frequently used either alongside schema.org terms or often in place of them. The OGP use case is "pretty" embedding across social networking platforms. It was created by Facebook, is used by Twitter, Buffer, etc, and can also inform "rich site snippets" for Google listing. Ultimately there is not one singular vocabulary, so being clear about where we expect to find it (OGP picked RDFa Lite; schema.org is frequently processed out of RDFa, Microdata, and/or JSON-LD, etc). So perhaps the "core choice" is less about what we express, but how we express it. Linked Data (regardless of encoding) seems to be the mostly widely used across our shared industry space as well as among search engines and other index providers (Library of Congress, Wikipedia, etc). |
In addition, metadata was added to EPUB 3 for accessibility conformance reporting. We also need to find path for bringing it in WP/PWP |
With the alignment to schema.org, most of the important DC terms should have an equivalent available:
Series are also handled in schema.org through http://schema.org/Series |
IMO dc:subject in EPUB 3 was specified to use individual terms (optionally from controlled vocabularies ) and Note: It was indeed the choice of the IPTC when By the way, how can we map |
It's only in EPUB 3.1 that |
This is mostly not a matter of plain value vs controlled vocabulary but a matter of semantics and cardinality.
|
If we stick to the EPUB 3.1+ use of If we stick to EPUB 3.0.1 and before, it's really somewhow in between the two. Right now it looks like the 3.2 draft is following the direction of 3.1 but using refines instead of attributes. |
|
A meta question/issue: at the moment, e.g., subject is not part of the minimal infoset in the spec. If we keep it this way, then we would not have an 'authority' to decide on whether That being said, it is good to look at these, because it may validate or invalidate our decision to go the schema.org way. I also wonder whether we would have some sort of a best-practice like document detailing how the various schema.org terms would be used. But that is not in the spec, I presume. |
I think we should get rid of controlled vocabularies in descriptive metadata both in WP and in 3.2. These metadata are for user filter/search purposes and localized text is much simpler/efficient in this use case. Much different from the B2B Onix use case. I’m also in favor of adding « about » to the infoset because of this user use case. Whatever the consensus goes, guidelines will be a must. |
There is a "best practice" effort in the CG for 3.2. Should we start one for WP, and when? |
This issue was discussed in a meeting.
|
This is based on my gap analysis between EPUB 3.2 and WP: #176 (comment)
Many dc:* metadata from the EPUB specification have no equivalent in WP:
dc:coverage
,dc:description
,dc:format
,dc:publisher
,dc:relation
,dc:rights
,dc:source
,dc:subject
ordc:type
Back in the days of EPUB 3.0.1, we also received feedback from the wider community about the popularity of some of these elements. @kovidgoyal, creator of Calibre provided the following comment:
The text was updated successfully, but these errors were encountered: