Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Description-Content-Type field #258

Merged
merged 16 commits into from
May 19, 2017
Merged
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 85 additions & 1 deletion source/specifications.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ The current core metadata file format, version 1.2, is specified in :pep:`345`.

However, the version specifiers and environment markers sections of that PEP
have been superceded as described below. In addition, metadata files are
permitted to contain the following additional field:
permitted to contain the following additional fields:

Provides-Extra (multiple use)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -52,6 +52,90 @@ respectively.
It is legal to specify ``Provides-Extra:`` without referencing it in any
``Requires-Dist:``.

Description-Content-Type
~~~~~~~~~~~~~~~~~~~~~~~~

A string containing the format of the distribution's description, so that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"stating the markup syntax (if any) used in" might be clearer than "containing the format of"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it. Changed to that in b371449.

tools can intelligently render the description.

Historically, PyPI supported descriptions in plain text and `reStructuredText
(reST) <http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html>`_,
and could render reST into HTML. However, it is common for distribution
authors to write the description in `Markdown
<https://daringfireball.net/projects/markdown/>`_ (`RFC 7763
<https://tools.ietf.org/html/rfc7763>`_) as many code hosting sites render
Markdown READMEs, and authors would reuse the file for the description. PyPI
didn't recognize the format and so could not render the description correctly.
This resulted in many packages on PyPI with poorly-rendered descriptions when
Markdown is left as plain text, or worse, was attempted to be rendered as reST.
This field allows the distribution author to specify the format of their
description, opening up the possibility for PyPI and other tools to be able to
render Markdown and other formats.

The format of this field is the same as the ``Content-Type`` header in HTTP
(e.g.:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an "i.e." ("for explanation") rather than an "e.g." ("for example"), since RFC 1341 is the relevant normative reference here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in a3c07c5. Thanks!

`RFC 1341 <https://www.w3.org/Protocols/rfc1341/4_Content-Type.html>`_).
Briefly, this means that it has a ``type/subtype`` part and then it can
optionally have a number of parameters:

Format::

Description-Content-Type: <type>/<subtype>; charset=<charset>[; <param_name>=<param value> ...]

The ``type/subtype`` part has only a few legal values:

- ``text/plain``
- ``text/x-rst``
- ``text/markdown``

The ``charset`` parameter can be used to specify whether the character set in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

charset is a character encoding. See for example https://www.w3.org/International/articles/http-charset/index

Copy link
Contributor Author

@msabramo msabramo May 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merwok: The doc has changed a bit since you reviewed in February, mostly as a result of me merging msabramo#2. Does the explanation look good in the new version:

The charset parameter can be used to specify the character encoding of
the description. The only legal value is UTF-8. If omitted, it is assumed to
be UTF-8.

or do you still think it needs to be tweaked?

use is UTF-8, ASCII, etc. If ``charset`` is not provided, then it is
recommended that the implementation (e.g.: PyPI) treat the content as
UTF-8.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move the sentence "If charset is not provided..." down below? (See next comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does e4b39db help?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @pfmoore that this can be dropped in favour of the more concise specification below.


Other parameters might be specific to the chosen subtype. For example, for the
``markdown`` subtype, there is a ``variant`` parameter that allows specifying
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it expected that PyPI will support rendering different variants of markdown (which will presumably require using a number of different libraries to achieve this)?

I would much rather just have a single idea of what "Markdown" is and just pick a single standard and stick with it, ideally something that is compatible with what Github thinks Markdown is so that the same file will work with both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention here (possibly needs to be clarified?) is that CommonMark is what PyPI would initially support, as it seems like the best standard right now, but I was trying to leave room for other variants so we can adapt to change.

I do wonder if I should remove the references to GFM and original Markdown, so that folks don't think they can use them and get their exact semantics (because it will likely be unrecognized by PyPI and thus treated as CommonMark). Folks are likely to see GFM and get excited and specify and then expect it to work exactly like GitHub and then complain when it doesn't.

Should I remove GFM and original Markdown and say CommonMark is the only legal variant right now?

Copy link
Contributor

@nicktimko nicktimko May 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per the GitHub Flavored Markdown spec, it's a superset of CommonMark with the following additional features ("extensions"):

  • Tables
  • Strikethrough using ~ like this. Used more often for comedic effect than srs bizniz.
  • Task Lists
    • Probably
    • Irrelevant
    • For PyPI
  • More Autolinks. CM wants <> around links to flag them as being links. This is probably just a minor usability thing to make "quick" comments more useful. A negligible barrier for any document that's scrutinized for a couple seconds
  • More HTML disallowed...but presumably there's going to be some (high) level of HTML sanitization going on anyways before/after conversion, so that might render it moot anyways. If the filters aren't turned up too high, users could do tables manually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rendering mechanism used by both PyPI and Warehouse does not rely on the renderer to emit safe HTML and we do our own strict HTML sanitization after it has been rendered. Theoretically we could have an option for raw HTML as a long_description and it shouldn't provide any avenue for attack. The list of whitelisting tags/attributes can be found at readme_renderer/clean.py#L24-48.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like neither <del> or <s> is allowed anyway, so the strikethrough feature is borderline moot in GFM as there's 0 users using it on PyPI ;).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can of course add things to the list where needed as well! The current list is basically compiled by adding things whenever we found something that didn't render correctly on PyPI that we felt was a reasonable thing to want to have rendered. I didn't exhaustively look through every HTML tag and make a determination on it.

the variant of Markdown in use, such as:

- ``CommonMark`` for `CommonMark`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last backtick is causing incorrect rendering

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Fixed in 4b17b0b.

<https://tools.ietf.org/html/rfc7764#section-3.5>`_

- ``GFM`` for `GitHub Flavored Markdown (GFM)
<https://tools.ietf.org/html/rfc7764#section-3.2>`_

- ``Original`` for `Gruber's original Markdown syntax
<https://tools.ietf.org/html/rfc7763#section-6.1.4>`_

Example::

Description-Content-Type: text/plain; charset=UTF-8

Example::

Description-Content-Type: text/x-rst; charset=UTF-8

Example::

Description-Content-Type: text/markdown; charset=UTF-8; variant=CommonMark

Example::

Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM

Example::

Description-Content-Type: text/markdown; charset=UTF-8; variant=Original

If a ``Description-Content-Type`` is not specified or it's set to an
unrecognized value, then the assumed content type is ``text/x-rst;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably wrong... slightly. My assumption is that PyPI will, if an explicit content-type has been picked, will error if it doesn't correctly render on upload (preventing the current problem of "oops, I accidentally broke my long description"). This is different than just assuming text/x-rst because for that we attempt to render it as text/x-rst and then we do a fallback to text/plain.

My suggestion would be to say:

  • If a Description-Content-Type is not specified, then applications should attempt to render it as text/x-rst and fall back to text/plain if it is not valid rst.
  • If a Description-Content-Type is an unrecognized value, then the assumed content type is text/plain (Although PyPI will probably reject anything with an unrecognized value).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I used your wording in a8c3b77. Let me know if you have other comments.

charset=UTF-8``.

If the ``charset`` is not specified or it's set to an unrecognized value, then
the assumed ``charset`` is ``UTF-8``.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 92–94 above merely recommend that implementations assume the charset is UTF-8, but this line seems to require that they do so.

You should resolve the contradiction one way or the other. I'd prefer if UTF-8 is required to be the default for simplicity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be all for UTF-8 only. What do others think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm +1 on saying UTF-8 if otherwise unspecified, but I think it's potentially useful to allow an alternative user-specified encoding. I doubt it'll be used a lot, but it's consistent with similar MIME-type standards.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in a3c07c5. Thanks!


If the subtype is ``markdown`` and ``variant`` is not specified or it's set to
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line duplicates what's said above.

an unrecognized value, then the assumed ``variant`` is ``CommonMark``.

Copy link
Member

@pfmoore pfmoore Jul 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe include the case of a missing charset here? Something like

If a Description-Content-Type is present, but has no charset specified, then the character set is assumed to be UTF-8

I'm not sure I see much benefit in distinguishing between "charset isn't given but the implementation should treat it as UTF-8" and "when charset isn't given, it defaults to UTF-8".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pfmoore Does e4b39db address your concern?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, looks fine to me, thanks.


Version Specifiers
==================
Expand Down