Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework the safety related API code #189

Closed
wants to merge 1 commit into from
Closed

Rework the safety related API code #189

wants to merge 1 commit into from

Conversation

ingydotnet
Copy link
Member

The main change is that 'Danger' has been renamed to 'Python' and that
the default dump() and dump_all() functions use the 'Python' schema
to be able to dump any Python data structure.

NOTE: In YAML, 'Schema' is used to mean all the semantics and rules of
what a YAML document means and how it is processed.

The load() and load_all() functions continue to use the Safe schema.

The dump() and load() sugar functions should be similar in that they
both do the must useful and safe operations.

There are top level functions for each schema (Safe and Python) and
those functions should be used when feeding data from one system to the
other and expecting the same semantics (schema):

  • safe_dump safe_dump_all
  • safe_load safe_load_all
  • python_dump python_dump_all
  • python_load python_load_all

When we have a schema language for YAML, the generic methods with be:

  • yaml.dump(node, Schema='foo.schema')
  • yaml.load(yaml, Schema='foo.schema')

A loader class like SafeLoader is a loader with a hardcoded schema.
Right now pyyaml has 2 schemas:

  • Python - serialize any python data
  • Safe - only serialize in a way that won't trigger code

'Danger' was used in response to a situation where people were caught
unaware that something bad could happen in a seemingly normal, default
situation. Now we've fixed the default to be safe, and Safe is an OK
name for a schema, but Danger really is not. It's not the purpose of the
schema to be dangerous. The purpose is to serialize Python data
structures.

The danger_ API functions can be removed because they have only been
released for a couple days and they aren't documented anywhere.


This also fixes a bug in that safe_load() and load() were aliases. They
shouldn't be, because load() accepts a Loader kwarg, and safe_load()
should not. ie safe_load(yaml, Loader=PythonLoader) shouldn't be
allowed.

The main change is that 'Danger' has been renamed to 'Python' and that
the default `dump()` and `dump_all()` functions use the 'Python' schema
to be able to dump any Python data structure.

NOTE: In YAML, 'Schema' is used to mean all the semantics and rules of
what a YAML document means and how it is processed.

The `load()` and `load_all()` functions continue to use the Safe schema.

The dump() and load() sugar functions should be similar in that they
both do the must useful and safe operations.

There are top level functions for each schema (Safe and Python) and
those functions should be used when feeding data from one system to the
other and expecting the same semantics (schema):

* safe_dump safe_dump_all
* safe_load safe_load_all
* python_dump python_dump_all
* python_load python_load_all

When we have a schema language for YAML, the generic methods with be:

* yaml.dump(node, Schema='foo.schema')
* yaml.load(yaml, Schema='foo.schema')

A loader class like SafeLoader is a loader with a hardcoded schema.
Right now pyyaml has 2 schemas:

* Python - serialize any python data
* Safe - only serialize in a way that won't trigger code

'Danger' was used in response to a situation where people were caught
unaware that something bad could happen in a seemingly normal, default
situation. Now we've fixed the default to be safe, and Safe is an OK
name for a schema, but Danger really is not. It's not the purpose of the
schema to be dangerous. The purpose is to serialize Python data
structures.

The danger_ API functions can be removed because they have only been
released for a couple days and they aren't documented anywhere.

----

This also fixes a bug in that safe_load() and load() were aliases. They
shouldn't be, because load() accepts a Loader kwarg, and safe_load()
should not. ie safe_load(yaml, Loader=PythonLoader) shouldn't be
allowed.
@alex
Copy link
Contributor

alex commented Jun 28, 2018 via email

@ingydotnet
Copy link
Member Author

This relates to #187

@ingydotnet
Copy link
Member Author

@alex I think this is an issue to be taken care of in the docs.

Adding a 'danger_' prefix to things doesn't get across the purpose of the function. The purpose is not to do something dangerous, it's to load python data. Which btw is only potentially dangerous (if you are using untrusted input).

The 'safe_' prefix, otoh does indicate the purpose. To load data that matches the safe schema and thus is believed not to trigger code using untrusted data.

@perlpunk perlpunk self-requested a review June 28, 2018 22:55
Copy link
Member

@perlpunk perlpunk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me. I also suggested python_load. I think it's a good name because it enables !!python/... tags.

@ingydotnet
Copy link
Member Author

ingydotnet commented Jun 28, 2018

@perlpunk interesting mnemonic 👍

Kirill gave me the pyyaml.org content, so we can update the docs there soon. He also suggested putting it all on https://readthedocs.org/ which I think is a great idea.

@glyph
Copy link

glyph commented Jun 29, 2018

I was very excited to see the renaming to danger_ because this makes pyyaml uncharacteristically safe; far too many yaml libraries are the source of distressing remote code execution vulnerabilities.

danger_ is a good name for these functions. Please don't change the name.

The danger is multi-faceted:

  • Serialization of arbitrary Python objects with no restrictions makes it much easier to accidentally serialize data that was not meant to be present in the serialization. This increases the risk of catastrophically huge serialized payloads, unintended sensitive information being saved, etc. So it's dangerous in that you might serialize too much.
  • (De)-serialization of arbitrary objects with no explicit schema guarantees or type information makes it likely that you might rehydrate an object in a state that was valid in a previous version of your library, but is now invalid. Combined with the first point, it's likely that you accidentally serialize objects that you had no intent of enforcing a compatibility contract on, but are now committed to maintaining forever because they show up in your customer's saved state files. So it's dangerous because it may massively increase your maintenance and compatibility burden over time, and it's dangerous because it may load objects that cause your program to break in hard-to-debug ways.
  • Expressing this in terms of saving and loading "python objects" makes many things superficially easier. It makes the operations sound safe, unless you are the sort of programmer who has previously put your own eye out with a ZODB; but in that case, you'll know to just never use these functions, but this terminology actively misleads newcomers. By default, people should think of the "python schema" here as extremely scary and dangerous, and only feel safe calling these APIs after they have FULLY considered the implications. So this naming is dangerous because it puts the burden of understanding on the audience least prepared to do that understanding.

In my considered opinion, there are exactly 2 times when it is valid to serialize arbitrary objects in this way:

  1. when you are in a prototyping environment and you know you are going to throw away all your saved data after a few iterations and commit to something more strictly enforced, and
  2. when you are communicating data between two processes that are part of the same system, where one has launched the other, and you know for sure that they are running the exact same version of the code.

These are both fairly niche applications, neither of which should be the primary use of pyyaml, which is to produce and consume vaguely human-readable configuration documents. So the fact that danger_ naming will likely make this type of usage considerably more obscure is very much a good thing.

@ingydotnet
Copy link
Member Author

@glyph,

There are a lot of dangerous operations in computing but I've never seen the prefix danger_ applied to any of them. The eval function is considered dangerous in any language but we don't call it danger_eval.

Any operation that you don't understand is potentially dangerous. The solution is to help people understand what is dangerous (and why) and what is not. We can do that by explaining things well in the docs.

I might be OK with adding a danger_load alias for the python_load call which is really the same as yaml.load(text, Loader=PythonLoader). So then you can use it if that's your style.

The other option is to get rid of both (danger_ and python_) shortcuts, since yaml.load() has already been made safe now. If people want to load unsafe things they could just be explicit with yaml.load(text, Loader=PythonLoader). (Note: we can't get rid of safe_load as it already exists in 3.12)

Regarding the danger_dump function, it's pretty useless and misleading because dumping is not known to be dangerous, and the default behavior of yaml.dump(...) should be to dump whatever it's given.

Regarding the naming of the Loader/Dumper classes, these really need to be named after the schemas they are enforcing.

Overall I think naming things 'danger' is both unprecedented and it's going to feel silly to be forced to use it after you understand how things work. I think it's a short-sighted solution, just as thinking of the YAML Data Language as a configuration language is a short-sighted point of view.

@sigmavirus24
Copy link
Contributor

I don't really have much time, but let me provide some things that are separate from the discussion of naming.

Let's say I have something I want to serialize with PyYAML. Let's say I've sanitized most of it but accidentally missed a random object and now I use yaml.dump to serialize it to YAML. I've now, unknowingly, caused it to include the !!python/. Unsuspectingly, I now go and try to load it with yaml.load and I can't. Further, if I had used danger_load my application may have broken in production.

I don't particularly care about the danger_ prefix or not (although I prefer it) but I think everything should be safe by default. I also agree with @glyph and @alex unsurprisingly that the python_ prefix is not quite helpful. Speaking as someone working on a product that had problems with serializing and deserializing python objects earlier this year that caused significant production downtime, I think the current API sufficiently protects users, which is what library designers need to be concerned about. Documenting one's way out of an unsafe paper bag is a disservice to users.

@ingydotnet
Copy link
Member Author

@sigmavirus24 perhaps the best thing for now is to not pollute the import namespace with new functions that are not great. Since we can get what we want with explicit calls let's just use those for now.

People who want to serialize outside the Safe schema can do so without the sugar. If they don't like it, we will get real world feedback instead of subjective opinions.

Note: The safe_ functions have been in PyYAML since the start. Making things safe by default kind of defeats the purpose of having them there but they'll need to stay.

@ingydotnet
Copy link
Member Author

@sigmavirus24 after thinking more on this I believe it would be wise to revert the #74 merge and let it sit out the 4.2 release. There is a ton going on already and this commit is the most contentious and breaks backward compatibility for almost all usage.

I am sorry that I wasn't paying attention when this got merged last August, but then again something of this magnitude shouldn't have gotten in without my signoff. I'm quite sure I would have had very similar reactions back then.

I agree that this is issue is very important, but I think it deserves a release of its own, and should not be lumped in with an already heavily overloaded 4.2 release.

@glyph
Copy link

glyph commented Jun 29, 2018

There are a lot of dangerous operations in computing but I've never seen the prefix danger_ applied to any of them. The eval function is considered dangerous in any language but we don't call it danger_eval.

  1. Yes, you're right, as an industry we have often put the public at risk and treated users' security somewhat cavalierly. It's understandable; software is a new discipline, few of us have any formal training, formal training has huge gaps and frequently falls behind; so we all have a lot to learn.

  2. The term "hazardous materials" as popularized by https://cryptography.io/en/latest/hazmat/primitives/ would be an acceptable substitute for "danger".

@glyph
Copy link

glyph commented Jun 29, 2018

... this commit is the most contentious and breaks backward compatibility for almost all usage ...

If you're going to reverse course on this, I think this CVE needs to be updated to indicate that subsequent versions are still vulnerable? http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2017-18342

@ingydotnet
Copy link
Member Author

@glyph there is nothing inherently dangerous about the yaml.load function (or any language's eval function). If I'm using either on trusted data there's no reason to be forced to write code that looks like yaml.danger_load(trusted_yaml). It's annoying and confusing.

re the CVE, technically the course is already reversed as 4.1 was retracted from PyPI earlier today.
I find the CVE to be a bit ironic because safe_load has been a part of PyYAML since the 3.01 release, over 12 years ago.

@cdent
Copy link

cdent commented Jun 29, 2018

From my standpoint, I think the danger* terms are potentially useful for sugar for the exposed interface that people commonly use as they signal "dragons!". safe_load, python_load and danger_load could all be facades over various calls to load with different Loaders (some the same). Under the hood, however, I think naming the loaders and schema "Danger" is something akin to over the top because it is sort of inflammatory and non-descriptive.

I think this option

The other option is to get rid of both (danger_ and python_) shortcuts, since yaml.load() has already been made safe now. If people want to load unsafe things they could just be explicit with yaml.load(text, Loader=PythonLoader)

is also a good one. The critical aspect of it is that you're doing something specific to overcome safe-by-default. This might also be reasonable for dump but I'm not really sure because dumping is not something I do a lot of. The scenario of confusion that @sigmavirus24 is pretty compelling.

With regard to where to go from here: safe-by-default seems to me like the keystone feature of a big new PyYAML 4, so releasing something without it doesn't seem like the way to go.

(And to be self-serving, but the sugar aspect of things is less critical to me than getting a clear and consistent definition of what "safe" will mean: #187.)

@perlpunk
Copy link
Member

Just throwing in some thoughts.

If the majority thinks there has to be a shortcut for *load() that is not safe, I would call it unsafe_load for example, instead of danger_load.

I'm coming from perl and did not know what kinds of exploits are possible with yaml.load.
In the various perl YAML modules, exploits are not trivial to create and only work if they fit to the code that is doing the load (means that the module the exploit targets has to be already loaded). No modules are loaded (you can create an instance of a class without loading that module in perl). There is no eval or anything involved unless you enable a specific option to load literal perl code.

In pyyaml, the situation seems to be different, and it's very easy to inject any code. So I'm in favor of dropping python_load and danger_load to avoid having easy shortcuts.

I still don't like the name DangerLoader, but I can see why you think PythonLoader may sound too harmless. How about something like PythonCodeLoader or PythonEvalLoader?

@sigmavirus24
Copy link
Contributor

@perlpunk The old load behaviour allows for Remote Code Execution, see #193 (comment)

@perlpunk
Copy link
Member

obsolete, see #257

@perlpunk perlpunk closed this Mar 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants