Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Argument Clinic] Change the errors raised by the path_t converter #122228

Closed
picnixz opened this issue Jul 24, 2024 · 12 comments
Closed

[Argument Clinic] Change the errors raised by the path_t converter #122228

picnixz opened this issue Jul 24, 2024 · 12 comments

Comments

@picnixz
Copy link
Member

picnixz commented Jul 24, 2024

Feature or enhancement

Proposal:

Currently, a ValueError is raised by the path_t converter if the input path is too long or if there are embedded null bytes. For any C code that is using this converter, such as os.stat, this actually makes os.stat raise a ValueError instead of an OSError in this case.

EDIT: Remove paragraph on NUL bytes. For NUL bytes, the converter must anyway raise ValueError. I fixed linecache in #122176 for that.

See #122170 (comment) for another issue (only on Windows though I don't know why since Linux may also have too long paths in general).

So, what I suggest is to raise OSError, or have a way to only disable ValueError that would not make stat(2) fail (i.e., keep ValueError for NUL bytes but not for long paths). I don't know which is the best but I'd prefer just ignoring long paths in path_t converter itself.

Links to previous discussion of this feature:

#122170 (comment)

@picnixz picnixz added type-feature A feature request or enhancement topic-argument-clinic labels Jul 24, 2024
@serhiy-storchaka
Copy link
Member

All functions that check for embedded null characters or bytes raise ValueError. This is built in PyArg_Parse() and Argument Clinic and reproduced across the code.

This is definitely not an OS error. These OS functions take null-terminated strings, therefore they cannot check for embedded null characters. ValueError is raised on the boundary between Python and OS, because there is no way to pass a string with embedded null characters to OS.

Embedded null characters is not the only problem you should guard against. You can also get UnicodeEncodeError or UnicodeDecodeError (which are ValueError subclasses, so existing guards against ValueError work for them).

You should guards against ValueError only if you can get a right answer at the low level (for example os.path.exists() returns False for such paths, and this is reasonable). And, of course, you can always get a TypeError or an AttributeError if pass wrong arguments, a MemoryError or a KeyboardInterrupt. The library code usually should not guard against them.

@picnixz
Copy link
Member Author

picnixz commented Jul 24, 2024

Ok, that's what I thought about the embedded null bytes (but then I don't know the import system). However, the question remains on the too long names. Should we raise a ValueError or an OSError (or should we do something else?) Because os.stat() raising a ValueError because of a too long name doesn't seem right to me (at least according to Eryk's #122170 (comment))

@picnixz
Copy link
Member Author

picnixz commented Jul 24, 2024

I updated my post and restricted my proposal to:

So, what I suggest is to raise OSError, or have a way to only disable ValueError that would not make stat(2) fail (i.e., keep ValueError for NUL bytes but not for long paths). I don't know which is the best but I'd prefer just ignoring long paths in path_t converter itself.

If the path does not have embedded NUL bytes, but is too long, are there any system calls that could fail with a non-OS-like error?

@serhiy-storchaka
Copy link
Member

Isn't it for Windows only?

I do not know what happens on Windows when too long path is passed. If it crashes or truncates the path, we should raise ValueError to prevent this. If it requires the length to be representable as a signed 16-bit integer, we should raise ValueError, because there is no way to represent the length.

Otherwise we can remove the guard, but we should check that all functions work with long paths on all supported Windows versions. This is also a breaking change, so we should investigate what effect it causes on the third-party code.

@picnixz
Copy link
Member Author

picnixz commented Jul 24, 2024

Isn't it for Windows only?

I think so (sorry I forgot about the Windows label). By the way, the latest modification on this code path was #118355 which introduced the possibility to suppress ValueError (AFAICT) but this would also suppress NUL bytes related errors (which would not be good).

I think the ValueError in path_t converter for too long paths is quite old, so it would probably be a breaking change. A quick search https://github.com/search?q=%2F%3A+path_t%5C%28%2F+language%3AC&type=code&l=C&p=1 seems to give no result that is not CPython so I don't know if the converter is widely used...

I do not know what happens on Windows when too long path is passed. If it crashes or truncates the path, we should raise ValueError to prevent this. If it requires the length to be representable as a signed 16-bit integer, we should raise ValueError, because there is no way to represent the length.

That.. is something I don't know. Eryk seemed to have much deeper knowledge in that area. Alternatively, I can try to do what I did for linecache, namely guarding against ValueError explicitly when needed (at least, we wouldn't break anything).

@eryksun
Copy link
Contributor

eryksun commented Jul 24, 2024

I do not know what happens on Windows when too long path is passed. If it crashes or truncates the path, we should raise ValueError to prevent this. If it requires the length to be representable as a signed 16-bit integer, we should raise ValueError, because there is no way to represent the length.

The check was added by Larry Hastings in #58831 in Python 3.3, back in 2012 when Windows XP was still supported. Larry didn't discuss a rationale for adding it.

Pathname arguments in the Windows file API are null-terminated strings. Internally, the base NT system uses a counted UNICODE_STRING, which stores the byte length as an unsigned 16-bit integer. Thus a 16-bit wchar_t string is limited to either 32766 or 32767 characters, depending on whether the API includes the null character in the MaximumLength value of Buffer. Generally it does, so the limit is usually 32766.

It could be that there was an issue in NT 5.1 (Windows XP) with long paths prefixed by "\\?\" getting truncated. I know that NT 4.0 and 5.0 (Windows 2000) used RtlInitUnicodeString(), which silently truncates if a string exceeds the maximum length. This was fixed in NT 5.1 with the addition of RtlUnicodeStringInitEx(), which fails with STATUS_NAME_TOO_LONG (corresponding to Windows ERROR_FILENAME_EXCED_RANGE) instead of silently truncating. Maybe some functions in the XP file API were still using the old init function, but that seems unlikely. Maybe Larry assumed it was still a problem in NT 5.1.

Otherwise we can remove the guard, but we should check that all functions work with long paths on all supported Windows versions.

In #122170 (comment), I discussed the behavior when the pathname or filename exceeds the maximum allowed length in WinAPI CreateFileW(). Other file API functions behave similarly, though the length limits may differ (e.g. CreateDirectoryW() reserves space for an 8.3 short filename if long pathnames aren't enabled and it isn't a "\\?\" extended path), or the error code may differ (e.g. SetCurrentDirectoryW() always uses ERROR_FILENAME_EXCED_RANGE, not ERROR_PATH_NOT_FOUND).

@zooba
Copy link
Member

zooba commented Aug 6, 2024

There are a number of issues on this topic going on right now, but for this one, we should not change the error type. A path longer than 32KB is always an invalid value on Windows, and so ValueError is correct.

Places where it ought to be handled silently rather than letting the user know they've provided an invalid value can be discussed individually. There's no general rule to apply here. Let's approach each one in terms of the real problems that are being observed.

@picnixz
Copy link
Member Author

picnixz commented Aug 12, 2024

Thanks for confirming that it's indeed the correct exception type. I'm closing this specific issue as not planned then.

@picnixz picnixz closed this as not planned Won't fix, can't repro, duplicate, stale Aug 12, 2024
@picnixz picnixz removed the type-feature A feature request or enhancement label Aug 12, 2024
@eryksun
Copy link
Contributor

eryksun commented Aug 12, 2024

There are a number of issues on this topic going on right now, but for this one, we should not change the error type. A path longer than 32KB is always an invalid value on Windows, and so ValueError is correct.

@zooba, I see no reason for Python to be required to raise ValueError for a long path on Windows, at least not after Windows Vista, and probably starting with Windows XP (when RtlUnicodeStringInitEx() was added to the system library). No other platform raises ValueError for a pathname or filename that's too long for the system API. The reason Python started raising ValueError on Windows is just that Larry Hastings silently added it in 3.3, without any discussion in the associated issue. It has been maintained as the status quo for years. Without that code, an OSError would be raised, for ERROR_FILENAME_EXCED_RANGE, ERROR_PATH_NOT_FOUND, or ERROR_INVALID_NAME (the final filename component exceeds the length limit of the filesystem).

@zooba
Copy link
Member

zooba commented Aug 12, 2024

It's that inconsistency in error that I don't like. Long names are pretty thoroughly documented as only being 32KB, so we can reliably raise a suitable error. We wouldn't do it for the smaller limit because we couldn't reliably detect the ways of bypassing it, but those ways don't exist at this limit.

One of the biggest problems with the 260 limit is that the error doesn't say that it's too long. If all the error codes were consistent (and ideally if they said it was too long), then I'd let them bubble out. But because it's a bit of a mess and it's easy for us to verify this particular case, so we should do it.

I'd argue that ERROR_INVALID_NAME (and some others) would make a better ValueError than OSError, since there's nothing about the OS state that could change to make it valid - repeating the call with the same value will get the same result. This is different from PATH_NOT_FOUND where it's possible that something else could create the path in the meantime. But by the time we've only got an OS error code we won't change the exception type outside of the base class, as it's already being raised with that type.

@eryksun
Copy link
Contributor

eryksun commented Aug 12, 2024

Our existing check has a one-off error. The upper limit is 32766, not 32767, because RtlInitUnicodeStringEx() includes the trailing null character in the UNICODE_STRING buffer.

Beyond that, the check doesn't take into account the path length after the system API resolves the working directory and "\??\" NT device prefix (e.g. "spam" -> "\??\C:\long\working\path\spam"). This can lead to internal library calls failing, as either boolean false or with status codes such as STATUS_NAME_TOO_LONG or STATUS_OBJECT_NAME_INVALID. For example, CreateFileW() calls RtlDosPathNameToRelativeNtPathName_U_WithStatus(), which fails with STATUS_OBJECT_NAME_INVALID if the length of the working path and filename exceeds the upper limit of a UNICODE_STRING. In this case, CreateFileW() and CreateDirectoryW() fail with ERROR_PATH_NOT_FOUND.


FYI, I spoke too soon about the API having been fixed back in Windows XP. I just discovered that for some reason they never updated FindFirstFileW() to call RtlInitUnicodeStringEx() instead of the unsafe RtlInitUnicodeString() function, which truncates a long path. At best that leads to a vague ERROR_PATH_NOT_FOUND (as opposed to an explicit ERROR_FILENAME_EXCED_RANGE), but at worst the truncated path may actually be an existing file or directory. The Windows team should fix this. I wouldn't be satisfied to allow undefined behavior just because the 32767 character limit is documented.

@zooba
Copy link
Member

zooba commented Aug 13, 2024

Our existing check has a one-off error.

That's a bug in the check, not a problem with the error type 😉 We should fix it - does it have its own issue yet? I'm sure it's an "easy" tag one.

after the system API resolves the working directory and "??" NT device prefix (e.g. "spam" -> "??\C:\long\working\path\spam")

These are impacted by the OS state, and so if the OS state causes the value to become unusable, then it's an OSError. Changing the OS state could (theoretically) resolve it. So this is consistent with my reasoning from above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants