Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade CString to not allocate on empty. #40547

Closed
wants to merge 1 commit into from

Conversation

clarfonthey
Copy link
Contributor

@clarfonthey clarfonthey commented Mar 15, 2017

This is done by letting CString contain an empty slice, while also returning a static, NUL-terminated slice when necessary.

This shouldn't be too much of a maintenance burden, but I'll put this PR up and let the libs team decide if they want to keep this.

@rust-highfive
Copy link
Collaborator

r? @brson

(rust_highfive has picked a reviewer for you, use r? to override)

let slice = slice::from_raw_parts(ptr, len as usize);
CString { inner: mem::transmute(slice) }
if len == 1 {
debug_assert_eq!(ptr, NUL_TERMINATED_EMPTY.as_ptr() as *mut c_char);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion makes no sense. It is requiring all ptrs pointing to null byte be NUL_TERMINATED_EMPTY, which is a breaking change and also weird.

(It precludes code like one below from working)

let cstring = CString::from_raw(b"\0");
...
mem::forget(cstring);

@@ -274,7 +287,12 @@ impl CString {
/// Failure to call `from_raw` will lead to a memory leak.
#[stable(feature = "cstr_memory", since = "1.4.0")]
pub fn into_raw(self) -> *mut c_char {
Box::into_raw(self.into_inner()) as *mut c_char
if let Some(bytes) = self.into_inner() {
if bytes.len() > 1 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see a reason for this conditional. Return the pointer always. Alternatively assert that if bytes.len() == 1, the pointer is the same as of NUL_TERMINATED_EMPTY slice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is already an assertion; I accidentally added that condition. Will remove.

@alexcrichton alexcrichton added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Mar 15, 2017
@oli-obk
Copy link
Contributor

oli-obk commented Mar 15, 2017

Was CString NonZero before this change? It won't be after the change.

Transmute::<Option, Something>(a) will stop compiling, too, if it compiled before...

#[stable(feature = "rust1", since = "1.0.0")]
pub struct CString {
// Invariant 1: the slice ends with a zero byte and has a length of at least one.
// Invariant 2: the slice contains only one zero byte.
// Improper usage of unsafe function can break Invariant 2, but not Invariant 1.
inner: Box<[u8]>,
inner: Option<Box<[u8]>,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like lacks a > here?

@clarfonthey
Copy link
Contributor Author

@oli-obk It was because it was just a wrapper for a Box, so, that is something to consider. In the future I can imagine that we'd like to back CString with a Vec instead of a Box to bring its API up to the same level as OsString and String, and if that happened, we'd have a similar problem.

I've also fixed the errors; initially I had also coded this change to replace Box with Vec, then changed it back last minute, which is why there were so many mistakes. I figured that I'd offer this change first because it doesn't require future plans for new APIs.

@@ -260,8 +267,13 @@ impl CString {
#[stable(feature = "cstr_memory", since = "1.4.0")]
pub unsafe fn from_raw(ptr: *mut c_char) -> CString {
let len = libc::strlen(ptr) + 1; // Including the NUL byte
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the strlen be moved the into the else branch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it should!

@mzji
Copy link

mzji commented Mar 15, 2017

How about change CString to something like

enum CString {
    Empty([u8; 1]),
    Some(Box<[u8]>),
}

and initialize it to Empty([0]) for empty strings? This approach looks like more cache-friendly.

@abonander
Copy link
Contributor

@mzji That won't work because CString needs to support being leaked for passing "ownership" over FFI, so giving out a pointer to the stack isn't what we want here.

@abonander
Copy link
Contributor

abonander commented Mar 15, 2017

To preserve null-pointer optimization, you can keep the original Box<[u8]>/Vec<u8> structure and then just don't push a zero-byte if the provided data is empty. Then in into_ptr(), CStr::to_ptr() and *_with_nul() methods, return a pointer to the static [0] if the data is empty.

@mzji
Copy link

mzji commented Mar 15, 2017

@abonander We cannot pass the ownership of both stack memory and static memory over the FFI boundary, right?

@abonander
Copy link
Contributor

abonander commented Mar 15, 2017

@mzji Static memory is fine because it's guaranteed to be around for the life of the program. There is some issue if the consumer of the string means to mutate it (because the static will be stored in a non-writable region, and also potentially shared with other instances), but that's not commonly done, to my knowledge; string manipulations generally involve copying to a new allocation. It's also meaningless to mutate a zero-length C-string because all you could mutate is the nul-byte, in which case your handling of C-strings is erroneous to begin with, and a segfault from attempting to write non-writable memory would probably be the nicest way to discover that bug.

Leaking heap memory is also fine because it won't ever be reclaimed unless you get the pointer back and free it yourself. That's the point of CString::into_ptr().

However, stack memory is an issue because if you're wanting to pass a C-string to some library which will then hold onto it, having the backing memory on the stack is an issue because when your function returns and another one is called, you'll have some other code using the exact same memory region as your C-string.

@mzji
Copy link

mzji commented Mar 15, 2017

@abonander Ok the 'C code will hold the pointer' part makes sense. Anyway that's shouldn't be called 'passing ownership', since the buffer will be freed by rust code, not C code. I misunderstood this.

@abonander
Copy link
Contributor

abonander commented Mar 15, 2017

That's why I put "ownership" in quotes. Even if the concept of ownership isn't built into the language, the C code (or any other language that's ABI-compatible with C) is still taking responsibility for the pointer and freeing it prematurely could cause Bad Things™ to happen.

@mzji
Copy link

mzji commented Mar 15, 2017

What I thought is actually that the C code just takes a borrow (in Rust meaning) of the string. However, yeah, some C code just does some cute bad things... So put empty string in static memory could almost immidiately figure out some C code just works incorrectly.

@abonander
Copy link
Contributor

If the FFI code doesn't assume the pointer will remain valid after the call returns, then passing a temporary pointer is fine; you would use CStr::as_ptr() there.

@mzji
Copy link

mzji commented Mar 15, 2017

That's why I put "ownership" in quotes. Even if the concept of ownership isn't built into the language, the C code is still taking responsibility for the pointer and freeing it prematurely would cause Bad Things™ to happen.

Also one thing we should remember, is that the memory should be freed by the allocator which allocates it. That means, even for pointers generated by CString::into_ptr(), C code shouldn't free them, except the C code will call the allocator provided by Rust code. Noted.

@clarfonthey clarfonthey force-pushed the c_string branch 2 times, most recently from 766957d to 4417e8b Compare March 15, 2017 23:53
@sfackler
Copy link
Member

How often are empty strings passed through FFI? Is there a specific use case this is targeting?

@dgel
Copy link

dgel commented Mar 17, 2017

I'm not sure if this scenario is a concern for rust libraries, but potentially you could load a dynamic library at runtime using dlopen, call a function returning an empty string, unload it and then try to access the string.
Disregarding deallocation, it might be nice if this doesn't crash on accessing the returned string

@clarfonthey clarfonthey force-pushed the c_string branch 2 times, most recently from 8c89b4d to 063cfe8 Compare April 6, 2017 18:30
@nagisa
Copy link
Member

nagisa commented Apr 7, 2017

Static memory is fine because it's guaranteed to be around for the life of the program.

This is not true. The dynamic libraries and their static memory may be loaded and unloaded at will.

All in all, I find it ever so harder to justify making this change.

@alexcrichton
Copy link
Member

We discussed this PR during libs triage the other day, and we were in general hesitant to accept this due to the difference from how types like String and Vec operate.

@clarcharr can you clarify use cases and perhaps provide some real-world benchmarks to motivate this PR?

@Mark-Simulacrum Mark-Simulacrum added the S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. label Apr 14, 2017
@shepmaster
Copy link
Member

@clarcharr thanks for the contribution! We're looking for some clarification; would you mind responding to the previous comment?

@clarfonthey
Copy link
Contributor Author

@shepmaster @alexcrichton sorry for not responding! I kind of forgot about this PR and I'm not 100% sure if I have a compelling enough argument for it.

Right now, my main thoughts were that it would support a future implementation of CString which allows Vec-type operations like String. It makes sense to not allocate by default because allowing manipulations of CStrings would make it very natural to include an empty CString in a constructor for a type, then later modify that string in future method calls. I can't think of a lot of cases where I'd necessarily do that, but then again, I consider CString to be one of the less-used types in stdlib.

Uses of that would basically be anywhere where some Rust code needs to build strings for C code. Which, again, I'm not sure how common that is.

I'd be willing to come up with some benchmarks if this kind of thing is desired, although right now I am leaning toward closing this and starting more of a discussion elsewhere on how desired Vec-like manipulation of CString is.

@clarfonthey
Copy link
Contributor Author

Also for context, I've currently been looking up inconsistencies for rust-lang/rfcs#1876 that can be fixed now without an RFC, so that there's less that'd be needed to talk about for a future RFC. I sort of included this in that list as a "maybe we want this" because I figured that it'd be easier to come up with an implementation than to speculate about how it'd be done.

@alexcrichton
Copy link
Member

Hm ok. I figured that we wouldn't end up expending that much effort enhancing CString, but rather if you're working with C strings and building them you'd quickly move to the Rust world of slices and Vec. Is there a concrete benefit from enhancing CString with various operations like on Vec and whatnot? It seems like we could just expose into_vec on CString (I don't think we currently have that?) in which case we wouldn't need to enhance the C string API?

@clarfonthey
Copy link
Contributor Author

We currently do have that method, but the main downside I see is that you either a) have to go unsafe and assert yourself that there are no internal NULs or b) pay an extra penalty checking yourself.

I also figure that this is similar to what you get from String, where you either have to go unsafe or pay an extra penalty if you convert to Vec.

And considering how they'd all just be shallow façades over Vec methods anyway I don't think it'd be too unreasonable to add them.

@alexcrichton
Copy link
Member

Ok well in any case the original thoughts of the libs team were that lacking motivation/benchmarks we'd be inclined to close this PR. Would you like to gather that though before closing?

@clarfonthey
Copy link
Contributor Author

Personally I think it's best to close this now and wait until more discussion/an RFC happens before continuing. I can reopen later if there's a desire to merge this later.

@clarfonthey clarfonthey deleted the c_string branch April 15, 2018 20:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.