Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a way to return allocated strings to C and then free them later #25777

Merged
merged 2 commits into from
Jun 11, 2015

Conversation

shepmaster
Copy link
Member

As far as I was able to determine, it's currently impossible to allocate a C NUL-terminated string in Rust and then return it to C (transferring ownership), without leaking memory. There is support for passing the string to C (borrowing).

To complicate matters, it's not possible for the C code to just call free on the allocated string, due to the different allocators in use.

CString has no way to recreate itself from a pointer. This commit adds one. This is complicated a bit because Rust Vecs want the pointer, size, and capacity.

To deal with that, another method to shrink and "leak" the CString to a char * is also provided.

We can then use strlen to determine the length of the string, which must match the capacity.

TODO

  • Improve documentation
  • Add stability markers
  • Convert to Box<[u8]>

Example code

With this example code:

#![feature(libc)]
#![feature(cstr_to_str)]
#![feature(c_str_memory)]

extern crate libc;

use std::ffi::{CStr,CString};

#[no_mangle]
pub extern fn reverse(s: *const libc::c_char) -> *const libc::c_char {
    let s = unsafe { CStr::from_ptr(s) };
    let s2 = s.to_str().unwrap();
    let s3: String = s2.chars().rev().collect();
    let s4 = CString::new(s3).unwrap();
    s4.into_ptr()
}

#[no_mangle]
pub extern fn cleanup(s: *const libc::c_char) {
    unsafe { CString::from_ptr(s) };
}

Compiled using rustc --crate-type dylib str.rs, I was able to link against it from C (gcc -L. -l str str.c -o str):

#include <stdio.h>

extern char *reverse(char *);
extern void cleanup(char *);

int main() {
  char *s = reverse("Hello, world!");
  printf("%s\n", s);
  cleanup(s);
}

As well as dynamically link via Ruby:

require 'fiddle'
require 'fiddle/import'

module LibSum
  extend Fiddle::Importer

  dlload './libstr.dylib'
  extern 'char* reverse(char *)'
  extern 'void cleanup(char *)'
end

s = LibSum.reverse("hello, world!")
puts s
LibSum.cleanup(s)

@rust-highfive
Copy link
Collaborator

r? @huonw

(rust_highfive has picked a reviewer for you, use r? to override)

let len_with_nul = len as usize + 1;
let vec = Vec::from_raw_parts(ptr as *mut u8, len_with_nul, len_with_nul);
CString::from_vec_unchecked(vec)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function needsa warning itself that it only handles precisely the pointers returned from into_raw_ptr. Or otherwise explain the preconditions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely.

@shepmaster
Copy link
Member Author

One question is whether this should be included in the higher-level documentation, as an example of how to return allocated strings via FFI.

@shepmaster
Copy link
Member Author

Also note I've left all stability attributes off these methods, as I'm not sure what is most appropriate.

@alexcrichton alexcrichton added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label May 26, 2015
@alexcrichton
Copy link
Member

Interesting! I agree with @bluss that this needs some documentation bolstering to outline some of the key aspects of this API:

  • Allocation/deallocation can only be done with these two methods, the pointer cannot be freed.
  • Incoming pointers to from_raw_ptr can only be those generated from into_raw_ptr.
  • To properly free a pointer from into_raw_ptr, one must call from_raw_ptr later on in the program.

In terms of naming this may want to consider either raw or ptr, but not both together. I think we've avoided that sort of terminology elsewhere in the standard library as it's a little wordy.

Stability-wise these just need explicit #[unstable] annotations with an appropriate feature name (e.g. c_str_memory) and a reason that explains why they're unstable. The reason can include reservations about the API, but it is also just frequently "recently added API".


At a higher level, I'm curious what you were doing to want to do this kind of operation? I would not expect this to be a common operation at the FFI boundary (due to the need to free the resources at some point).

@shepmaster
Copy link
Member Author

I'm curious what you were doing to want to do this kind of operation?

I've not done anything specific myself, but I've helped various people want to do something similar. It's a straightforward next step after an FFI example that adds two numbers; one that concats two strings.

I would not expect this to be a common operation at the FFI boundary

Perhaps I'm missing something, but how would you foresee any library that does heavy string processing to be called? One of the concepts that I'm excited most for Rust is writing a core library in Rust and then exposing C bindings. We can then use that library in other languages (Ruby is of particular interest to me).

I feel like CString is running into the same problem that Vec and Box have already tackled. It feels like one of the core types that wants to be transferred across the FFI boundary and be directly usable in a C caller.

  • Box has: from_raw and into_raw
  • Vec has from_raw_parts and (as_ptr, len, capacity)
  • CString will have from_raw and into_raw (to match Box, and address your naming points above)

@shepmaster
Copy link
Member Author

due to the need to free the resources at some point

FWIW, I see Vec and Box having the same "problem" - you need to pass it back to Rust to properly drop the type and use the right allocator.

@alexcrichton
Copy link
Member

Perhaps I'm missing something, but how would you foresee any library that does heavy string processing to be called?

I suppose it definitely depends on the exact library in play, but I would expect some other owned Box<T> is passed around where T contains an internal CString and then there are accessors which pull out the &CStr and return the underlying pointer.

I feel like CString is running into the same problem that Vec and Box have already tackled. It feels like one of the core types that wants to be transferred across the FFI boundary and be directly usable in a C caller.

Yeah I do agree that this seems like a good tool to have in the toolbox when dealing with FFI, I'm just somewhat hesitant because memory management is often quite difficult when it comes to an FFI boundary and I don't think this will necessarily open up the door to easy management of strings per se. For example in the question you linked to the returned string would need to be deallocated at some point (assuming the incoming string is changed to *const c_char), and it starts making the Ruby bits pretty hairy pretty quickly.

Not that any of that should stop this at the gates, just something to keep in mind!

@shepmaster
Copy link
Member Author

returned string would need to be deallocated at some point [...] and it starts making the Ruby bits pretty hairy pretty quickly.

But you'd have the same issue if you return the Box<T> to Ruby as well, right?

@shepmaster shepmaster force-pushed the cstring-return-to-c branch from ff86ab6 to 12151bc Compare May 26, 2015 22:50
@shepmaster
Copy link
Member Author

@bluss @alexcrichton I've updated the docs a smidge and added the stability attributes.

@gkoz
Copy link
Contributor

gkoz commented May 27, 2015

Perhaps I'm missing something, but how would you foresee any library that does heavy string processing to be called?

If the caller were to free the strings, wouldn't they want to use free, not this lib's custom destructor? Could CString be taught to support that?

@shepmaster
Copy link
Member Author

If the caller were to free the strings, wouldn't they want to use free

Generally, no. Many larger C libraries have their own allocators, and you need to use them to prevent passing invalid pointers between allocators.

GLib and libxml2 are two examples that come to mind. You can potentially compile these libraries with alternate allocation functions — perhaps you have your own allocator that does extra profiling for example.

Could CString be taught to support that?

I don't know how, as I haven't seen anything parameterized by allocator yet. However, were this to ever exist, it could be neat to use libc::malloc as the allocator and potentially have to worry less about matching up.

@alexcrichton
Copy link
Member

@shepmaster, @gkoz yeah I agree that we want to not recommend using free, and unfortunately even if CString used malloc as it's backing memory we still wouldn't want to recommend calling free across library boundaries. There are situations you can get yourself into where depending on how you link the libraries together you could still get a mismatch of allocators once you cross the library boundary.

I think this all looks good to me, thanks @shepmaster! I realize now though that the sibling function of CString::from_raw is CStr::from_ptr, so these methods may actually want to use ptr instead of raw (just a minor nit).

cc @rust-lang/libs, with the renaming of raw => ptr I'm ok with this, any other reservations?

@aturon
Copy link
Member

aturon commented May 28, 2015

@alexcrichton 👍

// determinable from the string length, and shrinking to fit
// is the only way to be sure.
let mut vec = self.inner;
vec.shrink_to_fit();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that shrink_to_fit does not guarantee that the capacity will be exactly equal to length. Isn't this important here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation in Vec guarantees that vec.capacity() == vec.len(), so the vector has no more information to than ptr + length either at that point. Maybe the docs should update to guarantee that(?).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, Vec::shrink_to_fit says:

Shrinks the capacity of the vector as much as possible.

It will drop down as close as possible to the length but the allocator may still inform the vector that there is space for a few more elements.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a quote from the docs:

It will drop down as close as possible to the length but the allocator may still inform the vector that there is space for a few more elements.

Maybe it should state more clearly what actually happens there or removed altogether if it does not hold anymore. Also vec.capacity() == vec.len() does not mean that the actually allocated memory will be equal to vec.len(), does it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't. The vector needs the value capacity as a cookie to give to the allocator when freeing the memory. If len == capacity is fine in the vector internally, that's going be a compatible cookie value to use.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this actually switch to using into_boxed_slice actually

@alexcrichton that was originally my first attempt, but I would have needed to change CString, as you noticed. I can give that refactoring a shot to start with, and then add this commit on top of that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as we already depend on shrink_to_fit to actually "shrink to fit"

Should the docs for this method be updated? Or maybe some non-doc-comments added that denote that it's being used to truly shrink-to-fit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd leave the shrink_to_fit docs as-is for now, but layering on a second commit sounds good to me!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The note on allocators is, to my knowledge (I'm pretty sure I wrote that doc comment), a reservation for future behaviour that is not implemented.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular it might happen once we get local allocators which naturally might favour a more coarse-grain than full-blown jemalloc.

@pnkfelix
Copy link
Member

cc me (i'm wondering about potential overlap with allocator API's that should be coming in the future).

@shepmaster
Copy link
Member Author

the sibling function of CString::from_raw is CStr::from_ptr, so these methods may actually want to use ptr instead of raw (just a minor nit).

@alexcrichton I actually deliberately chose to not call it that, because CStr::from_ptr can accept an arbitrary C string, whereas CString::from_raw can only accept the special one from CString::into_raw. I'm OK with changing it, if you feel there won't be more confusion.

@huonw
Copy link
Member

huonw commented Jun 1, 2015

r? @alexcrichton (Transferring reviewership, I haven't been keeping track at all...)

@rust-highfive rust-highfive assigned alexcrichton and unassigned huonw Jun 1, 2015
@alexcrichton
Copy link
Member

@shepmaster ah yes indeed! I think I would personally err on the side of ptr for consistency still, but others may feel differently!

@alexcrichton alexcrichton added the I-needs-decision Issue: In need of a decision. label Jun 2, 2015
@shepmaster shepmaster force-pushed the cstring-return-to-c branch from 12151bc to a425a42 Compare June 3, 2015 22:44
@shepmaster
Copy link
Member Author

@alexcrichton OK, updated with the change to Box<[u8]> and the naming suggestions. Thanks!

pub unsafe fn from_ptr(ptr: *const libc::c_char) -> CString {
let len = libc::strlen(ptr);
let len_with_nul = len as usize + 1;
let vec = Vec::from_raw_parts(ptr as *mut u8, len_with_nul, len_with_nul);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of going through Vec, I think it may be best to go directly to a Box<[u8]> because it's where the pointer originally came from, for example:

let len = libc:;strlen(ptr) + 1;
let slice = slice::from_raw_parts(ptr, len as usize);
CString { inner: mem::transmute(slice) }

Going through Vec and then possibly hitting shrink_to_fit may be adding a bit more indirection than necessary (and may also be somewhat less robust)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah as one final note, could you add to the docs that this method will calculate the length of the string specified by ptr as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexcrichton both concerns addressed! Let me know if the added doc line conveys what you intended.

@alexcrichton
Copy link
Member

Looks great to me, thanks @shepmaster!

@shepmaster shepmaster force-pushed the cstring-return-to-c branch from a425a42 to e20a6db Compare June 6, 2015 15:22
@alexcrichton
Copy link
Member

@bors: r+ e20a6db

@alexcrichton alexcrichton removed the I-needs-decision Issue: In need of a decision. label Jun 10, 2015
bors added a commit that referenced this pull request Jun 10, 2015
As far as I was able to determine, it's currently *impossible* to allocate a C NUL-terminated string in Rust and then return it to C (transferring ownership), without leaking memory. There is support for passing the string to C (borrowing).

To complicate matters, it's not possible for the C code to just call `free` on the allocated string, due to the different allocators in use.

`CString` has no way to recreate itself from a pointer. This commit adds one. This is complicated a bit because Rust `Vec`s want the pointer, size, and capacity.

To deal with that, another method to shrink and "leak" the `CString` to a `char *` is also provided. 

We can then use `strlen` to determine the length of the string, which must match the capacity. 

**TODO**

- [x] Improve documentation
- [x] Add stability markers
- [x] Convert to `Box<[u8]>`

### Example code

With this example code:

```rust
#![feature(libc)]
#![feature(cstr_to_str)]
#![feature(c_str_memory)]

extern crate libc;

use std::ffi::{CStr,CString};

#[no_mangle]
pub extern fn reverse(s: *const libc::c_char) -> *const libc::c_char {
    let s = unsafe { CStr::from_ptr(s) };
    let s2 = s.to_str().unwrap();
    let s3: String = s2.chars().rev().collect();
    let s4 = CString::new(s3).unwrap();
    s4.into_ptr()
}

#[no_mangle]
pub extern fn cleanup(s: *const libc::c_char) {
    unsafe { CString::from_ptr(s) };
}
```

Compiled using `rustc --crate-type dylib str.rs`, I was able to link against it from C (`gcc -L. -l str str.c -o str`):
 
```c
#include <stdio.h>

extern char *reverse(char *);
extern void cleanup(char *);

int main() {
  char *s = reverse("Hello, world!");
  printf("%s\n", s);
  cleanup(s);
}
```

As well as dynamically link via Ruby:

```ruby
require 'fiddle'
require 'fiddle/import'

module LibSum
  extend Fiddle::Importer

  dlload './libstr.dylib'
  extern 'char* reverse(char *)'
  extern 'void cleanup(char *)'
end

s = LibSum.reverse("hello, world!")
puts s
LibSum.cleanup(s)
```
@bors
Copy link
Contributor

bors commented Jun 10, 2015

⌛ Testing commit e20a6db with merge 01ab4f7...

@bors bors merged commit e20a6db into rust-lang:master Jun 11, 2015
@gkoz
Copy link
Contributor

gkoz commented Jun 11, 2015

Isn't there a naming inconsistency now between slice and Box that provide "from/into_raw" and CString with "from/into_ptr"?

@shepmaster shepmaster deleted the cstring-return-to-c branch June 11, 2015 13:06
@shepmaster
Copy link
Member Author

Isn't there a naming inconsistency now

Kind of, but there was already one with CStr::from_ptr. Since CString and CStr are conceptually closer than CString and Vec / Box, it made more sense to use something with ptr. If you check upthread, there's some discussion about naming as well.

@shepmaster
Copy link
Member Author

@gkoz not that it can't be changed of course, but at least we thought about it some 😺

@gkoz
Copy link
Contributor

gkoz commented Jun 11, 2015

@shepmaster I'd ask @alexcrichton to take another look at this. The from_raw fn names give a very strong hint that they only accept pointers returned by into_raw (while CStr::from_ptr can take any old char *). I consider breaking this convention a footgun. Maybe the names [for CStr and CString] should indeed differ because the contracts aren't the same.

Maybe slice::from_raw_parts is the wrong name because it actually doesn't require a Rust-produced pointer.

@alexcrichton
Copy link
Member

@gkoz thanks for pointing this out, I hadn't considered the Box case (just CString/CStr). I'll add some notes to deal with these naming issues during stabilization.

@rkjnsn
Copy link
Contributor

rkjnsn commented Jun 16, 2015

For what it's worth, I think raw is more clear.

@gkoz
Copy link
Contributor

gkoz commented Jul 28, 2015

Apparently some APIs do want to free the strings you give them.

When writing a Zabbix loadable module:

You should note that when the result of an item is a string (text, message and log) Zabbix expects to receive a raw pointer to a previously allocated memory of the resulting string and once done with the result Zabbix will free(3) the memory.

Corroborated by https://github.com/zabbix/zabbix/blob/trunk/include/module.h#L109-114

/* NOTE: always allocate new memory for val! DON'T USE STATIC OR STACK MEMORY!!! */
#define SET_STR_RESULT(res, val)        \
(                       \
    (res)->type |= AR_STRING,       \
    (res)->str = (char *)(val)      \
)

So there might be some demand for a malloc-using CString variant after all.

I wonder how the RFC: Allow changing the default allocator is going to impact this. Looks like CString would be using malloc out of the box in this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.