C String Buffers in Rust

Tue, 2 Feb 2021

Intro
Open and close a file the wrong way
Minimal working example
Interlude: Why not use CStr or CString from std?
A string buffer structure
Data served three ways
Examples
Digression: Chasing dragons

Intro

I needed to call some C functions from Rust which take a character buffer and buffer length. The C function then fills the buffer (partly) with string data.

My particular case was using a vendor library for interacting with a motion control PLC. Since you probably don’t have these libraries or a controller to test with, we’ll use fgets from libc as an example which has a similar interface. The C declaration is:

char *fgets(char *s, int size, FILE *stream);

Like my PLC function, fgets takes a buffer, buffer size, and an object to read from. fgets will return the s pointer if data are read (and s now points to a null-terminated string). Otherwise, fgets returns a null pointer and we would have to jump through hoops with ferror to determine whether there was an error or just no more data.

In this post, I’ll simplify error checking and just treat a 0 return value as end of file – nobody should use this code as an example of how to read a file in Rust anyway :)

Code is available at https://github.com/duelafn/blog-code/tree/main/2021/c-string-buffer

Open and close a file the wrong way

Rust has fine file management tools, but if we want to use fgets, we have to open and close our file using libc. I’ll include the code here along with required prefix code and imports so you can run the examples below. I’ll have to do similar things to open and close the connection to my PLC.

    use std::ffi::CString;
    type FILE = std::ffi::c_void;
    extern {
        fn fgets(buf: *mut i8, n: i32, stream: *mut FILE) -> *mut i8;
        fn fopen(pathname: *const i8, mode: *const i8) -> *mut FILE;
        fn fclose(stream: *mut FILE) -> i32;
    }

    let pathname = CString::new("test.txt").expect("CString::new failed");
    let mode = CString::new("r").expect("CString::new failed");
    let fh = unsafe { fopen(pathname.as_ptr(), mode.as_ptr()) };
    if !fh.is_null() {
        // ... do stuff ...
        unsafe { fclose(fh) };
    }

The question now is, how do we ... do stuff ...

Minimal working example

If we only needed to do this once, the following might be what we would come up with (fh is the file handle opened above) — Note, I’ll say more about these steps while describing the more complete solution later.

// Initialize a buffer of all zeroes
let mut buf: Vec<u8> = vec![0; 128];

// Pass a mutable pointer to our foreign function,
// We cast it to be of the right type
let rv = unsafe { fgets(buf.as_mut_ptr() as *mut i8, buf.len() as i32, fh) };

if !rv.is_null() {
    // Search for the position of the null byte
    let strlen = match buf.iter().position(|&x| 0 == x) {
        Some(n) => n,
        None    => buf.len(),
    };

    // Chop off null and trailing garbage
    buf.truncate(strlen);

    // Interpret the bytes as a string, provided they are a valid UTF-8 sequence
    let result = String::from_utf8(buf);  // Result<String, FromUtf8Error>

    // ... use the result ...
}

This gets the job done and is no-copy, the Vec memory is the same memory used in the result String. The code is ugly though, so we’d like to hide it in a structure. We’ll also add some improvements and features along the way.

Interlude: Why not use CStr or CString from std?

A repeating theme in the documentation of CStr and CString is that these structures contain exactly one null byte at the end of the string. While our final strings won’t contain null bytes, the full buffer will contain multiple null bytes. Therefore, our basic buffer type will need to be something more general than a CStr or CString.

The underlying data structure for a String is Vec<u8> which is general enough and should make for easy conversions. In principle, we should be able to use Box<[u8]> but, due to issues with bypassing the stack, I’m going to stick with a Vec.

A string buffer structure

If a Vec is good enough for String, then it should be good enough for us. All we need is a Vec:

pub struct CStrBuf { vec: Vec<u8> }

In my constructor, I take the length as a usize because that is more natural to rust code, but I verify that the buffer length will fit into the i32 that we’ll use when passing to our foreign code.

impl CStrBuf {
    pub fn new(len: usize) -> CStrBuf {
        if len > (std::i32::MAX as usize) {
            panic!("Expected length <= i32::MAX");
        }
        CStrBuf { vec: vec![0; len] }
    }

We will need to get pointers and a buffer length to pass to our foreign functions. The ptr and len methods of vec are the wrong type though, so our wrappers will convert to our desired types. Since the pointers are between 8-bit signed and unsigned integers, the cast is safe and free. The length conversion is potentially troublesome, but we dealt with that using a check in our constructor.

    pub fn as_ptr(&self) -> *const i8 {
        self.vec.as_ptr() as *const i8
    }
    pub fn as_mut_ptr(&mut self) -> *mut i8 {
        self.vec.as_mut_ptr() as *mut i8
    }
    pub fn buffer_len(&self) -> i32 {
        self.vec.len() as i32
    }

Just because we aren’t using a CString, doesn’t mean we can’t use it for inspiration. Browsing through the source for CString, we find a nugget of optimization. Apparently, iter().position() can be an order of magnitude slower than using “memchr”. Unfortunately, if we want to use memchr we have to add an external dependency. Feel free to swap the “match” lines if you’re unwilling to do so.

    pub fn strlen(&self) -> usize {
        // match self.vec.iter().position(|&x| 0 == x) {
        match memchr::memchr(0, &self.vec) {
            Some(n) => n,
            None    => self.vec.len(),
        }
    }

Data served three ways

Now the interesting methods – how do we get usable rust Strings (or str) out of our buffer? There are three ways we might use a buffer:

A one-off buffer, we fill the buffer then won’t need it again (as a buffer). We then want to reinterpret its contents as a string without copying.
A multi-use buffer where, after filling the buffer, we want to extract the string contents for storage and use in other parts of our code. In this case we have no choice but to copy the string contents out, but we need only copy to the null byte, not the whole buffer.
A multi-use buffer where between each read we immediately process the buffer string contents (read only) then can discard the contents when we use the buffer again.

There are situations where any of these can make sense, and luckily, we can support all of these with our CStrBuf.

Single-use buffer

Our zero-copy string conversion has to consume (invalidate) the buffer struct. We signal this to the rust compiler by a mut self argument rather than a borrow (&mut self). With that hint, the compiler won’t let us use the buffer after converting it to a string. Code attempting to do so won’t even compile.

    pub fn into_string(mut self) -> Result<String, std::string::FromUtf8Error> {
        let len = self.strlen();
        self.vec.truncate(len);
        return String::from_utf8(self.vec);
    }

NOTE: Using let mut content = buf.into_string().unwrap(), we get a String in content whose capacity is equal to our original buffer size. The extra capacity might be useful if we intend to modify the string, but otherwise may just be wasted space. If you intend to keep content around for a while without modifying it you should consider the balance between the cost of shrink_to_fit() (by my understanding this will copy the string contents) against the cost of the wasted string capacity (if the content does not use the full buffer capacity).

Multi-use buffer, long-life strings

In this case, we are forced to copy the string content out of our buffer so that the string can outlive the buffer. The newly copied string will be right-sized (capacity == length), and can have its own lifetime independent of the buffer or reuse of the buffer.

slice::to_vec

    pub fn to_string(&self) -> Result<String, std::string::FromUtf8Error> {
        let len = self.strlen();
        return String::from_utf8(self.vec[0..len].to_vec());
    }

Multi-use buffer, short-lived read-only str

It is common to process buffer data immediately upon receiving it, either handling it fully or parsing the string into some other data structure. In this case, we might only need to borrow a str reference, to be dropped before reading the next chunk into our buffer.

Rust’s borrow checker handles this with ease, we can borrow a Vec slice into a str. The rust compiler will allow us to use that str as long as we do not modify it and to not attempt to write to the buffer before finishing our work with the borrowed str.

str::from_utf8

    pub fn to_str(&self) -> Result<&str, std::str::Utf8Error> {
        let len = self.strlen();
        return std::str::from_utf8(&self.vec[0..len]);
    }

Examples

See the code repository tests tests for examples using each of these interfaces.

Digression: Chasing dragons

When I started writing this post, I had intended to avoid zero initialization of the buffer Vec, since our foreign functions were going to fill the buffer for us. This was always going to be a silly micro-optimization since any performance-oriented code will create only a single buffer and reuse it, but I wanted to see how it would be done.

The problem is that uninitialized memory is not just dangerous from a “you might have garbage in there” standpoint. The compiler has an “undef” marker that it uses for various optimizations. Rust will mark uninitialized memory with that same marker. The consequence is that if the compiler can prove that you are using uninitialized memory, strange things happen.

Attempting to create a safe wrapper around an uninitialized block of memory is “difficult” for this situation because the CStrBuf structure can never know whether the user actually used the buffer — for instance a user might call as_mut_ptr() and then never write to it or never even pass it to C due to some other condition. If rust can prove that the write never happens, we could end up reading “undef”.

In the end, while there may be ways to make a safe uninitialized buffer, I decided to let that dragon roam free. There just isn’t any real benefit in this particular application.

Code is available at https://github.com/duelafn/blog-code/tree/main/2021/c-string-buffer