Stream: t-libs

Topic: Read + uninit memory


Steven Fackler (Jan 09 2020 at 00:46, on Zulip):

I have a prototype of a more robust approach towards use of uninitialized buffers in the Read trait, if people are interested: https://paper.dropbox.com/doc/IO-Buffer-Initialization--AneImNfaJpZ0LCvEngs0_eFTAg-MvytTgjIOTNpJAS6Mvw38 for background, and https://github.com/sfackler/read-buf for the code

Steven Fackler (Jan 09 2020 at 00:46, on Zulip):

I'm going to be starting to write an RFC for it all this week

Amanieu (Jan 09 2020 at 09:51, on Zulip):

I've been looking into some of this stuff for non-libstd work.

Amanieu (Jan 09 2020 at 09:51, on Zulip):

One thing that I find very useful is for read to return an initialized slice rather than a usize.

Amanieu (Jan 09 2020 at 09:51, on Zulip):

Basically it returns the part of the buffer that has been filled.

Amanieu (Jan 09 2020 at 09:52, on Zulip):

You can easily get the usize back by calling .len().

Amanieu (Jan 09 2020 at 10:01, on Zulip):

Regarding the buffer argument, what I had in mind what a ReadBuf trait which is implemented by both &mut [u8] and &mut [MaybeUninit<u8>]. The read method would then become fn read(buf: impl ReadBuf) -> &[u8].

Amanieu (Jan 09 2020 at 10:01, on Zulip):

However this is no longer object-safe since it uses generics.

Steven Fackler (Jan 09 2020 at 13:42, on Zulip):

Taking an impl ReadBuf makes it not object safe which we need.

Steven Fackler (Jan 09 2020 at 13:43, on Zulip):

The returning a slice approach ends up being pretty awkward - there are no guarantees as to where that slice actually points, which you need if you're trying to have an interface that converts [MaybeUninit<u8>] into [u8].

Amanieu (Jan 09 2020 at 13:46, on Zulip):

I don't understand the problem with returning a slice. The slice points to the buffer that was provided by the caller. Specifically the part of the buffer that was filled by the read operation (the rest of the buffer still being uninitialized).

Steven Fackler (Jan 09 2020 at 14:06, on Zulip):

Read is not an unsafe trait, so all a caller knows is that the returned buffer points to some slice of initialized bytes. It is hopefully a slice from the head of the input buffer, but it could be a slice from the middle of it, or a slice into static memory, or something else

Steven Fackler (Jan 09 2020 at 14:06, on Zulip):

code doing something like e.g. read_exact needs to know that the head of the input buffer is the bit being initialized if it's going to be able to return the entire buffer as initialized memory at the end

Lokathor (Jan 09 2020 at 18:20, on Zulip):

you can always check the pointer address if you care that much about the exact location.

nikomatsakis (Jan 11 2020 at 14:03, on Zulip):

@Steven Fackler thinking a bit more about this, I'm wondering if the right way to think about the RFC is that we are adding a type to the std library for working with partially initialized buffers -- and then using that in the Read trait? (In other words, this is a general abstraction that may have use elsewhere, presumably..?)

Steven Fackler (Jan 11 2020 at 17:18, on Zulip):

that could make sense, though I'm not totally sure what other use cases would come up. Minimally we should probably parameterize it instead of being locked down to u8

Steven Fackler (Jan 11 2020 at 17:21, on Zulip):

the version for vectored IO is going to have to be locked down to being read specific since we need to match up with readv's ABI

Lokathor (Jan 11 2020 at 20:22, on Zulip):

code doing something like e.g. read_exact needs to know that the head of the input buffer is the bit being initialized if it's going to be able to return the entire buffer as initialized memory at the end

Actually, this is interesting. As I read it again I realize that maybe I don't understand the imagined usage pattern. Can you give an example?

I've not yet seen code that makes an uninit buffer, fills it a part at a time, and then returns that whole thing to some outer scope for some reason.

Steven Fackler (Jan 11 2020 at 20:30, on Zulip):

fn read_exact2(&mut self, buf: &mut [MaybeUninit<u8>]) -> io::Result<&[u8]>

Lokathor (Jan 11 2020 at 20:44, on Zulip):

right, so you get some buffer of initialized bytes back, that seems like a totally fine API

Lokathor (Jan 11 2020 at 20:45, on Zulip):

when you're done with that pass, you just go again with the full MaybeUninit slice from last pass

simulacrum (Jan 11 2020 at 20:46, on Zulip):

read_exact needs to do multiple reads, but if the returned buffers aren't consecutive (i.e., may not be from the head) then you can't return a consecutive buffer

Lokathor (Jan 11 2020 at 20:46, on Zulip):

oh you mean for the default read_exact method?

simulacrum (Jan 11 2020 at 20:46, on Zulip):

well, it too, but yes

Lokathor (Jan 11 2020 at 20:47, on Zulip):

well if you're not using a default read_exact method, then whatever internals did the first read clearly can continue to do the second read until the read_exact is complete

Steven Fackler (Jan 11 2020 at 20:49, on Zulip):

I want you to write the default implementation of that read_exact2 method

simulacrum (Jan 11 2020 at 20:49, on Zulip):

Maybe? But the default is not possible to write I think is the point

Lokathor (Jan 11 2020 at 20:50, on Zulip):

If you have a new read_exact, you have a new read also, naturally, wouldn't you say?

Lokathor (Jan 11 2020 at 20:50, on Zulip):

so you'd have read2 and then read_exact2, right?

Steven Fackler (Jan 11 2020 at 20:50, on Zulip):

yes

Steven Fackler (Jan 11 2020 at 20:50, on Zulip):

Write the implementation of read_exact2 that uses read2

Lokathor (Jan 11 2020 at 20:51, on Zulip):

i am on my phone, so please understand this will take a moment, but sure

Lokathor (Jan 11 2020 at 21:03, on Zulip):
/// you should init from the front of the buffer
fn read2(&mut self, buf: &mut [MaybrUninit<u8>]) -> io::result<&[u8]>{
  // assumed
}

/// the default impl fails if your read2 doesn't fill from the front of the buffer
fn read_exact2(&mut self, start: &mut [MaybrUninit<u8>]) -> io::result<&[u8]>{
  let mut buf = &mut *start;
  while buf.len() > 0 {
    let in_addr = buf.as_ptr() as usize;
    let out = self.read(buf)?;
    if out.as_ptr() as usize != in_addr {
      // didn't fill from the fromt
      return Err(they)
    } else {
      // this much is init now  chop away and continue
      let (head, tail) = buf.split_at_mut(out.len());
      buf = tail;
    }
  }
  // all of it is init now
  unsafe { transmute_slice_however(start) };
}
Steven Fackler (Jan 11 2020 at 21:06, on Zulip):

Right, and so everyone else that wants to incrementally fill a buffer like read_exact2 does has to compare raw pointers, which is not a great experience

Lokathor (Jan 11 2020 at 21:07, on Zulip):

..what did you expect from MaybeUninit?

Lokathor (Jan 11 2020 at 21:07, on Zulip):

MaybeUninit isn't a great experience in the first place.

simulacrum (Jan 11 2020 at 21:08, on Zulip):

I seem to recall something to the effect of it not being sound to stitch together buffers like this even if you compare pointers

Lokathor (Jan 11 2020 at 21:08, on Zulip):

as i also said, If you override the read_exact2 your own version of read_exact2 can skip the pointer comparison step because you know what your own internal read2 method does

Steven Fackler (Jan 11 2020 at 21:08, on Zulip):

in this case it's not stitching, it's making sure they're the same thing all along so that should be fine

Steven Fackler (Jan 11 2020 at 21:09, on Zulip):

I don't agree that we should accept a bad user experience because some part of the API has a non-ideal user experience

Lokathor (Jan 11 2020 at 21:09, on Zulip):

there's no stitching! please note the part where we have start, then reborrow it, and edit along the reborrowed one, and then release that reborrow and transmute start once at the end

simulacrum (Jan 11 2020 at 21:10, on Zulip):

I guess yeah, it seems plausibly fine to do this "stitching" (yes, not really, but at a high level)

Lokathor (Jan 11 2020 at 21:10, on Zulip):

@Steven Fackler Users are always free to override a default method

Lokathor (Jan 11 2020 at 21:10, on Zulip):

rather, people implementing the trait i should say

Steven Fackler (Jan 11 2020 at 21:10, on Zulip):

I don't understand how that's relevant to the claim that this read2 API forces awkward pointer double-checking

Lokathor (Jan 11 2020 at 21:11, on Zulip):

because if you overwrite read_exact2 with your own version you don't need to pointer check at all

Steven Fackler (Jan 11 2020 at 21:13, on Zulip):

someone implementing the Read trait doesn't need to do any checking of any kind in read_exact2 because they don't need to write it in the first place!

Steven Fackler (Jan 11 2020 at 21:14, on Zulip):

I am talking about other code in other places that wants to do a similar thing of incrementally initializing a buffer

Lokathor (Jan 11 2020 at 21:14, on Zulip):

that's what i asked for an example for that you never really gave a usage of :P

Steven Fackler (Jan 11 2020 at 21:15, on Zulip):

read_exact is an example. Unless you're claiming that that is the only code that ever incrementally initializes a buffer, then there are other instances of that thing

Steven Fackler (Jan 11 2020 at 21:15, on Zulip):

for example read_to_end

Steven Fackler (Jan 11 2020 at 21:16, on Zulip):

but in any case, the other major problem with that version of read2 is that the default implementation of read2 itself that needs to delegate to read can't be implemented sufficiently efficiently

Steven Fackler (Jan 11 2020 at 21:17, on Zulip):

there are also APIs like parsers that take in a slice of bytes and either process them or say "no, I need more stuff"

Steven Fackler (Jan 11 2020 at 21:17, on Zulip):

which then needs to incrementally fill a buffer

Lokathor (Jan 11 2020 at 21:17, on Zulip):

So the usual desired usage of MaybeUninit for a buffer that I see is something like

let cap = 5000;
let v = Vec::with_capacity(cap);
let buf = unsafe { slice::from_raw_parts(v.as_mut_ptr() as *mut MaybeUninit<u8>, cap) };
while let Ok(bytes_this_pass) = source.read_to_unit_buffer(&mut buf) {
  do_work(bytes_this_pass)
}

and people usually do not desire to actually _keep_ the bytes from pass to pass of the loop

Steven Fackler (Jan 11 2020 at 21:18, on Zulip):

what happens when do_work says "bytes_this_pass is too small for me to do anything with"?

Lokathor (Jan 11 2020 at 21:18, on Zulip):

then you return Err :P

there's always the possibilty of problems

Steven Fackler (Jan 11 2020 at 21:19, on Zulip):

but it's not an error condition

Lokathor (Jan 11 2020 at 21:19, on Zulip):

or you can use read_uninit_exact if you must have some quantity

Steven Fackler (Jan 11 2020 at 21:19, on Zulip):

you don't always know the exact quantity

Steven Fackler (Jan 11 2020 at 21:19, on Zulip):

e.g. you are trying to parse a single json object

Lokathor (Jan 11 2020 at 21:20, on Zulip):

if you're steaming in json then be prepared to handle part of a json; that's a problem you might have both with and without an uninit buffer. a normal read might also cut out partway through an object, you just plain need to be able to handle that somehow

Steven Fackler (Jan 11 2020 at 21:21, on Zulip):

you handle it by reading the rest of the object

Lokathor (Jan 11 2020 at 21:22, on Zulip):

so if your buffer is 10 (to pick a number), and you read 10, and then you're partway through an object.. you gotta do something with that 10 before you can read another 10 bytes, so you already have to be able to handle partial objects

Lokathor (Jan 11 2020 at 21:22, on Zulip):

also, before I forget: I absolutely wouldn't suggest putting an uninit based read onto the existing Read trait as ant sort of "default method" on top of read, that would be crazy.

Lokathor (Jan 11 2020 at 21:23, on Zulip):

read2 would be like, a totally different method on a new trait for types that are able to handle having MaybeUninit buffers safely

Steven Fackler (Jan 11 2020 at 21:23, on Zulip):

why would that be crazy

Lokathor (Jan 11 2020 at 21:24, on Zulip):

i mean what would the read2 default impl even be?

Steven Fackler (Jan 11 2020 at 21:25, on Zulip):

the problem with that proposal is that it doesn't have a good default impl

Steven Fackler (Jan 11 2020 at 21:25, on Zulip):

it is important to not bifurcate the read trait

Steven Fackler (Jan 11 2020 at 21:25, on Zulip):

code should be able to take advantage of readers that can work with uninit memory when possible, without needing to just forbid other readers in the type signature

Steven Fackler (Jan 11 2020 at 21:26, on Zulip):

there are other proposals that do have a good default impl, and don't force a second trait

Lokathor (Jan 11 2020 at 21:27, on Zulip):

Well, returning the slice of what you wrote is mostly just better for a lot of uses, for both &[u8] or &[MaybeUninit<u8>]

Steven Fackler (Jan 11 2020 at 21:27, on Zulip):

better than what?

Lokathor (Jan 11 2020 at 21:28, on Zulip):

so, in the sense that, people who want Result<usize> are somehow already best served by getting Result<usize>, yes that trait should continue to use Result<usize> even for the uninit buffer version

Lokathor (Jan 11 2020 at 21:28, on Zulip):

i mean better than Result<usize>

Amanieu (Jan 11 2020 at 22:00, on Zulip):

read2 could just guarantee that the returned slice starts at the beginning of the buffer.

Lokathor (Jan 11 2020 at 22:03, on Zulip):

Since the trait overall is safe, the writer of the trait is technically within their rights to just not do that though, and unsafe code can't simply rely on it correctly being start of buffer, so they'd have to do the pointer value check.

Lokathor (Jan 11 2020 at 22:04, on Zulip):

I think that would only end up happening with a deliberately hostile impl, but it could still happen at all.

Amanieu (Jan 11 2020 at 22:05, on Zulip):

Anyways, I kinda like the ReadBuf proposal.

simulacrum (Jan 11 2020 at 22:06, on Zulip):

And I believe we've pretty much agreed that we don't want to make Read (or an equivalent, but new) trait unsafe, even if we ignore backwards compat concerns? i.e., it's too common to write to force people to think about the unsafety aspects?

Amanieu (Jan 11 2020 at 22:06, on Zulip):

However I would make one small change: have read2 update the ReadBuf in-place to set the number of bytes read, and give ReadBuf a way to safely extract the initialized portion of the buffer.

Amanieu (Jan 11 2020 at 22:07, on Zulip):

This would works for ReadBufs too.

Amanieu (Jan 11 2020 at 22:08, on Zulip):

Basically I want to be able to, with only safe code, create a [MaybeUninit<u8>], pass it to read2 and get a [u8] back.

Amanieu (Jan 11 2020 at 22:08, on Zulip):

All the unsafe code should be in the read2 implementation.

Amanieu (Jan 11 2020 at 22:09, on Zulip):

(or in ReadBuf if the read2 implementation doesn't use unsafe code)

Lokathor (Jan 11 2020 at 22:16, on Zulip):

how does one make [MaybeUninit<u8>] in safe code again?
is there like a nightly box api for that or something?

Amanieu (Jan 11 2020 at 22:17, on Zulip):

MaybeUninit::uninit() is safe

Amanieu (Jan 11 2020 at 22:17, on Zulip):

You can create an uninitialized value just note read inside it.

Amanieu (Jan 11 2020 at 22:19, on Zulip):

@Steven Fackler After having a chance to fully review your proposal, I really like the ReadBuf abstraction. The only change I would recommend is for read2 to return Result<()> and instead always update the length of the ReadBuf in-place. This makes it more ergonomic to obtain the result of a read.

Lokathor (Jan 11 2020 at 22:20, on Zulip):

oh so you mean like, [MaybeUninit::<u8>::unint(); 1024], okay i get it

Amanieu (Jan 11 2020 at 22:27, on Zulip):

Then again I'm having second thoughts on how well that would work with ReadBufs...

Steven Fackler (Jan 11 2020 at 23:31, on Zulip):

ReadBufdoesn't track the length of "read" data internally, it tracks the amount of initialized memory. It's important that that's monotonically increasing so that you don't end up repeatedly re-zeroing bits of the buffer

Amanieu (Jan 11 2020 at 23:32, on Zulip):

Ah fair enough. In that case I have no objections.

Steven Fackler (May 17 2020 at 20:29, on Zulip):

I finally got around to writing the RFC for this, if anyone wants to take a look before I open it up: https://github.com/sfackler/rfcs/blob/read-buf/text/0000-read-buf.md

RalfJ (May 17 2020 at 21:45, on Zulip):

might be worth saying that let mut buf: [u8; 1024] = unsafe { MaybeUninit::uninit().assume_init() }; is also UB already even with a known reader according to the current rules (but those rules are under discussion)

RalfJ (May 17 2020 at 21:45, on Zulip):

(it doesn't affect the discussion much I think, but just to make sure that people don't read that code and think it is officially blessed.)

RalfJ (May 17 2020 at 21:46, on Zulip):

shameless plug: if you want a citation for "uninitialized memory does not just have an arbitrary value", you could link to my blog post: https://www.ralfj.de/blog/2019/07/14/uninit.html

RalfJ (May 17 2020 at 21:47, on Zulip):

though you already have a pretty convincing example in there :)

RalfJ (May 17 2020 at 21:48, on Zulip):

Sometimes, working with uninitialized buffers can be more complex than working with regular initialized buffers!

I'd add "however", as in: "Sometimes, however, working with ..."

RalfJ (May 17 2020 at 21:52, on Zulip):

what about assert_initialized -> assume_init, to match MaybeUninit?

RalfJ (May 17 2020 at 21:53, on Zulip):

so ReadBuf has an invariant initialized >= written? might be worth stating explicitly.

RalfJ (May 17 2020 at 21:56, on Zulip):

RalfJ said:

what about assert_initialized -> assume_init, to match MaybeUninit?

oh it already is assume later, just not in the example code

RalfJ (May 17 2020 at 21:57, on Zulip):

initialize_unwriten_to is an odd name, to me that sounds like a function that sets all uninit memory to some value n

RalfJ (May 17 2020 at 22:05, on Zulip):

But overall, big :+1: :-)

Steven Fackler (May 17 2020 at 23:24, on Zulip):

Thanks! Yeah I'm not sure about the naming of some of the methods on ReadBuf

Steven Fackler (May 17 2020 at 23:24, on Zulip):

in particular initialize_unwritten/initialize_unwritten_to

BurntSushi (May 17 2020 at 23:51, on Zulip):

@Steven Fackler love it! great RFC. i very much like the ReadBuf type. the way it works feels very transparent and makes a lot of sense. i always agree with your decision on io::Result<()> vs io::Result<usize>. having all of that encapsulated in ReadBuf feels like the right way to go. (i say this as someone who hasn't really tracked progress on this issue.)

i think my "biggest" criticism is definitely that trivial Read impls become possible. it's kind of a bummer. a lint would definitely help with that. the other aspect of this that stinks though is the docs. today, it's clear that there is one single required method to implement, and i'd guess that read is what most folks will continue using even with read_buf. in the new world, there will be no required methods listed in the place one expects, which seems like kind of a bummer.

how much has the case of not making read have a default impl been considered? it would create some guaranteed boiler plate for some subset of impls which stinks, but maybe it's not _that_ terrible?

Steven Fackler (May 18 2020 at 00:45, on Zulip):

I'd hope that people would actually start implementing read_buf by default rather than read as time goes on. We can provide APIs like ReadBuf::append to avoid making people mess with unsafe while still avoiding initialization.

I'll add leaving read required as an alternative.

simulacrum (May 18 2020 at 12:51, on Zulip):

I imagine that if we wanted to it wouldn't be too hard to add a compiler lint/error that required you to implement one of read or read_buf, fwiw -- it'd be a perma-unstable attribute pair or something like that

simulacrum (May 18 2020 at 12:52, on Zulip):

one downside of the current approach of mutually recursive defaults it that e.g. rust-analyzer's "add required methods" assist stops working etc, since there's no required methods

simulacrum (May 18 2020 at 12:53, on Zulip):

in the ideal world we'd e.g. make 2021 edition code have read_buf be required and prior editions get read defaulted

simulacrum (May 18 2020 at 12:53, on Zulip):

which in theory isn't impossible

Steven Fackler (May 18 2020 at 14:06, on Zulip):

Yeah I think the lint check would be pretty easy to add, and we could probably even just upgrade it to some rustc-only annotation to make it a hard error

Steven Fackler (May 18 2020 at 14:06, on Zulip):

which rust-analyzer could pick up on as well

simulacrum (May 18 2020 at 14:07, on Zulip):

yeah -- I wouldn't mind a hard error personally -- one downside of the mutually default approach is due to LLVM treating recursion without side effects as UB it would make it pretty easy to accidentally trigger that (whereas today it's sort of hard(er) to accidentally hit that)

Josh Triplett (May 18 2020 at 16:52, on Zulip):

I would love to see attributes for what the minimum required implementation is.

Steven Fackler (May 18 2020 at 21:23, on Zulip):

https://github.com/rust-lang/rfcs/pull/2930

simulacrum (May 18 2020 at 21:25, on Zulip):

@Steven Fackler hm, I think I don't see anything about the possibility of an custom attribute to force either read or read_buf to get implemented? do you just consider that out of scope?

simulacrum (May 18 2020 at 21:26, on Zulip):

IMO, it well may be -- tying the two together could just lead to unnecessary noise

Steven Fackler (May 18 2020 at 21:26, on Zulip):

There's a note about a lint

Steven Fackler (May 18 2020 at 21:26, on Zulip):

We would ideally create a lint that exactly one of the two methods is implemented, but that is not a hard requirement.

simulacrum (May 18 2020 at 21:27, on Zulip):

ah okay, thanks! I had skimmed and mostly paid attention to unresolved questions / future work section

Steven Fackler (May 18 2020 at 21:46, on Zulip):

I don't think it'll be very hard to implement that check, so I'm not too worried about it

simulacrum (May 18 2020 at 22:19, on Zulip):

sure, yeah

simulacrum (May 18 2020 at 22:19, on Zulip):

I agree it seems separate and not something that really needs to be RFCd (though probably deserves check boxes or whatever)

Last update: Jul 02 2020 at 12:55UTC