Stream: t-lang/wg-unsafe-code-guidelines

Topic: data-invariants


RalfJ (Aug 08 2018 at 22:46, on Zulip):

At some point I want to jot down my thoughts on data invariants, starting with the basic invariants that layout optimizations rely on -- what are they, and when do they have to hold, etc. What would be the best place to submit that? one of our two UCG github repos, a blog post, or directly an RFC?

Nicole Mazzuca (Aug 08 2018 at 22:57, on Zulip):

what do you mean? I'd say, given an object of type T, that object should always hold a value of type T. That seems like the place you'd want to be.

nikomatsakis (Aug 08 2018 at 23:09, on Zulip):

At some point I want to jot down my thoughts on data invariants, starting with the basic invariants that layout optimizations rely on -- what are they, and when do they have to hold, etc. What would be the best place to submit that? one of our two UCG github repos, a blog post, or directly an RFC?

I think it makes sense to write up something in https://github.com/rust-rfcs/unsafe-code-guidelines/

alercah (Aug 08 2018 at 23:19, on Zulip):

If we're considering data layout invariants, one possible thing to think about is whether we offer guarantees when correct bit-patterns happen to be written

alercah (Aug 08 2018 at 23:21, on Zulip):

e.g. is it okay to store a repr(rust) value on disk and read it back out to memory? do we offer guarantees that this works across identically defined types? across different compiler versions? can the compiler randomly reassign discriminants for enums every time the program runs?

Nicole Mazzuca (Aug 08 2018 at 23:40, on Zulip):

So this is a question of ABI - I'd argue that ABI should only be guaranteed (at the _most_) on exactly one compiler version and definition

Nicole Mazzuca (Aug 08 2018 at 23:41, on Zulip):

(We've broken ABI before)

RalfJ (Aug 09 2018 at 07:31, on Zulip):

@Nicole Mazzuca well that is an entirely cyclic definition. ;) You just said "the invariant to be maintained for T is the invariant for T".

RalfJ (Aug 09 2018 at 07:32, on Zulip):

@alercah These are interesting questions but somewhat orthogonal. At this point, I am concerned with what the rules are once a particular data layout is fixed by the compiler.

RalfJ (Aug 09 2018 at 07:39, on Zulip):

So, what I am imagining is something like:
When it has to hold: The layout invariant must hold always; whenever you construct a value of type T (transmute, are there other operations we have to list) and whenever you load a value of type T (i.e., actually deref a place or use ptr::read).
However, the invariant is pretty basic. It might be a good idea to adapt the principle that data layout invariants are the kind of invariant that, when it holds once for a value, it will always hold. That excludes making "& is dereferencable" part of the invariant (because that could hold once, but then stop holding when memory gets deallocated), but would otherwise be nice for reasoning.
So e.g. my proposal for the data layout invariant for &T would be that it must be non-NULL and aligned. For bool it must be 0 or 1, I think that is quite clear. Another interesting question is what about integer types: Is it okay to have undef/poison in an i32? For unions, I think the invariant should allow anything but @Vadim Petrochenkov thinks we should have some invariant even here. Trait object pointers must have a valid vtable. And so on.

RalfJ (Aug 09 2018 at 07:42, on Zulip):

The reason I think not putting "dereferencable" here is that this falls out of my aliasing model: Whenever you load a reference, it gets activated and retagged. That is UB if the reference is not dereferencable.

Nicole Mazzuca (Aug 09 2018 at 14:18, on Zulip):

No, there's a difference between "values" and "objects"

Nicole Mazzuca (Aug 09 2018 at 14:19, on Zulip):

it's like, the value 5 vs the box that 5 gets put in

Nicole Mazzuca (Aug 09 2018 at 14:21, on Zulip):

I recommend the book Elements of Programming, it goes over all of this stuff

Nicole Mazzuca (Aug 09 2018 at 14:57, on Zulip):

Like, here's an illustration of the idea:

Given a builtin type T, there exists a set of valid values for T; for a bool, that would be {true, false}, for i32, that would be [-2**31, 2**31) ∩ ℤ, etc.

Given a struct or tuple (hereafter referred to as a product) type T, there exists a set of valid values for T which is constructed by taking the tuple of it's members - i.e., for (bool, bool) that would be the set bool * bool, or {(true, true), (true, false), (false, true), (false, false)}

(continued)

Nicole Mazzuca (Aug 09 2018 at 15:12, on Zulip):

Given a union or enum type T, hereafter referred to as a sum type, there exists a set of valid values for T which is constructed by either
- (for the union) taking the union of the values of the member types - i.e., union { bool, i32 } would be the set {true, false} ∪ ([-2**31, 2**31) ∩ ℤ),
- (for the enum)
- given a mapping nth_ctor : (ℕ ∩ [0, N)) → Type (where N is the number of constructors for T)
- which is defined as the map from the ctor index to the product type of that ctor
- the valid values of T are the set ∪(i = 0 to N) {(i, v) | v ∈ nth_ctor(i)}

Nicole Mazzuca (Aug 09 2018 at 15:28, on Zulip):

Let's call this map from the type to the set of its values, val : Type -> Set Value

A Bit is the set of values {X, 1, 0}.

⊑ : List Bit * List Bit -> Bool is defined: x ⊑ y if and only if
- the lists have the same length, N
- for each index n ∈ [0, N) ∩ ℕ
- if yₙ = 0, then xₙ = 0
- if yₙ = 1, then xₙ = 1
- if yₙ = X, then xₙ can be any bit
Note: this relation has the effect of saying whether x has the same value as y.

Then, there exists a mapping bits : val(T) → (l: List Bit) where l has size size_of::<T>() - note that there exists no guarantee of distinctness, due to references and unions.

An object o with type T also has a list of Bits (of size size_of::<T>()). Let's call this list repr : Object -> List Bit. An object o with type T is valid if and only if ∃v : val(T) s.t. repr(o) ⊑ bits(v).

RalfJ (Aug 09 2018 at 17:57, on Zulip):

Not sure when I will have tome to digest all of this, but just one comment up front: The Box that you put 5 in is also a value. that value is a pointer/address/location in memory.
it is entirely unnecessary to talk about objects here. no objects appear anywhere in any of my formal models. or maybe you are just using "object" as a term for "location in memory"?

Nicole Mazzuca (Aug 09 2018 at 19:20, on Zulip):

Yes, an object is a location in memory with space for a value. You have to separate them when discussing mutation

Nicole Mazzuca (Aug 09 2018 at 19:21, on Zulip):

You also have things like this:

let mut x = 0i32;
*((&mut x) as *mut _ as *mut [u8; 4]) = [0, 1, 2, 3];

which do type punning on a single object.

If you don't have locations with values in them, how do you have mutation?

let mut x = 0;
x += 1;

x is one location, but it has both 0 and 1 in it at different times.

alercah (Aug 12 2018 at 18:35, on Zulip):

Yeah, this distinction becomes important if we think about where it is legal to have a pointer (which indicates a location) pointing to an invalid value.

alercah (Aug 12 2018 at 18:38, on Zulip):

If we're talking about bits as the set of valid values, then a 0 reference is definitely invalid. An unaligned reference should be invalid as well.

alercah (Aug 12 2018 at 18:38, on Zulip):

We could require raw pointers be aligned as well, but I don't think it makes a big difference one way or the other.

alercah (Aug 12 2018 at 18:39, on Zulip):

(also by "0 reference" I mean the null value, which may or may not be all-bits-0)

alercah (Aug 12 2018 at 18:42, on Zulip):

If we make it a data layout invariant that fields are always nonoverlapping (I mentioned this on github and have more thoughts on this: a compiler could pack, say, multiple bools into one byte in a struct/array if it could prove that there was never any separate addressing of them. But this could just be done by the as-if rule anyway.), then any arbitrarily nested field can be referred to without necessarily needing the rest of the struct (or any nested one) to be valid.

alercah (Aug 12 2018 at 18:42, on Zulip):

Likewise individual fields can be loaded without requiring validity of the others.

alercah (Aug 12 2018 at 18:43, on Zulip):

I can't think of any inherent reason we need to think of structs as being valid or invalid in toto.

alercah (Aug 12 2018 at 18:48, on Zulip):

You can refer to them safely without any problems. You can even move them around, and you still don't actually run into any problems. As long as you don't try to actually access a field or do anything conditional on it, I don't think that we can have any issues there. The only thing I can think of is if the compiler does an oversize load and relies on the validity of the extra memory loaded to perform some tricky optimization (e.g. for a boolean, knowing that only one bit is set, so ANDing against something you know has the bit cleared will always produce zeroes). I'm not sure if that is worth keeping the door open to.

alercah (Aug 12 2018 at 18:49, on Zulip):

Actually I guess the most likely case I can think of that is if I have, e.g.:

enum Singleton {
  Variant = 0,
}
struct Thing {
  b: bool,
  s: Singleton,
}

fn some_size(t: Thing) -> usize {
  if t.b {
    5
  } else {
    91
  }
}

The compiler here might rely on t.s necessarily having value 0 in order to do an oversized load and then compare to 0. If this is an optimization we permit, then we have to define loading at the entire struct level, so that accessing a field for a load also loads the rest of the struct.

alercah (Aug 12 2018 at 18:55, on Zulip):

But then that rapidly becomes complicated if we consider that the load might happen deep in some other function: if, say, we pass &t.b into another function, and then that loads from it, could the compiler inline and then do the oversize load? What if raw pointers were involved?

So my temptation here is to say that the compiler cannot rely on invariants in values in oversized loads.

alercah (Aug 12 2018 at 19:02, on Zulip):

Enum layout optimizations do prevent us from allowing *p = v where v is invalid, but I don't think that applies for a fully-owned value. So ptr::read could read an invalid value safely if we wanted?

alercah (Aug 12 2018 at 19:03, on Zulip):

(provided that the result's invalid fields were not conditioned on, ofc)

alercah (Aug 12 2018 at 19:08, on Zulip):

But that may prove untenable in practice? not sure

alercah (Aug 12 2018 at 19:09, on Zulip):

worth thinking about that, though

RalfJ (Aug 23 2018 at 20:34, on Zulip):

@Nicole Mazzuca I think bringing up some kind of denotational semantics for values like you are doing here is entirely unnecessary. in fact, from all I know, it is an unsolved problem to scale this to higher-order imperative languages. you'd have to explain "values" that makes up the function space, and then you run into recursive domain equation that you cannot solve... and much later you end up with essentially something like logical relations, where you do not need that abstract set of values at all. but anyway that is entirely off-topic, all we are discussion here is basic data layout invariants.

RalfJ (Aug 23 2018 at 20:35, on Zulip):

these invariants are on values represented in memory, i..e, sequences of bytes -- where a "byte" is the byte of an abstract machine that keeps track of extra state like whether something is initialized.

RalfJ (Aug 23 2018 at 20:36, on Zulip):

If you don't have locations with values in them, how do you have mutation?

oh, we totally have locations, and we have a memory that maps locations to bytes. I know C uses the term object for some of this, but I think it is rather confusing as we are not talking about OOP.

RalfJ (Aug 23 2018 at 20:37, on Zulip):

also, we are unlikely to adopt C's extremely complicated object model. In LLVM, locations form a mostly "flat" address space -- separate allocations are distinguished, but you can do pointer arithmetic within an allocation across what C would call object boundaries

RalfJ (Aug 23 2018 at 20:37, on Zulip):

@alercah every bool can have its address taken so you cannot really change how an individual bool looks

RalfJ (Aug 23 2018 at 20:37, on Zulip):

(bool, bool) must always be 2 bytes

RalfJ (Aug 23 2018 at 20:42, on Zulip):

So, coming back to memory, locations and objects -- I am coming at this from the perspctive that before you do anything else, you should define your state. C never does that, and it's a horrible mistake. that state will likely have some big finite partial function location -fin> byte -- where both location and byte don't have to be the "obvious" choice. in a boring IR, you could pick location = u64 and byte = u8. in miri, we have location = (AllocId, Offset) and byte basically an enum Initialized(u8), Uninit, except we add some extra stuff to be able to also store pointers in memory. so on paper I would use byte = enum { Data(u8), Uninit, PtrFragment { ptr: location, idx: u8 } where PtrFragment(ptr, n) represents byte n of a pointer. the only reason we do not have that in miri is performance.

RalfJ (Aug 23 2018 at 20:42, on Zulip):

so then, based on that, the validity invariant is a predicate over byte^n -- and maybe it can also depend on the current state of memory.

Nicole Mazzuca (Aug 24 2018 at 16:19, on Zulip):

How C does this is that functions are not object - in fact, Rust is the same way. A function is fundamentally different from an object. That's why we have &T and fn()

Nicole Mazzuca (Aug 24 2018 at 16:21, on Zulip):

A function doesn't have a value in the classical sense, in systems languages

Nicole Mazzuca (Aug 24 2018 at 16:21, on Zulip):

Also, "object" for "location + value" is a common wording in systems languages, and I feel it's unnecessary to reinvent the wheel when it comes to language here

RalfJ (Aug 24 2018 at 17:13, on Zulip):

location + value? as in, the value stored at that location?

RalfJ (Aug 24 2018 at 17:14, on Zulip):

however that only makes sense when you also consider types. otherwise you do not know how many bytes to consider.

RalfJ (Aug 24 2018 at 17:14, on Zulip):

and memory is better treated as inherently untyped.

Nicole Mazzuca (Aug 24 2018 at 18:01, on Zulip):

Yea, but lvalues are inherently typed

RalfJ (Aug 24 2018 at 18:02, on Zulip):

hm, no, I dont think so. in miri we have Place and it is not typed by itself. we often carry it along with its type information, but the concept makes sense without types.

nikomatsakis (Aug 24 2018 at 18:03, on Zulip):

well, a Place is an "expression", essentially, in some context, so you can compute its type?

RalfJ (Aug 24 2018 at 18:03, on Zulip):

I was talking about miri::Place, not mir::Place

nikomatsakis (Aug 24 2018 at 18:03, on Zulip):

or maybe miri means something distinct from MIR when you say Place

RalfJ (Aug 24 2018 at 18:04, on Zulip):

i.e. the thing the expression computes to

nikomatsakis (Aug 24 2018 at 18:04, on Zulip):

I see, ok

RalfJ (Aug 24 2018 at 18:04, on Zulip):

which is essentially a location + alignment + info for unsized stuff

RalfJ (Aug 24 2018 at 18:04, on Zulip):

but to be fair I think we have hardly any use in miri for a place without a type

RalfJ (Aug 24 2018 at 18:05, on Zulip):

but that's because places are an "ephemeral" concept -- they exist during evluation of a statement, but are not really part of the machine state

Nicole Mazzuca (Aug 24 2018 at 18:05, on Zulip):

so you're saying you have void* for all lvalues, and each read is read(type, offset, lvalue)?

Nicole Mazzuca (Aug 24 2018 at 18:05, on Zulip):

or something along those lines?

RalfJ (Aug 24 2018 at 18:05, on Zulip):

when you want to actually read you need a size, yes. not a full type though.

RalfJ (Aug 24 2018 at 18:05, on Zulip):

(miri needs a type for reasons that are only related to optimizations)

Nicole Mazzuca (Aug 24 2018 at 18:06, on Zulip):

that seems like a very odd model

Nicole Mazzuca (Aug 24 2018 at 18:06, on Zulip):

values... should... be typed

RalfJ (Aug 24 2018 at 18:07, on Zulip):

no I dont think so. types only matter when you are actually computing.

RalfJ (Aug 24 2018 at 18:08, on Zulip):

and even then, I'd make unsigned-mul and signed-mul just different operations

RalfJ (Aug 24 2018 at 18:08, on Zulip):

and then both can work on untyped bytes

Nicole Mazzuca (Aug 24 2018 at 18:08, on Zulip):

I... really dislike that model

RalfJ (Aug 24 2018 at 18:08, on Zulip):

I think types are just making trouble on this level

Nicole Mazzuca (Aug 24 2018 at 18:08, on Zulip):

I don't agree at all

RalfJ (Aug 24 2018 at 18:08, on Zulip):

types should be used to reason about program behavior, not to define program behavior

RalfJ (Aug 24 2018 at 18:09, on Zulip):

but I appreciate that this is an opinion :)

RalfJ (Aug 24 2018 at 18:11, on Zulip):

and I can live with types being used during the evaluation of a single expression -- as long as the machine state that is kept from one evaluation step to the next does not have types

Nicole Mazzuca (Aug 24 2018 at 18:13, on Zulip):

I would argue that, for example, fn () -> i32 should actually return i32 in the model, and not just "4 bytes"

Nicole Mazzuca (Aug 24 2018 at 18:13, on Zulip):

and similarly, that fn (i32) -> i32 should be callable only with i32

nikomatsakis (Aug 24 2018 at 18:13, on Zulip):

values... should... be typed

I mean -- values can have multiple types, clearly... (union, transmute). So you have to reconcile for that at minimum. This does tend to lead you towards Ralf's POV

Nicole Mazzuca (Aug 24 2018 at 18:14, on Zulip):

@nikomatsakis I very much disagree

nikomatsakis (Aug 24 2018 at 18:14, on Zulip):

although @RalfJ I'd still like to understand how transmute plays with the ref tracking in stacked borrows =)

Nicole Mazzuca (Aug 24 2018 at 18:14, on Zulip):

union and transmute do not have values of multiple types

Nicole Mazzuca (Aug 24 2018 at 18:15, on Zulip):

transmute takes the object representation of one value, and gives you a value of a different type with the same object representation

Nicole Mazzuca (Aug 24 2018 at 18:21, on Zulip):

my favorite example where, imo, that kind of thinking breaks down, is with return values of struct type

given a type struct i64x2 { x: i64, y: i64 }, and a function fn f() -> i64x2 { i64x2 { x: 0, y: 0 } }, there _does_ exist a value of type i64x2, but there never exists an _object_ of type i64x2

Nicole Mazzuca (Aug 24 2018 at 18:21, on Zulip):

(assuming x64 ABI)

Nicole Mazzuca (Aug 24 2018 at 18:21, on Zulip):

you don't ever actually have a byte ^ 16 thing

rkruppe (Aug 25 2018 at 00:03, on Zulip):

ABIs are implementation details though, for language semantics it seems absolutely fine -- even desirable for simplicity -- that every function writes its result to a place that is passed in (and indeed this is how MIR currently works)

Nicole Mazzuca (Aug 25 2018 at 00:21, on Zulip):

why is that more simple than "returns a value"

nikomatsakis (Aug 25 2018 at 01:08, on Zulip):

it is also true that computer hardware is not typed; so there is some advantage if you can map your model to types being something that the code "adds" to the base layer

nikomatsakis (Aug 25 2018 at 01:08, on Zulip):

although the mapping between this base layer and hardware is...suggestive at best, I suppose :)

RalfJ (Aug 25 2018 at 08:58, on Zulip):

I think transmute is just a NOP

RalfJ (Aug 25 2018 at 08:58, on Zulip):

it copies bytes around, done

RalfJ (Aug 25 2018 at 08:58, on Zulip):

that thing with struct return value is entirely a platform implementation detail

RalfJ (Aug 25 2018 at 08:59, on Zulip):

as far as language spec is concerned, a place exists somewhere that is prepared by the caller, and where the callee writes this return value into

RalfJ (Aug 25 2018 at 08:59, on Zulip):

and by "write the return value", I mean that it does return_place.x = 0; return_place.y = 0;

RalfJ (Aug 25 2018 at 09:00, on Zulip):

or return_place = i64x2 { ... } if we have aggregate creation as primitive (which IIRC MIR has)

RalfJ (Aug 25 2018 at 09:00, on Zulip):

why is that more simple than "returns a value"

why is that more simple than "returns a value"

we do not need the concept of a "value".

RalfJ (Aug 25 2018 at 09:01, on Zulip):

we have scalars, which is something we can immediately perform computation on; everything else just exists as place/operand in memory

rkruppe (Aug 25 2018 at 09:01, on Zulip):

@Nicole Mazzuca sorry it's not inherently simpler in itself, but it's one component (along with having places rather than just values in other parts of the language) of allowing us to avoid specifying validity for values, or more generally a really formalized denotational semantics

RalfJ (Aug 25 2018 at 09:01, on Zulip):

no need to invent a concept of a "value" which is something like "a bunch of bytes, but not sitting in memory currently"

Nicole Mazzuca (Aug 25 2018 at 15:05, on Zulip):

I think the question is "what is the purpose of the model". If it's "allow people to reason about their Rust", I'd argue a typed semantics is far easier to reason about; if the question is "allow us to reason about the Rust model", then it might be better to have it be untyped. I personally think it's much easier to reason about unsafe code which has values and types, not buffers and invariants.

rkruppe (Aug 25 2018 at 15:10, on Zulip):

quite a few of the tricky questions that people writing unsafe/systems-y code ask are naturally about memory contents: can i type pun this buffer, is this layout compatible with C, can i read this uninitialized memory and do X with it, etc. so i don't really see how a typed-values-first approach will be automatically more natural

Nicole Mazzuca (Aug 25 2018 at 15:10, on Zulip):

that's kind of the less interesting parts of unsafe code, imo

Nicole Mazzuca (Aug 25 2018 at 15:13, on Zulip):

maybe it's just due to being in the C++ community, where we have a value based semantics and it's very useful (and fairly reasonable) to read the standard

Nicole Mazzuca (Aug 25 2018 at 15:16, on Zulip):

it also more closely matches the actual machine, imo

Nicole Mazzuca (Aug 25 2018 at 15:16, on Zulip):

we really aren't dealing with memory, most of the time; we're dealing with registers (which are different depending on type!)

rkruppe (Aug 25 2018 at 15:18, on Zulip):

i think this is a red herring, despite 90% of my compiler work dealing intimately with these sorts of Real Machine Widgets
a language definition needs to permit efficient compilation, but it doesn't need to be very suggestive of that compilation strategy, especially since any analogy to real machines will likely be superficial anyway

rkruppe (Aug 25 2018 at 15:19, on Zulip):

aside: if one really wanted a language semantics tailored to register machines, it would be more useful to give up the pretense that certain kinds of values (e.g., condition codes, masks, variable-length vectors) can be stored in memory _at all_

Nicole Mazzuca (Aug 25 2018 at 15:20, on Zulip):

sure

Nicole Mazzuca (Aug 25 2018 at 15:20, on Zulip):

I don't disagree with that idea lol

Nicole Mazzuca (Aug 25 2018 at 15:36, on Zulip):

also, obviously this is all opinion, and Ralf's model is also entirely valid; I just think a value-based model is easier to understand (due to being super familiar with them with C++)

RalfJ (Aug 31 2018 at 08:42, on Zulip):

maybe it's just due to being in the C++ community, where we have a value based semantics and it's very useful (and fairly reasonable) to read the standard

You are the first person that I hear say such things about C++.^^ I consider the C++ standard to be huge and unreadable... and also fundamentally deficient because it is not written int he style of an "abstract machine". C++ is axiomatic, it essentially says "I wish all these things to be true" and then leaves it to others to figure out if, and how, that is possible.

All the work for actually reasoning formally about low-level code that I am ware of (first and foremost, CompCert) uses an untyped memory. Types make it harder to define what your machine is, and a more complicated machine is more complicated to reason about.

RalfJ (Aug 31 2018 at 08:45, on Zulip):

So, I think the purpose of this exercise is to define Rust, and a definition should be (a) executable, written with some form of abstract machine in mind, because that's how people think about their programs, and (b) as simple as possible. The latter is crucially important also when reasoning about your code.

Registers are entirely an implementation detail. No need to clutter our language definition with such concerns.

RalfJ (Aug 31 2018 at 08:47, on Zulip):

That said, I know plenty of people who consider axiomatic models useful and maybe even superior to the operational approach I am proposing. There is certainly a large amount of personal preference in there, and it probably also depends on what you are doing. However, the history of UB in C/C++ IMHO shows that having only an axiomatic definition is not sufficient -- clearly, the axiomatic definition of C/C++ did not really help programmers to determine if their code is UB; in fact, even compiler authors and academics struggle to make sense out of it. I think that is a piece of hard evidence against such an axiomatic approach.

RalfJ (Aug 31 2018 at 08:49, on Zulip):

Also see John Regehr advocating an operational definition, and even Torvalds agrees.
Another piece of evidence is the story of weak memory models: Initially they were all axiomatic, but none of them managed to solve the "out of thin air" problem. one of the best contenders we have right now is the first realistic *operational* model of relaxed memory.

RalfJ (Aug 31 2018 at 08:53, on Zulip):

I am not debating that a higher-level view is useful for reasoning, that's why I am working on program logics for my PhD. ;) but the gold standard here is to prove that your value-based fiction is actually correct with respect to a more low-level operational semantics. Thus we end up with a semantics that is reasonably close to "the real thing" (where "reasonable" of course depends on your PoV^^) -- LLVM's memory is untyped and just a "bunch of bytes", which is important because we have to trust that the semantics match what really happens. And we still get high-level reasoning, but in a way that we know is sound because there is a proof connecting it to the low-level "reality" (just another model, of course, but hopefully one that is much easier to declare realistic).

RalfJ (Aug 31 2018 at 08:56, on Zulip):

In most of my work, when I am not interested in pointer arithmetic or layout concerns, we have a very high-level view of memory where the "value" you can store in a location can be arbitrarily big, contain nested sums and products and closures and whatnot. We put all the burden on implementing this on the compiler. But that is not so hard because there is no pointer arithmetic and no transmute, there is no way for the program to observe what the compiler is doing. But if you'd now want to also define, in the same language, what happens with byte-level accesses -- that's where I have yet to see a high-level model that I would actually trust to match actual program behavior. (And it is about trust here, this is the fundamental definition of the language and we do not have a verified compiler.) Robbert Krebber's thesis is the closest I know, and it is very far from "simple".

RalfJ (Aug 31 2018 at 09:01, on Zulip):

Sorry for the long rant, I felt I had to explain where I am coming from.^^

Do you know of a simple explanation of C++'s value-based model? Simple means it easily fits onto a dozen pages or so, but should still support some amount of both byte-level and high-level operations and give a precise description of what is UB. All I hear from my peers is hardly anyone tried formalizing C++ because it is so complicated.

Here's a rather simple but realistic untyped model for a subset of C. "Simple" because it fits in a paper -- it's almost 3 dozen pages but it's also giving lots of background and related work and so on. And "realistic" because it is the one used by CompCert to compile plenty of actual C code. It even provides some facilities for concurrent execution (just with locks, but still).

RalfJ (Aug 31 2018 at 09:14, on Zulip):

(The fact that there is hardly any formal work on reasoning about C++ also seems to contradict your statement that C++'s definition is well-suited for reasoning.)

Basile Desloges (Aug 31 2018 at 14:08, on Zulip):

@RalfJ I thought the C++ spec was written against an abstract machine (from https://www.youtube.com/watch?v=KoqY50HSuQg slide 18, I tried to retrieve the source directly but the page seems to have changed and I can't find it :/). Did something changed since that video or are you speaking of something else?

RalfJ (Aug 31 2018 at 14:12, on Zulip):

The C++ standard, at least every edition I have seen so far, is written like the C standard

RalfJ (Aug 31 2018 at 14:13, on Zulip):

and that's as far away from an abstract machine as it gets

RalfJ (Aug 31 2018 at 14:13, on Zulip):

but if I am wrong with that, that'd be great :D

Basile Desloges (Aug 31 2018 at 14:13, on Zulip):

You mean the way the standard is written then? Maybe it is written against an abstract machine, but not very clearly :)

rkruppe (Aug 31 2018 at 14:16, on Zulip):

@RalfJ may be referring to the difference between an axiomatic definition (list of guarantees, but without directly describing how the abstract machine behaves) and an operational definition (which directly gives you an algorithm you can execute)

Basile Desloges (Aug 31 2018 at 14:18, on Zulip):

Oh I see thanks. I think I didn't really understood the difference between those two :)

RalfJ (Aug 31 2018 at 14:19, on Zulip):

Well, I have not seen anything resembling a definition of an abstract machine, which starts by saying: here's the machine state, and then here's the possible execution steps and how they manipulate said state.

RalfJ (Aug 31 2018 at 14:19, on Zulip):

maybe the authors had an abstract machine in mind, but never told us about it. But for C, writing a conforming abstract machine didn't happen until <5 years ago in academia (it's still not normative, and it's huge and almost unusable for reasoning or any other purpose, really; but it's a start). And C++ is much harder.

Jake Goulding (Aug 31 2018 at 15:16, on Zulip):

It's a very abstract abstract machine

Nicole Mazzuca (Aug 31 2018 at 16:40, on Zulip):

I don't mean reasoning in a "you" sort of way, I mean reasoning in a "user" sort of way. Also, LLVM doesn't have the model of untyped memory; it has a C++ style memory + values model?

nikomatsakis (Aug 31 2018 at 16:41, on Zulip):

Here's a rather simple but realistic untyped model for a subset of C

Nice citation.

Nicole Mazzuca (Aug 31 2018 at 16:43, on Zulip):

Also, here's the base model of C++: http://eel.is/c++draft/basic.memobj

RalfJ (Aug 31 2018 at 16:43, on Zulip):

@Nicole Mazzuca I think reasoning in a "user" sort of way is misguided and likely incorrect if it cannot be put on solid formal foundations.

RalfJ (Aug 31 2018 at 16:43, on Zulip):

and no from all I see in the lang ref, LLVM's memory is entirely untyped

RalfJ (Aug 31 2018 at 16:44, on Zulip):

it doesnt "remember" at which type bytes were put unto it

RalfJ (Aug 31 2018 at 16:44, on Zulip):

also all the formal models I have seen for LLVM use untyped memory

RalfJ (Aug 31 2018 at 16:45, on Zulip):

whereas for C, Robbert Krebber's thesis is an example of a formal "kind-of typed"(?) memory. at least, structs and arrays are visible as such in memory.

RalfJ (Aug 31 2018 at 16:45, on Zulip):

whereas for LLVM, structs and arrays are only relevant to compute the address of an access. memory is "flat".

RalfJ (Aug 31 2018 at 16:46, on Zulip):

I will stash these C++ references somewhere for consumption after my vacation :)

Nicole Mazzuca (Aug 31 2018 at 16:46, on Zulip):
define i32 @main() {
start:
  %0 = i32 0
  ret i32 %0
}

^ note that %0 has type i32

Nicole Mazzuca (Aug 31 2018 at 16:46, on Zulip):

_memory_ doesn't have a type, but _values_ do

Nicole Mazzuca (Aug 31 2018 at 16:49, on Zulip):

(which is what I'm arguing for)

nikomatsakis (Aug 31 2018 at 17:09, on Zulip):

I don't mean reasoning in a "you" sort of way, I mean reasoning in a "user" sort of way.

I'm a bit curious about this — my impression is that many unsafe code authors (myself included) would prefer to have an underlying "untyped" model that they can reason about. The farther things get from how the underlying computer works (or at least the simple abstaction we have for it), the more confusing things are. Witness e.g. the common attempt to disable type-based aliasing etc, because it is hard to figure out if code is obeying those rules.

I don't have real data on this, of course, and you seem to have the exact opposite impression. I could easily imagine that there are people who have the reverse opinion. It might be interesting to take some "case study" examples that highlight the differences and try to see what falls about from thinking through them in both "styles". I'm not sure exactly what that means =)

RalfJ (Aug 31 2018 at 17:12, on Zulip):

_memory_ doesn't have a type, but _values_ do

Well that's already much less typed than I thought you would advocate. :)
my argument, then, is that the value on the types solely serves to select the right operation. operations are typed, because it makes a difference whether you multiply signed or unsigned integers.

RalfJ (Aug 31 2018 at 17:12, on Zulip):

but depending on what you mean by "values", they do not even exist very long -- in miri, "values" only exist to be fed into operations

RalfJ (Aug 31 2018 at 17:13, on Zulip):

but between two execution steps, there is no notion of a "value"

RalfJ (Aug 31 2018 at 17:13, on Zulip):

there is memory (allocations containing bytes), and a stack where the local variables are also just an allocation ID saying where in memory this local is stored.

RalfJ (Aug 31 2018 at 17:14, on Zulip):

so if the "values" you speak of are a mostly ephemeral concept, I think we are actually pretty close.

RalfJ (Aug 31 2018 at 17:14, on Zulip):

however, we do not really need anything other than primitive types to type those values, as primitive operations only act on primitive types

RalfJ (Aug 31 2018 at 17:15, on Zulip):

@nikomatsakis there is actually research that tries to find out what people think C is, that might be relevant... let me see if I can find it

nikomatsakis (Aug 31 2018 at 17:15, on Zulip):

ah yes I think I've seen it

nikomatsakis (Aug 31 2018 at 17:15, on Zulip):

but would like to see it again :)

RalfJ (Aug 31 2018 at 17:16, on Zulip):

I think that would be https://www.cl.cam.ac.uk/~pes20/cerberus/notes50-survey-discussion.html

RalfJ (Aug 31 2018 at 17:16, on Zulip):

and more general the Cerberus project is a good source of C-related information: https://www.cl.cam.ac.uk/~pes20/papers/topics.html#Cerberus

Nicole Mazzuca (Aug 31 2018 at 17:26, on Zulip):

yeah, you only should care about the type of values when you're using them; in that case, you do need full nominal types, if only because struct A { x: i32, y: f32 } and struct B { x: i32, y: f32 } aren't guaranteed to be layed out the same

Nicole Mazzuca (Aug 31 2018 at 17:28, on Zulip):

@nikomatsakis my experience is with C++ people. C people are very different from C++ people, ime

Nicole Mazzuca (Aug 31 2018 at 17:29, on Zulip):

@RalfJ did you read my initial idea for a model >.<

Nicole Mazzuca (Aug 31 2018 at 17:30, on Zulip):

I absolutely don't think that memory should be typed, and I've argued that we (in this case meaning the C++ community) should change the C++ model to have untyped memory as well

RalfJ (Aug 31 2018 at 20:47, on Zulip):

you never really do anything with "value" of struct type, other than copying them around, for which you only need the size. so those types are not actually relevant, all we need is size annotations at copy operations. the only place where we need full types is for primitive operations -- arithmetic and the like. (these types are extremely relevant for layout concerns, of course, but I see that as mostly separate -- layout fixes the offsets of all the fields; the main semantics and memory model then say what happens for a given assignment of offsets.)

I have seen an idea for a model of yours last year, yes -- but I did not get very far understanding it. :/ it started out introducing a huge host of terminology, and I was lost in that jungle of definitions before it even really started.

Nicole Mazzuca (Aug 31 2018 at 22:09, on Zulip):

@RalfJ that's not necessarily true - for example, Box<T> should always be non-zero when it's a value

rkruppe (Aug 31 2018 at 22:11, on Zulip):

what, specifically, do you mean by "when it's a value"? in particular, what does that cover that isn't covered by a non-null invariant on the bytes in memory, asserted e.g. when it's loaded?

Nicole Mazzuca (Aug 31 2018 at 22:14, on Zulip):

@rkruppe that's exactly what I mean - let x: Box<i32> = e; <- at the point when this statement is executed, e better evaluate to a non-null Box<i32> value

rkruppe (Aug 31 2018 at 22:19, on Zulip):

ok but that doesn't require defining values and deciding invariants for them, it seems to me you can just put enough things into memory that everything we care about is in memory at the time where we want to assert that it's valid

Nicole Mazzuca (Aug 31 2018 at 22:19, on Zulip):

wha..?

Nicole Mazzuca (Aug 31 2018 at 22:21, on Zulip):

however you want to put it, if you have define void @foo(i32 * nonnull noalias), that i32* value has to be both nonnull and noalias; there is no memory there

rkruppe (Aug 31 2018 at 22:21, on Zulip):

for example in let PAT = EXPR; evaluation of EXPR writes the result to a temporary memory locations, and the let copies parts of that into named (and possibly longer-lived) memory locations

rkruppe (Aug 31 2018 at 22:22, on Zulip):

once again, whatever we do at the LLVM IR level doesn't need to impact the language memory model, we only need to be able to justify the optimizations done on LLVM IR (including those made by the translation to LLVM IR)

Nicole Mazzuca (Aug 31 2018 at 22:24, on Zulip):

I don't think everything being memcpy's is especially understandable or useful, rather than just having values; it also ignores the reality of stuff like let x = &prvalue-expression;, imo

rkruppe (Aug 31 2018 at 22:27, on Zulip):

there is another angle to this -- we'll likely represent Box<T> just like any primitive pointer (e.g. as a scalar in miri), and can therefore do everything with this scalar that we do with other primitives, without introducing any value for "a struct containing a single pointer"

rkruppe (Aug 31 2018 at 22:28, on Zulip):

or in other words, keeping the data model of the abstract machine limited to "scalars + untyped memory"

Nicole Mazzuca (Aug 31 2018 at 22:29, on Zulip):

@rkruppe what about Rc<T>? What about some user-defined pointer type? what about &[T]?

rkruppe (Aug 31 2018 at 22:30, on Zulip):

the _language_ doesn't say anything about invariants of Rc, does it? beyond what follows from the components that library code uses to create Rc

Nicole Mazzuca (Aug 31 2018 at 22:30, on Zulip):

yes it does have invariants on Rc, that's what std::ptr::Shared is for

rkruppe (Aug 31 2018 at 22:31, on Zulip):

that's what i meant with "what follows from the components"

rkruppe (Aug 31 2018 at 22:31, on Zulip):

also idk what you mean with "the reality of stuff like let x = &prvalue-expression;" -- pretty much by definition that has to place the result of the prvalue-expression into memory?

Nicole Mazzuca (Aug 31 2018 at 22:32, on Zulip):

it doesn't do any memory copies tho

rkruppe (Aug 31 2018 at 22:33, on Zulip):

yes, what's the problem? just evaluate the expression into a longer-lived stack slot (living as long as x) than for temporaries that live shorter

rkruppe (Aug 31 2018 at 22:35, on Zulip):

i'll also note, since you apparently care about proximity to the compiled code more than I do, that evaluating an aggregate will indeed directly write the fields to memory (unless the aggregate is very small), or at least it will once we have proper RVO/move forwarding on MIR

Nicole Mazzuca (Aug 31 2018 at 22:35, on Zulip):

yes, but it won't then copy it?

Nicole Mazzuca (Aug 31 2018 at 22:35, on Zulip):

it'll just write it to the stack slot

rkruppe (Aug 31 2018 at 22:36, on Zulip):

ok apparently we're talking past each other

rkruppe (Aug 31 2018 at 22:36, on Zulip):

when and where do you think i'm saying what exactly is copied?

Nicole Mazzuca (Aug 31 2018 at 22:37, on Zulip):

and if we don't have a model beyond "builtins and untyped memory", what happens to struct X { y: Unique<T>, z: i32 }? fn foo(x: X) _requires_ that x.y is non-null

Nicole Mazzuca (Aug 31 2018 at 22:37, on Zulip):

evaluation of EXPR writes the result to a temporary memory locations, and the let copies parts of that into named (and possibly longer-lived) memory locations

rkruppe (Aug 31 2018 at 22:38, on Zulip):

oh i forgot i put it like that

rkruppe (Aug 31 2018 at 22:38, on Zulip):

sorry

Nicole Mazzuca (Aug 31 2018 at 22:39, on Zulip):

we very clearly have a value of type X getting passed to foo - we don't suddenly pass a memory buffer just because it's a UDT

rkruppe (Aug 31 2018 at 22:39, on Zulip):

i can only plead again to separate the rules we impose on the language from the implementation

rkruppe (Aug 31 2018 at 22:40, on Zulip):

also at least if you add another i32 field we will, in fact, create LLVM IR that passes the aggregate in memory

Nicole Mazzuca (Aug 31 2018 at 22:40, on Zulip):

yes, the implementation would differ based on what ABI you're targeting

Nicole Mazzuca (Aug 31 2018 at 22:40, on Zulip):

you would literally pass a pointer to an X object on x86

Nicole Mazzuca (Aug 31 2018 at 22:40, on Zulip):

but that _should not_ come into it

rkruppe (Aug 31 2018 at 22:41, on Zulip):

ok good, but then i am really confused about what you're arguing

Nicole Mazzuca (Aug 31 2018 at 22:41, on Zulip):

what's important is that, on the virtual machine, you're doing a load of an X value from memory, and that the invariants _must_ be valid

rkruppe (Aug 31 2018 at 22:42, on Zulip):

wait i thought you were advocating that in this scenario there's only values, nothing involving memory, least of all a load

Nicole Mazzuca (Aug 31 2018 at 22:43, on Zulip):

I mean, it depends on what you pass? I was assuming that you were loading from the stack when you're passing, but I guess you could also pass an rvalue expression in which case no loads happen

rkruppe (Aug 31 2018 at 22:43, on Zulip):

so that load you refer to would have happened in the caller, if at all?

Nicole Mazzuca (Aug 31 2018 at 22:44, on Zulip):

yep

rkruppe (Aug 31 2018 at 22:44, on Zulip):

and the reason why foo can assume x.z is non-null is a definition of validity on X values, recursing into the fields?

Nicole Mazzuca (Aug 31 2018 at 22:45, on Zulip):

yes - if you have a value of type T, that value must be valid

Nicole Mazzuca (Aug 31 2018 at 22:45, on Zulip):

if it's invalid in some way, then you've got UB

rkruppe (Aug 31 2018 at 22:48, on Zulip):

ok so for contrast: the nonnull attribute can be justified in an untyped-memory model if you put X in memory and pass it by reference (note: this is different from passing &X!) -- you enforce the validity invariant for X on the memory of the argument on entry

Nicole Mazzuca (Aug 31 2018 at 22:50, on Zulip):

yes, but then you have a weird model where you're enforcing validity invariants "at some points"

Nicole Mazzuca (Aug 31 2018 at 22:50, on Zulip):

instead of at exactly one point - any time there's an rvalue conversion

rkruppe (Aug 31 2018 at 22:50, on Zulip):

noalias gets into the stacked borrows proposal but there likewise you can recurse into memory you own (going through &_ is unlikely to fly for other reasons), you just have to do it "on memory" rather than having values that represent aggregates

rkruppe (Aug 31 2018 at 22:51, on Zulip):

in a model where rvalues are so pervasive, surely operations on them could result in invalid rvalues? when do those trigger UB? (and how do you even represent them in the data model)

Nicole Mazzuca (Aug 31 2018 at 22:54, on Zulip):

It makes it really clear why this code (https://rust-lang.zulipchat.com/#narrow/stream/136281-wg-unsafe-code-guidelines/subject/Accessing.20uninitialized.20field.20lvalues/near/133104916) is valid -

let ptr = alloc::alloc(layout) as *mut IdentifierInner;
std::ptr::write(&mut (*ptr).size, size as u32);

let mut buff = std::slice::from_raw_parts_mut((*ptr).mut_ptr(), size);
for ch in s.nfc() {
  let offset = ch.encode_utf8(buff).len();
  buff = &mut buff[offset..];
}

because you never do an rvalue conversion - (*ptr).size is an lvalue expression of type u32, and you immediately give that lvalue expression to an &mut operator, which converts the lvalue T to an rvalue &mut T. There's no read of the T

Nicole Mazzuca (Aug 31 2018 at 22:54, on Zulip):

similarly with buff[offset...] - the lvalue expression is immediately passed to the &mut operator

rkruppe (Aug 31 2018 at 22:56, on Zulip):

"there is no load of the u32 at (*ptr).size" is also true and also relevant in the style of model ralf has developed

Nicole Mazzuca (Aug 31 2018 at 22:57, on Zulip):

yes, because you never do an rvalue conversion - that's the literal "thing" that happens in Rust

Nicole Mazzuca (Aug 31 2018 at 22:58, on Zulip):

it feels like you're just saying "load" when you mean "rvalue conversion", except we pretend that that load goes directly into memory in between. You could definitely lower Rust into these memory semantics, but it seems odd?

rkruppe (Aug 31 2018 at 22:59, on Zulip):

in this example there's no interesting differences between scalars+memory vs values+objects, because everything's happening in terms of scalars (ints and slice references)

Nicole Mazzuca (Aug 31 2018 at 23:00, on Zulip):

and it means, imo, that you now have to do a mental lowering from Rust semantics to low-level memory semantics to figure out whether you're doing something bad, which seems unfortunate, as opposed to just using Rust semantics in the model definition.

rkruppe (Aug 31 2018 at 23:01, on Zulip):

ok i see where you are going

Nicole Mazzuca (Aug 31 2018 at 23:01, on Zulip):

also, it's not operations on scalars - they're operations on IdentifierInner

Nicole Mazzuca (Aug 31 2018 at 23:01, on Zulip):

the lvalue expressions are on an uninitialized IdentifierInner, which is what makes it interesting

Nicole Mazzuca (Aug 31 2018 at 23:03, on Zulip):

at least, imo

rkruppe (Aug 31 2018 at 23:03, on Zulip):

i can't think of any projections through lvalues that we've ever wanted to consider UB?

rkruppe (Aug 31 2018 at 23:04, on Zulip):

as in, expressions like &foo.bar are at most interesting wrt pointer arithmetic and rules about the resulting reference (which may in turn imply a bit about foo though it turns out probably not), not wrt validity of foo in itself

Nicole Mazzuca (Aug 31 2018 at 23:05, on Zulip):

right, exactly - we're projecting through an lvalue, which is distinct from accessing an rvalue? does that make sense?

like, the model in rust is that &foo.bar is an lvalue expression, which is distinct from say, bar(foo) which is an rvalue expression

rkruppe (Aug 31 2018 at 23:05, on Zulip):

it seems we're in violent agreement here, so moving on

Nicole Mazzuca (Aug 31 2018 at 23:05, on Zulip):

what? you can't disagree, that's literally the model in Rust

Nicole Mazzuca (Aug 31 2018 at 23:05, on Zulip):

that's how the typechecker works

Nicole Mazzuca (Aug 31 2018 at 23:06, on Zulip):

hahaha, fair

rkruppe (Aug 31 2018 at 23:06, on Zulip):

oops, agreement, not disagreement!!

rkruppe (Aug 31 2018 at 23:06, on Zulip):

that's not how this idiom works, brain

rkruppe (Aug 31 2018 at 23:07, on Zulip):

so regarding your point about "lowering" rust locations and rvalues to memory accesses - YES that is exactly the point, taking it all much lower level and breaking it down into simpler operations so that the spec is simpler and can be implemented much more directly (as in miri)

rkruppe (Aug 31 2018 at 23:07, on Zulip):

this does, as you say, place some burden on people looking at source code

Nicole Mazzuca (Aug 31 2018 at 23:07, on Zulip):

yeah - my interest is making it easier for people who are reading and writing source code, as opposed to those who are writing the spec

rkruppe (Aug 31 2018 at 23:08, on Zulip):

it's not just easier on the spec writers, it's also easier on rustc and miri developers

rkruppe (Aug 31 2018 at 23:08, on Zulip):

because it's literally how MIR works

Nicole Mazzuca (Aug 31 2018 at 23:08, on Zulip):

sure; they're not who I'm interested in making the lives of easier tho :)

rkruppe (Aug 31 2018 at 23:09, on Zulip):

ah but a better miri means you often won't have to manually reason your way through whether something is ub!

Nicole Mazzuca (Aug 31 2018 at 23:10, on Zulip):

whether something is UB is a runtime property dependent on data

Nicole Mazzuca (Aug 31 2018 at 23:10, on Zulip):

I doubt Miri's ability to solve the halting problem

rkruppe (Aug 31 2018 at 23:13, on Zulip):

very funny. many questions do boil down to a specific litmus test with only one or two relevant traces, though

rkruppe (Aug 31 2018 at 23:14, on Zulip):

such as your example from above -- the addresses and buffer length don't matter, you just run it once with one input, miri doesn't complain about accessing an unintialized size, and you're in the clear

rkruppe (Aug 31 2018 at 23:41, on Zulip):

one other thought -- it's not unheard of to have multiple specs in a different style and prove them equivalent (or give a sound mapping from one to another). if a c++ style model is significantly easier to reason about for programmers -- which i am still not really sure about -- it might nevertheless make sense for the UCG to produce a simple low-level spec, implement it in miri, etc. and (later) provide programmers with a c++ style explanation that is proven equivalent

however there is a very real chance that there will be differences (possibly only in edge cases) between the natural way to formulate these two models. in any case we won't know this until someone (cough) sits down and writes out a concrete proposal to compare and contrast

Nicole Mazzuca (Aug 31 2018 at 23:59, on Zulip):

it's possible nfc has a bug for specific characters, tho

Nicole Mazzuca (Aug 31 2018 at 23:59, on Zulip):

or size_utf8, or encode_utf8

RalfJ (Sep 01 2018 at 08:05, on Zulip):

@RalfJ that's not necessarily true - for example, Box<T> should always be non-zero when it's a value

Fair. Lang items are primitive types.

RalfJ (Sep 01 2018 at 08:06, on Zulip):

But this is just a validity invariant

RalfJ (Sep 01 2018 at 08:07, on Zulip):

which applies every time you "touch" data -- so again, the type comes from the operation. Doing assignment at type T? Copy the bytes, and enforce that they form a valid T. Everything is still bytes.

RalfJ (Sep 01 2018 at 08:09, on Zulip):

and I think passing values between functions through memory is a great way to side-step many annoying questions.

RalfJ (Sep 01 2018 at 08:11, on Zulip):

arguments, on the caller side, are an Operand, and on the callee side a Place. so the most natural thing to do for argument passing is "copy operand to place", an operation we have to define anyway. This is also what happens in miri, btw.

RalfJ (Sep 01 2018 at 08:11, on Zulip):

and every time we write anything to a place, that operation happens at a type. so we copy the bytes, and then enforce validity of them according to said type.

RalfJ (Sep 01 2018 at 08:11, on Zulip):

(I plan to implement this in miri when I come back from my vacation)

RalfJ (Sep 01 2018 at 08:20, on Zulip):

@Nicole Mazzuca for the record, I have been trying to avoid lvalue and rvalue and prvalue and the roughly one dozen kinds of *value C++ has, because I find this huge amount of terminology very confusing. And in fact the people defining MIR seem to think the same way. So MIR has places, which roughly correspond to lvalues, and Operands, which roughly correspond to rvalues.

(*ptr).size is indeed a place, and then for argument passing there is a _tmp = &mut (*ptr).size where &mut is the operation defining how the place of the left is to be filled; in this case, it is with the location (and metadata) computed on the right. There is actually no Operand/rvalue anywhere in this expression. MIR does not allow nesting &mut in other expressions, so we can avoid that complication. So indeed the data "pointed to" by the Place (*ptr).size is never touched, and does not have to be valid; just the place itself must be valid.

I see no need to introduce compound values here.

RalfJ (Sep 01 2018 at 08:24, on Zulip):

yeah - my interest is making it easier for people who are reading and writing source code, as opposed to those who are writing the spec

I think that is a luxury we can try to achieve once we have a spec. First of all, I disagree that a spec with compound values is easier for people who are reading and writing source code, except if you restrict yourself to safe code only where there is hardly ever any disagreement about what the code will do.
However, irrespective of that, given how hard it is to spec low-level languages, we should do everything we can to make the work simpler for the spec writers. Finally, given that the spec is trusted in the sense that any proof about any Rust program ever takes the spec as a given, and relies on the spec making sense, it is of paramount importance to make the spec as easy to read as we possibly can. Basically, everything the spec says should be immediately obviously sensible to anyone reading it, so that w can have as many eyes as possible on it. If we screw up the spec, the entire tower resting on top of it crumbles. the harder to read a spec is, the less useful it is as a foundation for a language.

RalfJ (Sep 01 2018 at 08:38, on Zulip):

(And with that I am leaving for my vacation, see you in two weeks!)

nikomatsakis (Sep 01 2018 at 10:33, on Zulip):

for example in let PAT = EXPR; evaluation of EXPR writes the result to a temporary memory locations, and the let copies parts of that into named (and possibly longer-lived) memory locations

I just want to point that this is exactly how it is desugared in MIR. For a non-trivial pattern, we first store EXPR into a temporary and then "move" that temporary into PAT. For a trivial pattern like x, we store directly into the memory for x.

rkruppe (Sep 01 2018 at 10:35, on Zulip):

that last part is quite important though, and my not mentioning it has caused confusion, mea culpa

nikomatsakis (Sep 01 2018 at 10:38, on Zulip):

that last part is quite important though, and my not mentioning it has caused confusion, mea culpa

it's just an optimization, though

nikomatsakis (Sep 01 2018 at 10:38, on Zulip):

we just wanted to avoid a temporary

nikomatsakis (Sep 01 2018 at 10:39, on Zulip):

I don't think it's significant in any other way?

Matthew Jasper (Sep 01 2018 at 10:40, on Zulip):

It's visible in cases like #53695

nikomatsakis (Sep 01 2018 at 10:40, on Zulip):

yeah - my interest is making it easier for people who are reading and writing source code, as opposed to those who are writing the spec

ps, @Nicole Mazzuca, I strongly agree with this ambition! However, I am not sure what is the best way to achieve it.

In particular, I think that it often happens that unsafe code authors start with a view of what they want the hardware to do, and then work up to how to express that in Rust. In this case, having the spec be expressed in terms of more low-level concepts is helpful.

nikomatsakis (Sep 01 2018 at 10:41, on Zulip):

It's visible in cases like #53695

hmm, I suppose that's true.

nikomatsakis (Sep 01 2018 at 10:41, on Zulip):

well, that depends

nikomatsakis (Sep 01 2018 at 10:42, on Zulip):

but yeah it depends on when the "pattern storing" takes place, and I guess that most logically it would take place after the block's data has been dropped

rkruppe (Sep 01 2018 at 10:42, on Zulip):

i meant it caused confusion about the mental model, or how we justify validating the scalar corresponding to Box<_> (that was the context in which i wrote that)

nikomatsakis (Sep 01 2018 at 10:42, on Zulip):

that's an interesting point @Matthew Jasper that perhaps affects my opinion about that issue :)

nikomatsakis (Sep 01 2018 at 10:43, on Zulip):

i meant it caused confusion about the mental model, or how we justify validating the scalar corresponding to Box<_> (that was the context in which i wrote that)

I don't see why

nikomatsakis (Sep 01 2018 at 10:43, on Zulip):

in particular, in both cases, we are storing into memory?

nikomatsakis (Sep 01 2018 at 10:43, on Zulip):

in one case, we store directly to the memory for the local varibale, in the other to a temporary?

nikomatsakis (Sep 01 2018 at 10:43, on Zulip):

I guess I have to re-read that part of your discussion, I got a bit lost there

nikomatsakis (Sep 01 2018 at 10:43, on Zulip):

anyway, I should go do weekend things too :beach: :P

rkruppe (Sep 01 2018 at 10:44, on Zulip):

hm yes, the core thing i missed was that we can and should validate scalars when storing, not just when loading

rkruppe (Sep 01 2018 at 10:46, on Zulip):

i was very incorrectly going with "scalars are validated on loads" and then the extra copy is significant

nikomatsakis (Sep 01 2018 at 10:47, on Zulip):

ah, I see, yes, that makes sense :)

nikomatsakis (Sep 01 2018 at 10:48, on Zulip):

well, I may also be mistaken, it's worth keeping in mind, particularly seeing as MIR does this optimization

nikomatsakis (Sep 01 2018 at 10:49, on Zulip):

(and I believe that our first efforts at defining UB semantics will probably begin from MIR)

nikomatsakis (Sep 01 2018 at 10:49, on Zulip):

though it of course means we have to define lowering into MIR to get a full round-trip

nikomatsakis (Sep 01 2018 at 10:49, on Zulip):

but then we sort of need that anyway, because otherwise we can't really spec out the borrow checker

rkruppe (Sep 01 2018 at 10:50, on Zulip):

i think it makes sense to define all evaluation wrt to a destination location, and then speccing this behavior is just a matter of passing along the right destination location in various contexts

nikomatsakis (Sep 01 2018 at 10:55, on Zulip):

it makes sense to look at MIR construction here — we have a few different "modes" of evaluation ultimately

nikomatsakis (Sep 01 2018 at 10:55, on Zulip):

but maybe those can be viewed as optimizations

nikomatsakis (Sep 01 2018 at 10:56, on Zulip):

(and we'd want to show then that they are valid)

nikomatsakis (Sep 01 2018 at 10:56, on Zulip):

e.g., we don't make a temporary value for something like foo(22) =)

nikomatsakis (Sep 01 2018 at 10:57, on Zulip):

still, it'd obviously be nice to keep the "reference model" as simple as possible

nikomatsakis (Sep 01 2018 at 10:57, on Zulip):

(and — in the early days of MIR — we did make temporaries for all those cases...)

rkruppe (Sep 01 2018 at 10:57, on Zulip):

it makes sense to look at MIR construction here — we have a few different "modes" of evaluation ultimately

this is a very good point, i am actually tragically uninformed about mir construction

nikomatsakis (Sep 01 2018 at 10:57, on Zulip):

the main other one is evaluating to a "place"

nikomatsakis (Sep 01 2018 at 10:57, on Zulip):

e.g. for things like &foo.bar

nikomatsakis (Sep 01 2018 at 10:57, on Zulip):

in cases like &bar() it creates a temporary and returns that

rkruppe (Sep 01 2018 at 10:58, on Zulip):

ah but this is 1. crucial for the semantics and 2. based on a syntactic classification of place expressions vs value expressions, right?

nikomatsakis (Sep 01 2018 at 10:58, on Zulip):

yes

nikomatsakis (Sep 01 2018 at 10:58, on Zulip):

I raised it because it is a distinction we cannot eliminate

rkruppe (Sep 01 2018 at 10:59, on Zulip):

ok enough of my edification, go do weekend things :)

Nicole Mazzuca (Sep 01 2018 at 13:47, on Zulip):

@nikomatsakis I absolutely disagree, I guess, on how people write unsafe code. as a C++ developer with access to a good standard, I do not go that way at all; I look at how to solve my problems in the abstract model of C++.

Nicole Mazzuca (Sep 01 2018 at 13:48, on Zulip):

similar to most other people I know in the C++ community

Nicole Mazzuca (Sep 01 2018 at 13:50, on Zulip):

and I want Rust to be that way. I don't want people to think "oh, I want to have a tagged pointer, I should give myself a pointer type and then tag the bottom bits, because that is valid on the hardware". I want someone to think "oh, I should have a usize which can be cast to a pointer, because rvalue-conversion on an invalid pointer is suspect at best"

Nicole Mazzuca (Sep 01 2018 at 14:00, on Zulip):

@Nicole Mazzuca for the record, I have been trying to avoid lvalue and rvalue and prvalue and the roughly one dozen kinds of *value C++ has, because I find this huge amount of terminology very confusing. And in fact the people defining MIR seem to think the same way. So MIR has places, which roughly correspond to lvalues, and Operands, which roughly correspond to rvalues.

(*ptr).size is indeed a place, and then for argument passing there is a _tmp = &mut (*ptr).size where &mut is the operation defining how the place of the left is to be filled; in this case, it is with the location (and metadata) computed on the right. There is actually no Operand/rvalue anywhere in this expression. MIR does not allow nesting &mut in other expressions, so we can avoid that complication. So indeed the data "pointed to" by the Place (*ptr).size is never touched, and does not have to be valid; just the place itself must be valid.

I see no need to introduce compound values here.

It's really not that hard, and it's kind of necessary to understand for purposes of typechecking Rust. There are 3 kinds of values in C++, and two categories, but one of those value kinds is unnecessary in Rust, and so the categories don't need to exist either.

There are two kinds of things, lvalues, and rvalues. An lvalue is any expression which refers to a specific place in memory, and you can only access one of these through the * or [] operators. An rvalue is any expression which is not an lvalue. An rvalue conversion is when you want an rvalue (for example, to initialize a stack-local) but all you have is an lvalue.

(note: There are three kinds of lvalues - read, read-write, and owned.)

This is why I say it's not that complicated, and also, you need to understand this stuff anyways if you want to understand some stuff that Rust does around Drop/non-Copy types, and lifetimes, so I think it's entirely valid to define a model in terms of these things that Rust programmers should inherently understand. We should not define the model in terms of MIR, a specific lowering of Rust _that changes_ depending on optimizations and versions of Rust. I want to know what a valid lowering of Rust to assembler is, which means we should define a model in terms of Rust.

alercah (Sep 06 2018 at 00:14, on Zulip):

I will admit, I have skimmed a lot of this conversation. I'm starting to think that the ideal model is to try to think of a value-level semantics in a way that becomes easy to translate things upward.

whether something is UB is a runtime property dependent on data

This stood out to me, in part because it's true, but in particular because I feel like we generally do not want UB to be dependent on the compiler. It ought not be possible to argue that a given interaction is fine simply because the compiler lets it happen sometimes. Any purely hardware-based or representation-based semantics would allow that, I think.

A purely object-value based semantics like C's and C++'s are also difficult, however, because they lead to a lot of confusion and disagreement and underspecifiedness.

alercah (Sep 06 2018 at 00:27, on Zulip):

So perhaps there's some middle ground where we could define a machine which is concrete enough to code and teach people by, but abstract enough not to let coincidences slip by? I guess my metric here would be "could we theoretically write a run-time sanitizer for this".

alercah (Sep 06 2018 at 00:30, on Zulip):

(I suspect the answer is that yes, we could do so, provided that the program doesn't move bytes outside the sanitizer's purview. But we could imagine a world where e.g. the VM's extra tagging could include the filesystem and networked applications and such. I think the same would apply to RalfJ's tags in the stacked borrowed model)

alercah (Sep 06 2018 at 06:17, on Zulip):

I had a nice conversation tonight with @Nicole Mazzuca on Discord and I think I'm pretty much in agreement with her. The stuff I wrote above seems wrong at the edges, so please don't nitpick it too deeply. ;)

alercah (Sep 06 2018 at 06:19, on Zulip):

Although I just mean that about the stuff I wrote above. I'm unsure about lvalue/rvalue terminology, except to say that if we agree on the concepts then I'm not too concerned for now about the actual terms for them.

nikomatsakis (Sep 06 2018 at 09:52, on Zulip):

@alercah link to conversation, perhaps?

Nicole Mazzuca (Sep 06 2018 at 17:19, on Zulip):

@nikomatsakis it was in a private message

alercah (Sep 08 2018 at 21:04, on Zulip):

(oh, nvm, already addressed)

RalfJ (Sep 17 2018 at 18:07, on Zulip):

@Nicole Mazzuca

This is why I say it's not that complicated, and also, you need to understand this stuff anyways if you want to understand some stuff that Rust does around Drop/non-Copy types, and lifetimes, so I think it's entirely valid to define a model in terms of these things that Rust programmers should inherently understand. We should not define the model in terms of MIR, a specific lowering of Rust _that changes_ depending on optimizations and versions of Rust. I want to know what a valid lowering of Rust to assembler is, which means we should define a model in terms of Rust.

Rust is way too high-level to be specified directly. Things like pattern matching, struct initializers and other complex compound expressions should be desugared to a core language. I think it is one of the big mistakes of C/C++ to not do this. For the same reason that compiling C, C++ or Rust directly to hardware is hard and full of redundancy, it is not a good idea to specify them directly. IRs are extremely useful as a specification device, not just as building block of a compiler. For example, the fact that miri can "specify" (implement, really) Rust behavior without any notion of "read, read/write and owned" values to me is a big sign saying that we do not actually need these notions to talk about Rust behavior.

That core language does not have to be MIR, but I think something close to MIR is a pretty good choice.

RalfJ (Sep 17 2018 at 18:10, on Zulip):

An lvalue is any expression which refers to a specific place in memory, and you can only access one of these through the * or [] operators. An rvalue is any expression which is not an lvalue.

Your notion of a "value" must be very far from mine. In my world (using the usual PL terminology), expressions evaluate to values, but most of the time they are not values. A value is a fully computed "thing" -- a number, or a memory location, or so. It is something static. An expression is a computation that results in a value, it is something dynamic (and then we can bikeshed about whether it can have sideffects or whether those are treated separately).

nikomatsakis (Sep 17 2018 at 18:20, on Zulip):

the key thing is that an "lvalue" is really not a value, but an expression

nikomatsakis (Sep 17 2018 at 18:20, on Zulip):

basically synonym for a MIR Place

nikomatsakis (Sep 17 2018 at 18:21, on Zulip):

or at least that's how I think of it...

nikomatsakis (Sep 17 2018 at 18:21, on Zulip):

...I am not sure perhaps the C++ Spec has a distinct notion

rkruppe (Sep 17 2018 at 18:23, on Zulip):

an lvalue or place is not a source language expression, although it includes field accesses and indexing and so on just as source language expressions do, so it's confusing to call it "expression" IMO. i prefer to think of them as something analogous to values, though with more complicated structure than e.g. u32

Jake Goulding (Sep 17 2018 at 19:15, on Zulip):

As a compiler user not author, I've always found the terms "lvalue" and "rvalue" extremely opaque and generally not useful. I don't know how much y'all want to try to stick to existing / established terms vs making it more accessible tho.

rkruppe (Sep 17 2018 at 19:16, on Zulip):

you're right, they're crap terms. we've already moved to "place" instead of "lvalue" in compiler internal identifiers and the reference, and i think i'm not the only one who would prefer to follow up by calling "rvalues" simply "values"

Jake Goulding (Sep 17 2018 at 20:16, on Zulip):

Still candidates for the hypothetical glossary I advocate for ;-)

avadacatavra (Sep 17 2018 at 22:12, on Zulip):

@Jake Goulding the glossary isn't hypothetical! it's just...not totally filled out (but i stuck it in the mdbook because i want it too!)

Jake Goulding (Sep 17 2018 at 22:12, on Zulip):

oh, neat... where might I find it?

Jake Goulding (Sep 17 2018 at 22:12, on Zulip):

s/hypothetical/unknown to me/

Nicole Mazzuca (Sep 18 2018 at 02:18, on Zulip):

@rkruppe they are definitely non-descriptive, and value/place are better terms.

Nicole Mazzuca (Sep 18 2018 at 02:19, on Zulip):

@RalfJ I just fundamentally disagree. You can write your models all you want, but what's useful to the average rust programmer is a model in terms of Rust, not a model in terms of an IR that is definitely not Rust.

RalfJ (Sep 18 2018 at 07:32, on Zulip):

@rkruppe

i think i'm not the only one who would prefer to follow up by calling "rvalues" simply "values"

MIR/miri call them "operands". Personally I find "value" a bit confusing there because operands can live in memory (in x = y;, when y is big, what is the value? we do a memcpy here, not a "first load, then store".
However, it is possible that this is just an optimization -- that we can see values as "chunks of bytes", and the fact that an operand "delays" the load is not actually observable. that would be a nice model indeed, I think, and then we can call them "values".

RalfJ (Sep 18 2018 at 07:33, on Zulip):

On the subject of places, I think the "value" content here is a memory address. that is more complicated than u32 indeed, but not very much so. In miri, a place is a triple of a pointer, an expected alignment, and metadata for unsized places.

RalfJ (Sep 18 2018 at 07:34, on Zulip):

@Jake Goulding seems the glossary is at https://github.com/rust-rfcs/unsafe-code-guidelines/blob/master/reference/src/introduction.md. Took me a bit to find it because the filename does not exactly give it away^^

RalfJ (Sep 18 2018 at 07:49, on Zulip):

@RalfJ I just fundamentally disagree. You can write your models all you want, but what's useful to the average rust programmer is a model in terms of Rust, not a model in terms of an IR that is definitely not Rust.

What I am arguing for is that the best and most useful way to describe a complicated thing is by describing it in terms of something simpler. Instead of doing one HUGE step -- a direct semantics for surface-level Rust -- we do two much smaller steps: A semantics for a language with few, easy to understand concepts; and a translation into that language. the translation takes care of concerns like evaluation order and other things I mentioned. I think this is much more useful to a programmer than a complex HUGE direct semantics, because there is actually a chance they might understand it. Sometimes, adding an intermediate abstraction provides an overall huge complexity reduction and thus makes things much easier to understand, and I strongly believe language semantics is one of these cases. A direct semantics for the surface language is near useless as it is too complex, not understandable.

I see a bunch of evidence for this:

nikomatsakis (Sep 18 2018 at 16:57, on Zulip):

So, I'd like to step back. I have a feeling that we all share the same endgoals:

My belief is that the best route to get there is to operate in layers. That is, defining UB for a "MIR-like" abstraction is hard enough. Assuming we find a model we are happy with, then we can define a mapping from Rust down to MIR, and we have a specification. Perhaps one that is not in the best form for universal consumption, but an actionable one. (And I tend to think that, while most Rust programmers aren't familiar with MIR per se, they have some kind of internal desugaring going on in their heads anyway — e.g., they eventually learn that foo.bar() is sugar for Foo::bar(&foo) and so forth.)

Now, given a specification, the task becomes to explain that spec in various ways such that it is understandable. I imagine that the best way to explain it will depend on the audience. e.g., C programmers might already have a kind of concrete abstraction for the machine and would prefer to start from a MIR-like level. People coming to Rust from some other background might prefer a different approach.

In terms of audiences, I do think that compiler authors hold a mildly privileged space, in that (as @RalfJ pointed out), they are the ones who have to apply the spec to decide whether a given optimization fits and so forth, as well as implementing dynamic checkers. For this reason, I think that having the "primary, normative" variant be done in terms of something quite close to the IR that they will be using for this (MIR) makes sense.

Jake Goulding (Sep 18 2018 at 17:03, on Zulip):

I'd expect "revelations" at one layer to propagate up and down the abstraction stack, informing decisions

nikomatsakis (Sep 18 2018 at 17:04, on Zulip):

I do think it's reasonable to hold off on declaring anything "stable" or "done" until we're confident it can be reasoned about from multiple "angles"

nikomatsakis (Sep 18 2018 at 17:05, on Zulip):

and also I should say that when I say "MIR" I don't literally mean the compiler's representation, necessarily, but perhaps a slightly cleaned up and idealized form of it...

avadacatavra (Sep 19 2018 at 11:04, on Zulip):

@Jake Goulding https://github.com/rust-rfcs/unsafe-code-guidelines/blob/master/reference/src/introduction.md
i stuck it in the intro for now (thinking, maybe we should define terms before using them) but that's mostly just so that we have something

Jake Goulding (Sep 19 2018 at 13:14, on Zulip):

@avadacatavra yup thanks! (@RalfJ linked it earlier and I submitted a PR to add some words I'd like defined)

Last update: Nov 19 2019 at 17:45UTC