Stream: t-compiler

Topic: AST("CST")/parsing evolution


eddyb (May 01 2019 at 10:55, on Zulip):

@matklad @nikomatsakis I started writing https://hackmd.io/zrZhb94HS6KxW3sguwXNqA btw, if you want to follow along as I do this :)

eddyb (May 01 2019 at 12:09, on Zulip):

wait what, where did the folder go?!

eddyb (May 01 2019 at 12:12, on Zulip):

@nnethercote @Vadim Petrochenkov IMO https://github.com/rust-lang/rust/pull/58061 is a serious step backward, I don't know how to make any progress without reverting it...

eddyb (May 01 2019 at 12:15, on Zulip):

so there's our first action item: reverting a PR 3 months after it landed because I didn't realize it fully replaced the existing thing

Vadim Petrochenkov (May 01 2019 at 12:26, on Zulip):

What is the problem exactly?
Everything doable in a folder should be doable in a mutable visitor as well?

Vadim Petrochenkov (May 01 2019 at 12:28, on Zulip):

Fold is representable as "clone + modify", and the result still can be allocated into an immutable arena if necessary, I think.

eddyb (May 01 2019 at 12:53, on Zulip):

@Vadim Petrochenkov see the hackmd link above, at the bottom (above the "generativity" bullet-point, I should find a way to hide that part)

eddyb (May 01 2019 at 12:55, on Zulip):

you want to be able to dynamically decide not to change anything, without a separate deep immutable visit

eddyb (May 01 2019 at 12:57, on Zulip):

so if you want to enforce mutation is detected, on top of that, you want to go as far away as possible from mutating nodes directly

eddyb (May 01 2019 at 12:57, on Zulip):

I consider unconditional deep cloning a non-starter

eddyb (May 01 2019 at 12:58, on Zulip):

while T -> T and &mut T -> () are somewhat equivalent, only the former lets you vary the input and output independently

eddyb (May 01 2019 at 13:00, on Zulip):

so that PR removed the freedom to change the interface, in the name of performance (which was measured on a broken implementation - today's rustc, wrt proc macros, that is - so I'm not sure it even counts!)

eddyb (May 01 2019 at 13:08, on Zulip):

I mean, sure, you can have all your signatures be &mut &'arena T -> (), but at that point, that's pretty silly, since &'arena T would be copyable, so you lose nothing, performance-wise, if you switch to &'arena T -> &'arena T

eddyb (May 01 2019 at 13:10, on Zulip):

@matklad I think I've covered most of the design space I've been thinking about lately, I'm curious if I missed anything

eddyb (May 01 2019 at 13:22, on Zulip):

anyway, I'll go have lunch now and then work on other things, but I can come back to this if there any remarks

nikomatsakis (May 01 2019 at 13:33, on Zulip):

(I broke this out into a separate topic, @eddyb -- I'll check out the hackmd soon)

eddyb (May 01 2019 at 13:36, on Zulip):

note that there's not much of an evolution written out, just a bunch of alternatives to how we do things today, and the pressing issue of AST -> TokenStream (which today is arguably broken)

matklad (May 01 2019 at 14:06, on Zulip):

@eddyb awesome write up! Seems like a good coverage indeed, left a couple of comments inline, which boil to:

Additionally (and which is totally not my area of expertise), I'd love to see motivation why "tracking original tokens" is a better solution in a given circumstances , than "fixing pretty-printing". I guess in theory we could fix pretty-printing to be 100% reliable, right? Long-term, this seems like a better approach, because it works with identity-less AST.

eddyb (May 01 2019 at 14:25, on Zulip):

you can cache the results, not sure why identity is a problem. it should never leak into e.g. proc macros

eddyb (May 01 2019 at 14:26, on Zulip):

you can't have 100% reliable pretty-printing

eddyb (May 01 2019 at 14:26, on Zulip):

without... a parse tree. which literally keeps the original tokens around!

eddyb (May 01 2019 at 14:27, on Zulip):

I really don't see how you can 100% remove identity

eddyb (May 01 2019 at 14:28, on Zulip):

oh one thing I didn't write is "relative/contextual identity" (which would let you reuse the same sub-tree in two places, e.g. when using $foo more than once in macro_rules!)

matklad (May 01 2019 at 14:35, on Zulip):

I have the following plan (not fully implemented) for id-less syntax trees:

This bit is ready, and more or less works in rust-analyzer.

For macro-expansion:

This bit is completely vaporware at the moment :-)

eddyb (May 01 2019 at 14:35, on Zulip):

@matklad oh, also, you haven't seen my (crazier) def ideas, which change some of the dynamics here :P (but I don't think enough to fully replace the AST)

eddyb (May 01 2019 at 14:35, on Zulip):

this write-up took priority, but I should do the "def" one too soon

matklad (May 01 2019 at 14:36, on Zulip):

That is, the single bit where we seem to really require IDs is macro expansion: we need to figure out which tokens in the expansion originated from which source tokens.

But this IDs are on the token trees, not on the AST.

eddyb (May 01 2019 at 14:37, on Zulip):

that's just implicit in rustc

eddyb (May 01 2019 at 14:37, on Zulip):

do you do name resolution on HIR?

eddyb (May 01 2019 at 14:37, on Zulip):

(for things other than macro invocations)

eddyb (May 01 2019 at 14:38, on Zulip):

if so, your "HIR" is our "AST" (or in the future, "Def"+"AST") :P

eddyb (May 01 2019 at 14:38, on Zulip):

tfw names stop being that useful

matklad (May 01 2019 at 14:38, on Zulip):

:D

eddyb (May 01 2019 at 14:40, on Zulip):

if you add that "Def" layer around AST/HIR then, yes, you can do nicer things. in particular, you can assign IDs very early during parsing that you can keep using all the way to the bank codegen, and even across crates

matklad (May 01 2019 at 14:40, on Zulip):

I run name resolution during translation from syntax trees to HIR.

The core thing is that macro expansion operates on stuff without identity

eddyb (May 01 2019 at 14:40, on Zulip):

and you can make the actually-trees of AST/HIR (Ty/Expr/Pat) far more value-like

eddyb (May 01 2019 at 14:40, on Zulip):

yeah, the name resolution algorithm we have doesn't really admit that

eddyb (May 01 2019 at 14:41, on Zulip):

because of all the stateful tracking

eddyb (May 01 2019 at 14:42, on Zulip):

rewriting that, backwards-compatibly (or even making a second implementation) will likely take a while, in a language like Rust

eddyb (May 01 2019 at 14:42, on Zulip):

and will need some amount of identity anyway

matklad (May 01 2019 at 14:42, on Zulip):

that is undeniably true :-) Working with existing architecture is the only way to fix immediate problems

eddyb (May 01 2019 at 14:43, on Zulip):

@matklad again, this is offtopic, but, this is the diagram for the write-up I haven't done yet: today: https://bit.ly/2GBLppR -> future: https://bit.ly/2L4HYxp

eddyb (May 01 2019 at 14:43, on Zulip):

I hope it helps even a bit :P

matklad (May 01 2019 at 14:44, on Zulip):

I read it as "Thing with ID is Def, and AST is a value-type attribute of DEF", and I like it :)

matklad (May 01 2019 at 14:45, on Zulip):

and will need some amount of identity anyway

Where exactly do we need identity?

eddyb (May 01 2019 at 14:49, on Zulip):

referring to an invocation in a way that reflects its scope

eddyb (May 01 2019 at 14:50, on Zulip):

unless you want to use some sort of inverted tree to represent scopes (my head hurts just thinking of that)

eddyb (May 01 2019 at 14:53, on Zulip):

also, you can't map from an identity-less AST back to source in any meaningful way (assuming you remove Span too) without either a tree that has the same shape as with an identity-based AST, or some relative scheme - I guess I could see an identity-less Parse Tree working, but since that would include all the tokens, I don't know... we'll have to explore this further in the future

eddyb (May 01 2019 at 14:54, on Zulip):

"AST is a value-type attribute of DEF",

Kinda! {ast,hir}::{Ty,Expr,Pat} would be hanging off the Defs (which would correspond to things that have DefIds today)

eddyb (May 01 2019 at 14:54, on Zulip):

oops, accidentally sent partial message

matklad (May 01 2019 at 14:54, on Zulip):

in ra, we refer to macro invocations by a pair of (FileId, InFileSpan), where FileId contains "identity" bits (So it's more like a current module, rathet than a file), and InFileSpan is a pointer into AST, which is a value type

eddyb (May 01 2019 at 14:55, on Zulip):

okay so that's just the relative scheme I was talking about

eddyb (May 01 2019 at 14:56, on Zulip):

where FileId is the container (which I want to focus around tokens, because that's what proc macros output)

matklad (May 01 2019 at 14:56, on Zulip):

yeah, with a twist that the node itself doesn't know it's id

eddyb (May 01 2019 at 14:56, on Zulip):

oh I also meant to say I replied to your comments

eddyb (May 01 2019 at 14:57, on Zulip):

if you're using pointers for identity, that's isomorphic to having an ID inside :P

eddyb (May 01 2019 at 14:57, on Zulip):

or to replacing the pointers (to children nodes) with IDs allocated in an array

matklad (May 01 2019 at 14:57, on Zulip):

if you're using pointers for identity, that's isomorphic to having an ID inside :P

Nope: macro call doesn't know FileId

eddyb (May 01 2019 at 14:58, on Zulip):

FileId is the contextual part, I'm referring to InFileSpan

matklad (May 01 2019 at 14:59, on Zulip):

I woldn't call InFileSpan and id: it's positional, and not identity based. Like 0..5 is not and ID of hello in "hello, world", because it works for any copy of "hello world"

matklad (May 01 2019 at 15:00, on Zulip):

so, this might be just a question of terminology :)

eddyb (May 01 2019 at 15:00, on Zulip):

so you're not really missing global identity (since (FileId, InFileSpan) is just that), it's just relative, which has benefits (if we can get it to work)

matklad (May 01 2019 at 15:01, on Zulip):

Yeah, I have global identity, but it is strictly outside the tree itself. I think it useful when implementing refactorings or other things which need to generate syntax on the fly.

If you generate syntax trees out of thin air, its good if you don't need to care how to assign identity to them

eddyb (May 01 2019 at 15:01, on Zulip):

I've talked before about the relative ID thing as being the ultimate evolution of "token ranges", I just don't know how to use it to optimize certain things correctly

eddyb (May 01 2019 at 15:01, on Zulip):

okay, yeah, something I should've pointed out is that the nodes don't need to know their identity

eddyb (May 01 2019 at 15:02, on Zulip):

you just need to be able to obtain it while visiting them, starting from some root

eddyb (May 01 2019 at 15:02, on Zulip):

let me reread that section and perhaps amend it, to make this clearer

eddyb (May 01 2019 at 15:04, on Zulip):

(sorry for getting a bit confused and assuming that you truly had no sort of global identity and instead were doing some things value-based)

matklad (May 01 2019 at 15:05, on Zulip):

also, you can't map from an identity-less AST back to source in any meaningful way

This actually works quite successfully in rust-analyzer. We use a "source-map" pattern for this: when we lower raw syntax to HIR, we produce a pair of (HIR, SourceMap), where HIR is an ID-based arena allocated graph, and SourceMap is an FxHashMap<HirId, SourceNode>.

The cool bit here is that HIr becomes completely independent of the source representation! So, if you edit the source such that HIR is the same, then only SourceMap component changes. Because almost no code looks at the source-map, that makes stuff super-incremental

matklad (May 01 2019 at 15:06, on Zulip):

Yeah, I could also have been more clear that I argue not against global ids per-se, but against syntax trees, which know their identity.

eddyb (May 01 2019 at 15:13, on Zulip):

yeah when I said identity-less I meant "it's really gone, completely"

eddyb (May 01 2019 at 15:15, on Zulip):

@matklad added a pagaraph to https://hackmd.io/zrZhb94HS6KxW3sguwXNqA?view#A-Global-node-identity-unique-precise-lossless

eddyb (May 01 2019 at 15:16, on Zulip):

(I can't think of a way to integrate it into the above bullet-point, I wish there was an easier way to describe multi-dimensional design spaces :P)

matklad (May 01 2019 at 15:17, on Zulip):

Can't you insert raw-html with position: absolute and appropriate z-index? /s :D

eddyb (May 01 2019 at 15:18, on Zulip):

you mean 3D CSS transforms :P?

eddyb (May 01 2019 at 15:20, on Zulip):

okay this has been a bit offtopic, but I guess it all overlaps enough that there's no reason to split the Zulip topic or w/e. also, I really wanted to go work on a few other things, so I'll get back to this maybe tomorrow or on Friday

matklad (May 03 2019 at 14:13, on Zulip):

@eddyb moved out the weeds out of steering meeting :D

eddyb (May 03 2019 at 14:14, on Zulip):

idk if these are weeds, and, like, we have the other topic already?

matklad (May 03 2019 at 14:16, on Zulip):

So, agree 100% that mem-management is imiportant, and, more generally, the lifetime of syntax trees (when in gets destroied). In rust-analyzer, syntax trees weight a ton, due to all of comments and whitespace :-( My plan is to make sure that only a small fraction of syntax trees is kept in memory, and that the are loaded and unloaded on-demand

matklad (May 03 2019 at 14:17, on Zulip):

that is, besides optimizing just the memory footprint of the datastructure, we might be able to save mem by a sort-of "overlay" system, where you look only at a section of a tree at a time, jsut to convert it to HIR

eddyb (May 03 2019 at 14:17, on Zulip):

well, specifically I am interested in the AST which destroys parse tree minutiae like that, and the transitionary period of "we still have some sort of AST mutation" as opposed to all expansion being done in a very "functional" manner

eddyb (May 03 2019 at 14:18, on Zulip):

I definitely agree with you, if we were to have a true parse tree (based on GLL or anything else)

matklad (May 04 2019 at 00:05, on Zulip):

Semi-related: I got curious about how Swift parser is set-up to produce an AST and a CST simultaneously. Swift doesn't build AST from CST, rather, both are produced at the same time for historical reasons which will probably apply to Rust as well, when we have a CST.

And turns out there's no magic: in each parsing function, there are two pieces of code: for AST and for CST. Here's how a simple expression is parsed (<expr-is>, I assume some kind of instanceof check):

Last update: Nov 16 2019 at 01:35UTC