Stream: t-compiler/wg-rls-2.0

Topic: Extracting the lexer


matklad (Apr 01 2019 at 18:16, on Zulip):

I’ve just realized that a relatively easy part of the compiler to extract would be the lexer. I imagine it could have a relatively pure interface which works for both rustc and rls2 (the current interface in rustc is very impure and, for example, depends on string interning). Note that in and of itself, extracting lexer would’t be benefitial: it’s a stable and easy part. But his would be a good precedent

matklad (Apr 01 2019 at 18:17, on Zulip):

Should we spend time on this?

detrumi (Apr 03 2019 at 11:21, on Zulip):

It might be a good starting point to get things to line up

detrumi (Apr 03 2019 at 11:23, on Zulip):

For details like string interning, should those line up as well, or can it be abstracted over?

detrumi (Apr 03 2019 at 11:24, on Zulip):

It feels like rustc does many things differently for performance, which makes it hard to extract. But maybe that's just a problem with the current implementation?

matklad (Apr 03 2019 at 11:27, on Zulip):

I hope there's a purely-functional interface to the lexer, which should achieve both good perf and flexibility, required for IDE

matklad (Apr 03 2019 at 11:28, on Zulip):

In particular, I think the lexer could return string slices, and leave the interning to the calling code

detrumi (Apr 03 2019 at 11:30, on Zulip):

makes sense

matklad (Apr 04 2019 at 11:38, on Zulip):

So, I've looked closer at the lexer and I think we should do it. My plan is to produce an rust_lexer crate, which has zero/few dependencies (we might pull something in for XID_Start), with roughly the following interface:

pub enum LexerError {
    BareCRinComment { position: usize },
    ....
}

pub struct Token {
     kind: TokenKind,
     len: usize,
}

#[repr(u8)] // fieldless
pub enum TokenKind {
    Shebang,
    Comment,
    Whitespace,
    ....
    Error,
}

pub fn next_token(src: &str, is_file_start: bool, errors: &mut Vec<LexerError>) -> Token {
    assert!(!src.is_empty());
    ....
}

RLS2 will consume the tokens and build up a concrete syntax tree.

rustc will consume the tokens and build the rich rustc tokens, with spans, interned identifiers, escaped strings, etc.

the crate will live in rust-lang/rust repo. Long term, I think we should add an /libs top-level dir, alongside with /src, where we store librarified compiler which you can build without x.py, LLVM and bootstraping.

@WG-rls2.0 thoughts?

matklad (Apr 04 2019 at 11:40, on Zulip):

Note that immediate benefit from extracting the lexer would be negligible: it's a simple bit of code which rests with few modifications. However, I find it important to actually start the "from the ground up" librarification, and this seems like a good case .

detrumi (Apr 04 2019 at 12:27, on Zulip):

In the short term, having it live inside rust-lang/rust would probably mean that it requires x.py and bootstrapping, which is a bit too much overhead for such a small library. And for the long term, that'd mean that x.py would also have to call cargo to build the /libs crates, wouldn't that complicate the build process further?

detrumi (Apr 04 2019 at 12:28, on Zulip):

But I guess moving the lexer to a separate repo (and to crates.io?) would make changes harder, since it would be spread across multiple repositories

matklad (Apr 04 2019 at 12:35, on Zulip):

Yeah, this plan is to publish it to crates,io, so that rust-analyzer can just pick it up. The /libs setup will complicate the build, but not too much x.py already uses cargo to build everything

matklad (Apr 04 2019 at 19:09, on Zulip):

proposed API: https://github.com/rust-lang/rust/pull/59706

Last update: Nov 12 2019 at 15:45UTC