Stream: t-compiler/wg-rls-2.0

Topic: Rowan memory-usage


Simon Vandel Sillesen (May 17 2020 at 01:22, on Zulip):

Hi

DHAT shows the Rowan trees to be the most memory-hungry while running analysis-stats. AFAICT Rowan nodes are only cached for a file, but this cache is not shared across files. Is this correct?

I would like to investigate ways to reduce Rowan memory usage. I have some ideas I would like to try, but unsure how to do it.

Any pointers? Ideas?

matklad (May 17 2020 at 08:34, on Zulip):

Yeah, sharing node cache should be an easy win!

The simplest thing to do would be to add a read-only cache with pre-interned Node, and than use it for all files

matklad (May 17 2020 at 08:35, on Zulip):

More complicated thing would be something like a thread-local cache, but I am afraid it can grow unbounded, so it also needs some kind of LRU...

Simon Vandel Sillesen (May 17 2020 at 13:31, on Zulip):

Regarding sharing node cache. I see Roslyn simply makes SyntaxNodeCache be a static class with a bound of 65536 elements. https://github.com/dotnet/roslyn/blob/f03f87fbb636dfba824947c0e19e779ee9a39626/src/Compilers/Core/Portable/Syntax/InternalSyntax/SyntaxNodeCache.cs#L113

I'll experiment with instantiating NodeCache thread locally statically. Let's try for now without any bounds

matklad (May 17 2020 at 13:50, on Zulip):

Yeah, actually making it just a global should also work. The only thing I am worrying about is that this might make parsing slower, as you'd need synchronisation for looking up nodes

Simon Vandel Sillesen (May 17 2020 at 14:31, on Zulip):

Yeah, I suspect the synchronization cost will be high (haven't benchmarked though). How parallel is rust analyzer currently? Analysis-stats seems serial AFAICT

matklad (May 17 2020 at 14:53, on Zulip):

The only really parallel bit is this one: https://github.com/rust-analyzer/rust-analyzer/blob/71e94b1d0bf7c58aa377513f010fcb3f56081f5f/crates/ra_ide_db/src/symbol_index.rs#L113

matklad (May 17 2020 at 14:54, on Zulip):

It'd be good to expose it as analysis-stats/ analysis-bench, as it's a good measure of how fast we can parse files in parallel

matklad (May 17 2020 at 14:55, on Zulip):

I haven't looked at the rowan PR yet (hopefully do tat on monday), but I think it makes sense to remove the cache from the public API completely, and just use a global singleton, provided that the synchronization cost is not too bad

Christopher Durham (May 18 2020 at 16:50, on Zulip):

(now I'm wondering about trying to build a lru cache on a dashmap (sharded hashmap) instead of a plain hashmap and seeing how that behaves)

Christopher Durham (May 18 2020 at 16:51, on Zulip):

If this ends up positive, I'll at least need to consider what the global cache would look like for sorbus

Christopher Durham (May 18 2020 at 16:52, on Zulip):

One tricky thing I'm doing in sorbus right now is deduplicating based on subtree identity rather than subtree contents

Christopher Durham (May 18 2020 at 16:53, on Zulip):

This however becomes problematic (in terms of ideal deduplication) if small nodes exit the cache and thus any new parents of them stop getting deduplicated

Christopher Durham (May 18 2020 at 16:57, on Zulip):

There's still the comment in Rowan's cache to avoid re-hashing subtrees. That's what the identity-based cache in sorbus is doing

Christopher Durham (May 18 2020 at 16:58, on Zulip):

Plus "how many direct children does this node have" is a terrible heuristic for "how deep is this node"

Christopher Durham (May 18 2020 at 16:58, on Zulip):

Where the second is the more interesting one for caching

Christopher Durham (May 18 2020 at 16:59, on Zulip):

How deep a node is a decent heuristic for how likely it is to have the same node duplicated elsewhere in the compilation unit

Christopher Durham (May 18 2020 at 17:00, on Zulip):

(as the deeper a tree gets the more chances it has to diverge)

Christopher Durham (May 18 2020 at 17:05, on Zulip):

Ok, I thought of an interesting way to bound the cache compute cost while being resilient to some cache evictions

Christopher Durham (May 18 2020 at 17:05, on Zulip):

Basically, do the identity equivalence some number of levels down, rather than initially

Christopher Durham (May 18 2020 at 17:06, on Zulip):

The question, though, is if that even helps in any common cases

Christopher Durham (May 18 2020 at 17:07, on Zulip):

Seeing as if a shallow node is evicted from the cache, its ancestors are not unlikely to also be on the way out

Simon Vandel Sillesen (May 19 2020 at 10:20, on Zulip):

I haven't looked into how sorbus does identity of a tree, so correct me where I am wrong in the following. To avoid hashing the complete subtree of a node(all descendants of a node), would it be enough to hash the kind + text size + child1 ptr... Child2 ptr?

Simon Vandel Sillesen (May 19 2020 at 15:39, on Zulip):

Actually if this child pointers are enough to give identity to a node, then text size is unnecessary as that information is already contained in the children

Christopher Durham (May 19 2020 at 17:16, on Zulip):

Here it is in the code: https://github.com/CAD97/sorbus/blob/2b5fcd42808fb99f41170f575bddea9e0f873878/src/green/builder.rs#L30

Christopher Durham (May 19 2020 at 17:17, on Zulip):

But yes, for the builder cache I'm hashing the node as the rough equivalent of hash(kind), hash(text_len), for child in children ptr::hash(Arc::as_ptr(child))

Christopher Durham (May 19 2020 at 17:18, on Zulip):

It would be possible to skip hashing the text length for a small decrease in work, though, yes

Christopher Durham (May 19 2020 at 17:20, on Zulip):

(this is only for the builder; regular PartialEq/Hash does full tree equivalence)

Simon Vandel Sillesen (May 19 2020 at 18:25, on Zulip):

AFAICT sorbus doesn’t have the condition to only cache nodes with <= 3 children. How come? I would think would help only caching nodes with a high chance of being seen again

Christopher Durham (May 20 2020 at 16:49, on Zulip):

The reason is that (so far) I don't think that's a good heuristic for what we actually want

Christopher Durham (May 20 2020 at 16:49, on Zulip):

Which would be the number of (transitive) children(?)

Christopher Durham (May 20 2020 at 16:49, on Zulip):

What we want is a decent heuristic for whether the node is likely to have duplicates that we can dedupe

Christopher Durham (May 20 2020 at 16:50, on Zulip):

It was important for Rowan to use some heuristic here because it recursively rehashed the entire subtree

Christopher Durham (May 20 2020 at 16:51, on Zulip):

Whereas sorbus from the start didn't do that, instead using child identity for deduping

Christopher Durham (May 20 2020 at 16:52, on Zulip):

It's important to note that Rowan's "3 or fewer direct children" would still rehash subtrees even if they had more than 3 direct children

Christopher Durham (May 20 2020 at 16:52, on Zulip):

So, abusing big O, Rowan's cache was roughly O(n) and sorbus's was O(log n)

Christopher Durham (May 20 2020 at 16:53, on Zulip):

(where N is the number of nodes in the tree, and log n is asymptotically the number of direct children of any one node)

Christopher Durham (May 20 2020 at 16:55, on Zulip):

Rowan's just had a gate on the O(n) that said you had to have 3 or fewer direct children to be part of that n

Christopher Durham (May 20 2020 at 16:57, on Zulip):

Basically, the fact that I no longer had to revisit subtrees to do the cache let me just always do node creation through the cache. But with a strong argument that some O(1) heuristic is a good approximation for "does this node have duplicates", I'd be perfectly happy to re-add that heuristic to sorbus

Christopher Durham (May 20 2020 at 17:01, on Zulip):

I'd have to double-check how exactly r-a parses them, but my gut says that some nodes that could definitely be cached end up not being cached because of this heuristic, like #[rustfmt::skip]

Christopher Durham (May 20 2020 at 17:02, on Zulip):

If anything, I think textual length is probably a better heuristic than number of direct children

Christopher Durham (May 20 2020 at 17:03, on Zulip):

But also one that isn't available before iterating the children to sum their length, so that'd be an O(children) minimum heuristic

Christopher Durham (May 20 2020 at 17:07, on Zulip):

Plus @Simon Vandel Sillesen, there's another big hole I just realized in caching based on identity while also caching based on direct children length:

Your patch will now decrease the number of cached nodes, because nodes now can only be cached if their children were

Christopher Durham (May 20 2020 at 17:08, on Zulip):

Which actually was a key reason in sorbus caching everything now that I think about it

Simon Vandel Sillesen (May 20 2020 at 22:25, on Zulip):

@Christopher Durham I have a few clarifying questions.

I read https://github.com/rust-analyzer/rowan/blob/40fe55c450c88f4da7cf315bf027bec3250a7352/src/green/builder.rs#L29 as Rowan only hashing if the direct children are <= 3. What am I missing?

Regarding caching based on identity with children count heuristic. I feel like I’m missing something obvious - my mental model is not yet accurate of parsing and trees. Why can nodes only be cached if the children were? I think an example would help me greatly.

Finding a nice heuristic will however help reducing the number of hashes. I’ll benchmark no limitations on cache next week when I get to a computer.

Simon Vandel Sillesen (May 20 2020 at 22:43, on Zulip):

So if I have node A 2+2 that is a node with 3 children (all tokens). In another branch of the tree i now want to build the node B 2+2 node. This second node B would have the same identity hash as the first node A, right? If token + of node A went out of cache, node A would still be in the cache .... ah okay, I think I realize ... so when constructing the token + for node B it will not be found in the cache and will therefore get a different ptr. So the identity of node A and B will not be the same

Simon Vandel Sillesen (May 20 2020 at 22:56, on Zulip):

Idea to find heuristic: setup caching based on identity with no limitations. Instrument the hashmap access with cache miss/hit and eprtintln the miss/hit along with kind, Len of node and a print of each child. It would be interesting to see which of these, if any, are a good predictor of a cache hit. Run this data collection on representative code

Simon Vandel Sillesen (May 20 2020 at 22:57, on Zulip):

A good predictor of a cache miss would be equally useful as those can be used to avoid hashing

Simon Vandel Sillesen (May 21 2020 at 00:11, on Zulip):

Actually it should be caching based on subtree hashing. Else we get misleading information on cachability of a node

Christopher Durham (May 22 2020 at 18:56, on Zulip):

@Simon Vandel Sillesen (sorry for slow replies, I'm trying to do too many things at the same time right now :sweat_smile: )

Christopher Durham (May 22 2020 at 18:56, on Zulip):

First off, something I just realized:

Christopher Durham (May 22 2020 at 18:57, on Zulip):

For example, all #[inline] in this file share the same green node!

Christopher Durham (May 22 2020 at 18:57, on Zulip):

This is actually not true if we gate caching to nodes that have <= 3 direct children

Christopher Durham (May 22 2020 at 18:57, on Zulip):
    ATTR@25..34
      POUND@25..26 "#"
      L_BRACK@26..27 "["
      PATH@27..33
        PATH_SEGMENT@27..33
          NAME_REF@27..33
            IDENT@27..33 "inline"
      R_BRACK@33..34 "]"
Christopher Durham (May 22 2020 at 18:58, on Zulip):

As #[inline] is a node with four direct children

Christopher Durham (May 22 2020 at 19:01, on Zulip):

If you do something like sorbus's cache where _every_ node is built with the cache and persisted in the cache, you know you have perfect deduplication and can still get perfect deduplication via identity

Christopher Durham (May 27 2020 at 02:48, on Zulip):

I just keep having more and more cursed ideas for how to optimize this.

In sorbus, we're storing the node's children's textual offset from the parent in the children array, so that going from &green::Node, (kind, offset) to &green::Node is ~O(log n) rather than O(n)

Christopher Durham (May 27 2020 at 02:48, on Zulip):

The first node is obviously at offset 0, though, so that's a wasted u32!

Christopher Durham (May 27 2020 at 02:50, on Zulip):

So we could "just" stick a depth counter there and have our O(1) depth check

Christopher Durham (May 27 2020 at 02:50, on Zulip):

(Should we? Probably not, it'd add a very annoying edge case to child iteration.)

Christopher Durham (May 27 2020 at 03:03, on Zulip):

If we care a little less about performance of a cache and a little more about proper caching while retaining a bounded size, then LFUDA or GDS caches sound enticing, especially for a shared cache

Christopher Durham (May 27 2020 at 03:08, on Zulip):

Other than stealing the first child's offset as space for storing depth, we could also just add another u32 to Node and flip the parity of the children array

Simon Vandel Sillesen (May 27 2020 at 14:40, on Zulip):

Shrinking node size would be a way to reduce the work that needs to be done on memory allocation. However a 4 byte decrease like your example might actually not be that significant. I guess it will depend on the memory allocator, and its bucket sizes.

An orthogonal, and I think it might be more fruitful, idea would be to add arena allocation of nodes. If er can add an api to basically batch allocate the complete tree, that would work well for crates.io dependencies that don't change.

Simon Vandel Sillesen (May 27 2020 at 14:41, on Zulip):

I'm currently working on a patch for rust-analyzer to reduce the amount of events allocated in the parser. I don't have any numbers yet, but i believe it should be helpful in reducing memory usage of parsing :+1:

Simon Vandel Sillesen (May 27 2020 at 14:55, on Zulip):

@Christopher Durham i just read the article on dynamics cache aging. It seems like something that could be useful in case of a global cache. A good cache policy could definitely improve performance a bunch

matklad (May 27 2020 at 14:57, on Zulip):

Hey, it seems like I am not able to follow this work to closely. I hope to catch up with it once things calm down (hehe). If you need something from me to unblock experimentation (publishing crates or what not), feel free to agressively ping :)

Simon Vandel Sillesen (May 27 2020 at 15:13, on Zulip):

No blockings on my side :slight_smile:

pksunkara (May 27 2020 at 16:10, on Zulip):

I am working on publishing crates. Chalk should start publishing soon and then I can move on to rust-analyzer

Christopher Durham (May 28 2020 at 00:07, on Zulip):

(I'm just dumping this cursed idea here so it's recorded, but I doubt it's actually a good idea:
It's theoretically possible to have both thread-local shared builder caches and sharing between threads. "Just" have a cache miss look at the other thread's caches before making a new node and inserting it to the local cache.)

Christopher Durham (May 28 2020 at 00:10, on Zulip):

@Simon Vandel Sillesen

arena allocation

Out of curiousity, do you know if sorbus's FAM-based nodes are inherently problematic for an arena?

Christopher Durham (May 28 2020 at 00:11, on Zulip):

From a theoretical point of view, arena-allocating nodes, at least for dependencies, makes sense

Christopher Durham (May 28 2020 at 00:11, on Zulip):

From a convenience point of view, owning trees makes everything _so much easier_.

Simon Vandel Sillesen (May 28 2020 at 08:19, on Zulip):

@Christopher Durham i believe roslyn uses this concept of a L1 and L2 cache system. Probably a good thing to write on that list of things to experiment with

Simon Vandel Sillesen (May 28 2020 at 08:21, on Zulip):

@Christopher Durham does Fam-based nodes mean nodes in a family tree? As in parents point to children, but not the opposite? I don't know on the top of my head, will need to research or think some time

Simon Vandel Sillesen (May 28 2020 at 08:24, on Zulip):

I think the api and ownership semantics would need to be different for arena-trees. As in ownership is only possible at the root level of a tree (the complete syntax tree)

Christopher Durham (May 28 2020 at 16:52, on Zulip):

FAM = Flexible Array Member

Christopher Durham (May 28 2020 at 16:54, on Zulip):

(wait, I did actually get Rowan using an older version of my FAM infra)

Christopher Durham (May 28 2020 at 16:55, on Zulip):

Basically,
Without FAM Node = { kind, len, childs: Vec<Child> }
With FAM Node = { kind, len, childs: [Child] }

Christopher Durham (May 28 2020 at 16:56, on Zulip):

It means Node (well, the heap part) is dynamically sized. This means a simple typed arena doesn't work.

Christopher Durham (May 28 2020 at 17:44, on Zulip):

By the way, AsRef<[T]> for vec::Drain and vec::IntoIter (the API required for no-alloc cache hits) are FCP-merge. Once they're in nightly, I can make a PR to Rowan to use them; I already have a proof-of-concept patch for sorbus (as well as a slightly less POC draft locally).

Simon Vandel Sillesen (May 28 2020 at 18:29, on Zulip):

@Christopher Durham

no-alloc cache hits

sounds good! Unfortunately rust analyzer is stable only, right?

FAM

Ah okay, I see. So Node is not Sized, right?
In any case, we could have different sized bins ala https://citybound.github.io/citybound/chunky/struct.MultiArena.html

Christopher Durham (May 28 2020 at 22:03, on Zulip):

Yeah, r-a compiles on stable and it'd be nice to keep it that way. The API in question is be stabilized by the PRs, though, so it'll probably be available in 1.46 (1.45 if it sneaks in before the beta cutoff next week)

Christopher Durham (May 28 2020 at 22:03, on Zulip):

https://github.com/rust-lang/rust/pull/72584, https://github.com/rust-lang/rust/pull/72584

Christopher Durham (May 28 2020 at 22:04, on Zulip):

vec::Drain is the more important one, as that's how the TreeBuilder works.

Christopher Durham (Jun 17 2020 at 01:45, on Zulip):

The necessary API for no-alloc cache hits are riding the trains to stable

Christopher Durham (Jun 17 2020 at 01:46, on Zulip):

I've done the requisite patch in sorbus: https://github.com/CAD97/sorbus/pull/45

Christopher Durham (Jun 17 2020 at 01:46, on Zulip):

If nobody has done the same for rowan by the time 1.46 hits stable I'll do the same for rowan

Christopher Durham (Jun 17 2020 at 01:47, on Zulip):

The weak_as_ptr feature is also stabilizing in 1.46

Christopher Durham (Jun 17 2020 at 01:47, on Zulip):

And that may potentially have an answer for the cache growth!

Christopher Durham (Jun 17 2020 at 01:50, on Zulip):

"Just" store weak pointers in the cache

Christopher Durham (Jun 17 2020 at 01:50, on Zulip):

And expose a .gc(&mut self) to walk the cache and clear out any dangling pointers

Christopher Durham (Jun 17 2020 at 01:53, on Zulip):

Alternatively: be even more clever, and expose the .gc(&mut self) that just purges any cached element with strong_count() == 1

Christopher Durham (Jun 17 2020 at 18:09, on Zulip):

https://github.com/CAD97/sorbus/pull/46 is an implementation in sorbus of a cache GC.

Christopher Durham (Jun 18 2020 at 19:14, on Zulip):

upstream hashbrown prereqs merged, so we have GC for the sorbus caches exposed now.
I went to port this to rowan only to be foiled by myself; thin_dst::ThinArc doesn't expose any way to get at Arc::strong_count.

Christopher Durham (Jun 18 2020 at 19:14, on Zulip):

I guess I need to look into actually updating rowan from using thin_dst to slice_dst and erasable then...

Christopher Durham (Jun 19 2020 at 02:14, on Zulip):

Simple cache gc support for rowan: https://github.com/rust-analyzer/rowan/pull/63

Christopher Durham (Jun 21 2020 at 05:23, on Zulip):

I'm doing evil things to hash tables

Christopher Durham (Jun 21 2020 at 05:23, on Zulip):

The current cache GC is roughly O(n log n)

Christopher Durham (Jun 21 2020 at 05:24, on Zulip):

That's why sorbus exposes a O(n) "GC turn" that only collects roughly one and a half "layers" of a freeable tree

Christopher Durham (Jun 21 2020 at 05:24, on Zulip):

But this is a way of getting a complete collection in just one pass

Christopher Durham (Jun 21 2020 at 05:26, on Zulip):

(m is the number of cached elements, n is the number of freeable elements)

Christopher Durham (Jun 21 2020 at 05:26, on Zulip):

(which isn't quite O(m), but should be O(m+n) (which... is O(m) because asymptotes and n⊂m))

Christopher Durham (Jun 21 2020 at 05:27, on Zulip):

But the evil part is it works via concurrent modification and iteration of the hash table

Christopher Durham (Jun 21 2020 at 05:27, on Zulip):

I think I found a way for it to actually be sound

Christopher Durham (Jun 21 2020 at 05:28, on Zulip):

But I'm asking hashbrown how abusive I'm allowed to be to their hash tables to be safe.

matklad (Aug 14 2020 at 21:52, on Zulip):

I am warming up to Roslyn/libsyntax idea of storing trivia in tokens more!

The idea is that, instead of storing comments, white space and skipped tokens as nodes, we store them as leading/trailing fields of tokens.

On the first sight, this seems like a royally bad idea, because most of the tree are tokens, and we are growing each token by two pointers.

However, I think we’ll actually save on memory here. Most of the tokens are interned (even if we take trivia into account), so +two pointers is not such a big deal. OTOH, we now store twice as few pointers in internal nodes, which would seem like a bigger win.

Christopher Durham (Aug 14 2020 at 23:13, on Zulip):

Hmm

Christopher Durham (Aug 14 2020 at 23:13, on Zulip):

That definitely does seem plausible

Christopher Durham (Aug 14 2020 at 23:13, on Zulip):

The number of unique tokens would go up, ofc

Christopher Durham (Aug 14 2020 at 23:14, on Zulip):

(as not every token with the same text would store the same lead/tail trivia)

Christopher Durham (Aug 14 2020 at 23:15, on Zulip):

But it does seem like it could reduce the number of pointers to any given trivia node

Mr Smeet (Aug 15 2020 at 14:23, on Zulip):

It seems to me that before making any decisions regarding changes, we should develop a benchmark. Because maybe we will win some memory but lose execution speed or that don't give any benefits

Last update: Sep 30 2020 at 16:15UTC