Stream: t-compiler/const-eval

Topic: copying undef optimization #58556


pnkfelix (Mar 12 2019 at 12:57, on Zulip):

hi @oli

pnkfelix (Mar 12 2019 at 12:58, on Zulip):

I wanted to double check something with you about this PR

pnkfelix (Mar 12 2019 at 12:58, on Zulip):

namely I wanted to understand the purpose of making the so-called "precomputed cache" here

pnkfelix (Mar 12 2019 at 12:58, on Zulip):

which I think is an attempt to make a compressed form of the input bitvector

pnkfelix (Mar 12 2019 at 12:59, on Zulip):

and then, if the result of the compression is actually trivial (i.e. we found out it was all zeros or all one bits), we sidestep the copy entirely

pnkfelix (Mar 12 2019 at 12:59, on Zulip):

and otherwise, I think we subsequently "decompress" the compressed form into the target buffer

pnkfelix (Mar 12 2019 at 13:01, on Zulip):

If the above interpretation is correct, then the main (and only?) benefit I can imagine is that you can fit more of the input buffer into your processor cache during the reading, and likewise during the subsequent writing

pnkfelix (Mar 12 2019 at 13:01, on Zulip):

rather than trying to dedicate half of your cache and the other half of the cache to the other if you were to do a more direct copy

pnkfelix (Mar 12 2019 at 13:02, on Zulip):

(this is all assuming that one could just have added a prepass to the original code just look if its all zeros or all ones, and skip the copy if that is the case.)

pnkfelix (Mar 12 2019 at 13:02, on Zulip):

(of course the cost/benefit of such a second pass is hard for me to determine)

oli (Mar 12 2019 at 13:35, on Zulip):

It's both a processor cache thing, and a way to get around the problem, that iterating over bits and setting them one by one in a big integer is super slow

oli (Mar 12 2019 at 13:36, on Zulip):

the set_range_inbounds method is optimized to do two bitshifts and a bitmask instead of iterating over all the bits

oli (Mar 12 2019 at 13:49, on Zulip):

So the reason I am not just computing this while copying is that we are copying a single value repeat times. This means that we'd be recomputing them repeat times if we just did it while copying

oli (Mar 12 2019 at 13:50, on Zulip):

The cache only computes the set once and then reuses it for all duplicates (or if the entire thing is homogenous, just does a single operation).

oli (Mar 12 2019 at 13:50, on Zulip):

We could even go as far as figuring out that as long as the sequence length is a divisor of 64, we can just fill in a u64 once and copy that one

pnkfelix (Mar 12 2019 at 14:29, on Zulip):

I guess my point is, I can imagine some input buffers where the "compressed form" is not efficient at all

pnkfelix (Mar 12 2019 at 14:30, on Zulip):

i.e. if you had an input buffer 01010101... then you almost certainly would have been better off with the original code (right?)

oli (Mar 12 2019 at 14:31, on Zulip):

oh yea

oli (Mar 12 2019 at 14:31, on Zulip):

hmm

pnkfelix (Mar 12 2019 at 14:31, on Zulip):

but I'm also willing to believe that in practice, you are seeing long ranges with the same value (0 or 1) repeated, and thus the run-length encoded ranges will be very compact and very fast to decompress into each of the dest_allocations

oli (Mar 12 2019 at 14:32, on Zulip):

we could detect such cases and just resort to the original code?

pnkfelix (Mar 12 2019 at 14:32, on Zulip):

or at least, I am assuming that you saw long ranges with the same value arise in practice, right?

oli (Mar 12 2019 at 14:32, on Zulip):

oh yea, mostly it's all bits defined

pnkfelix (Mar 12 2019 at 14:32, on Zulip):

okay. Then I'll just r+ this. :smiley:

oli (Mar 12 2019 at 14:32, on Zulip):

hehe ok

oli (Mar 12 2019 at 14:33, on Zulip):

I can revisit for more perf improvements if new cases come up

pnkfelix (Mar 12 2019 at 14:33, on Zulip):

I will admit that I overlooked the fact that you are repetitively writing the output repeat times

pnkfelix (Mar 12 2019 at 14:34, on Zulip):

I only noticed a single copy and so was not convinced it was worth making the run-length encoded form. But if you are generating multiple copies, I can believe it ends up paying off.

pnkfelix (Mar 12 2019 at 14:39, on Zulip):

(Actually, it would probably be really easy to add logic saying "if ranges is larger than undef_mask, then just use undef_mask as the source rather than spending time decompressing ranges, right?)

pnkfelix (Mar 12 2019 at 14:39, on Zulip):

Just something to consider. It doesn't need to be part of this PR

pnkfelix (Mar 12 2019 at 14:40, on Zulip):

(I'm basically just repeating what you said above when you said "we could detect such cases and just resort to the original code"; i just hadn't realized how trivially we might detect such cases.)

RalfJ (Mar 12 2019 at 20:02, on Zulip):

oh yea, mostly it's all bits defined

well, except for padding

Jake Goulding (Mar 12 2019 at 20:13, on Zulip):

What's a little padding between friends?

Last update: Nov 15 2019 at 20:10UTC