Stream: t-cargo/PubGrub

Topic: 0.2.x api is a Selection sort


view this post on Zulip Eh2406 (May 08 2021 at 17:47):

I was continuing the experiments with the Cargo->Pubgrub script. It is a lot faster with the proposed 0.2.1 release. But it still has bean running for over and hour. Giving me lots of time to think. The profiling data claims that we are dominated by choose_package_version. Staring at my implementation, wondering if an allocation here or there was the problem, I realized that our API requires us to do a Selection sort on the potential_packages. Before each decision we need to call choose_package_version to tell us witch to pick and each call needs to look at all potential_packages. So the "look at" code is called ~ (#decisions ^ 2) / 2 times. I'd like to consider changing this for 0.3. What are your thoughts?

view this post on Zulip Matthieu Pizenberg (May 10 2021 at 14:42):

I have not thought too much about that yet, and not sure what you mean by "changing this". In theory, we don't need to look at all potential packages, we could simply pick the first one. But as you explained in the guide, picking the one with the lowest amount of available versions is the cause of the "look at" each potential package but also provides a huge perf gain.

view this post on Zulip Matthieu Pizenberg (May 10 2021 at 14:47):

Once a new package is selected, all its dependencies that have not been selected yet become a "potential package". This also mean that maybe, a better heuristic would also take into account the number of dependencies a package has, not only the number of potential versions to take a decision. This would lower the number of potential_packages for all the following decisions. Of course, depending on a given version, the number of dependencies may change, but a simple heuristic could just consider the most likely version to be chosen (newest / oldest). Maybe the tradeoff between more computation for one decision (need to check the number of dependencies of each package) and reducing the number of accumulated "potential packages" during the search is worth it.

view this post on Zulip Matthieu Pizenberg (May 10 2021 at 14:51):

I'm not following where the "^2" is coming from. I'd say the "look at" code is called roughly #decisions * mean#potential_packages

view this post on Zulip Eh2406 (May 10 2021 at 16:17):

Each potential_package will become a decision in some later iteration. So mean#potential_packages ~= #decisions / 2.
As to "changing this" I do want to keep that it is up to the user. And I have not figured it all out yet. I am thinking about having a priority queue, so the user provides a method prioritize<Priority: Ord>(package: P, range: Range<V>) -> (Priority, Option<V>). When we add a potential_package or change the range for a package we call prioritize, and put it in the queue. When we need to make a decision, we pop from the queue.

view this post on Zulip Eh2406 (May 16 2021 at 01:40):

As a proof of concept https://github.com/pubgrub-rs/pubgrub/tree/priority-queue
Looks like (as implemented) it scales better, but does more Hashes of P.
So elm is ~5% slower as the cases are simple and str hash is not free.
large_case is unchanged as it is simple and u16 hash is basically free.
zuse is ~21% faster as it has more complicated cases.

view this post on Zulip Eh2406 (May 20 2021 at 19:13):

I built a stand alone file based on the hobo crate. When I tried to benchmark it with the 0.2.1 candidate I got:

Benchmarking large_cases/hobo_str_SemanticVersion.ron: Collecting 100 samples in estimated 8161.2 s

when I ran with this branch I got:

Benchmarking large_cases/hobo_str_SemanticVersion.ron: Collecting 100 samples in estimated 712.05 s

So there exist real inputs for witch this is a ~91% improvement.

view this post on Zulip Matthieu Pizenberg (May 21 2021 at 15:59):

8161.2s

That must have been quite painful! ^^

view this post on Zulip Matthieu Pizenberg (May 21 2021 at 16:03):

If you have time, could you also try the rather simple change of heuristic consisting in choosing the next package that has the lowest number_of_versions + number_of_new_dependencies or number_of_versions * (1+number_of_new_dependencies)?

view this post on Zulip Matthieu Pizenberg (May 21 2021 at 16:05):

I have a feeling that this simple heuristic change could result in dramatic improvements of worst case scenarios because it has the following effects:

view this post on Zulip Matthieu Pizenberg (May 21 2021 at 16:09):

number_of_versions + number_of_dependencies

where number_of_new_dependencies here would be the number of new dependencies of the version that would be chosen if that package was picked. This implies that we would most likely want to cache that information (dependencies of that package at this version) to avoid re-fetching it when that package is eventually picked.

view this post on Zulip Eh2406 (May 21 2021 at 19:16):

8161.2s

To be clear I did not actually let the benchmark run for ether of them. I thought the runtime estimate was more than enough to demonstrate the point. :-P

view this post on Zulip Eh2406 (May 21 2021 at 19:20):

I will be happy to try different heuristics! I would love to find a better one! But even if it is better, I don't think it will change the improvements from having to evaluate the heuristic less often.

view this post on Zulip Matthieu Pizenberg (May 21 2021 at 19:21):

Indeed those are two orthogonal things


Last updated: Oct 21 2021 at 20:47 UTC