Stream: t-cargo/PubGrub

Topic: crates.io index


view this post on Zulip Matthieu Pizenberg (Nov 20 2020 at 08:57):

Just had a quick look at zuse index, what are the meaning of the differents parts here for example?

cfg-if-wasm-bindgen-Minor(2)->=0.1.9, <0.2.0/default
Links:openssl

view this post on Zulip Matthieu Pizenberg (Nov 20 2020 at 10:03):

I think the way I'd go about dealing with crates, is I'd not use OfflineDependencyProvider with all cargo features embedded as data (encoded in package names that are all stored). I'd rather create a new OfflineCrateProvider similar to OfflineDependencyProvider but with cargo features embedded as logic.

I don't know how you deal with everything but concretely, for features for example I'd do something like this:

type Index = Map<Crate, Map<V, CrateDeps>>;

type CrateDeps = Map<Feature, Deps>;
type Feature = String;
type Deps = Map<Crate, Dep>;
type Crate = String;
type Dep = (Range<V>, Set<Feature>);

// the id P would be a string uniquely identifying
// the crate with its activated features.
to_crate_id : (Crate, Set<Feature>) -> P;
from_crate_id : P -> (Crate, Set<Feature>);

// internally uses from_crate_id and to_crate_id.
get_dependencies : (P, V) -> [(P, Range<V>)]

view this post on Zulip Eh2406 (Nov 20 2020 at 10:53):

I do have a Dependency provider that is used to generate this. I look forward to polishing it up so I can share it and we can find ways to improve it. I don't want to have this several hundred line program in our test suite. One of the things it can do is turn each type safe P into a unique string. That lets our test suite only need to know how to read OfflineDependencyProvider.

view this post on Zulip Eh2406 (Nov 20 2020 at 10:54):

(and no I don't know what I am doing up at this time.)

view this post on Zulip Eh2406 (Nov 20 2020 at 10:55):

Links:openssl There can only be one thing that links to a clib, or the linker gets mad, so Links:openssl means that that crate will link to openssl and so nothing else can.

view this post on Zulip Eh2406 (Nov 20 2020 at 11:08):

cfg-if-wasm-bindgen-Minor(2)->=0.1.9, <0.2.0/default this is a lot to unpack.
/default means that whatever we select we need to enable the default set of features.
bindgen-Minor(2) means a the crate bindgen but only if it is 0.2.x. We need that info in the package name as we can also have a 0.1.x and a 1.x.y in the same resolution.
cfg-if ... ->=0.1.9, <0.2.0 means that we depend on cfg-ifin the range >=0.1.9, <0.2.0 in this case that is the same as cfg-if-Minor(1), but if it was the range >=0.1.9, <0.3.0then it would let us pick between cfg-if-Minor(1) and cfg-if-Minor(2). We can have different versions selected for different parents.
So all together it is:
bindgen-Minor(2) depends on cfg-if ... ->=0.1.9, <0.2.0 (even if that spans incompatible versions) and needs the default set of features.

view this post on Zulip Matthieu Pizenberg (Nov 20 2020 at 13:29):

I must admit I am lost ahah

view this post on Zulip Alex Tokarev (Nov 20 2020 at 13:42):

That's because all the things we haven't implemented for Cargo specifically are encoded in this format

view this post on Zulip Alex Tokarev (Nov 20 2020 at 13:43):

Multiple major versions are allowed, features can be enabled, and "links" attributes are in there as well

view this post on Zulip Alex Tokarev (Nov 20 2020 at 13:45):

0.2 has additional dependency on wasm-bindgen, while 0.1 does not, so it's encoded in the name too

view this post on Zulip Eh2406 (Nov 20 2020 at 14:18):

So 3 lessons from this try:

view this post on Zulip Eh2406 (Nov 20 2020 at 14:37):

Let's say for argument purposes my current code is:

It is still an interesting benchmark on it own with a very high Time/MB, and a realistic example in that it code that someone new to using PubGrub actually wrote.

view this post on Zulip Eh2406 (Nov 20 2020 at 14:39):

I.E. we have to goles 1 figure out what is needed for Cargo, 2 benchmark PubGrub.
It still can work for 2 even if it is a WIP for 1.

view this post on Zulip Matthieu Pizenberg (Nov 20 2020 at 15:16):

"It depends". For example, the ratio of valid / invalid dependencies is a lot lower for one thing.

solutions: 1909
errors: 279

And the code that builds the derivation tree has not been optimized at all, so that may be part of the problem.

Another thing to factor is is that if 1 package is very long to build, it's usually also because it's dependencies are very long to build. So when building a benchmark that iterates over all packages of a single very long to build package, it's normal to end up with hight time/mb

view this post on Zulip Matthieu Pizenberg (Nov 20 2020 at 16:01):

And the code that builds the derivation tree has not been optimized at all, so that may be part of the problem

I'll agree though that it's a good benchmark to optimize the code building the data structure used for error reporting (currently called DerivationTree) :)

view this post on Zulip Eh2406 (Nov 20 2020 at 16:29):

Another thing to factor is is that if 1 package is very long to build, it's usually also because it's dependencies are very long to build.

pubgrub#16 links to https://github.com/rust-lang/cargo/issues/8539#issuecomment-671074837 witch is a case (with -Zminimal-versions) where either of self_update = "*" and solana-metrics = "*" solve instantly but both takes over 50 min. So the really interesting slow cases do not have the shape you described, but most slow cases do have that shape. And your point still stands, "build all" has O(n^2) even without pathologies.

view this post on Zulip Matthieu Pizenberg (Nov 20 2020 at 17:16):

@Eh2406 I just tried using crates_index and cant figure out how to make it work. The example in the docs (https://docs.rs/crates-index/0.16.2/crates_index/index.html) gives an empty iterator it seems. Any idea of what might go wrong?

view this post on Zulip Matthieu Pizenberg (Nov 20 2020 at 17:23):

Ok actually, the problem seems to be linked to new_cargo_default(), if I use this, it does not figure out how to load the index used by cargo itself. Weird. If I change the path by using something like new("_index") it works.

view this post on Zulip Matthieu Pizenberg (Nov 20 2020 at 18:10):

Are "build" dependencies of your dependencies necessary to cargo build your own crate?

view this post on Zulip Eh2406 (Nov 20 2020 at 19:39):

"build" dependencies are "dev" dependencies are not. Sorry I have not had time to work on my script today. I will hopefully get it in shape to share tomorrow.

view this post on Zulip Matthieu Pizenberg (Nov 20 2020 at 22:15):

no worry I was just trying few things and wanted to understand those details. I've never used build dependencies in rust yet

view this post on Zulip Matthieu Pizenberg (Nov 21 2020 at 11:20):

@Eh2406 to parse the versions requirements did you use the semver crate? It does not seem possible to extract the different segments of a VersionReq with the public API. Did you copy-pasted the code and added conversion functions to range? (or maybe just rolled you own parser)
I see there can be multiple requirements separated by a comma in the dependencies of cargo toml. I've even seen one with 2 commas:

<0.3, >=0.1, >=0.2.0-beta.0

view this post on Zulip Eh2406 (Nov 21 2020 at 20:23):

Lost some time to a headache today. I will follow up with this when I am feeling better.
Ya the semver crate is a odd beest. What I did was keep a list (semver::Version, pubgrub::Semver). Then I can filter using VersionReq.matches(Version). Then I can make a Range::between(min, max). It is a hack but it works.

view this post on Zulip Matthieu Pizenberg (Nov 21 2020 at 20:25):

Don't worry about it, I did something forking the crate. Currently WIP but I'll say more soon

view this post on Zulip Eh2406 (Nov 21 2020 at 20:26):

I look forward to comparing our aproteches!

view this post on Zulip Matthieu Pizenberg (Nov 22 2020 at 01:01):

Alright, I've pushed my work to the "crates" branch. You should read the example examples/crates/main.rs.
First comment the line written // THEN DO THIS, and uncomment the line // DO THIS FIRST and run

cargo run --release --features=serde --example crates

It will create the ron file in temp/index.ron.
This files is the serialization of my adaptation of the index, which store features in a rather efficient way.
Then you can toggle the commenting you just did on examples/crates/main.rs and re-run the examples. It will try to solve dependencies of all packages (with only mandatory dependencies). At the time of downloading the index for me, there was 51673 crates, 313123 unique versions, 301536 not yanked ones. From this, I build my index with only crates that do not specify specific targets (like windows and such) and only normal versions (not pre-release) so the index is down to 280906 versions.

From that index of 280906 versions, I solve dependencies of all of these with mandatory dependencies. The resolution was successful for 216794 of those 280906 versions and took 4 minutes on my computer.

view this post on Zulip Eh2406 (Nov 22 2020 at 03:22):

This is a great start! Thank you for sharing! Eventually I think this should probably live in https://github.com/pubgrub-rs/advanced_dependency_providers. But we can figure that out later. I have three thoughts/things to work on next:

  1. Cargo's resolver ( for now ) always includes target-specific dependencies, that field gets used later in the pipeline. So it is more accurate and less code to keep them in. On the other hand dev dependencies can be completely ignored, there is some debate but it may just be a bug that we add them to the index.
  2. I think having features trigger there optional dependencies, is going to make solving the ron file significantly more complicated.
  3. Another source of complexity is one-per-mager. let's say we have:
a:
  1.0
 1.1
 2.0
 2.1
b:
  1.0 -> a="<=1.1"
c:
  1.0 -> a=">=2.1"
d:
  1.0 -> a=">=1.1,<=2.0"
e:
  1.0 -> b="*" c="*" d="*"

Cargo thinks the set {e:1.0, d:1.0, c:1.0,b:1.0, a:1.1, a:2.1 } is a valid solution, I don't think this code handels that yet.
Making sure that that 2 and 3 work well together adds yet more complexity.
But this is a great start and definitely better organized then my code.

view this post on Zulip Matthieu Pizenberg (Nov 22 2020 at 11:37):

I think having features trigger there optional dependencies, is going to make solving the ron file significantly more complicated

I don't think so. With that design everything is ready already to include features. There is just two places to modify. First in complete the featured_from function like mandatory_from (https://github.com/pubgrub-rs/pubgrub/blob/crates/examples/crates/index.rs#L96). Second is extending the dependencies returned by get_dependencies here https://github.com/pubgrub-rs/pubgrub/blob/crates/examples/crates/index.rs#L207. Both are rather straightforward but I did not need it to make this PoC.

It may take a bit longer and the ron file might be a bit bigger but I don't think it will change a lot.

view this post on Zulip Matthieu Pizenberg (Nov 22 2020 at 11:39):

Another source of complexity is one-per-mager

That was indeed not included in this PoC. I think it should be studied on its own to find a good solution, in isolation to all other additional complexity like features and such.

view this post on Zulip Eh2406 (Nov 22 2020 at 15:14):

Looking again in the morning is there something that makes sure that a:feat1 and a:feat2 get the same version?

view this post on Zulip Eh2406 (Nov 22 2020 at 15:19):

Most of the complexity of cfg-if-wasm-bindgen-Minor(2)->=0.1.9, <0.2.0/default comes from the intersection of one-per-mager and feature related dependencies. There are a number of good encodings of features. There is at least one good encoding of one-per-mager. But when you put them together there is more complexity than the sum of the parts.

view this post on Zulip Matthieu Pizenberg (Nov 22 2020 at 15:39):

Looking again in the morning is there something that makes sure that a:feat1 and a:feat2 get the same version?

You're right I didn't encode this. I think you mentioned a possible elegant solution to this in our past discussions. get_dependencies for a:feat1 at version v1 could return the feat1 dependencies of v1 plus a dependency to a raw a at v1. In turn get_dependencies for a would return mandatory dependencies of a. As such every feature such as a:featN @ v would have an exact dependency to a @ v. This will force selection of the same version.

view this post on Zulip Eh2406 (Nov 22 2020 at 15:41):

Sounds like a plan, just wanted to be sure your encoding did not mean there was a more elegant solution that I had missed when reading the code.

view this post on Zulip Matthieu Pizenberg (Nov 22 2020 at 15:42):

when you put them together there is more complexity than the sum of the parts

Right again. I still think it would be very useful to find a clean solution for the "multiple-versions" problem first, before trying to find a solution to the combined problem. At least because it's nice problem on its own. It also maybe corresponds to the behavior of some registries (NPM works like that right?)

view this post on Zulip Matthieu Pizenberg (Nov 22 2020 at 15:44):

I'm trying to put on paper some thoughts, will see if something useful come out of this

view this post on Zulip Matthieu Pizenberg (Nov 22 2020 at 15:46):

Question related to the "multiple versions" case. If a depends on d1, b depends on d2 and c depends on d1 or d2, do you know what is done for c?

view this post on Zulip Eh2406 (Nov 22 2020 at 15:48):

It will get the highest one that works with the other requirements. See my example from last night.

view this post on Zulip Eh2406 (Nov 22 2020 at 15:54):

One solution to "multiple versions" on its own. (The one I have been working with.)
So a semver req like >=2.0, < 4.0 can be expressed as Range::between(2.0, 3.0) Or Range::between(3.0, 4.0). We already have a way to express Or. Or is just multiple versions of a package. So we change req like >=2.0, < 4.0into a new package with two versions one depends on Range::between(2.0, 3.0) and the other depends on Range::between(3.0, 4.0).

view this post on Zulip Eh2406 (Nov 22 2020 at 15:58):

So what do we do with >=2.0, < 999.0are we going to make a package with 997 versions? We could but it wouldn't be efficient. What we can do is look at the versions that are actually available and only add the relevant versions.

view this post on Zulip Matthieu Pizenberg (Nov 22 2020 at 16:02):

So if I call [2, 3[ or [3, 4[ a "bucket" ([0.1, 0.2[ would also be a bucket) you create one version per existing bucket.

view this post on Zulip Matthieu Pizenberg (Nov 22 2020 at 16:03):

And an issue is that it grows in n2 with the number of dependencies that would require multiple buckets right?

view this post on Zulip Eh2406 (Nov 22 2020 at 16:04):

Now we need to tell it that the buckets are compatible. So we add it to the name. serde-major(1) is the name for serde Range::between(1.0, 2.0).

view this post on Zulip Eh2406 (Nov 22 2020 at 16:05):

And an issue is that it grows in n2 with the number of dependencies that would require multiple buckets right?

I don't think I see your point, can you elaborate?

view this post on Zulip Matthieu Pizenberg (Nov 22 2020 at 16:06):

nah not clear enought for me, I thought I had something and then it vanished from my mind ahah

view this post on Zulip Eh2406 (Nov 22 2020 at 16:06):

That happens a lot to me.

view this post on Zulip Matthieu Pizenberg (Nov 22 2020 at 16:07):

I mean I've to think a little more, but I'll make a cookie pause now ^^

view this post on Zulip Eh2406 (Nov 22 2020 at 16:10):

In theory we could notice in advance that >=2.0, < 3.0 will only have one bucket, and skip an indirection , but the code complexity has not been worth it so far.

view this post on Zulip Eh2406 (Nov 22 2020 at 19:38):

I pushed a copy of what I've bean making: https://github.com/pubgrub-rs/pubgrub/compare/dev...crates2
It is repetitive and has bad naming, but if you feel like taking a look, there it is.

view this post on Zulip Matthieu Pizenberg (Nov 22 2020 at 23:05):

I will certainly take a look during the week. I've also tried to formalize your "multiple-versions" method combined with my "on-the-file" packages id generators for an implementation of a dynamic "multiple-versions" dependency provider. Here is the explanation, let me know if you think this is clear.


For every package, we pre-compute "buckets", which are intervals in which
there can only be 1 version picked. So for example, if existing versions are
(0.1.0, 0.1.1, 0.2.0, 1.0.0, 1.0.1, 3.3.0), there are 4 buckets which are
{0.1.0, 0.1.1}, {0.2.0}, {1.0.0, 1.0.1}, {3.3.0}.
The rules of semantic versioning mean that before version 1.0.0, a bucket corresponds to
a minor version, and after it, a bucket corresponds to a major version.
But in general, buckets can be anything as long as it means for the implementer
that only 1 version per bucket can be selected.
With semantic versioning, we can easily identify buckets
by the first version of the corresponding bucket range.
So the bucket identifier of the range [0.1.0, 0.2.0[ is 0.1.0 while
the bucket identifier of the range [2.0.0, 3.0.0[ is 2.0.0

If "p" @ v depends on "q" in range r, the base idea of bucketted dependency resolver is:

Let's take a concrete example with the following dependencies:

root @ 1 depends on a * and d *
a @ 1 depends on
    b @ [2.1, 4.5[
    c @ [0.4, 0.6[
d @ 1 depends on b @ [1, 2[
b @ 1.0 has no dependency
b @ 2.1 has no dependency
b @ 3.0 has no dependency
c @ 0.4 has no dependency
c @ 0.5 has no dependency

The typical solution would be:
a @ 1, b @ 1.0, b @ 3.0, c @ 0.5, d @ 1, root @ 1

Let's precompute existing buckets for each package.
There is only one bucket for each one of "root", "a" and "d" since those
only have one existing version.
Package "b" has buckets 1.0.0, 2.0.0 and 3.0.0.
Package "c" has buckets 0.4.0 and 0.5.0.

So now let's see what are the dependencies that would be generated
on the fly by a dependency provider with bucketted logic.

"root" @ 1.0.0 depends on "root@1.0.0#a" @ * and "root@1.0.0#d" @ *
"root@1.0.0#a" @ 1.0.0 depends on "a$1.0.0" @ [1.0.0,2.0.0[
"root@1.0.0#d" @ 1.0.0 depends on "d$1.0.0" @ [1.0.0,2.0.0[

"a" @ 1.0.0 depends on "a@1.0.0#b" @ * and "a@1.0.0#c" @ *
"a@1.0.0#b" @ 2.0.0 depends on "b$2.0.0" @ [2.1.0,3.0.0[
"a@1.0.0#b" @ 3.0.0 depends on "b$3.0.0" @ [3.0.0,4.0.0[
"a@1.0.0#c" @ 0.4.0 depends on "c$0.4.0" @ [0.4.0,0.5.0[
"a@1.0.0#c" @ 0.5.0 depends on "c$0.5.0" @ [0.5.0,0.6.0[

"d" @ 1.0.0 depends on "d@1.0.0#b" @ *
"d@1.0.0#b" @ 1.0.0 depends on "b$1.0.0" @ [1.0.0,2.0.0[

view this post on Zulip Eh2406 (Nov 23 2020 at 00:37):

That makes sence and I think is what my implementation does.

view this post on Zulip Eh2406 (Nov 23 2020 at 03:33):

Force pushed a change that uses a little of the nomenclature you suggested.

view this post on Zulip Eh2406 (May 20 2021 at 14:20):

I just modified my script to run all versions of all packages, and it successfully ran! it took 21.6 hours and the resulting file is 1.17GB. The slowest case is hobo taking 571 seconds followed by lots of solana-* crates that take longer then 180 seconds. I have reason to believe that they would be helped by https://rust-lang.zulipchat.com/#narrow/stream/260232-t-cargo.2FPubGrub/topic/0.2E2.2Ex.20api.20is.20a.20Selection.20sort but have not measured by how much. This is amazingly good news, there are no cases in crates.io that triggers exponential blow up!

Perf is very sensitive to P::Clone, I had to use an arena to get things to run this fast and to avoid a stack overflow.

view this post on Zulip Matthieu Pizenberg (May 21 2021 at 16:00):

Wow congrats! that is good new indeed!


Last updated: Oct 21 2021 at 21:32 UTC