Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incoming: Arborist #64

Closed
kemitchell opened this issue Nov 27, 2019 · 21 comments
Closed

Incoming: Arborist #64

kemitchell opened this issue Nov 27, 2019 · 21 comments

Comments

@kemitchell
Copy link
Member

Isaac was kind enough to confirm that read-package-tree will probably be deprecated in favor of Arborist. Arborist will expose a promise API.

I have watched release on the Arborist repo. When the time comes, licensee will want to switch. But I'd strongly recommend that we hold off doing any substantial development until that decision is taken.

@kemitchell
Copy link
Member Author

@isaacs
Copy link

isaacs commented Jul 28, 2020

Would you be open to a PR that changes this to use Arborist instead of using read-package-tree or shelling out to npm ls? I think it's about ready to start poking at, and having our licensee test meaningfully passing in CI would be nice.

@kemitchell
Copy link
Member Author

@isaacs I've been strongly considering rewriting Licensee as a new package, dropping old Node support, using Arborist, offering substantial additional functionality, and charging for closed and commercial use. The code would remain public. I'm sure I'd run CI.

I've probably spent more time on all sides of licensing in npm than most anyone I know, but it's not exactly fun (or risk free) coding package metadata corrections, staying on top of SPDX upstream, writing generators to spit out due diligence tables, and so on. Especially after doing it for years.

If Arborist fells ready, I can put this on my weekend list.

@isaacs
Copy link

isaacs commented Jul 28, 2020

Yeah, it's definitely close. We'll probably have a v7 beta in a few weeks. I'd set the dependency to ~0.0.9 || 1 for now, and then bump it to just 1.x once it's out. It'll have a few more bugfixes I'm sure (just found a new one today, actually), but the API surface isn't expected to change between now and 1.x. The biggest shortcoming right now is that the docs are woefully incomplete.

I'm happy to help with all the tree handling stuff, but I'm not eager to wade into the actual license logic :)

@jsha
Copy link

jsha commented May 16, 2022

FYI it looks like Arborist is stable now, and read-package-tree says:

The functionality that this package provided is now in @npmcli/arborist

One small reason it would be nice to update: read-package-tree depends on path-is-absolute, which is also deprecated:

This package is no longer relevant as Node.js 0.12 is unmaintained.

Since npm depends on licensee, that means npm indirectly depends on a pair of deprecated packages.

@kemitchell
Copy link
Member Author

@jsha thanks for the heads-up. I still think that Licensee should get in line with npm CLI on Arborist.

Any personal interest here? I'd certainly welcome a PR. I'll get to it eventually, but can't say when.

PS: Three cheers for Let's Encrypt! I've more than a few LE certs myself.

@jsha
Copy link

jsha commented May 16, 2022

I'm pretty rusty on the Node ecosystem lately - I noticed the issue because I just installed npm for the first time in a while and wanted to look at its dependencies. So the migration is probably more than I can handle right now, but who knows!

PS: Three cheers for Let's Encrypt! I've more than a few LE certs myself.

Aw, thanks! That's always lovely to hear.

@kemitchell
Copy link
Member Author

Poking around with Arborist for the license checker a little.

Good News: Arborist can spit out a nice, flat structure representing the current node_modules tree.

Bad News:

  • AFAICT, the nodes of the trees Arborist spits out do not have extended metadata from package.json, such as the sanitized license value or homepage. We are going to have to load package.jsons on our own, though of course there's a package for that.

  • Arborist does a lot more than old read-package-tree used to do, including all the npm install-related diffing. It's a bigger dep with a bunch of direct dependencies for functionality behind that Licensee simply isn't gonna need.

@ljharb
Copy link
Member

ljharb commented Jun 2, 2022

The former seems likely for arborist to add, assuming that data is in the packument.

@kemitchell
Copy link
Member Author

Arborist is no longer a separate repo. They've folded it into the larger CLI repo.

It's clearly processing license metadata. There's a subcommand of the arborist CLI that prints package counts by license: https://github.com/npm/cli/blob/2db3eff44d7bb86b956207cc63d279806fd14ed0/workspaces/arborist/bin/license.js

@isaacs
Copy link

isaacs commented Jun 2, 2022

Last I was involved in this, I believe that the blocker to having license info in Arborist by default is for it to be included in the minified packuments served by the registry. This is still the only blocker:

$ curl -s https://registry.npmjs.org/abbrev -H accept:application/vnd.npm.install-v1+json | npx json versions["'1.1.1'"]
{
  "name": "abbrev",
  "version": "1.1.1",
  "devDependencies": {
    "tap": "^10.1"
  },
  "dist": {
    "integrity": "sha512-nne9/IiQ/hzIhY6pdDnbBtz7DjPTKrY00P/zvPSm5pOFkl6xuGrGnXn/VtTNNfNtAfZ9/1RtehkszU9qcTii0Q==",
    "shasum": "f8f2c887ad10bf67f634f005b6987fed3179aac8",
    "tarball": "https://registry.npmjs.org/abbrev/-/abbrev-1.1.1.tgz",
    "signatures": [
      {
        "keyid": "SHA256:jl3bwswu80PjjokCgh0o2w5c2U4LhQAE57gj9cz1kzA",
        "sig": "MEYCIQDvQCH2XtwIWIVnBSH4P51+UstW+ybuYvlEWwSQoGW7fgIhAJleZ3eJj+NTABBRNuW2xhR8pQwFRPSd9cFjP/aS3RrE"
      }
    ]
  }
}

Once the registry starts including license info in the minified packument, Arborist will track it, as it is on the list of fields that Arborist would record in the lockfile and add to Node objects.

@kemitchell
Copy link
Member Author

It looks like sometimes I'm getting data on node.package.license, and sometimes I'm not.

@isaacs, sorry to bug you here---I'm a bit afraid of wading into npm/cli issues. Is Arborist still going to rely on minified pakuments from the registry if we always just ask for the local tree with loadActual?

@isaacs
Copy link

isaacs commented Jun 2, 2022

No worries about bugging me, I'm the right one to bug, and you've earned it ;)

Tl;dr - If you do rm node_modules/.package-lock.json and then arborist.loadActual(), you'll reliably get license metadata included in the tree metadata.

Full explanation:

Arborist doesn't exactly rely on the minified packuments for doing a loadActual. However, there are some shortcuts that it takes to avoid extra work, and that is the reason why sometimes you see full package.json contents in the tree, and sometimes it's the minified data.

Arborist saves the state of the tree to a lockfile at node_modules/.package-lock.json (aka the "hidden lockfile"), so that it doesn't ever read more than it needs to. If nothing has changed, then it's the same data, so any package in node_modules that isn't newer than the hidden lockfile, it doesn't bother reading. This can drop loadActual time down by about 90-99%, because it's one file read instead of thousands.

If you have a full node_modules, and no hidden lockfile, then loadActual will write this file with what it finds, so that the next time it's fast. And, because it builds the tree in that case from actual package.json data, it'll include everything that arborist considers valuable (including license).

When it does a reify(), Arborist says, well, I have the tree, I know that at the end of this process, the tree is going to match what I have here, so it just writes that ideal tree data to the hidden lockfile. (This might not always match the contents of package-lock.json. For example, optional and development deps are written to package-lock.json, even if they are excluded from the actual reification, but the hidden lockfile is intended to match the contents of node_modules exactly, so it's written after the prune.)

The problem is, reify() starts with the root package.json dependencies, and the minified data from the registry. (And the actual lockfile if present, but of course, that was typically built up from the registry metadata previously, so it's effectively the same thing.) At no point does it read the package.json files in node_modules (unless the lockfile is old (pre-v7) or ancient (pre-v6), since these lack data it needs), because why would it? That's several thousand reads, and it already has the tree. Unfortunately, the registry metadata doesn't include the license identifiers.

So, the tree it's reifying doesn't have license identifiers, and thus, the tree it writes to the hidden lockfile doesn't either. The next loadActual takes the shortcut, and you get no licenses.

@kemitchell
Copy link
Member Author

@isaacs, thanks so much.

Might it be possible to get an Arborist flag to ignore existing second lockfiles? I suppose our alternative for Licensee is looking for a second lockfile, squirreling it away to a temp dir if present, doing our analysis on a fresh free, then replacing. Feels wrong.

@isaacs
Copy link

isaacs commented Jun 2, 2022

Deleting the hidden lockfile is perfectly safe. (There are a number of situations where npm will ignore or clobber it anyway, and users routinely delete node_modules, update the tree with yarn, etc., so it can never be more than a cautiously-trusted nice-to-have enhancement.) But yes, it does feel wrong, and this is exposing the wrongness of a design constraint of Arborist (Virtual Trees and Actual Trees are Interchangeable for All Relevant Purposes) is not possible, due to an expected contract with the registry being violated (Minified Packuments Have All That npm Needs).

The fix that won't feel wrong is to get license added to the include list for minified packuments. That was the intent in 2020, but I guess it didn't ever happen. Maybe @nlf or @darcyclarke could shed some light on the feasibility of getting that done? Even if that's done, though, you'll have millions of package trees out in the wild that lack this metadata until it flows in via dependency updates.

In any event, you'd probably like to be able to run licensee on the virtual tree expressed in package-lock.json, so getting it into the minified registry metadata is worthwhile.

@kemitchell
Copy link
Member Author

Forgive me. Why would we want the "virtual tree" as opposed to a reflection of what's on disk?

Sounds like the way forward for now is detecting and deleting node_modules/.package-lock.json if present, then invoking Arborist.

@isaacs
Copy link

isaacs commented Jun 2, 2022

Forgive me. Why would we want the "virtual tree" as opposed to a reflection of what's on disk?

Let's say that you're using chokidar, which has an optional dep on fsevents, which only installs on macOS systems. One day, fsevents changes its license to something objectionable for your organization. You run licensee as part of your CI process, which executes on a Linux VM.

If you run licensee against the actual tree on a linux machine, then fsevents won't be present, and licensee won't complain. But it is in the "ideal" tree as captured in package-lock.json.

Flip it the other direction, and it gets potentially even more hazardous. Say licensee is run on local macOS dev boxes only, and as a result you ship something to production with a license that allows dev usage, but not inclusion in a for-profit application.

@darcyclarke
Copy link

darcyclarke commented Aug 12, 2022

@isaacs I don't think we have any plans to add more blessed fields/data to corgi docs at the moment (I also can't remember when/why we considered license back in 2020 but didn't actually add it... you'll have to refresh my memory there 🤔).

Funny enough though, @nlf happened to "fix" this the other day as he was digging into another bug where npm query was also missing package.json contents when hidden lockfiles were present (we're basically doing the same querying of the inventory but with fancy CSS selectors - ref. https://github.com/npm/cli/blob/latest/workspaces/arborist/lib/query-selector-all.js / https://docs.npmjs.org/cli/v8/using-npm/dependency-selectors).

The way we "fixed" this was by introducing a new forceActual option to loadActual() which replicates the experience of manually deleting the hidden lockfile (ref. npm/cli#5263); @kemitchell if you set that to be true now you should be in business.

@isaacs
Copy link

isaacs commented Aug 12, 2022

Ah, too bad it's not an option to add fields to the corgis, we tried hard to avoid having to force a loadActual except when it's absolutely necessary, but sounds like the right call.

@kemitchell
Copy link
Member Author

Thanks, @darcyclarke and @isaacs. I have marked #77 ready for review.

@isaacs
Copy link

isaacs commented Aug 23, 2022

Left a few drive-by comments from a retired maintainer, hopefully they're helpful :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants