Implement reuse scanner plugin#11299
Implement reuse scanner plugin#11299maennchen wants to merge 5 commits intooss-review-toolkit:mainfrom
reuse scanner plugin#11299Conversation
SPDX documents without packages are valid per the specification. This is needed to support REUSE tool output which only contains files. Signed-off-by: Jonatan Männchen <[email protected]>
utils/spdx-document/src/test/kotlin/SpdxTagValueParserTest.kt
Dismissed
Show dismissed
Hide dismissed
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #11299 +/- ##
============================================
+ Coverage 57.41% 57.44% +0.03%
- Complexity 1705 1708 +3
============================================
Files 346 346
Lines 12875 12874 -1
Branches 1228 1227 -1
============================================
+ Hits 7392 7396 +4
+ Misses 5005 5002 -3
+ Partials 478 476 -2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Add SpdxTagValueParser to parse SPDX Tag:Value format and return an SpdxDocument. Signed-off-by: Jonatan Männchen <[email protected]>
Add REUSE tool (version 6.2.0) to the Docker image via pip. Document the REUSE tool license (GPL-3.0-or-later) in /opt/licenses to comply with GPL requirements. The tool is installed unmodified from pip. Signed-off-by: Jonatan Männchen <[email protected]>
Add a scanner plugin that uses the REUSE tool to detect license and copyright information in source code projects following the REUSE specification. The plugin runs `reuse spdx` to generate an SPDX bill of materials and uses SpdxTagValueParser to extract license and copyright findings. Signed-off-by: Jonatan Männchen <[email protected]>
The REUSE scanner test fixtures contain files with custom license identifiers (e.g., LicenseRef-BS) that are intentionally used to test the scanner's behavior. Using `precedence = "override"` prevents REUSE from requiring corresponding license files in the root LICENSES directory for these test-only identifiers. Signed-off-by: Jonatan Männchen <[email protected]>
d3842ec to
aa6f6aa
Compare
|
|
||
| require(dataLicense.isNotBlank()) { "The data license must not be blank." } | ||
|
|
||
| require(packages.isNotEmpty()) { "At least one package must be listed in packages" } |
There was a problem hiding this comment.
SPDX documents without packages are valid per the specification.
For reference, please add a link to the spec to support that "claim". I'd probably use https://spdx.github.io/spdx-spec/v2.3/composition-of-an-SPDX-document/#522-package-information-section, with an emphasis on "If" in "If SPDX information is being used to describe packages, then one instance of the package information per package being described shall exist".
| @@ -0,0 +1,226 @@ | |||
| /* | |||
| * Copyright (C) 2017 The ORT Project Copyright Holders <https://github.com/oss-review-toolkit/ort/blob/main/NOTICE> | |||
There was a problem hiding this comment.
Wrong Copyright year, I believe.
| @@ -0,0 +1,226 @@ | |||
| /* | |||
There was a problem hiding this comment.
Add SpdxTagValueParser to parse SPDX Tag:Value format and return an SpdxDocument.
Please don't just repeat in the commit message what's more or less obvious from the diff. Rather explain why this is being added, what's the planned use-case?
There was a problem hiding this comment.
In general, I'm a bit reluctant to add more custom code for an SPDX parser, as we decided to go into the direction of giving the new upstream SPDX Java library a try. So could that be used instead, with https://github.com/spdx/spdx-java-tagvalue-store?
There was a problem hiding this comment.
Sure, I'll give this a try. I'll put the PR in draft and pick it up at a later point when I have time for it.
| @@ -0,0 +1,4 @@ | |||
| BS License | |||
There was a problem hiding this comment.
Before I review this commit in more detail, please document the motivation to add REUSE as a scanner in the commit message. Questions I'd like to see answered include:
- Why do none of the other scanner implementations suffice? It seems that REUSE is a very simple "scanner", and other scanners, like Askalono or Licensee, probably cover the same feature set and are fast as well. Is it maybe so that for comparison or whatever reason you rely on the specifics of how REUSE reports things?
- Letting REUSE produce SPDX in order to parse it and then create a scan result is a bit like taking the long way. But it indeed seems like SPDX, and in tag-value format (!), is the only way to get license and copyright findings out of the REUSE tool. So in a way, I wonder whether https://codeberg.org/fsfe/reuse-tool/issues/394 should be implemented first? But then again, that REUSE issue doe snot seem to move forward for 4 years.
There was a problem hiding this comment.
Why do none of the other scanner implementations suffice?
The main reason is that this is less about generic license detection and more about semantic compatibility and trust.
Many projects already rely on reuse lint in their CI, and a lot of developers run it locally because it’s lightweight and easy to use. For those projects, REUSE is effectively the source of truth for licensing and copyright compliance.
REUSE is also intentionally simple. It checks compliance against a very explicit specification (SPDX headers, .license files, dep5, etc.) and reports deterministically whether a project follows those rules. It does not try to infer licenses from file contents. Tools like Askalono or Licensee solve a different problem by doing best-effort license detection via matching or classification.
Using the same tool inside ORT matters because it guarantees identical results. Even small semantic differences in how headers or edge cases are handled would lead to confusing discrepancies between reuse lint and ORT.
Letting REUSE produce SPDX in order to parse it and then create a scan result is a bit like taking the long way.
I agree that going via SPDX tag/value looks like the long way around. Unfortunately, at the moment it’s also the only supported way to extract structured findings from REUSE. I would much prefer a JSON output as well.
That said, my intent here is deliberately narrow: I want to use REUSE only as a scanner for ORT. REUSE is not trying to replicate what ORT provides around license policy evaluation, vulnerabilities, or dependency-wide analysis. ORT remains the system that aggregates results and applies compliance rules.
REUSE also has a clearly defined scope: it only works if a project (and optionally its dependencies) actually implements REUSE. If that’s not the case, this scanner simply isn’t applicable, which is expected.
Thanks for raising the comment, I'd like to see the reaction before proceeding here.
Adds a scanner plugin for
reuse-tool.See individual commits.