Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate configuration of scan storages #8721

Open
mnonnenmacher opened this issue Jun 2, 2024 · 9 comments
Open

Consolidate configuration of scan storages #8721

mnonnenmacher opened this issue Jun 2, 2024 · 9 comments
Labels
enhancement Issues that are considered to be enhancements scanner About the scanner tool

Comments

@mnonnenmacher
Copy link
Member

mnonnenmacher commented Jun 2, 2024

The scanner supports storing multiple types of data in storages for reuse in subsequents runs or other tools:

  • Scan results
    • Package based
    • Provenance based
  • Provenance resolutions results
    • For packages
    • For nested provenances
  • File archives
  • File lists

Currently the storage backends can be configured separately for each of those four data types. While this is very flexible, in practice it provides little value. For example, if scan results are stored in a Postgres database, there is little reason to store provenance results in a different place. Or if file archives are stored in S3, there is little reason to store file lists in a different place.

This flexibility makes the configuration complex: The default settings store all data in a local directory which is usually not desired in a production setup, so to store the data remotely four storage backend configurations are required. This often confuses users and can also cause performance issues for users not knowing how the scanner works internally, for example, by forgetting to configure a provenance storage which leads to unnecessary repetition of the provenance resolution.

To simplify the configuration, the proposal is to consolidate the configuration to just two types of data:

  • Structured data
    • Scan results
    • Provenance resolution results
  • Binary data
    • File archives
    • File lists (at least currently, it could be more efficient to treat file lists as structured data as well)

The implementation proposal is:

  • Rename ScanStorage and all related classes to ScanResultStorage
    • When introduced, the name ScanStorage was chosen because ScanResultStorage was already taken, but this is not the case anymore.
  • Make two interfaces providing the functions for storing structured data and binary data
    • ScanStorage
    • BinaryStorage
  • Make implementations of those interfaces that reuse the existing implementations
    • For example, PostgresScanStorage uses ProvenanceBasedPostgresStorage, PostgresPackageProvenanceStorage and PostgresNestedProvenanceStorage.
  • Adapt the configuration to require only configuration for those two types of storages.
  • Make the new interfaces plugins (see Turn scan storages into plugins #6603)
    • This ensures that new implementations provide all required features. For example, a new MariaDbScanStorage should not only provide a way to store scan results, but also to store provenance resolution results.
@mnonnenmacher mnonnenmacher added scanner About the scanner tool to triage Issues that need triaging labels Jun 2, 2024
@sschuberth
Copy link
Member

For reference, this also relates to the pending draft at #5516 which tries to address the storage configuration complexity by centralizing and reusing it.

@sschuberth
Copy link
Member

To simplify the configuration, the proposal is to consolidate the configuration to just two types of data:

That distinction makes sense to me. And maybe the interfaces should be named accordingly: For binary data that's already the case, but maybe ScanStorage should be more generally StructuredStorage. Thinking about it, maybe BlobStorage would then be a better contrast to that than BinaryStorage.

@fviernau fviernau changed the title Consolidate Storages Scan Storages Consolidate Scan Storages Jun 4, 2024
@sschuberth sschuberth added enhancement Issues that are considered to be enhancements and removed to triage Issues that need triaging labels Aug 13, 2024
@sschuberth sschuberth changed the title Consolidate Scan Storages Consolidate configuration of scan storages Nov 26, 2024
@mnonnenmacher
Copy link
Member Author

In the context of this issue and #6603 I would also like to further simplify the scan storage implementation:

Currently, an arbitrary number of storages can be configured, and they can act as either reader, writer, or both. I would like to change that so that only a single implementation can be configured which is always reader and writer.
Also, we support package and provenance based storages. I would like to remove the support for package based storages because this would significantly reduce the complexity of the scanner logic.

For the separate reader/writer configuration I am aware of only two use cases:

  1. The ClearlyDefinedStorage which can only read scan results. My proposal would be to separate that functionality from the scan result storage, for example by adding a new interface like ScanResultProvider, especially as ClearlyDefined only supports lookup by package, not by provenance.
  2. When migrating to a different storage implementation, the old one can still be configured as read-only to not have to re-scan everything. Instead, I would propose to add a migration command for this, but only if it is really required.

The following implementations would be affected by removing support for package based storages:

  1. ClearlyDefinedStorage: See above.
  2. The old package based PostgreSQL and file storages: For those we have provenance based replacements.
  3. Sw360Storage: I have no idea if anyone is using this. Ideally, we can remove it.

@oss-review-toolkit/core-devs I would be very interested in your feedback.

@sschuberth
Copy link
Member

sschuberth commented Jan 14, 2025

we support package and provenance based storages. I would like to remove the support for package based storages

I'm fine with removing support for package-based storages.

Just as a reminder: I believe we do still have plans to add file-based storages / query provenance-based storages on a file-level. This should be loosely kept in mind for relevant design choices.

adding a new interface like ScanResultProvider, especially as ClearlyDefined only supports lookup by package, not by provenance.

Sounds reasonable to me. I foresee more such "read-only" scan result storages coming up in the context of https://github.com/aboutcode-org, and such results would probably be queried by purl (so, also not by provenance; that is, unless you encode the repository_url etc. as part of the qualifier).

Sw360Storage: I have no idea if anyone is using this. Ideally, we can remove it.

Are you, @heliocastro?

I would like to change that so that only a single implementation can be configured which is always reader and writer.

Just thinking out aloud here: I still wish for an IPFS scan storage. Maybe such a storage would be slow, so maybe you'd want to first use a faster storage and only use IPFS as the fallback. Such a thing would not be possible with the new implementation.

Also I'm wondering: Does the current complexity mainly come from allowing readers and writers to be different, or from allowing multiple readers and writers? Would it make sense to maybe only remove the more complex of the aforementioned features, but keeping the other one?

@mnonnenmacher
Copy link
Member Author

mnonnenmacher commented Jan 14, 2025

Just as a reminder: I believe we do still have plans to add file-based storages / query provenance-based storages on a file-level. This should be loosely kept in mind for relevant design choices.

Yes, that's another argument for simplifying the current implementation.

Just thinking out aloud here: I still wish for #1328. Maybe such a storage would be slow, so maybe you'd want to first use a faster storage and only use IPFS as the fallback. Such a thing would not be possible with the new implementation.

So the faster storage would act as a cache for the IPFS storage? Maybe it's better to solve this in the IPFS implementation itself instead of on a higher level.

Also I'm wondering: Does the current complexity mainly come from allowing readers and writers to be different, or from allowing multiple readers and writers? Would it make sense to maybe only remove the more complex of the aforementioned features, but keeping the other one?

In the scanner code it's mainly the support for both package and provenance based storages. In the configuration it's mainly the support for multiple kinds of storages at the same time, separate readers and writers, and the fact that the configuration is scattered.

What I would like is to separate ORT's internal storage logic from external services, because they do not use the same concepts, and then find other ways to integrate the external services.

@heliocastro
Copy link
Contributor

heliocastro commented Jan 15, 2025

@mnonnenmacher @sschuberth We already overhaul the entire SW360 backend and from next release, 20.x it will be very different, so the current storage model will work only for version up to 18.x and i honestly think that this one should go for good.
We want to not anymore store artifacts, but instead add links in the respective components to some storage defined, so what i think is actually reasonable:

  • Deprecate / remove the current sw360 storage, but before do some inquiry on the community meeting
  • Provide a way to when we use the sw360 provider component update, be able to use the current storage info to add as a reference link to

@mnonnenmacher
Copy link
Member Author

@heliocastro So if I understand you correctly you are not using the Sw360Storage, correct?

* Provide a way to when we use the sw360 provider component update, be able to use the current storage info to add as a reference link to

Sorry, I do not understand what you mean, could you elaborate?

@heliocastro
Copy link
Contributor

Yep, not using Sw360Storage at all, correct.

For the second point, the "upload-result-to-sw360" utility today create projects and components based on the information coming from the ORT results.
So, i plan to extend this to add a link/info to the referred content, so i will need to have a way to retrieve information about the storage if exists.

Basically, we are eliminating the Sw360Storage, but adding a functionality to upload-result to add a reference to whatever storage is defined on ORT config

@mnonnenmacher
Copy link
Member Author

Basically, we are eliminating the Sw360Storage, but adding a functionality to upload-result to add a reference to whatever storage is defined on ORT config

That sounds good, it's in line with my goal to separate the integration of external services from ORT's internal storage implementation.

So, i plan to extend this to add a link/info to the referred content, so i will need to have a way to retrieve information about the storage if exists.

I think currently there is no easy way to get this, could you give an example what this reference should look like, for example, if results are stored in a PostgreSQL database?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Issues that are considered to be enhancements scanner About the scanner tool
Projects
None yet
Development

No branches or pull requests

3 participants