Skip to content

Commit

Permalink
chore: restrict malware check to python projects; apply general fixes… (
Browse files Browse the repository at this point in the history
#790)

Signed-off-by: Ben Selwyn-Smith <benselwynsmith@googlemail.com>
  • Loading branch information
benmss authored Jul 15, 2024
1 parent 2e9784a commit a61906e
Show file tree
Hide file tree
Showing 25 changed files with 505 additions and 233 deletions.
32 changes: 16 additions & 16 deletions src/macaron/malware_analyzer/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,11 @@
We schedule the heuristics sequentially:

1. **Empty Project Link**: If the package contains project links (e.g., documentation, Git Repositories),
the analyzer will further operate the heuristic `Unreachable Project Links` to analyze if all the project links are not reachable.
the analyzer will further operate the heuristic `Unreachable Project Links` to analyze if all the project links are unreachable.
2. **One Release**: Checks if there is only one release of the package. If the package contains multiple
releases, the checker will further check the release frequency through `High Release Frequency` and
`Unchanged Release` to see if the maintainers release multiple times in a short timeframe (threshold) and
whether the released contents are identical.
`Unchanged Release` to see if the maintainers release multiple times in a short timeframe (threshold), and
whether the contents of the releases are identical.
3. **Closer Release Join Date**: Considers the date when the maintainer registered their account (if
available). The checker will calculate the gap between the latest release date and the maintainer's account
registration date.
Expand All @@ -18,24 +18,24 @@ encryption and `requests` for data exfiltration.

## Supported Ecosystem: PyPI

Define Seven Heuristics: `False` means suspicious and `True` means non-suspicious. `SKIP` means some metadata are missing, and the checker skips the heuristic.
Define Seven Heuristics: `False` means suspicious and `True` means benign. `SKIP` means some metadata is missing, and the checker will skip the heuristic.

1. **Empty Project Link**
- **Description**: Checks whether the package contains any project links (e.g., documents or Git
Repositories). Many malicious activities do not include any project link.
Repositories). Many malicious activities do not include any project links.
- **Rule**: Return `FALSE` when there is only one project link; otherwise, return `TRUE`.

2. **Unreachable Project Links**
- **Description**: Checks the accessibility of the project links. This is considered an auxiliary
heuristic since no cases have met this heuristic.
- **Rule**: Return `FALSE` if all project links are not reachable; otherwise, return `TRUE`.
- **Rule**: Return `FALSE` if all project links are unreachable; otherwise, return `TRUE`.

3. **One Release**
- **Description**: Checks whether the package has only one release.
- **Rule**: Return `FALSE` if the package contains only one release; otherwise, return `TRUE`.

4. **High Release Frequency**
- **Description**: Checks if the package released multiple versions within a short period. We calculate
- **Description**: Checks if the package released multiple versions within a short timeframe. We calculate
the release frequency and define a default frequency threshold of 2 days.
- **Rule**: Return `FALSE` if the frequency is higher than the threshold; otherwise, return `TRUE`.

Expand All @@ -49,19 +49,19 @@ Define Seven Heuristics: `False` means suspicious and `True` means non-suspiciou
- **Rule**: Return `FALSE` if the gap is less than the threshold; otherwise, return `TRUE`.

7. **Suspicious Setup**
- **Description**: Checks the `setup.py` to see if there are suspicious imported modules and the
`install_requires` packages installed during the package installation process. We define two suspicious
- **Description**: Checks the `setup.py` to see if there are suspicious imported modules, or
`install_requires` packages that are installed during the package installation process. We define two suspicious
keywords as the blacklist.
- **Rule**: Return `FALSE` if the package name contains suspicious keywords; otherwise, return `TRUE`.

## Heuristics-Based Analyzer: Scanning 1167 Packages from Trusted Organizations

| Heuristic Name | Count |
| ------------------ | ----- |
| Lower Release | 102 |
| Empty Link | 45 |
| Links Missing | 24 |
| Frequent Release | 14 |
| Suspicious Setup | 5 |
| Heuristic Name | Count |
|------------------| ----- |
| One Release | 102 |
| Empty Link | 45 |
| Links Missing | 24 |
| Frequent Release | 14 |
| Suspicious Setup | 5 |

**The result is used as a reference for the confidence score to lower the false positive rate.**
2 changes: 1 addition & 1 deletion src/macaron/malware_analyzer/__init__.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
# Copyright (c) 2022 - 2024, Oracle and/or its affiliates. All rights reserved.
# Copyright (c) 2024 - 2024, Oracle and/or its affiliates. All rights reserved.
# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/.
2 changes: 1 addition & 1 deletion src/macaron/malware_analyzer/checks/__init__.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
# Copyright (c) 2022 - 2024, Oracle and/or its affiliates. All rights reserved.
# Copyright (c) 2024 - 2024, Oracle and/or its affiliates. All rights reserved.
# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/.
Original file line number Diff line number Diff line change
Expand Up @@ -201,7 +201,7 @@ def _should_skip(
Returns
-------
bool
bool
Returns True if any result of the dependency heuristic does not match the expected result.
Otherwise, returns False.
"""
Expand Down Expand Up @@ -238,6 +238,7 @@ def run_heuristics(
if should_skip:
results[analyzer.heuristic] = HeuristicResult.SKIP
continue

result, detail_info = analyzer.analyze(api_client)
if analyzer.heuristic:
results[analyzer.heuristic] = result
Expand All @@ -257,8 +258,9 @@ def run_check(self, ctx: AnalyzeContext) -> CheckResultData:
CheckResultData
The result of the check.
"""
purl = ctx.component.purl
parsed_purl = PackageURL.from_string(purl)
parsed_purl = PackageURL.from_string(ctx.component.purl)
if parsed_purl.type != "pypi":
return CheckResultData(result_tables=[], result_type=CheckResultType.UNKNOWN)
package = parsed_purl.name
result_tables: list[CheckFacts] = []

Expand Down
2 changes: 1 addition & 1 deletion src/macaron/malware_analyzer/datetime_parser.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2023 - 2024, Oracle and/or its affiliates. All rights reserved.
# Copyright (c) 2024 - 2024, Oracle and/or its affiliates. All rights reserved.
# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/.

"""This module provides a helper function for safely parsing datetime strings."""
Expand Down
2 changes: 1 addition & 1 deletion src/macaron/malware_analyzer/pypi_heuristics/__init__.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
# Copyright (c) 2022 - 2024, Oracle and/or its affiliates. All rights reserved.
# Copyright (c) 2024 - 2024, Oracle and/or its affiliates. All rights reserved.
# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/.
6 changes: 3 additions & 3 deletions src/macaron/malware_analyzer/pypi_heuristics/heuristics.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,6 @@ class HeuristicResult(Enum):
join date or release information, is missing or unavailable.
"""

PASS = "PASS" # nosec
FAIL = "FAIL" # nosec
SKIP = "SKIP" # nosec
PASS = "PASS" # nosec B105
FAIL = "FAIL"
SKIP = "SKIP"
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
# Copyright (c) 2022 - 2024, Oracle and/or its affiliates. All rights reserved.
# Copyright (c) 2024 - 2024, Oracle and/or its affiliates. All rights reserved.
# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/.
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,7 @@
class CloserReleaseJoinDateAnalyzer(BaseHeuristicAnalyzer):
"""Analyzer checks the heuristic.
Note
----
If any maintainer's date duration is larger than threshold,
we consider it as "PASS".
If any maintainer's date duration is larger than threshold, we consider it as "PASS".
"""

def __init__(self) -> None:
Expand All @@ -32,21 +29,25 @@ def _load_defaults(self) -> int:
section_name = "heuristic.pypi"
if defaults.has_section(section_name):
section = defaults[section_name]
return int(section.get("timedelta_threshold_of_join_release"))
return section.getint("timedelta_threshold_of_join_release")
return 5

def _get_maintainers_join_date(self, api_client: PyPIRegistry) -> list[datetime] | None:
"""Get the join date of the maintainers.
Each package might have multiple maintainers.
Parameters
----------
api_client: PyPIRegistry
The API client.
Returns
-------
list[datetime] | None: Maintainers join date.
Note
----
Each package might have multiple maintainers.
list[datetime] | None
The maintainers' join date.
"""
maintainers: list | None = api_client.get_maintainer_of_package()
maintainers: list | None = api_client.get_maintainers_of_package()
if maintainers is None:
return None

Expand All @@ -55,29 +56,40 @@ def _get_maintainers_join_date(self, api_client: PyPIRegistry) -> list[datetime]
maintainer_join_date = api_client.get_maintainer_join_date(maintainer)
if maintainer_join_date is not None:
join_dates.append(maintainer_join_date)

return join_dates

def _get_latest_release_date(self, api_client: PyPIRegistry) -> datetime | None:
"""Get package's latest release date.
Parameters
----------
api_client: PyPIRegistry
The API client.
Returns
-------
datetime | None: Package's latest release date.
datetime | None
The package's latest release date.
"""
upload_time: str | None = api_client.get_latest_release_upload_time()
if upload_time:
datetime_format: str = "%Y-%m-%dT%H:%M:%S"
res: datetime | None = parse_datetime(upload_time, datetime_format)
if res:
return res
return None
if not upload_time:
return None
datetime_format: str = "%Y-%m-%dT%H:%M:%S"
return parse_datetime(upload_time, datetime_format)

def analyze(self, api_client: PyPIRegistry) -> tuple[HeuristicResult, dict]:
"""Check whether the maintainers' join date closer to package's latest release date.
Parameters
----------
api_client: PyPIRegistry
The API client.
Returns
-------
tuple[HeuristicResult, dict]: Result and confidence.
tuple[HeuristicResult, dict]
The result and details.
"""
maintainers_join_date: list[datetime] | None = self._get_maintainers_join_date(api_client)
latest_release_date: datetime | None = self._get_latest_release_date(api_client)
Expand All @@ -97,4 +109,5 @@ def analyze(self, api_client: PyPIRegistry) -> tuple[HeuristicResult, dict]:

if difference >= threshold_delta:
return HeuristicResult.PASS, detail_info

return HeuristicResult.FAIL, detail_info
Original file line number Diff line number Diff line change
Expand Up @@ -17,15 +17,22 @@ def __init__(self) -> None:
def analyze(self, api_client: PyPIRegistry) -> tuple[HeuristicResult, dict]:
"""Check whether the PyPI package has no project link.
Parameters
----------
api_client: PyPIRegistry
The API client.
Returns
-------
tuple[HeuristicResult, dict]: Result and project links if they exist. Otherwise, return an empty dictionary
tuple[HeuristicResult, dict]
The result and project links if they exist. Otherwise, return an empty dictionary
"""
project_links: dict[str, str] | None = api_client.get_project_links()

if project_links is None:
return HeuristicResult.SKIP, {}

if len(project_links) == 0: # total
if len(project_links) == 0: # Total.
return HeuristicResult.FAIL, {}

return HeuristicResult.PASS, {"project_links": project_links}
Original file line number Diff line number Diff line change
Expand Up @@ -32,15 +32,21 @@ def _load_defaults(self) -> int:
section_name = "heuristic.pypi"
if defaults.has_section(section_name):
section = defaults[section_name]
return int(section.get("releases_frequency_threshold"))
return section.getint("releases_frequency_threshold")
return 2

def analyze(self, api_client: PyPIRegistry) -> tuple[HeuristicResult, dict]:
"""Check whether the release frequency is high.
Parameters
----------
api_client: PyPIRegistry
The API client.
Returns
-------
tuple[HeuristicResult, Confidence | None]: Confidence and result.
tuple[HeuristicResult, dict]
The result and details.
"""
version_to_releases: dict | None = api_client.get_releases()
if version_to_releases is None or len(version_to_releases) == 1:
Expand Down Expand Up @@ -70,4 +76,5 @@ def analyze(self, api_client: PyPIRegistry) -> tuple[HeuristicResult, dict]:

if frequency <= self.average_gap_threshold:
return HeuristicResult.FAIL, {"frequency": frequency}

return HeuristicResult.PASS, {"frequency": frequency}
Original file line number Diff line number Diff line change
Expand Up @@ -18,14 +18,21 @@ def __init__(self) -> None:
def analyze(self, api_client: PyPIRegistry) -> tuple[HeuristicResult, dict]:
"""Check the releases' total is one.
Parameters
----------
api_client: PyPIRegistry
The API client.
Returns
-------
tuple[HeuristicResult, dict]: Result and confidence.
tuple[HeuristicResult, dict]
The result and details.
"""
releases: dict | None = api_client.get_releases()
if releases is None:
return HeuristicResult.SKIP, {"releases": {}}

if len(releases) == 1:
return HeuristicResult.FAIL, {"releases": releases} # Higher false positive, so we keep it MEDIUM

return HeuristicResult.PASS, {"releases": releases}
Original file line number Diff line number Diff line change
Expand Up @@ -27,31 +27,44 @@ def __init__(self) -> None:
def _get_digests(self, api_client: PyPIRegistry) -> list[str] | None:
"""Get all digests of the releases.
Parameters
----------
api_client: PyPIRegistry
The API client.
Returns
-------
list[str] | None
Digests.
The digests.
"""
releases: dict | None = api_client.get_releases()
if releases is None:
return None

digests: list[str] = []
for _, metadata in releases.items():
if metadata:
digest: str | None = json_extract(metadata[0], ["digests", self.hash_algo], str)
if digest is None:
continue
digests.append(digest)
for metadata in releases.values():
if not metadata:
continue

digest: str | None = json_extract(metadata[0], ["digests", self.hash_algo], str)
if digest is None:
continue
digests.append(digest)

return digests

def analyze(self, api_client: PyPIRegistry) -> tuple[HeuristicResult, dict]:
"""Check the content of releases keep updating.
Parameters
----------
api_client: PyPIRegistry
The API client.
Returns
-------
tuple[HeuristicResult, dict]: Result and relevant metadata.
tuple[HeuristicResult, dict]
The result and relevant metadata.
"""
digests: list[str] | None = self._get_digests(api_client)
if digests is None:
Expand All @@ -61,4 +74,5 @@ def analyze(self, api_client: PyPIRegistry) -> tuple[HeuristicResult, dict]:
highest_frequency = max(frequency.values())
if highest_frequency > 1: # Any two release are same
return HeuristicResult.FAIL, {}

return HeuristicResult.PASS, {}
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,15 @@ def __init__(self) -> None:
def analyze(self, api_client: PyPIRegistry) -> tuple[HeuristicResult, dict]:
"""Analyze the package.
Parameters
----------
api_client: PyPIRegistry
The API client.
Returns
-------
tuple[HeuristicResult, dict]: Result type and relevant metadata.
tuple[HeuristicResult, dict]
The result type and relevant metadata.
"""
project_links: dict | None = api_client.get_project_links()

Expand All @@ -48,4 +54,5 @@ def analyze(self, api_client: PyPIRegistry) -> tuple[HeuristicResult, dict]:
except requests.exceptions.RequestException as error:
logger.debug(error)
continue

return HeuristicResult.FAIL, {}
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
# Copyright (c) 2022 - 2024, Oracle and/or its affiliates. All rights reserved.
# Copyright (c) 2024 - 2024, Oracle and/or its affiliates. All rights reserved.
# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/.
Loading

0 comments on commit a61906e

Please sign in to comment.