Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a crypto handler #26

Merged
merged 9 commits into from
Oct 12, 2022
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 15 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,14 +24,24 @@ pip install matrix-content-scanner
Copy and edit the [sample configuration file](https://github.com/matrix-org/matrix-content-scanner-python/blob/main/config.sample.yaml).
Each key is documented in this file.

Then run the content scanner (from within your virtual environment if one was created):
Before running the Matrix Content Scanner for the first time (if you are not [migrating
from the legacy Matrix Content Scanner](#migrating-from-the-legacy-matrix-content-scanner)),
run (from within your virtual environment if one was created):

```commandline
python -m matrix_content_scanner.mcs -c CONFIG_FILE
python -m matrix_content_scanner.mcs -c CONFIG_FILE --generate-secrets
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mhmm, I'm not entirely sure about this approach. My experience with synapse (which does similar things with key files) is that requirements like this make it a bit of a pain to run in kubernetes-like environments, and led to matrix-org/synapse#13615.

I realise this comment is in conflict with my earlier comment. I'm just not sure what is the best approach.

I suggest you stick with what you have, to avoid further vacillation. Just warning you that you might need to revisit this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm also on the fence. I initially had it automatically generate secrets because that's what the javascript content scanner does, but there are good arguments in both directions tbh.

I suggest you stick with what you have, to avoid further vacillation. Just warning you that you might need to revisit this.

Actually, the issue you linked above seems to indicate this approach would cause issues with k8s-style deployments. I know there's an intent to deploy the content scanner on EMS (and actually most of the infrastructure is already in place for this afaik) so I'd rather revert to generating it automatically.


Where `CONFIG_FILE` is the path to your configuration file.

This will generate the cryptographic secrets required for the content scanner to run.

Then run the content scanner:

```commandline
python -m matrix_content_scanner.mcs -c CONFIG_FILE
```

## API

See [the API documentation](/docs/api.md) for information about how clients are expected
Expand All @@ -56,9 +66,9 @@ deployment instructions) is the configuration format:
* `acceptedMimeType` is renamed `scan.allowed_mimetypes`
* `requestHeader` is renamed `download.additional_headers` and turned into a dictionary.

Note that the format of the cryptographic pickle file and key are compatible between this
project and the legacy Matrix Content Scanner. If no file exist at that path one will be
created automatically.
Note that the format of the cryptographic pickle file file and key are compatible between
this project and the legacy Matrix Content Scanner. If no file exist at that path one will
be created automatically.
babolivier marked this conversation as resolved.
Show resolved Hide resolved

## Development

Expand Down
7 changes: 5 additions & 2 deletions config.sample.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -100,10 +100,13 @@ download:

# Configuration for decrypting Olm-encrypted request bodies.
crypto:
# The path to the Olm pickle file.
# The path to the Olm pickle file. This file contains the key pair to use when
# encrypting and decrypting encrypted POST request bodies.
# This file needs to be created with the --generate-secrets command line argument
# for the Matrix Content Scanner to run, see README.md for more information.
babolivier marked this conversation as resolved.
Show resolved Hide resolved
# Required.
pickle_path: "./pickle"

# The key to the pickle.
# The key to use to decode the pickle file.
# Required.
pickle_key: "this_is_a_secret"
125 changes: 125 additions & 0 deletions matrix_content_scanner/crypto.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# Copyright 2022 The Matrix.org Foundation C.I.C.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import json
import logging
from typing import TYPE_CHECKING

from olm.pk import PkDecryption, PkDecryptionError, PkMessage

from matrix_content_scanner.config import MatrixContentScannerConfig
from matrix_content_scanner.utils.constants import ErrCode
from matrix_content_scanner.utils.errors import ConfigError, ContentScannerRestError
from matrix_content_scanner.utils.types import JsonDict

if TYPE_CHECKING:
from matrix_content_scanner.mcs import MatrixContentScanner


logger = logging.getLogger(__name__)


class CryptoHandler:
"""Handler for handling Olm-encrypted request bodies."""

def __init__(self, mcs: "MatrixContentScanner") -> None:
key = mcs.config.crypto.pickle_key
path = mcs.config.crypto.pickle_path

# Try reading the pickle file from disk.
try:
with open(path, "r") as fp:
pickle = fp.read()
except OSError as e:
raise ConfigError(
"Failed to open the pickle file configured at crypto.pickle_path (%s): %s"
% (path, e)
)

# Create a PkDecryption object with the content and key.
try:
self._decryptor: PkDecryption = PkDecryption.from_pickle(
pickle=pickle.encode("ascii"),
passphrase=key,
)
except PkDecryptionError as e:
# If we failed to extract the key pair from the pickle file, it's likely
# because the key is incorrect, or there's an issue with the file's content.
raise ConfigError(
"Configured value for crypto.pickle_key is incorrect or pickle file is"
" corrupted (Olm error code: %s)" % e
)

logger.info("Loaded Olm key pair from pickle file %s", path)

self.public_key = self._decryptor.public_key

@staticmethod
def generate_and_store_key_pair(config: MatrixContentScannerConfig) -> None:
"""Generates a new Olm key pair, and store it in the configured pickle file.

Args:
config: The content scanner config.

Raises:
ConfigError if we failed to write the file.
"""
path = config.crypto.pickle_path

logger.info(
"Generating a new Olm key pair and storing it in pickle file %s", path
)

# Generate a new key pair and turns it into a pickle.
decryptor = PkDecryption()
pickle_bytes = decryptor.pickle(passphrase=config.crypto.pickle_key)

# Try to write the pickle's content into a file.
try:
with open(path, "w+") as fp:
fp.write(pickle_bytes.decode("ascii"))
except OSError as e:
raise ConfigError(
"Failed to write the pickle file at the location configured for"
" crypto.pickle_path (%s): %s" % (path, e)
)

def decrypt_body(self, ciphertext: str, mac: str, ephemeral: str) -> JsonDict:
"""Decrypts an Olm-encrypted body.

Args:
ciphertext: The encrypted body's ciphertext.
mac: The encrypted body's MAC.
ephemeral: The encrypted body's ephemeral key.

Returns:
The decrypted body, parsed as JSON.
"""
try:
decrypted = self._decryptor.decrypt(
message=PkMessage(
ephemeral_key=ephemeral,
mac=mac,
ciphertext=ciphertext,
)
)
babolivier marked this conversation as resolved.
Show resolved Hide resolved
except PkDecryptionError as e:
logger.error("Failed to decrypt encrypted body: %s", e)
raise ContentScannerRestError(
http_status=400,
reason=ErrCode.FAILED_TO_DECRYPT,
info=str(e),
)

# We know that `decrypted` parses as a JsonDict.
return json.loads(decrypted) # type: ignore[no-any-return]
36 changes: 34 additions & 2 deletions matrix_content_scanner/mcs.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@

from matrix_content_scanner import logutils
from matrix_content_scanner.config import MatrixContentScannerConfig
from matrix_content_scanner.crypto import CryptoHandler
from matrix_content_scanner.httpserver import HTTPServer
from matrix_content_scanner.scanner.file_downloader import FileDownloader
from matrix_content_scanner.scanner.scanner import Scanner
Expand Down Expand Up @@ -58,9 +59,12 @@ def file_downloader(self) -> FileDownloader:
def scanner(self) -> Scanner:
return Scanner(self)

@cached_property
def crypto_handler(self) -> CryptoHandler:
return CryptoHandler(self)

def start(self) -> None:
"""Start the HTTP server and start the reactor."""
setup_logging()
http_server = HTTPServer(self)
http_server.start()
self.reactor.run()
Expand Down Expand Up @@ -88,6 +92,8 @@ def setup_logging() -> None:


if __name__ == "__main__":
setup_logging()

parser = argparse.ArgumentParser(
description="A web service for scanning media hosted by a Matrix media repository."
)
Expand All @@ -97,6 +103,11 @@ def setup_logging() -> None:
required=True,
help="The YAML configuration file.",
)
parser.add_argument(
"--generate-secrets",
action="store_true",
help="Generate secrets such as cryptographic key pairs needed for the content scanner to run.",
)

args = parser.parse_args()

Expand All @@ -109,6 +120,27 @@ def setup_logging() -> None:
logger.error("Failed to read configuration file: %s", e)
sys.exit(1)

# Start the content scanner.
# If required by the command-line arguments, generate and store the secrets needed for
# the program to run.
if args.generate_secrets:
try:
CryptoHandler.generate_and_store_key_pair(cfg)
except ConfigError as e:
logger.error("Failed to generate secrets: %s", e)
sys.exit(1)

sys.exit(0)

# Create the content scanner.
mcs = MatrixContentScanner(cfg)

# Construct the crypto handler early on, so we can make sure we can load the Olm key
# pair from the pickle file.
try:
_ = mcs.crypto_handler
except ConfigError as e:
logger.error(e)
babolivier marked this conversation as resolved.
Show resolved Hide resolved
sys.exit(1)

# Start the content scanner.
mcs.start()
43 changes: 43 additions & 0 deletions tests/test_crypto.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Copyright 2022 The Matrix.org Foundation C.I.C.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import json
import unittest

from olm.pk import PkEncryption

from tests.testutils import get_content_scanner


class CryptoHandlerTestCase(unittest.TestCase):
def setUp(self) -> None:
self.crypto_handler = get_content_scanner().crypto_handler

def test_decrypt(self) -> None:
"""Tests that an Olm-encrypted payload is successfully decrypted."""
payload = {"foo": "bar"}

# Encrypt the payload with PkEncryption.
pke = PkEncryption(self.crypto_handler.public_key)
encrypted = pke.encrypt(json.dumps(payload))

# Decrypt the payload with the crypto handler.
decrypted = self.crypto_handler.decrypt_body(
encrypted.ciphertext,
encrypted.mac,
encrypted.ephemeral_key,
)

# Check that the decrypted payload is the same as the original one before
# encryption.
self.assertEqual(decrypted, payload)
8 changes: 7 additions & 1 deletion tests/testutils.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
from twisted.web.http_headers import Headers

from matrix_content_scanner.config import MatrixContentScannerConfig
from matrix_content_scanner.crypto import CryptoHandler
from matrix_content_scanner.mcs import MatrixContentScanner
from matrix_content_scanner.utils.types import JsonDict

Expand Down Expand Up @@ -123,4 +124,9 @@ def get_content_scanner(config: Optional[JsonDict] = None) -> MatrixContentScann
# all required settings in that section.
default_config.update(config)

return MatrixContentScanner(MatrixContentScannerConfig(default_config))
parsed_config = MatrixContentScannerConfig(default_config)

# Generate the Olm key pair and store them in a pickle file.
CryptoHandler.generate_and_store_key_pair(parsed_config)

return MatrixContentScanner(parsed_config)
8 changes: 8 additions & 0 deletions tox.ini
Original file line number Diff line number Diff line change
Expand Up @@ -33,5 +33,13 @@ commands =

extras = dev

# The current version of python-olm that's on PyPI does not include a types marker.
# Hopefully that's something we can fix at some point, but in the mean time let's not
# block things on this and instead use the wheels on gitlab.matrix.org's repository (which
# do have a type marker). We use --index-url (and not --extra-index-url) so that pip does
# not try to download the python-olm that's on pypi.org. This is fine because GitLab will
# redirect requests for packages it doesn't know about to pypi.org.
install_command = python -m pip install --index-url=https://gitlab.matrix.org/api/v4/projects/27/packages/pypi/simple {opts} {packages}
Comment on lines +36 to +42
Copy link
Contributor Author

@babolivier babolivier Oct 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means we don't run mypy with the same version of Olm as the one we actually use for running the content scanner and its tests (pypi -> 3.1.3; gitlab -> 3.2.13), which seems to be fine but isn't ideal.


commands =
mypy matrix_content_scanner tests