Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a crypto handler #26

Merged
merged 9 commits into from
Oct 12, 2022
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 15 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,14 +24,24 @@ pip install matrix-content-scanner
Copy and edit the [sample configuration file](https://github.com/matrix-org/matrix-content-scanner-python/blob/main/config.sample.yaml).
Each key is documented in this file.

Then run the content scanner (from within your virtual environment if one was created):
Before running the Matrix Content Scanner for the first time (if you are not [migrating
from the legacy Matrix Content Scanner](#migrating-from-the-legacy-matrix-content-scanner)),
run (from within your virtual environment if one was created):

```commandline
python -m matrix_content_scanner.mcs -c CONFIG_FILE
python -m matrix_content_scanner.mcs -c CONFIG_FILE --generate-secrets
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mhmm, I'm not entirely sure about this approach. My experience with synapse (which does similar things with key files) is that requirements like this make it a bit of a pain to run in kubernetes-like environments, and led to matrix-org/synapse#13615.

I realise this comment is in conflict with my earlier comment. I'm just not sure what is the best approach.

I suggest you stick with what you have, to avoid further vacillation. Just warning you that you might need to revisit this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm also on the fence. I initially had it automatically generate secrets because that's what the javascript content scanner does, but there are good arguments in both directions tbh.

I suggest you stick with what you have, to avoid further vacillation. Just warning you that you might need to revisit this.

Actually, the issue you linked above seems to indicate this approach would cause issues with k8s-style deployments. I know there's an intent to deploy the content scanner on EMS (and actually most of the infrastructure is already in place for this afaik) so I'd rather revert to generating it automatically.


Where `CONFIG_FILE` is the path to your configuration file.

This will generate the cryptographic secrets required for the content scanner to run.

Then run the content scanner:

```commandline
python -m matrix_content_scanner.mcs -c CONFIG_FILE
```

## API

See [the API documentation](/docs/api.md) for information about how clients are expected
Expand All @@ -56,9 +66,9 @@ deployment instructions) is the configuration format:
* `acceptedMimeType` is renamed `scan.allowed_mimetypes`
* `requestHeader` is renamed `download.additional_headers` and turned into a dictionary.

Note that the format of the cryptographic pickle file and key are compatible between this
project and the legacy Matrix Content Scanner. If no file exist at that path one will be
created automatically.
Note that the format of the cryptographic pickle file file and key are compatible between
this project and the legacy Matrix Content Scanner. If no file exist at that path one will
be created automatically.
babolivier marked this conversation as resolved.
Show resolved Hide resolved

## Development

Expand Down
7 changes: 5 additions & 2 deletions config.sample.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -100,10 +100,13 @@ download:

# Configuration for decrypting Olm-encrypted request bodies.
crypto:
# The path to the Olm pickle file.
# The path to the Olm pickle file. This file contains the key pair to use when
# encrypting and decrypting encrypted POST request bodies.
# This file needs to be created with the --generate-secrets command line argument
# for the Matrix Content Scanner to run, see README.md for more information.
babolivier marked this conversation as resolved.
Show resolved Hide resolved
# Required.
pickle_path: "./pickle"

# The key to the pickle.
# The key to use to decode the pickle file.
# Required.
pickle_key: "this_is_a_secret"
28 changes: 0 additions & 28 deletions matrix_content_scanner/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,6 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
from pathlib import Path
from typing import Any, Dict, List, Optional, Union

import attr
Expand Down Expand Up @@ -62,29 +60,6 @@ def _parse_size(size: Optional[Union[str, float]]) -> Optional[float]:
raise ConfigError(e)


def _check_file_path(raw_path: str) -> None:
"""Check if the file at the given path exists and is readable. If it's not, check if
the parent directory exists and is writeable.

Args:
raw_path: The path to check.

Raises:
ConfigError if the file does not exist and the parent directory cannot be written
into or does not exist.
"""
path = Path(raw_path).resolve()

if os.access(path, os.R_OK):
return

if not os.access(path.parent, os.W_OK):
raise ConfigError(
"File %s does not exist and parent directory is not writeable or does not exist"
% raw_path
)


# Schema to validate the raw configuration dictionary against.
_config_schema = {
"type": "object",
Expand Down Expand Up @@ -206,6 +181,3 @@ def __init__(self, config_dict: Dict[str, Any]):
self.crypto = CryptoConfig(**(config_dict.get("crypto") or {}))
self.download = DownloadConfig(**(config_dict.get("download") or {}))
self.result_cache = ResultCacheConfig(**(config_dict.get("result_cache") or {}))

# Check that we can read the pickle file, or create it if it doesn't exist.
_check_file_path(self.crypto.pickle_path)
60 changes: 47 additions & 13 deletions matrix_content_scanner/crypto.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,9 @@

from olm.pk import PkDecryption, PkDecryptionError, PkMessage

from matrix_content_scanner.config import MatrixContentScannerConfig
from matrix_content_scanner.utils.constants import ErrCode
from matrix_content_scanner.utils.errors import ContentScannerRestError
from matrix_content_scanner.utils.errors import ConfigError, ContentScannerRestError
from matrix_content_scanner.utils.types import JsonDict

if TYPE_CHECKING:
Expand All @@ -34,31 +35,64 @@ class CryptoHandler:
def __init__(self, mcs: "MatrixContentScanner") -> None:
key = mcs.config.crypto.pickle_key
path = mcs.config.crypto.pickle_path

# Try reading the pickle file from disk.
try:
# Try reading the pickle from disk.
with open(path, "r") as fp:
pickle = fp.read()
except OSError as e:
raise ConfigError(
"Failed to open the pickle file configured at crypto.pickle_path (%s): %s"
% (path, e)
)

# Create a PkDecryption object with the content and key.
# Create a PkDecryption object with the content and key.
try:
self._decryptor: PkDecryption = PkDecryption.from_pickle(
pickle=pickle.encode("ascii"),
passphrase=key,
)

logger.info("Loaded Olm pickle from %s", path)
except FileNotFoundError:
# If the pickle file doesn't exist, try creating it.
logger.info(
"Olm pickle not found, generating one and saving it at %s", path
except PkDecryptionError as e:
# If we failed to extract the key pair from the pickle file, it's likely
# because the key is incorrect, or there's an issue with the file's content.
raise ConfigError(
"Configured value for crypto.pickle_key is incorrect or pickle file is"
" corrupted (Olm error code: %s)" % e
)

self._decryptor = PkDecryption()
pickle_bytes = self._decryptor.pickle(passphrase=key)
logger.info("Loaded Olm key pair from pickle file %s", path)

self.public_key = self._decryptor.public_key

@staticmethod
def generate_and_store_key_pair(config: MatrixContentScannerConfig) -> None:
"""Generates a new Olm key pair, and store it in the configured pickle file.

Args:
config: The content scanner config.

Raises:
ConfigError if we failed to write the file.
"""
path = config.crypto.pickle_path

logger.info(
"Generating a new Olm key pair and storing it in pickle file %s", path
)

# Generate a new key pair and turns it into a pickle.
decryptor = PkDecryption()
pickle_bytes = decryptor.pickle(passphrase=config.crypto.pickle_key)

# Try to write the pickle's content into a file.
try:
with open(path, "w+") as fp:
fp.write(pickle_bytes.decode("ascii"))

self.public_key = self._decryptor.public_key
except OSError as e:
raise ConfigError(
"Failed to write the pickle file at the location configured for"
" crypto.pickle_path (%s): %s" % (path, e)
)

def decrypt_body(self, ciphertext: str, mac: str, ephemeral: str) -> JsonDict:
"""Decrypts an Olm-encrypted body.
Expand Down
31 changes: 29 additions & 2 deletions matrix_content_scanner/mcs.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,6 @@ def crypto_handler(self) -> CryptoHandler:

def start(self) -> None:
"""Start the HTTP server and start the reactor."""
setup_logging()
http_server = HTTPServer(self)
http_server.start()
self.reactor.run()
Expand Down Expand Up @@ -93,6 +92,8 @@ def setup_logging() -> None:


if __name__ == "__main__":
setup_logging()

parser = argparse.ArgumentParser(
description="A web service for scanning media hosted by a Matrix media repository."
)
Expand All @@ -102,6 +103,11 @@ def setup_logging() -> None:
required=True,
help="The YAML configuration file.",
)
parser.add_argument(
"--generate-secrets",
action="store_true",
help="Generate secrets such as cryptographic key pairs needed for the content scanner to run.",
)

args = parser.parse_args()

Expand All @@ -114,6 +120,27 @@ def setup_logging() -> None:
logger.error("Failed to read configuration file: %s", e)
sys.exit(1)

# Start the content scanner.
# If required by the command-line arguments, generate and store the secrets needed for
# the program to run.
if args.generate_secrets:
try:
CryptoHandler.generate_and_store_key_pair(cfg)
except ConfigError as e:
logger.error("Failed to generate secrets: %s", e)
sys.exit(1)

sys.exit(0)

# Create the content scanner.
mcs = MatrixContentScanner(cfg)

# Construct the crypto handler early on, so we can make sure we can load the Olm key
# pair from the pickle file.
try:
_ = mcs.crypto_handler
except ConfigError as e:
logger.error(e)
babolivier marked this conversation as resolved.
Show resolved Hide resolved
sys.exit(1)

# Start the content scanner.
mcs.start()
8 changes: 7 additions & 1 deletion tests/testutils.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
from twisted.web.http_headers import Headers

from matrix_content_scanner.config import MatrixContentScannerConfig
from matrix_content_scanner.crypto import CryptoHandler
from matrix_content_scanner.mcs import MatrixContentScanner
from matrix_content_scanner.utils.types import JsonDict

Expand Down Expand Up @@ -123,4 +124,9 @@ def get_content_scanner(config: Optional[JsonDict] = None) -> MatrixContentScann
# all required settings in that section.
default_config.update(config)

return MatrixContentScanner(MatrixContentScannerConfig(default_config))
parsed_config = MatrixContentScannerConfig(default_config)

# Generate the Olm key pair and store them in a pickle file.
CryptoHandler.generate_and_store_key_pair(parsed_config)

return MatrixContentScanner(parsed_config)
12 changes: 6 additions & 6 deletions tox.ini
Original file line number Diff line number Diff line change
Expand Up @@ -34,12 +34,12 @@ commands =
extras = dev

# The current version of python-olm that's on PyPI does not include a types marker.
# Ideally we'd just specify a custom pip install command that uses an --index-url that
# points to Olm's PyPI repo, but it looks like the packaging does not include py.typed in
# wheels. https://gitlab.matrix.org/matrix-org/olm/-/merge_requests/62 is an attempt at
# fixing this.
deps =
git+https://gitlab.matrix.org/babolivier/olm.git@babolivier/py.typed_manifest#egg=python-olm&subdirectory=python
# Hopefully that's something we can fix at some point, but in the mean time let's not
richvdh marked this conversation as resolved.
Show resolved Hide resolved
# block things on this and instead use the wheels on gitlab.matrix.org's repository (which
# do have a type marker). We use --index-url (and not --extra-index-url) so that pip does
# not try to download the python-olm that's on pypi.org. This is fine because GitLab will
# redirect requests for packages it doesn't know about to pypi.org.
install_command = python -m pip install --index-url=https://gitlab.matrix.org/api/v4/projects/27/packages/pypi/simple {opts} {packages}

commands =
mypy matrix_content_scanner tests