Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DPE-4656] add TLS CA rotation routine #353

Merged

Conversation

reneradoi
Copy link
Contributor

@reneradoi reneradoi commented Jul 8, 2024

Issue

When a new TLS certificate authority (CA) certificate is issued, the opensearch-operator should add this new CA to all its units and request new certificates. The new certificates (including the CA certificate) should be distributed to all OpenSearch nodes in a rolling restart manner, without downtime to the entire cluster.

Due to limitations on the self-signed-certificates operator it is not possible to:

  • get a notice if a CA certificate is about to expire
  • request a new CA when the current one is about to or has expired
  • request an intermediate CA and sign future certificates with it

There is currently no support for renewing a root / CA certificate on the self-signed-certificates operator. A new root / CA certificate will only be generated and issued if the common_name of the CA changes.

We have decided to implement the logic in that way that we check each certificate if it includes a new CA. If so, we store the new CA and initiate the CA rotation workflow on OpenSearch.

Solution

This PR implements the following workflow:

  • check each CertificateAvailableEvent if it includes a new CA
  • add the new CA to the truststore
  • add a notice tls_ca_renewing to the unit's peer data
  • initiate a restart of OpenSearch (using the locking mechanism to coordinate cluster availability during the restart)
  • after restarting, add a notice tls_ca_renewed to the unit's peer data
  • when the restart is done on all of the cluster nodes, request new TLS certificates and apply them to the node

During the phase of renewing the CA, all incoming CertificateAvailableEvents will be deferred in order to avoid incompatibilites in communication between the nodes.

Please also see the flow of events and actions that has been documented here: https://github.com/canonical/opensearch-operator/wiki/TLS-CA-rotation-flow

Notes

reneradoi added 30 commits July 8, 2024 14:55
…he new CA cert will be stored in a separate secret `Intermediate CA`, and it will be added to the truststore.
…ts on expiry as this workflow is currently not supported by self-signed-certificates-operator
…th the old and the new ca during ca rotation
…-implement-CA-replacement-routine

# Conflicts:
#	lib/charms/opensearch/v0/opensearch_base_charm.py
…with TLS v1.2 to mitigate issues with `SSLHandshakeException: Insufficient buffer remaining for AEAD cipher fragment`
@juditnovak
Copy link
Contributor

juditnovak commented Sep 3, 2024

Unittests are added :-)

I'm waiting for the ongoing issue resolved (the one that @skourta have identified on certificate-available event being forever defer()-red). Since that's the very same part of the code where the double app-admin certificate fetch is happening (see the issue highlighted in the unittests), it may be a good idea to fix that part of the workflow as well.

@skourta
Copy link
Contributor

skourta commented Sep 3, 2024

I met with @reneradoi and explained the issue and its source to him. He's on it.

@reneradoi reneradoi changed the base branch from main to feature/TLS-CA-rotation September 3, 2024 09:57
@skourta skourta linked an issue Sep 3, 2024 that may be closed by this pull request
Copy link
Contributor

@juditnovak juditnovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful work, congrats @reneradoi :-)

I'm checking locally if #417 may still hold (I suspect it should not as the code went through related changes since the issue was raised).

Regardless, this is a big chunk of work, strongly tested in numerous ways (incl. a lot of manual testging, against 2 TLS charms).

High time to merge and if any bugs are to be fixed those can be addressed in smaller (clearer :-) ) tasks.

lib/charms/opensearch/v0/opensearch_tls.py Outdated Show resolved Hide resolved
@reneradoi reneradoi requested a review from juditnovak September 5, 2024 13:16
@reneradoi reneradoi merged commit 38de8e1 into feature/TLS-CA-rotation Sep 6, 2024
31 of 34 checks passed
@reneradoi reneradoi deleted the DPE-4656-implement-CA-replacement-routine branch September 6, 2024 06:17
@juditnovak
Copy link
Contributor

Important to mention withing the highlights of earlier discussions.
With the new unittests, instead of the original 55% coverage we've ended up at a 62% unittes coverage :-) :-) :-)
image

skourta pushed a commit that referenced this pull request Sep 18, 2024
When a new TLS certificate authority (CA) certificate is issued, the
opensearch-operator should add this new CA to all its units and request
new certificates. The new certificates (including the CA certificate)
should be distributed to all OpenSearch nodes in a rolling restart
manner, without downtime to the entire cluster.

Due to limitations on the self-signed-certificates operator it is not
possible to:
- get a notice if a CA certificate is about to expire
- request a new CA when the current one is about to or has expired
- request an intermediate CA and sign future certificates with it

There is currently no support for renewing a root / CA certificate on
the self-signed-certificates operator. A new root / CA certificate will
only be generated and issued if the common_name of the CA changes.

We have decided to implement the logic in that way that we check each
certificate if it includes a new CA. If so, we store the new CA and
initiate the CA rotation workflow on OpenSearch.

This PR implements the following workflow:
- check each `CertificateAvailableEvent` if it includes a new CA
- add the new CA to the truststore
- add a notice `tls_ca_renewing` to the unit's peer data
- initiate a restart of OpenSearch (using the locking mechanism to
coordinate cluster availability during the restart)
- after restarting, add a notice `tls_ca_renewed` to the unit's peer
data
- when the restart is done on all of the cluster nodes, request new TLS
certificates and apply them to the node

During the phase of renewing the CA, all incoming
`CertificateAvailableEvents` will be deferred in order to avoid
incompatibilites in communication between the nodes.

Please also see the flow of events and actions that has been documented
here:
https://github.com/canonical/opensearch-operator/wiki/TLS-CA-rotation-flow

- There is a dependency to
#367 because during
the rolling restart when the CA is rotated it is very likely that the
voting exclusion issue shows up (at least in 3-node-clusters). Therefore
the integration test is currently running only with two nodes. Once the
voting exclusions issue is resolved, this can be updated to the usual
three nodes.
- Due to an upstream bug with JDK it is necessary to use TLS v1.2 (more
details see opensearch-project/security#3299).
- This PR introduces a method to append configuration to the jvm options
file of OpenSearch (used to set TLS config to v1.2).

---------

Co-authored-by: Mehdi Bendriss <bendrissmehdi@gmail.com>
Co-authored-by: Judit Novak <judit.novak@canonical.com>
skourta pushed a commit that referenced this pull request Sep 18, 2024
When a new TLS certificate authority (CA) certificate is issued, the
opensearch-operator should add this new CA to all its units and request
new certificates. The new certificates (including the CA certificate)
should be distributed to all OpenSearch nodes in a rolling restart
manner, without downtime to the entire cluster.

Due to limitations on the self-signed-certificates operator it is not
possible to:
- get a notice if a CA certificate is about to expire
- request a new CA when the current one is about to or has expired
- request an intermediate CA and sign future certificates with it

There is currently no support for renewing a root / CA certificate on
the self-signed-certificates operator. A new root / CA certificate will
only be generated and issued if the common_name of the CA changes.

We have decided to implement the logic in that way that we check each
certificate if it includes a new CA. If so, we store the new CA and
initiate the CA rotation workflow on OpenSearch.

This PR implements the following workflow:
- check each `CertificateAvailableEvent` if it includes a new CA
- add the new CA to the truststore
- add a notice `tls_ca_renewing` to the unit's peer data
- initiate a restart of OpenSearch (using the locking mechanism to
coordinate cluster availability during the restart)
- after restarting, add a notice `tls_ca_renewed` to the unit's peer
data
- when the restart is done on all of the cluster nodes, request new TLS
certificates and apply them to the node

During the phase of renewing the CA, all incoming
`CertificateAvailableEvents` will be deferred in order to avoid
incompatibilites in communication between the nodes.

Please also see the flow of events and actions that has been documented
here:
https://github.com/canonical/opensearch-operator/wiki/TLS-CA-rotation-flow

- There is a dependency to
#367 because during
the rolling restart when the CA is rotated it is very likely that the
voting exclusion issue shows up (at least in 3-node-clusters). Therefore
the integration test is currently running only with two nodes. Once the
voting exclusions issue is resolved, this can be updated to the usual
three nodes.
- Due to an upstream bug with JDK it is necessary to use TLS v1.2 (more
details see opensearch-project/security#3299).
- This PR introduces a method to append configuration to the jvm options
file of OpenSearch (used to set TLS config to v1.2).

---------

Co-authored-by: Mehdi Bendriss <bendrissmehdi@gmail.com>
Co-authored-by: Judit Novak <judit.novak@canonical.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 node opensearch cluster fails to configure TLS
4 participants