Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(cdc): Prepare the self hosted environment for the Change Data Capture pipeline #938

Merged
merged 16 commits into from
May 25, 2021

Conversation

fpacifici
Copy link
Contributor

This is an RFC on how to set up Change Data Capture.
We will use Change Data Capture to stream WAL updates from postgres into clickhouse so that features like issue search will be able to join event data and metadata (from postgres) through Snuba.

This requires the followings:

This PR is preparing postgres to stream updates via the replication log.
The idea is to

  • download the the replication log plugin binary during install.sh
  • mount a volume with the binary when starting postgres
  • providing a new entrypoint to postgres that ensures everything is correctly configured.

There is a difference between how this is set up and how we do the same in the development environment.
In the development environment we download the library from the entrypoint itself and store it in a persistent volume, so we do not have to download it every time.
Unfortunately this does not work here as the postgres image is postgres:9.6 while it is postgres:9.6-alpine. This one does not come with either wget or curl. I don't think installing that in the entrypoint would be a good idea, so the download happens in install.sh. I actually think this way is safer so we never depend on connectivity for postgres to start properly.

@fpacifici fpacifici requested a review from BYK April 29, 2021 05:45
@fpacifici fpacifici requested a review from a team April 30, 2021 16:09
Copy link
Member

@BYK BYK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Left a few comments to verify certain things. Otherwise I think we can merge this.

- type: bind
read_only: true
source: ./postgres/init_hba.sh
target: /docker-entrypoint-initdb.d/init_hba.sh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this script in place, do we still need line 83 above?

POSTGRES_HOST_AUTH_METHOD: "trust"

Also, since we now control the entrypoint, does it have to live under this specific directory? Because if not, then why not just mount ./postgres under /opt/sentry and use everything from there? This would reduce the number of bind mounts to 1 and would avoid file-mounts which we had issues with sometimes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need POSTGRES_HOST_AUTH_METHOD: "trust". That is not about replication while the changes to pg_hba are only about the replication connections.

POSTGRES_HOST_AUTH_METHOD changes the standard connections we use to run queries.


if [[ $WAL2JSON_VERSION == "latest" ]]; then
VERSION=$(
wget "https://api.github.com/repos/getsentry/wal2json/releases/latest" -O - |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't relied on the availability of wget so far (with the exception of our experimental --minimize-downtime flag). We haven't received any complaints about that but just wanted to double check on whether this is a safe assumption to make or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What bout curl ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same. The only assumptions we make are bash, docker, docker-compose, sed, awk, and grep. I think we can find/use a curl or wget docker image to be safe :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out I don't didn't have wget. 😊

▶ Downloading and installing wal2json ...
install-wal2json.sh: line 13: wget: command not found
An error occurred, caught SIGERR on line 20
Cleaning up...
$

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like macOS does ship with curl. If we can't find a Docker image easily I think curl would be a safe choice. Do you know of any Linux that doesn't have it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now relying on curlimages/curl

@fpacifici fpacifici requested a review from BYK April 30, 2021 22:52

if [[ $WAL2JSON_VERSION == "latest" ]]; then
VERSION=$(
wget "https://api.github.com/repos/getsentry/wal2json/releases/latest" -O - |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same. The only assumptions we make are bash, docker, docker-compose, sed, awk, and grep. I think we can find/use a curl or wget docker image to be safe :)


if [[ $WAL2JSON_VERSION == "latest" ]]; then
VERSION=$(
wget "https://api.github.com/repos/getsentry/wal2json/releases/latest" -O - |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out I don't didn't have wget. 😊

▶ Downloading and installing wal2json ...
install-wal2json.sh: line 13: wget: command not found
An error occurred, caught SIGERR on line 20
Cleaning up...
$


if [[ $WAL2JSON_VERSION == "latest" ]]; then
VERSION=$(
wget "https://api.github.com/repos/getsentry/wal2json/releases/latest" -O - |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like macOS does ship with curl. If we can't find a Docker image easily I think curl would be a safe choice. Do you know of any Linux that doesn't have it?

@BYK BYK changed the title feat(cdc) Prepare the self hosted environment for the Change Data Capture pipeline feat(cdc): Prepare the self hosted environment for the Change Data Capture pipeline May 5, 2021
@fpacifici
Copy link
Contributor Author

fpacifici commented May 6, 2021

Ok, we have a problem.
I just noticed that that wal2json.so depends on libc, but postgres alpine (on the dev environment and where we build the binary) has libc.musl instead.

On postgres-alpine

bash-5.1# ldd /wal2json/latest.so 
        /lib/ld-musl-x86_64.so.1 (0x7effc8ba9000)
        libc.musl-x86_64.so.1 => /lib/ld-musl-x86_64.so.1 (0x7effc8ba9000)

On postgres

root@1bca1c71f616:/usr/lib/postgresql/9.6/lib# ldd wal2json.so 
        linux-vdso.so.1 (0x00007ffd30348000)
        libc.musl-x86_64.so.1 => not found

So questions @BYK :

  1. What is the reason we use postgres-alpine in dev and postgres in self hosted ?
  2. What if, on self hosted, we cloned the sentry wal2json repo and built the library from source during the install process instead of using the binary?

I doubt linking libc statically would be a good idea, I don't even think it is possible as the build process is orchestrated by postgres.

@BYK
Copy link
Member

BYK commented May 6, 2021

What is the reason we use postgres-alpine in dev and postgres in self hosted ?

Probably someone chose "the lighter" version on dev and we stuck with it. For self-hosted, we tend to use the most common distributions, which is not alpine :)

What if, on self hosted, we cloned the sentry wal2json repo and built the library from source during the install process instead of using the binary?

That sounds like a step backwards and would introduce a lot of complexity that should live inside the wal2json repo. I've noticed that we use the postgres-alpine base for the builder Docker image. I'm pretty sure if we switch that up to postgres we wouldn't have issues.

This means now wal2json produces two binaries based on build args again:

  1. For libc.musl w/ Alpine
  2. For regular libc

Thoughts?

@fpacifici
Copy link
Contributor Author

Yes using postgres standard is supposed to work. I did not test it, but there should be no reason for that.

Sure I can deliver two binaries in the release, I was not convinced distributing multiple binaries per required implementation of libc would be a common practice. But that may be just because I never noticed it.

@fpacifici
Copy link
Contributor Author

fpacifici commented May 6, 2021

Building two binaries is possible, and it works correctly when importing the right one here.
getsentry/wal2json#12

@fpacifici fpacifici merged commit 8dc8460 into master May 25, 2021
@fpacifici fpacifici deleted the feat/cdc_onprem branch May 25, 2021 00:51
@github-actions github-actions bot locked and limited conversation to collaborators Jun 9, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants