Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: Falling back to primary PG server not reliable on Rails 7.1 #71

Merged
merged 1 commit into from
Jul 3, 2024

Conversation

tgxworld
Copy link
Contributor

@tgxworld tgxworld commented Jul 2, 2024

This commit adds the use of ActiveRecord::ConnectionAdatpers::PostgreSQLAdatper#verify!
when a new connection is retrieved from the connection pool. This is because ActiveRecord::ConnectionAdapters::PostgreSQLAdapter#active?
was changed in Rails 7.1 to return false if the connection has not be used to do
something meaningful. Ref:
rails/rails@8551e64.
Due to this change, our fallback to primary checker will keep thinking
that the primary PG server is down. Note that before Rails 7.1, ActiveRecord::ConnectionAdapters::PostgreSQLAdapter#active?
will always execute a fake query to check if the query can be executed. By calling ActiveRecord::ConnectionAdatpers::PostgreSQLAdatper#verify!, before ActiveRecord::ConnectionAdapters::PostgreSQLAdapter#active?, we ensure that the connection to the Postgres
server has been made first.

This commit also updates the ActiveRecord tests to be more reliable by
reducing the Unicorn worker processes to 1 so that we don't have to rely
on flooding the app with requests to get all the Unicorn processes to
failover/fallback.

Co-authored-by: Loïc Guitaut loic@discourse.org

@Flink
Copy link
Contributor

Flink commented Jul 2, 2024

I guess we should probably apply what I did in https://github.com/discourse/rails_failover/pull/70/files#diff-9d30ba9a0a8fdc4fb424d5ee1f338be98c9e183c7a8407340f3935c7adecdf33 too, since we could have the same problem then. (We won’t need a new PR for Rails 7.2 support in the end 😅)

@tgxworld tgxworld force-pushed the fix_connectioN_active_check branch 4 times, most recently from 8450f1e to 97f9d2b Compare July 3, 2024 02:17
@tgxworld tgxworld force-pushed the fix_connectioN_active_check branch 23 times, most recently from a5cfb29 to 19878a8 Compare July 3, 2024 06:24
@tgxworld tgxworld force-pushed the fix_connectioN_active_check branch from 19878a8 to e25e98f Compare July 3, 2024 06:28
@@ -107,7 +107,6 @@ To install this gem onto your local machine, run `bundle exec rake install`. To
The ActiveRecord failover tests are run against a dummy Rails server. Run the following commands to run the test:

1. `make setup_pg`
2. `make start_pg`
Copy link
Contributor Author

@tgxworld tgxworld Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now managed by the RSpec test itself so that we can ensure that each test starts with the Postgres servers being in the same state.

Comment on lines +56 to +65
start_pg_primary # Start the fallback process

flood_get("/posts", times: 100) do |res|
expect(res.code.to_i).to eq(200)
expect(res.body).to include("reading")
expect(res.body).to include(EXPECTED_POSTS_COUNT)
end
ensure
system("make restart_pg_primary")
sleep 0.05 # Wait for fallback to complete

response = get("/posts")

expect(response.body).to eq(<<~BODY.chomp)
Posts count: #{EXPECTED_POSTS_COUNT}
role: writing
BODY
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was missing which is why our integration test suite did not catch the problem. We were not testing the fallback process.

@@ -1,2 +0,0 @@
<%= @role %>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is not required since we render plain: now.

@@ -1,4 +1,4 @@
worker_processes (ENV["UNICORN_WORKERS"] || 5).to_i
worker_processes (ENV["UNICORN_WORKERS"] || 1).to_i
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reduced number of unicorn workers to 1 to make the tests easier to reason about. It does reduce our coverage for multiple processes but I think this is a fine trade-off for now.

Comment on lines -11 to -21
setup_dummy_rails_server:
@cd spec/support/dummy_app && BUNDLE_GEMFILE=Gemfile bundle install --quiet && BUNDLE_GEMFILE=Gemfile RAILS_ENV=production $(BUNDLER_BIN) exec rails db:create db:migrate db:seed

start_dummy_rails_server:
@cd spec/support/dummy_app && BUNDLE_GEMFILE=Gemfile SECRET_KEY_BASE=somekey bundle exec unicorn -c config/unicorn.conf.rb -D -E production

stop_dummy_rails_server:
@kill -TERM $(shell cat spec/support/dummy_app/tmp/pids/unicorn.pid)

teardown_dummy_rails_server:
@cd spec/support/dummy_app && (! (bundle check > /dev/null 2>&1) || BUNDLE_GEMFILE=Gemfile DISABLE_DATABASE_ENVIRONMENT_CHECK=1 RAILS_ENV=production $(BUNDLER_BIN) exec rails db:drop)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved these to be managed by the RSpec tests so that things are easier to reason about.

Comment on lines -44 to -46
restart_pg_primary:
@$(PG_BIN_DIR)/pg_ctl --silent --log /dev/null -w -D $(PG_PRIMARY_DATA_DIR) -o "-p $(PG_PRIMARY_PORT)" -o "-k $(PG_PRIMARY_RUN_DIR)" restart

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This command is no longer needed.

@tgxworld tgxworld force-pushed the fix_connectioN_active_check branch from e25e98f to ae760aa Compare July 3, 2024 06:33
@tgxworld
Copy link
Contributor Author

tgxworld commented Jul 3, 2024

I guess we should probably apply what I did in https://github.com/discourse/rails_failover/pull/70/files#diff-9d30ba9a0a8fdc4fb424d5ee1f338be98c9e183c7a8407340f3935c7adecdf33 too

@Flink Interesting, the build now fails on Rails 7.1 if the above fix is not applied. Maybe the change to active? was released in a patch version of Rails 7.1? This would explain why the test suite was not failing previously.

@@ -64,7 +64,8 @@ def initiate_fallback_to_primary
spec_name,
role: handler_key,
)
connection_active = connection.active?

connection_active = connection.verify!
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actual fix is here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably keep the call to connection.active? and use .tap(&:verify!) on line 66 because of the difference in implementations of #verify!. Indeed, with Rails 7.1, this will return true but the code from 7.0 and 6.1 is the following:

reconnect! unless active?

So here the result from reconnect! could be something else than a truthy value and if, for whatever reason, the connection is already active, it will return nil.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I'm thinking if we should just rely on a more reliable API where the implementation does not differ between versions. Executing a query seems to be the most stable IMO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#verify! is a reliable API. It will establish the connection if it’s necessary, and does nothing if it’s already there. Then you just call #active? and it will work as expected (under the hood it’s executing a query).
We should not rely on #verify! to give us a meaningful return value, that’s all 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good 👍 I've updated the PR.

@@ -11,7 +11,7 @@ class Railtie < ::Rails::Railtie
app.config.active_record_rails_failover = true
::ActiveSupport.on_load(:active_record) do
begin
::ActiveRecord::Base.connection
::ActiveRecord::Base.connection.verify!
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actual fix is also here.

@Flink
Copy link
Contributor

Flink commented Jul 3, 2024

@Flink Interesting, the build now fails on Rails 7.1 if the above fix is not applied. Maybe the change to active? was released in a patch version of Rails 7.1? This would explain why the test suite was not failing previously.

I just checked, and the code hasn’t changed between the various releases of Rails 7.1, weird 🤔

This commit drops the reliance on
`ActiveRecord::ConnectionAdapters::PostgreSQLAdapter#active?` in favor
or using `ActiveRecord::ConnectionAdatpers::PostgreSQLAdatper#execute`
to check if the app can connect to the primary PG server.

This is because `ActiveRecord::ConnectionAdapters::PostgreSQLAdapter#active?`
was changed in Rails 7.1 to return `false` if the connection has not be used to do
something meaningful. Ref:
rails/rails@8551e64.
Due to this change, our fallback to primary checker will keep thinking
that the primary PG server is down. Note that before Rails 7.1, `ActiveRecord::ConnectionAdapters::PostgreSQLAdapter#active?`
will always execute a fake query to check if the query can be executed.

Instead of relying on ActiveRecord's internal API, we will instead rely
on `ActiveRecord::ConnectionAdapters::PostgreSQLAdapter#execute` to
execute a blank query as our means of verifying if the PG server is up
and ready to receive connections.

This commit also updates the ActiveRecord tests to be more reliable by
reducing the Unicorn worker processes to 1 so that we don't have to rely
on flooding the app with requests to get all the Unicorn processes to
failover/fallback.

Co-authored-by: Loïc Guitaut <loic@discourse.org>
@tgxworld tgxworld force-pushed the fix_connectioN_active_check branch from ae760aa to cb2038c Compare July 3, 2024 09:11
@tgxworld tgxworld merged commit 92c1f7e into main Jul 3, 2024
13 checks passed
@tgxworld tgxworld deleted the fix_connectioN_active_check branch July 3, 2024 22:12
@tgxworld
Copy link
Contributor Author

tgxworld commented Jul 3, 2024

Thank you for revieiwng @Flink 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants