-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX: Falling back to primary PG server not reliable on Rails 7.1 #71
Conversation
I guess we should probably apply what I did in https://github.com/discourse/rails_failover/pull/70/files#diff-9d30ba9a0a8fdc4fb424d5ee1f338be98c9e183c7a8407340f3935c7adecdf33 too, since we could have the same problem then. (We won’t need a new PR for Rails 7.2 support in the end 😅) |
8450f1e
to
97f9d2b
Compare
a5cfb29
to
19878a8
Compare
19878a8
to
e25e98f
Compare
@@ -107,7 +107,6 @@ To install this gem onto your local machine, run `bundle exec rake install`. To | |||
The ActiveRecord failover tests are run against a dummy Rails server. Run the following commands to run the test: | |||
|
|||
1. `make setup_pg` | |||
2. `make start_pg` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now managed by the RSpec test itself so that we can ensure that each test starts with the Postgres servers being in the same state.
start_pg_primary # Start the fallback process | ||
|
||
flood_get("/posts", times: 100) do |res| | ||
expect(res.code.to_i).to eq(200) | ||
expect(res.body).to include("reading") | ||
expect(res.body).to include(EXPECTED_POSTS_COUNT) | ||
end | ||
ensure | ||
system("make restart_pg_primary") | ||
sleep 0.05 # Wait for fallback to complete | ||
|
||
response = get("/posts") | ||
|
||
expect(response.body).to eq(<<~BODY.chomp) | ||
Posts count: #{EXPECTED_POSTS_COUNT} | ||
role: writing | ||
BODY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was missing which is why our integration test suite did not catch the problem. We were not testing the fallback process.
@@ -1,2 +0,0 @@ | |||
<%= @role %> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is not required since we render plain:
now.
@@ -1,4 +1,4 @@ | |||
worker_processes (ENV["UNICORN_WORKERS"] || 5).to_i | |||
worker_processes (ENV["UNICORN_WORKERS"] || 1).to_i |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reduced number of unicorn workers to 1 to make the tests easier to reason about. It does reduce our coverage for multiple processes but I think this is a fine trade-off for now.
setup_dummy_rails_server: | ||
@cd spec/support/dummy_app && BUNDLE_GEMFILE=Gemfile bundle install --quiet && BUNDLE_GEMFILE=Gemfile RAILS_ENV=production $(BUNDLER_BIN) exec rails db:create db:migrate db:seed | ||
|
||
start_dummy_rails_server: | ||
@cd spec/support/dummy_app && BUNDLE_GEMFILE=Gemfile SECRET_KEY_BASE=somekey bundle exec unicorn -c config/unicorn.conf.rb -D -E production | ||
|
||
stop_dummy_rails_server: | ||
@kill -TERM $(shell cat spec/support/dummy_app/tmp/pids/unicorn.pid) | ||
|
||
teardown_dummy_rails_server: | ||
@cd spec/support/dummy_app && (! (bundle check > /dev/null 2>&1) || BUNDLE_GEMFILE=Gemfile DISABLE_DATABASE_ENVIRONMENT_CHECK=1 RAILS_ENV=production $(BUNDLER_BIN) exec rails db:drop) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved these to be managed by the RSpec tests so that things are easier to reason about.
restart_pg_primary: | ||
@$(PG_BIN_DIR)/pg_ctl --silent --log /dev/null -w -D $(PG_PRIMARY_DATA_DIR) -o "-p $(PG_PRIMARY_PORT)" -o "-k $(PG_PRIMARY_RUN_DIR)" restart | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This command is no longer needed.
e25e98f
to
ae760aa
Compare
@Flink Interesting, the build now fails on Rails 7.1 if the above fix is not applied. Maybe the change to |
@@ -64,7 +64,8 @@ def initiate_fallback_to_primary | |||
spec_name, | |||
role: handler_key, | |||
) | |||
connection_active = connection.active? | |||
|
|||
connection_active = connection.verify! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actual fix is here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably keep the call to connection.active?
and use .tap(&:verify!)
on line 66 because of the difference in implementations of #verify!
. Indeed, with Rails 7.1, this will return true
but the code from 7.0 and 6.1 is the following:
reconnect! unless active?
So here the result from reconnect!
could be something else than a truthy value and if, for whatever reason, the connection is already active, it will return nil
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I'm thinking if we should just rely on a more reliable API where the implementation does not differ between versions. Executing a query seems to be the most stable IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#verify!
is a reliable API. It will establish the connection if it’s necessary, and does nothing if it’s already there. Then you just call #active?
and it will work as expected (under the hood it’s executing a query).
We should not rely on #verify!
to give us a meaningful return value, that’s all 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good 👍 I've updated the PR.
@@ -11,7 +11,7 @@ class Railtie < ::Rails::Railtie | |||
app.config.active_record_rails_failover = true | |||
::ActiveSupport.on_load(:active_record) do | |||
begin | |||
::ActiveRecord::Base.connection | |||
::ActiveRecord::Base.connection.verify! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actual fix is also here.
I just checked, and the code hasn’t changed between the various releases of Rails 7.1, weird 🤔 |
This commit drops the reliance on `ActiveRecord::ConnectionAdapters::PostgreSQLAdapter#active?` in favor or using `ActiveRecord::ConnectionAdatpers::PostgreSQLAdatper#execute` to check if the app can connect to the primary PG server. This is because `ActiveRecord::ConnectionAdapters::PostgreSQLAdapter#active?` was changed in Rails 7.1 to return `false` if the connection has not be used to do something meaningful. Ref: rails/rails@8551e64. Due to this change, our fallback to primary checker will keep thinking that the primary PG server is down. Note that before Rails 7.1, `ActiveRecord::ConnectionAdapters::PostgreSQLAdapter#active?` will always execute a fake query to check if the query can be executed. Instead of relying on ActiveRecord's internal API, we will instead rely on `ActiveRecord::ConnectionAdapters::PostgreSQLAdapter#execute` to execute a blank query as our means of verifying if the PG server is up and ready to receive connections. This commit also updates the ActiveRecord tests to be more reliable by reducing the Unicorn worker processes to 1 so that we don't have to rely on flooding the app with requests to get all the Unicorn processes to failover/fallback. Co-authored-by: Loïc Guitaut <loic@discourse.org>
ae760aa
to
cb2038c
Compare
Thank you for revieiwng @Flink 👍 |
This commit adds the use of ActiveRecord::ConnectionAdatpers::PostgreSQLAdatper#verify!
when a new connection is retrieved from the connection pool. This is because
ActiveRecord::ConnectionAdapters::PostgreSQLAdapter#active?
was changed in Rails 7.1 to return
false
if the connection has not be used to dosomething meaningful. Ref:
rails/rails@8551e64.
Due to this change, our fallback to primary checker will keep thinking
that the primary PG server is down. Note that before Rails 7.1,
ActiveRecord::ConnectionAdapters::PostgreSQLAdapter#active?
will always execute a fake query to check if the query can be executed. By calling
ActiveRecord::ConnectionAdatpers::PostgreSQLAdatper#verify!
, beforeActiveRecord::ConnectionAdapters::PostgreSQLAdapter#active?
, we ensure that the connection to the Postgresserver has been made first.
This commit also updates the ActiveRecord tests to be more reliable by
reducing the Unicorn worker processes to 1 so that we don't have to rely
on flooding the app with requests to get all the Unicorn processes to
failover/fallback.
Co-authored-by: Loïc Guitaut loic@discourse.org