Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

During rolling deploy it is possible for the old application pod to interact with the updated database #1867

Open
1 task done
jobara opened this issue Jul 24, 2023 · 17 comments · May be fixed by #1981
Open
1 task done
Assignees
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@jobara
Copy link
Collaborator

jobara commented Jul 24, 2023

Prerequisites

Describe the bug

In our current rolling deploy system, as new pods are being deployed an old pod sticks around until the new ones are ready for use. However, there is a single shared database that the pods connect to. The issue here is that a user may be interacting with the old pod, but the database could have been migrated to a new structure. This could lead to data corruption and/or 500 errors reported to the user as the application may have a mismatch of expectations of the data compared to the current database.

Expected behavior

We should minimize or eliminate the possibility of the old application and new database from interacting with each other.

@jobara jobara added bug Something isn't working help wanted Extra attention is needed labels Jul 24, 2023
@jobara jobara added this to the 1.2.0 milestone Jul 24, 2023
@colleenskemp
Copy link

This ticket captures the following tickets as related sub-tickets:
#1728
#1686
#1550

@colleenskemp
Copy link

@jobara - We understand that this is not a priority at this time. Is that right? The sense of our team is that we can turn off the rolling updates, but then we will have downtimes for each deployment. This might not be worth our time.

Do you agree?

@jobara
Copy link
Collaborator Author

jobara commented Aug 2, 2023

@colleenskemp I'll have to think so more on this. I'll check in with @michelled when she's back.

@jobara
Copy link
Collaborator Author

jobara commented Sep 27, 2023

At the dev check in meeting with @JureUrsic, @peterhebert, and @michelled we discussed using Laravel's maintenance mode for this. When the deploy is happening the script would call php artisan down, after the deploy is finished it would call php artisan up. Any users accessing the site during the maintenance time would see a maintenance page.

@jobara
Copy link
Collaborator Author

jobara commented Oct 25, 2023

@JureUrsic I was thinking about this today, and wondering when/where it should run. I was thinking it could go around the migration step in DeployGlobal.php but I'm not sure because wouldn't the old web head need to come down before we take the site out of maintenance mode? Also are you able to take on work on this task?

@JureUrsic
Copy link
Contributor

@jobara it should go into "local" command on start and beginning

@JureUrsic
Copy link
Contributor

I can run some tests on dev, just give me the commands to run

@jobara
Copy link
Collaborator Author

jobara commented Oct 26, 2023

I can run some tests on dev, just give me the commands to run

@JureUrsic thanks, you can use the php artisan down and php artisan up commands. See Laravel's maintenance mode for more information.

@jobara
Copy link
Collaborator Author

jobara commented Nov 14, 2023

@JureUrsic the other day I manually reset the database in the dev deploy. As part of that I put the site in maintenance mode. However, after bringing the site back up using php artisan up the site was removed from maintenance mode; however, for several minutes the site remained inaccessible and returned a 500 error from nginx I believe. So the site actually looked broken for awhile. I'm not sure if this will happen with the plans we have for this ticket, but something to look into along with it.

@SantiagoG-Colab
Copy link

@marvinroman

@marvinroman
Copy link
Contributor

So the problem with maintenance mode currently is that the health check on the pods also gets maintenance mode so the pod is considered unhealthy and the load balancer doesn't forward connections.

We will take the following actions to fix:

  • Create a health check that will bypass maintenance mode.
  • Put the php artisan down/up in the php artisan deploy:global command.

@marvinroman
Copy link
Contributor

@jobara I've made the necessary changes in the branch associated with this issue. Let me know if you want me to create a PR for it?

@jobara
Copy link
Collaborator Author

jobara commented Nov 16, 2023

@marvinroman thanks for working on this. Yes, please file a PR for the changes.

@jobara
Copy link
Collaborator Author

jobara commented Nov 16, 2023

So the problem with maintenance mode currently is that the health check on the pods also gets maintenance mode so the pod is considered unhealthy and the load balancer doesn't forward connections.

We will take the following actions to fix:

  • Create a health check that will bypass maintenance mode.
  • Put the php artisan down/up in the php artisan deploy:global command.

Regarding the health check, in taking a glance at your branch, it looks like it checks the DB now. But I guess that won't really tell us if the web site is actually served up properly. Is there a way to check different things if the site is in maintenance mode or not?

Regarding turning maintenance mode on/off in the global deploy, will that affect the original instance as well and not just the two new ones that are in the process of spinning up?

@jobara
Copy link
Collaborator Author

jobara commented Nov 16, 2023

@marvinroman also in your branch I noticed that it brings the site back up after 5 minutes. These kinds of timers are always risky as we don't know if the task has yet to complete or completed some time before. Is it possible to get a hook into when the pods are actually being used, and/or when the old pods are all removed?

@marvinroman
Copy link
Contributor

So the problem with maintenance mode currently is that the health check on the pods also gets maintenance mode so the pod is considered unhealthy and the load balancer doesn't forward connections.
We will take the following actions to fix:

  • Create a health check that will bypass maintenance mode.
  • Put the php artisan down/up in the php artisan deploy:global command.

Regarding the health check, in taking a glance at your branch, it looks like it checks the DB now. But I guess that won't really tell us if the web site is actually served up properly. Is there a way to check different things if the site is in maintenance mode or not?

Regarding turning maintenance mode on/off in the global deploy, will that affect the original instance as well and not just the two new ones that are in the process of spinning up?

This is a health check of the pod and not the site to know whether to forward connections to the pod from the load balancer. In other words are the services properly running. We have an external check that determines site health and will notify us of site issues.

When maintenance mode is activated it occurs across all the pods.

@marvinroman
Copy link
Contributor

I agree that there are risks associated with a timer, but we haven't found an alternative at this time.

We have determined that lifecycle hooks aren't possible to use in our infrastructure at this time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
Status: No status
5 participants