Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty server_metadata.json blocks agent from start #19720

Closed
soupdiver opened this issue Nov 22, 2023 · 14 comments · Fixed by #19935
Closed

Empty server_metadata.json blocks agent from start #19720

soupdiver opened this issue Nov 22, 2023 · 14 comments · Fixed by #19935
Assignees

Comments

@soupdiver
Copy link
Contributor

Overview of the Issue

When on startup there is an empty server_metadata.json the agent will not start.

2023-11-22T13:20:27.353Z [ERROR] agent: startup error: error="error reading server metadata: unexpected end of JSON input"

It seems this can happen when Consul stops abruptly. In my case it's running in a container.

I guess fixing the corruption and dealing with an empty file are different concerns here even if related.
But if you can simply delete that file and the server works afterwards why can't Consul handle this itself?

Related: #1221


Reproduction Steps

Start agent when there is an empty server_metadta.json

Consul info for both Client and Server

Version: 1.15.4

Operating system and Environment details

Docker image: docker.io/library/consul:1.15

Log Fragments

2023-11-22T13:20:27.353Z [ERROR] agent: startup error: error="error reading server metadata: unexpected end of JSON input"

@huikang
Copy link
Collaborator

huikang commented Nov 29, 2023

@soupdiver , I can't reproduce this issue since server_metadata.json can't be empty once a server agent is started.

It seems this can happen when Consul stops abruptly. In my case it's running in a container.

Could you explain under what circumstance, your have an empty server_metadata.json? Thannks.

@soupdiver
Copy link
Contributor Author

soupdiver commented Nov 30, 2023

Could you explain under what circumstance, your have an empty server_metadata.json? Thannks.

I think I did in my post. It seems this can happen when Consul stops abruptly. In my case it's running in a container.

Consul runs in a container and when the host or container stops abruptly on the next start the file is empty and consul won't start.

I can't reproduce this issue since server_metadata.json can't be empty once a server agent is started.

Yea, maybe in an ideal situation but clearly it can happen.

@huikang
Copy link
Collaborator

huikang commented Nov 30, 2023

Could you provide any details on what caused the Consul container to stop? Before consul is restarted, what files/directories are in its data directory (e.g, ls /path/to/consul-data-dir)?

@soupdiver
Copy link
Contributor Author

Could you provide any details on what caused the Consul container to stop?

i stop the VM that runs the container.

Before consul is restarted, what files/directories are in its data directory

From what I can tell, it's just all "normal" Consul data. Raft info, services etc.
The only issue is that (for some reason) sometimes the server_metadata.json turns out empty.

Simply removing that file makes everything work again. So, if Consul itself fails to parse the file or if it's empty, I think Consul should be able to handle this case by itself.

@yateya
Copy link

yateya commented Dec 7, 2023

Same error randomly happened to me while using docker image: docker.io/hashicorp/consul:1.16.1 . I am trying to reproduce the error again.

@soupdiver
Copy link
Contributor Author

Same error randomly happened to me while using docker image: docker.io/hashicorp/consul:1.16.1 . I am trying to reproduce the error again.

For me it happens when the host shuts down and seems not to properly stop the container.
I think there are 2 different problems here

  1. Corrupt file on shutdown
  2. Consul being unable to start with an empty/corrupt server_metadata.json

I guess 1 is harder to investigate but 2 should be relatively easy. Since deleting the file "solves" the problem.

@huikang

Could you provide any details on what caused the Consul container to stop?

Shutting down the host caused stopping the container.

Before consul is restarted, what files/directories are in its data directory

Jusgt the "normal" consul stuff. Services and raft info etc. Nothign crazy here... only the metadata makes problems

@yateya
Copy link

yateya commented Dec 7, 2023

I checked consul code, the only suspecting lines is in function persistServerMetadata, where OpenServerMetadata uses openfile O_TRUNC (this means the file is truncated to empty file) os.OpenFile(filename, os.O_WRONLY|os.O_CREATE|os.O_TRUNC, 0600) then WriteServerMetadata is called to write it.

If WriteServerMetadata failed OR for some reason consul crashed between those two steps, the file could be empty as we saw.

@soupdiver
Copy link
Contributor Author

OR for some reason consul crashed between those two steps, the file could be empty as we saw.

To me this sounds like something that can happen in case of an unexpected und unclean shutdown. The container host shuts down abruptly, doesn't give the container time to cleanly shutdown etc.

@mustafamg
Copy link

mustafamg commented Dec 12, 2023

Is there any progress regarding this issue? @huikang ? As @soupdiver mentioned, it is in two parts; one is easy. Which is catching the exception to allow an empty server_metadata.json file. That can be fast.
The second, which is finding why it became empty, may take time and can be another task.
This bug is critical as it makes the client unreliable.

@huikang
Copy link
Collaborator

huikang commented Dec 13, 2023

@mustafamg , sorry about the late response. I will make a fixing PR today or tomorrow.

@benlperkins
Copy link

Maybe something could be learned from this fix that went into consul for a similar issue in years past?

#2240

@mustafamg
Copy link

Thanks for the fix, @huikang When do you expect it to release?

huikang added a commit that referenced this issue Jan 5, 2024
huikang added a commit that referenced this issue Jan 5, 2024
@huikang
Copy link
Collaborator

huikang commented Jan 5, 2024

@mustafamg , the change will be included in the next patch releases of 1.15, 1.16, and 1.17, which I believe should be by end of Jan.

@alexgornov
Copy link

Can you please tell me which versions these changes were included in? I don't see it in the release notes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants