Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Celestia-bridge stuck 2 times during 24 hours #4045

Closed
Sebby83 opened this issue Jan 14, 2025 · 5 comments
Closed

Celestia-bridge stuck 2 times during 24 hours #4045

Sebby83 opened this issue Jan 14, 2025 · 5 comments
Labels
bug Something isn't working external Issues created by non node team members

Comments

@Sebby83
Copy link

Sebby83 commented Jan 14, 2025

Summary of Bug

Hi,

Our Celestia-bridge node has encountered two instances of being stuck within the past 24 hours.

  1. First Occurrence Jan 13 at 13:33:27 GMT:

The service logs indicated:

"underlying subscription is stuck"

More logs are available on gist.
The only way to restore the service was by deleting the database, after which the Celestia bridge took a couple of hours to resync.

  1. Second Occurrence:

This issue started again today at approximately 20:15 GMT.

Common log messages included:

    "listener: subscriber error, resubscribing..."
    "server: request height not found"

The full logs for this event are available gist.

We've ruled out an RPC issue, as the node successfully received block height updates.
Proof of this is available here

We suspect this may be related to database corruption, as the server has sufficient free space on the NVMe drive and the storage appears to be healthy.

Additional Context:
We made a configuration change two days ago, increasing the max receive message size with the following command, but we are not sure if this is related:

sed -i.bak 's/^max-recv-msg-size = "10485760"/max-recv-msg-size = "20971520"/' ~/.celestia-app/config/app.toml

Let us know if you need further information

Version

Semantic version: v0.20.4
Commit: 51b7943
Build Date: Thu Dec 19 19:25:05 GMT 2024
System version: amd64/linux
Golang version: go1.23.3

Steps to Reproduce

The issue manifests intermittently, and unfortunately, I was unable to reproduce it consistently.

@Sebby83 Sebby83 added the bug Something isn't working label Jan 14, 2025
@github-actions github-actions bot added the external Issues created by non node team members label Jan 14, 2025
@rootulp
Copy link
Contributor

rootulp commented Jan 15, 2025

Since this is a bridge node, moving this issue to celestia-node (the repo for DA nodes).

@rootulp rootulp transferred this issue from celestiaorg/celestia-app Jan 15, 2025
@Wondertan
Copy link
Member

@rootulp, I saw this issue at first and immediately thought its for the node as well, but when I looked deeper the whole issue is about the stuck subscription which we import from core. We observed this issues quite often ourselves and had to make detection for this stuck state and automatic resubscription. In some sense, I see merit in keeping this issue in core or app instead of node.

Although, as we are soon moving to GRPC, I think it does't matter much and we can simply wait for it to come and solve this.

@Wondertan
Copy link
Member

@Sebby83, thanks for reporting. We soon gonna move to GRPC based subscription between consensus and bridge nodes. This should resolve the issue. Until then restarting either of the nodes usually helps.

@Sebby83
Copy link
Author

Sebby83 commented Jan 15, 2025

@Wondertan
Thanks a lot for a prompt reply. When it happens again I'll follow up your instructions.

@Sebby83 Sebby83 closed this as completed Jan 15, 2025
@Sebby83
Copy link
Author

Sebby83 commented Jan 15, 2025

Workaround provided by @Wondertan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working external Issues created by non node team members
Projects
None yet
Development

No branches or pull requests

3 participants