Celestia-bridge stuck 2 times during 24 hours #4045

Sebby83 · 2025-01-14T23:15:16Z

Summary of Bug

Hi,

Our Celestia-bridge node has encountered two instances of being stuck within the past 24 hours.

First Occurrence Jan 13 at 13:33:27 GMT:

The service logs indicated:

"underlying subscription is stuck"

More logs are available on gist.
The only way to restore the service was by deleting the database, after which the Celestia bridge took a couple of hours to resync.

Second Occurrence:

This issue started again today at approximately 20:15 GMT.

Common log messages included:

    "listener: subscriber error, resubscribing..."
    "server: request height not found"

The full logs for this event are available gist.

We've ruled out an RPC issue, as the node successfully received block height updates.
Proof of this is available here

We suspect this may be related to database corruption, as the server has sufficient free space on the NVMe drive and the storage appears to be healthy.

Additional Context:
We made a configuration change two days ago, increasing the max receive message size with the following command, but we are not sure if this is related:

sed -i.bak 's/^max-recv-msg-size = "10485760"/max-recv-msg-size = "20971520"/' ~/.celestia-app/config/app.toml

Let us know if you need further information

Version

Semantic version: v0.20.4
Commit: 51b7943
Build Date: Thu Dec 19 19:25:05 GMT 2024
System version: amd64/linux
Golang version: go1.23.3

Steps to Reproduce

The issue manifests intermittently, and unfortunately, I was unable to reproduce it consistently.

The text was updated successfully, but these errors were encountered:

rootulp · 2025-01-15T14:05:30Z

Since this is a bridge node, moving this issue to celestia-node (the repo for DA nodes).

Wondertan · 2025-01-15T16:37:17Z

@rootulp, I saw this issue at first and immediately thought its for the node as well, but when I looked deeper the whole issue is about the stuck subscription which we import from core. We observed this issues quite often ourselves and had to make detection for this stuck state and automatic resubscription. In some sense, I see merit in keeping this issue in core or app instead of node.

Although, as we are soon moving to GRPC, I think it does't matter much and we can simply wait for it to come and solve this.

Wondertan · 2025-01-15T16:39:05Z

@Sebby83, thanks for reporting. We soon gonna move to GRPC based subscription between consensus and bridge nodes. This should resolve the issue. Until then restarting either of the nodes usually helps.

Sebby83 · 2025-01-15T17:02:53Z

@Wondertan
Thanks a lot for a prompt reply. When it happens again I'll follow up your instructions.

Sebby83 · 2025-01-15T17:03:11Z

Workaround provided by @Wondertan

Sebby83 added the bug Something isn't working label Jan 14, 2025

github-actions bot added the external Issues created by non node team members label Jan 14, 2025

rootulp transferred this issue from celestiaorg/celestia-app Jan 15, 2025

rootulp mentioned this issue Jan 15, 2025

Investigate impact of Mocha spam celestiaorg/celestia-app#4230

Open

Sebby83 closed this as completed Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Celestia-bridge stuck 2 times during 24 hours #4045

Celestia-bridge stuck 2 times during 24 hours #4045

Sebby83 commented Jan 14, 2025

rootulp commented Jan 15, 2025

Wondertan commented Jan 15, 2025

Wondertan commented Jan 15, 2025

Sebby83 commented Jan 15, 2025

Sebby83 commented Jan 15, 2025

Celestia-bridge stuck 2 times during 24 hours #4045

Celestia-bridge stuck 2 times during 24 hours #4045

Comments

Sebby83 commented Jan 14, 2025

Summary of Bug

Version

Steps to Reproduce

rootulp commented Jan 15, 2025

Wondertan commented Jan 15, 2025

Wondertan commented Jan 15, 2025

Sebby83 commented Jan 15, 2025

Sebby83 commented Jan 15, 2025