Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ODistributedRecordLockedException Timeout (100ms) on acquiring lock on record #xx:-1 on server 'xxx'. It is locked by request null #8974

Closed
freeart opened this issue Aug 2, 2019 · 32 comments
Assignees
Labels
Milestone

Comments

@freeart
Copy link

freeart commented Aug 2, 2019

OrientDB Version: 3.1.0 from develop branch

Java Version: docker openjdk:8-jdk-alpine

OS: docker openjdk:8-jdk-alpine

We have cluster of 2 orientdb in production and we get an error

com.orientechnologies.orient.server.distributed.task.ODistributedRecordLockedException: Timeout (100ms) on acquiring lock on record #82:-1 on server 'odb1'. It is locked by request null

It happens 2 times in 1 month
Id records is always #82:-1 or #83:-1 and request is null
After that message the table is blocked for inserting until restart orientdb server
We did kill -5 before restart on both of dbs
The incident was about 1pm we think
In attachment you can find our config, docker-compose file, docker logs and kill -5 logs
issue.zip

@jonsalvas
Copy link

jonsalvas commented Aug 6, 2019

Hi guys,

we are facing the same issue with 3.0.22. The warning itself is not the biggest issue as we implemented a retry logic to retry x times (as described in the documentation) until the lock is released and the transaction can be committed. The main issue is that after a certain amount of time (and after continuously inserting data into the database with JMeter and 10 threads in parallel) the database is completely locking for some time while the above WARN msg appear in log. Until the response from the database there is no way to read from or write into the database at all!!! Even if we try to update/insert other vertices. It is expected behavior that the update transaction of that specific vertex is locking others trying to write into that vertex but it seems that it is simply locking everything ! After a certain amount of time (may take up to one minute) the database is accepting requests again is however still logging the above message. By the way it makes no difference whether we are using "synchronous" or "asynchronous" or "undefined", the error still appears. In our preproduction database we also noticed some deadlock errors which we unfortunately were unable to reproduce locally.

We wrote a minimal script to reproduce the issue. Prerequisite is a schema with Class "TEST" and one Vertex with attribute with value=5. We attached the minimal javascript, some thread dumps, the used configs, logs and the JMeter Test for deeper analysis. As little side information: We are running the recent orientdb 3.0.22 image in kubernetes (using a helm chart from https://github.com/helm/charts/tree/master/incubator/orientdb).

debugging-orientdb.zip

@lvca @luigidellaquila Could you please assist with this issue as we have a production release approaching in a couple of weeks and this issue is a real blocker for us running orientdb in distributed mode.

There are several similar issues such as #8691 #7856 #8663 #8742 but they don't seem to be fixed or adressed.

@freeart
Copy link
Author

freeart commented Aug 8, 2019

@luigidellaquila We use orientdb in the public health system in a big EU country and we can't just restart db by every problem because of the regulatory agency. When it is happening the table is blocked to insert. It is very critical issue.

@freeart
Copy link
Author

freeart commented Aug 20, 2019

@luigidellaquila I wrote something stupid? Why nobody answers? We still have this issue and can't use orientdb literally as production product

@wolf4ood
Copy link
Member

Hi @freeart

sorry for the delay, first of all i see that you are you using 3.1 from develop branch.

Is there a reason for that?

Are you using OrientDB from Java client?

@luigidellaquila
Copy link
Member

Hi @freeart

Sorry for my late reply, I was away for two weeks and I still have some backlog...
Let me check it, I'll give you a feedback ASAP

Thanks

Luigi

@freeart
Copy link
Author

freeart commented Aug 20, 2019

@wolf4ood Hi, thanks for your reply. If you look how many problems I found in the stable branch you can imagine why we use develop version. Issues what I found
I'm using the rest api because of many problems in nodejs driver for 3.x version. I just wrote my tiny rest wrapper on orientdb rest api.
@luigidellaquila Thanks. Bosses want to "kill" me for my choice of db=) Maybe it's our fault in config, who knows. We have no strong devops.

@wolf4ood
Copy link
Member

Hi @freeart

3.0.x as 3.1.x contains always the latest fix. 3.1.x . could be general more unstable since it's development of the new version of ODB.

Which kind of issue did you get with the nodejs driver?

Are you using some kind of load balancing for the 2 nodes?

Thanks

@freeart
Copy link
Author

freeart commented Aug 20, 2019

@wolf4ood I started use orientdb 3.x from beta and nodejs driver was super unstable and didn't know about new features 3.x and I was tired to read stackoverflow every problem I found in driver and I wrote my own.
Things like those are fundamental to me and I couldn't wait the stable because of lag between the develop branch and the stable snapshots:
#8662
#8658
#8647
#8646
First month I tried to use stable branch but issues was so critical like I needed to implement something in 2 days but db didn't do properly like #8543 I needed to do two queries insert and select instead of single insert return.

We didn't use load balancing. Should we use it?

@wolf4ood
Copy link
Member

@freeart

i just wanted to know if you are writing concurrently in each node

Thanks

@freeart
Copy link
Author

freeart commented Aug 20, 2019

@wolf4ood I understand what you mean. I will ask our devops

@freeart
Copy link
Author

freeart commented Aug 20, 2019

@wolf4ood I confirm. We have orientdb in cluster on two real servers (there are in the same network) and we are writing to the first one and reading as well.

@wolf4ood
Copy link
Member

@freeart

rids #82:-1 or #83:-1 means that there is some problem with insert operation. since those are temporary record id.

The strange thing is the deadlock after the exception. The exception itself it can happen while waiting for getting the lock on a resource. Usually it is solved by using a retry mechanism.

are you able to reproduce this on a test cluster?

@freeart
Copy link
Author

freeart commented Aug 20, 2019

@wolf4ood No, we can't reproduce it. But on production it happens often. I attached logs and seems like nodes of cluster have lost connection between them for 2 days and has happened deadlock after and I don't know it was cause or not.

@freeart
Copy link
Author

freeart commented Aug 20, 2019

Maybe @jonsalvas can help us. Seems like he wrote a test scenario. I don't understand exactly what is it in his attachment

@wolf4ood
Copy link
Member

@jonsalvas use case is slightly different since it can be cause due load balancing which trigger concurrent writes on each node, but we exclude that for your use case.

@wolf4ood
Copy link
Member

@freeart

one info that can help us identifying the deadlock is that when it happens, execute a thread dump of both of the nodes and share it with us

https://dzone.com/articles/how-to-take-thread-dumps-7-options

@freeart
Copy link
Author

freeart commented Aug 20, 2019

@wolf4ood Ok I will. Can you explain me when I write to first node and has happened network problem between nodes. What happened after it has connected? The second one needs to get new records from the first one to sync? If we are still writing to the first one can it be cause of issue?

@wolf4ood
Copy link
Member

Hi @freeart

Not sure about the network problem, Did you experience some physical networking issue?

It should not cause this issue, writes are blocked meanwhile the records are being transfert, at least with community edition.

@freeart
Copy link
Author

freeart commented Aug 20, 2019

@wolf4ood I don't know what cause is it. Just see some times in logs
WARNI [odb2]->[odb1] Server 'odb1' did not respond to the gossip message (db=nvd, timeout=10000ms), but cannot be set OFFLINE by configuration [OClusterHealthChecker]
Some times: WARNI [odb2] Timeout (45001ms) on waiting for synchronous responses from nodes=[odb1] responsesSoFar=[] request=(id=0.107015 task=upd_db_status) [ODistributedDatabaseImpl]
So, next time when deadlock happens I will do jstack -l pid > ./threadDump.txt on both docker machines

@jonsalvas
Copy link

This morning I received the message in our Production environment:

What happened:

  • one of the nodes (in distributed cluster with 3 nodes) fell down due to OOM
  • everything worked fine, data was still received by the two other nodes
  • I restarted the node
  • Now one of the nodes started sending chunks to the restarted node. Meanwhile the restarted node already received data and threw a lot of 'Node is not online yet' in the log (which is fine I guess).
  • When finished the restarted node reported "ONLINE" to all others but received " Updated node status to 'ONLINE' [OHazelcastPlugin]Cannot route TX operation against distributed node
    java.lang.NullPointerException"
  • Finally I got a lot of messages like: com.orientechnologies.orient.server.distributed.task.ODistributedRecordLockedException: Timeout (100ms) on acquiring lock on record Add trigger feature [moved] #3:-1 on server 'xxxx-odb-orientdb-1'. It is locked by request null
  • Regarding "retry": we have a retry logic implemented, which does not help (it is retrying 100 times) and then failing and requeing in our rabbitmq.
  • The database is not accepting any data anymore it seems that the database is in a deadlock situation from which it is not able to recover from. I have to restart everything which is really really bad!

I attached the log.

logs-from-orientdb-in-prd-xxx-odb-orientdb-1.txt

@freeart
Copy link
Author

freeart commented Aug 23, 2019

@jonsalvas Did you do the memory dump via jstack -l pid > ./threadDump.txt?

@jonsalvas
Copy link

@freeart No sorry forgot to do that. I was too nervous about getting our production database running again :-). Next time it happens I will definitely do it.

@freeart
Copy link
Author

freeart commented Aug 23, 2019

@jonsalvas We stopped to use cluster mode on production and we are looking for an another db as "plan B" solution. I don't want to wait for the "next time", because we can lose our project.

@jonsalvas
Copy link

@freeart That's sad to hear. To be honest: We were also thinking about switching to ArangoDB as Plan B. The problem is that we now have a buy-in due to orient-specific code which we would need to migrate. We therefore hope we can fix our problems asap.

@wolf4ood
Copy link
Member

Hi @jonsalvas

We did a couple of fix on distributed, which could lead to the deadlock situation.

They should be available in 3.0.23.
Let me know if you are able to upgrade it and feedback if this issue is solved

Thanks

@ygyg70
Copy link

ygyg70 commented Aug 31, 2019

I updated to 3.0.23, still experiencing one of the issues mentioned here:
#8742
See also
#8994
I am currently blocked

@ahill134
Copy link

ahill134 commented Sep 4, 2019

Also was running into similar issues and upgraded to 3.0.23. I am no longer receiving the acquiring lock issue however i am now receiving this:

INFO: Caught Network I/O errors on SERVERNAME:2424/INDEXNAME, trying an automatic reconnection... (error: com.orientechnologies.orient.server.distributed.task.ODistributedRecordLockedException; local class incompatible: stream classdesc serialVersionUID = 7723318576895821763, local class serialVersionUID = -6271099426689941140) com.orientechnologies.common.io.OIOException: com.orientechnologies.orient.server.distributed.task.ODistributedRecordLockedException; local class incompatible: stream classdesc serialVersionUID = 7723318576895821763, local class serialVersionUID = -6271099426689941140 at com.orientechnologies.orient.client.remote.OStorageRemote.baseNetworkOperation(OStorageRemote.java:450) at com.orientechnologies.orient.client.remote.OStorageRemote.networkOperationRetryTimeout(OStorageRemote.java:331) at com.orientechnologies.orient.client.remote.OStorageRemote.networkOperationNoRetry(OStorageRemote.java:361) at com.orientechnologies.orient.client.remote.OStorageRemote.execute(OStorageRemote.java:1080) at com.orientechnologies.orient.core.db.document.ODatabaseDocumentRemote.execute(ODatabaseDocumentRemote.java:434) at com.optum.tech360.orient.OrientDB3Util.executeQuery(OrientDB3Util.java:202) at com.optum.crispr.util.EsUtil.buildRecordsList(EsUtil.java:92) at com.optum.crispr.util.EsUtil.getIndexRecords(EsUtil.java:117) at com.optum.crispr.ThreadedOrientLoader.runOrientLoader(ThreadedOrientLoader.java:68) at com.optum.crispr.ThreadedOrientLoader.run(ThreadedOrientLoader.java:37) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.InvalidClassException: com.orientechnologies.orient.server.distributed.task.ODistributedRecordLockedException; local class incompatible: stream classdesc serialVersionUID = 7723318576895821763, local class serialVersionUID = -6271099426689941140 at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:699) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1885) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1751) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2042) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431) at com.orientechnologies.orient.client.binary.OChannelBinaryAsynchClient.deserializeException(OChannelBinaryAsynchClient.java:322) at com.orientechnologies.orient.client.binary.OChannelBinaryAsynchClient.handleStatus(OChannelBinaryAsynchClient.java:284) at com.orientechnologies.orient.client.binary.OChannelBinaryAsynchClient.handleStatus(OChannelBinaryAsynchClient.java:307) at com.orientechnologies.orient.client.binary.OChannelBinaryAsynchClient.beginResponse(OChannelBinaryAsynchClient.java:196) at com.orientechnologies.orient.client.binary.OChannelBinaryAsynchClient.beginResponse(OChannelBinaryAsynchClient.java:158) at com.orientechnologies.orient.client.remote.OStorageRemote.beginResponse(OStorageRemote.java:1953) at com.orientechnologies.orient.client.remote.OStorageRemote.lambda$networkOperationRetryTimeout$2(OStorageRemote.java:348) at com.orientechnologies.orient.client.remote.OStorageRemote.baseNetworkOperation(OStorageRemote.java:407) ... 14 more

Java Version:
openjdk version "1.8.0_201"
OpenJDK Runtime Environment (build 1.8.0_201-b09)
OpenJDK 64-Bit Server VM (build 25.201-b09, mixed mode)

OS:
NAME="Red Hat Enterprise Linux Server"
VERSION="7.6 (Maipo)"

Orient DB Version: 3.0.23

After receiving this error I cannot insert into the database at all but I can create new classes/properties. It is very similar symptoms to the acquiring lock/deadlock issue, so i believe it is coming from the same source.

@freeart
Copy link
Author

freeart commented Oct 2, 2019

I got the issue about deadlock in single node configuration. Seems like it's not a problem about communication between nodes @wolf4ood

@wolf4ood
Copy link
Member

wolf4ood commented Oct 2, 2019

@freeart

can you capture the thread dump when is in deadlock and send it to me at e.risa@sap.com ?

You can use on of the following methods

https://dzone.com/articles/how-to-take-thread-dumps-7-options

Thanks

@freeart
Copy link
Author

freeart commented Oct 8, 2019

@wolf4ood I can't capture the dump because of docker

PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
  1     0 root     S    5835m  71%   3   0% /usr/lib/jvm/java-1.8-openjdk/bin/java -server -Xms4G -Xmx4G -Djna.nosys=true -XX:+HeapDumpOnOutOfMemoryError -Djava.awt.headless=true -Dfile.encoding=UTF8 -D

/orientdb # jstack -l 1 > ./dump.txt
1: Unable to get pid of LinuxThreads manager thread

@jonsalvas
Copy link

jonsalvas commented Nov 21, 2019

@jonsalvas use case is slightly different since it can be cause due load balancing which trigger concurrent writes on each node, but we exclude that for your use case.

@wolf4ood Could you please elaborate on this? How is this intended in a HA environment? Do we have to send all our write operations to a single node only, instead of a load balancer? How to handle HA, if the node goes down? Should the client handle that? etc. etc. Would be nice to see a best practice architecture. The documentation is not really helpful in this case.

We are currently testing 3.0.25 in production. It seems to run more stable, but we keep monitoring. As soon as the issue reappears I will attach the heap dump

@tglman tglman self-assigned this Jun 1, 2020
@tglman tglman added the bug label Jun 1, 2020
@tglman tglman added this to the 3.1.x milestone Jun 1, 2020
@tglman tglman modified the milestones: 3.1.x, 3.1.16 Feb 23, 2022
@ozgursucu
Copy link

ozgursucu commented Feb 26, 2022

@tglman what is the current situation of this problem? We are also facing this problem frequently in our production environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

9 participants