Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

insert fail on distributed mode #6093

Closed
rayxai opened this issue May 6, 2016 · 11 comments
Closed

insert fail on distributed mode #6093

rayxai opened this issue May 6, 2016 · 11 comments
Assignees
Labels
Milestone

Comments

@rayxai
Copy link

rayxai commented May 6, 2016

After running stably for 3 months, suddenly there were a lot of following exceptions, seems node1 cannot accept insert operation, but update/query is ok. and insert is ok on node2.
Now we disabled distributed mode.
Thanks for any idea.

version: 2.1.7
"executionMode": "synchronous"
data size: vert 2500k, edge 2500k

Caused by: com.orientechnologies.orient.server.distributed.ODistributedException: Error on executing distributed request (id=5022941 from=node1 task=record_create(#11:-1 v.0) user=#5:0) against database 'orion.[orion_v]' to nodes [node1, node2]
at com.orientechnologies.orient.server.hazelcast.OHazelcastDistributedDatabase.send2Nodes(OHazelcastDistributedDatabase.java:189)
at com.orientechnologies.orient.server.hazelcast.OHazelcastPlugin.sendRequest(OHazelcastPlugin.java:360)
at com.orientechnologies.orient.server.distributed.ODistributedStorage.createRecord(ODistributedStorage.java:547)
at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.executeSaveRecord(ODatabaseDocumentTx.java:1999)
at com.orientechnologies.orient.core.tx.OTransactionNoTx.saveRecord(OTransactionNoTx.java:159)
at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.save(ODatabaseDocumentTx.java:2568)
at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.save(ODatabaseDocumentTx.java:2409)
at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.save(ODatabaseDocumentTx.java:121)
at com.orientechnologies.orient.server.network.protocol.binary.OBinaryNetworkProtocolAbstract.createRecord(OBinaryNetworkProtocolAbstract.java:367)
at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary.createRecord(ONetworkProtocolBinary.java:1569)
at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary.executeRequest(ONetworkProtocolBinary.java:365)
at com.orientechnologies.orient.server.network.protocol.binary.OBinaryNetworkProtocolAbstract.execute(OBinaryNetworkProtocolAbstract.java:223)
at com.orientechnologies.common.thread.OSoftThread.run(OSoftThread.java:77)
Caused by: com.orientechnologies.orient.server.distributed.ODistributedException: Quorum 2 not reached for request (id=5022941 from=node1 task=record_create(#11:-1 v.0) user=#5:0). Elapsed=6ms Servers in timeout/conflict are:

@rayxai
Copy link
Author

rayxai commented May 6, 2016

first we tried to restart the db, but the exception still appear several minutes after restart. then restarted it as standalone server, and it disappeared.

@rayxai
Copy link
Author

rayxai commented May 6, 2016

logs before exception:

2016-05-01 20:09:03:219 INFO  []:2434 [orientdb] [3.5.3] processors=2, physical.memory.total=3.7G, physical.memory.free=106.2M, swap.space.total=7.8G, swap.space.free=7.8G, heap.memory.used=1.5G, heap.memory.free=426.2M, heap.memory.total=1.9G, heap.memory.max=1.9G, heap.memory.used/total=78.54%, heap.memory.used/max=78.54%, minor.gc.count=72210, minor.gc.time=3056529ms, major.gc.count=521, major.gc.time=556340ms, load.process=1.00%, load.system=1.00%, load.systemAverage=0.01, thread.count=98, thread.peakCount=154, cluster.timeDiff=0, event.q.size=0, executor.q.async.size=0, executor.q.client.size=0, executor.q.query.size=0, executor.q.scheduled.size=0, executor.q.io.size=0, executor.q.system.size=0, executor.q.operation.size=0, executor.q.priorityOperation.size=0, executor.q.response.size=0, operations.remote.size=3, operations.running.size=0, operations.pending.invocations.count=3, operations.pending.invocations.percentage=0.00%, proxy.count=18, clientEndpoint.count=0, connection.active.count=1, client.connection.count=0, connection.count=1 [HealthMonitor]
2016-05-01 20:09:23:506 WARNI [node1] detected 1 node(s) in timeout or in conflict and quorum (2) has not been reached, rolling back changes for request (id=5015367 from=node1 task=record_create(#11:-1 v.0) user=#5:0) [ODistributedResponseManager]
2016-05-01 20:09:23:506 WARNI [node1] Quorum 2 not reached for request (id=5015367 from=node1 task=record_create(#11:-1 v.0) user=#5:0). Elapsed=6ms Servers in timeout/conflict are:
  node2: com.orientechnologies.orient.core.storage.ORecordDuplicatedException: Cannot index record orion_v{pk:150001346477,pk_type:3}: found duplicated key '150001346477' in index 'idx_orion_v_pk' previously assigned to the record #11:2658222 RID=#11:2658222
Received: {node1=#11:2658285 v.1, node2=com.orientechnologies.orient.core.storage.ORecordDuplicatedException: Cannot index record orion_v{pk:150001346477,pk_type:3}: found duplicated key '150001346477' in index 'idx_orion_v_pk' previously assigned to the record #11:2658222 RID=#11:2658222} [ODistributedResponseManager]
2016-05-01 20:09:23:506 WARNI [node1] sending undo message (record_delete(#11:2658285 delayed=false)) for request (id=5015367 from=node1 task=record_create(#11:-1 v.0) user=#5:0) to server node1 [ODistributedResponseManager]
2016-05-01 20:09:25:116 WARNI [node1] detected 1 node(s) in timeout or in conflict and quorum (2) has not been reached, rolling back changes for request (id=5015370 from=node1 task=record_create(#11:-1 v.0) user=#5:0) [ODistributedResponseManager]

@rayxai
Copy link
Author

rayxai commented May 6, 2016

I noticed that for insert, normal responses of send2Nodes should have same rid for all nodes, right?

responses:{node1=#11:2675912 v.1, node2=#11:2675912 v.1}

but in our case have different rid which causes the exception. (all insertion failed after that point)

Received: {node1=#11:2660169 v.1, node2=#11:2660136 v.1}

what causes the inconsistence, and seems the db cannot automatically recover it.
thanks for any hint.

@lvca
Copy link
Member

lvca commented May 6, 2016

This exception:

node2=com.orientechnologies.orient.core.storage.ORecordDuplicatedException: Cannot index record orion_v{pk:150001346477,pk_type:3}: found duplicated key '150001346477' in index 'idx_orion_v_pk' previously assigned to the record #11:2658222 RID=#11:2658222} [ODistributedResponseManager]

Is the cause. We had an issue on indexes, so I guess you have 2 records for the same key. Could you please resolve the duplication and upgrade to last 2.1.x? @tglman is this fix in v2.1.x?

@lvca lvca assigned tglman and unassigned lvca May 6, 2016
@tglman
Copy link
Member

tglman commented May 6, 2016

Hi,

Yes all the fix based on cleanup on exception are in 2.1.x.

Regards

@rayxai
Copy link
Author

rayxai commented May 6, 2016

thanks for quick reply, currently we use unique index to avoid duplication, what do you mean by resolve the duplication? do we need to check duplication before insert, and delete existing duplication manually to resolve the inconsistence?

thanks.

@tglman
Copy link
Member

tglman commented May 6, 2016

Hi @ray3888,

Well this is actually what does oriendb in case of notTx usage, the point is that some of this beavior was fixed in a hotfix more recent of 2.1.7 so we are suggesting to update to the last hotfix(currently 2.1.16).

Regards

@rayxai
Copy link
Author

rayxai commented May 6, 2016

thanks for the advice. we will upgrade soon.

Regards

@rayxai
Copy link
Author

rayxai commented May 20, 2016

upgraded to 2.1.16 two days ago, enabled distributed mode, and changed from noTx to Tx just in case, works well up to now.

@tglman
Copy link
Member

tglman commented May 20, 2016

hi @ray3888,

Great, can we close this now ?

@rayxai
Copy link
Author

rayxai commented May 20, 2016

sure

@rayxai rayxai closed this as completed May 20, 2016
@robfrank robfrank modified the milestones: 2.1.x (next hotfix), 2.1.18 May 24, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

5 participants