-
Notifications
You must be signed in to change notification settings - Fork 52
"Detected race condition" #13
Comments
This is "expected" but "unexpected". Let me explain. It is expected in that it is a condition the software is built to expect and self-remedy--that message is just for debug purposes. It is unexpected in that it is an edge case of the algorithm that could cause errors if left uncorrected. And in the course of writing this message, I noticed a bug (which would lead to too many of these messages being generated), so I'm glad you brought it up. That message is supposed to be triggered when Node A receives state information from Node B, but Node B updated its state tables after Node B sent the message. This is to prevent a race condition in which Node B's state changes after it sends the state to Node A but before Node A properly "joins" the cluster and is, therefore, informed of changes to Node B's state. What actually happens is that the message is triggered when Node A receives state information from Node B, but Node A updated its state tables after Node B sent the message. Which is decidedly more chatty and generates more false positives, but eventually leads to the same result. I need to do some serious work on Nodes joining the cluster and communicating Node state as part of #4 and #10, so I'll fix this bug when I do that. Essentially, I need Nodes to "formally" announce their presence to the cluster, and trigger the race condition check at that point, not before. This will bring the implementation in line with the paper. Thanks for raising the issue. Sorry for the problem. :( |
As I was working on #10, I came across the part in the paper that specifically calls for this function:
Of course, as #4 demonstrates, relying on clock time in a distributed system is a Bad Idea™, so we don't want to use a timestamp for that. The original suggestion I heard was to use a vector clock, which would certainly do the trick. I think, however, based on the usage, it may be overkill. Really, all we're trying to do is determine whether the state a node modified its state tables based off of changed before the node officially joined the cluster and began receiving state updates. To that end, I'm starting to think that a simple version number for each state table would suffice. The number is incremented whenever the state table is altered, and included when sending the state table to other nodes. Those nodes then include that number when they announce their presence to the cluster, and if there's a mismatch, the node in the cluster sends the new state to the joining node. Any thoughts or objections? |
I can't comment with any authority, not knowing the design of Wendy, but there is a vector clock implementation for Go that could be useful if you want that route. |
Yeah, it just hasn't been brought up to speed to work with Go1 yet. So I'd have to update it and submit the patch, which isn't the end of the world. I just think a vector clock may be more information than, strictly speaking, is necessary for this. Versioning the state tables with a uint64 is sufficient, I think. The biggest change is working on the joining algorithm, but I've got that all whiteboarded out, and I think it should be easy enough to implement. I've got some stuff that needs doing for my job, so I can't focus on this as much as I'd like, but I'll get the change in ASAP. |
This has been resolved as of the beta1 release. |
I get the following error quite often when a new node joins an existing cluster:
This is during development, when both nodes are running on the same machine.
Is this expected? If so, perhaps using a less "scary" term than "race condition" might be good.
The text was updated successfully, but these errors were encountered: