-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possibility of losing commits when polling in high concurrency scenarios with SQL Server #21
Comments
I've also seen this issue on PostgreSQL. |
@mgevans In SQL Server, the scenario is probably very unlikely to occur, as the window of opportunity is very, very small if you use the right isolation level (lock-based I don't really know PostgreSQL, but if there is a similar way of using lock-based |
Hi Peter,
Do you happen to know whether the READ COMMITTED SNAPSHOT option was on or
off on your production database? I was under the impression (based on my
analysis and tests) that the behavior you describe only applies with READ
COMMITTED SNAPSHOT being on.
With READ COMMITTED SNAPSHOT on, a reader in your example will see the row
inserted by Thread B even if Thread A hasn't committed yet. The problem
becomes more likely, as you say, with a transaction scope.
With READ COMMITTED SNAPSHOT off, a reader in your example will usually
block as soon as Thread A inserted a row (because it blocks on the lock
taken by the insert). There is a very tiny race condition, where the reader
doesn't see the row because the lock isn't taken yet. This, however, is not
affected by how long it takes to commit the transaction, the race is only
between the two INSERTs and the reader.
Thread A generates an IDENTITY value 1.
Thread B generates an IDENTITY value 2.
The reader reads past the place where CommitSequence 1 will be placed.
Thread A inserts the commit (and takes a lock) with CommitSequence: 1.
Thread B inserts the commit (and takes a lock) with CommitSequence: 2.
The reader now tries to read the row with CommitSequence 2.
...
So, did you see the behavior with READ COMMITTED on or off? I'm asking
because I know it's extremely likely to occur with READ COMMITTED SNAPSHOT
on; but I'm still trying to determine how likely the problem is to occur
wuth RCS off :)
About the TransactionScope: I think that scope inside ExecuteCommand is
there mostly to suppress ambient transactions if the
EnlistInAmbientTransaction option is set to false (which it usually is). In
fact, as soon as EnlistInAmbientTransaction is enabled, the scope becomes
a) unnecessary and b) a problem (
#12). If
you remove it, NEventStore would however enlist in ambient transactions by
default, which is probably not a good change.
About your fix: There are a few use cases for running NEventStore with SQL
Persistence with EnlistInAmbientTransactions enabled; so IMO any change in
there should keep those use cases in mind. In particular, taking an
application lock for the whole transaction is probably not the best idea if
EnlistInAmbientTransactions is used because the transaction could live
quite "long" and the application lock would thus reduce concurrency a lot.
I think that as long as READ COMMITTED SNAPSHOT is off, it should be enough
to hold the app lock around the INSERT statement. I'm not sure if this
could increase the likelihood of deadlocks, though.
…On Wed, Nov 30, 2016 at 3:31 PM, Peter Stephenson ***@***.***> wrote:
Hi,
We've hit this problem several in production and we think we have a
potential fix for SQL Server at least (I'm afraid I don't know the other
technologies very well!)
The issue seems to be down to a race condition when generating IDs (SQL
Server does this just prior to the insert when using an identity) and
completing the transaction. It's possible for the polling client to read
until latest with some missing.
2 simultaneous threads operate as follows:
Thread A inserts a commit with CommitSequence: 1
Thread B inserts a commit with CommitSequence: 2
Thread B completes its transaction
Thread A completes its transaction
The transaction scope used around all commands actually makes the issue
far more likely (as opposed to it not being there) as the time between the
ID generation and the row being committed is increased substantially. READ
COMMITTED wouldn't help in this case...
We've done some experimentation with SQL server directly and noted that
adding an exclusive application lock for the duration of the write (thus
serialising write operations) completely resolves this issue, however with
the transaction scope in play it can take a while to release the lock.
As each of the providers seem to only use a single statement for this
operation, I would argue the transaction scope could be removed, the
transaction moved to the server side code and a serialising lock taken. I'm
not familiar enough with the other technologies to complete this fix for
all of them, however.
The rough idea can be seen here: https://github.com/ASOS/
NEventStore.Persistence.SQL/tree/fix-mssql-race-condition
I'm willing to pull request this accross, but have a few questions:
- Do you think the TransactionScope is definitely necessary?
- Is anybody able to suggest implementations for the other providers?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#21 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAIhLNanAeTYuJWr-LhhXTrh2gyYa_5Oks5rDYjWgaJpZM4IJNlN>
.
|
Hi, We actually realised that Read Committed Snapshot was on shortly after I posted that and I removed the post as it was misleading, sorry! |
See #14 (comment) about |
We are experiencing this in our production environment. Are there any suggestions for fixing this? We've implemented the READCOMMITTEDLOCK changeea686a1#diff-9ec321712e759f586acc6e7a5b53661f1f7cadf2e6b6faade5e6e7db28f98f30 for example here we pulled out 3227030099 and 3227030104, but not 3227030103, which is in the same bucket. Update: We ended up implementing something similar to the suggestion in this comment. So far it's working well. |
@cecilyth Would it be possible for you to share the code of your solution? I think if Also, I'd be interested: Are you 100% sure the |
Here's the narrative for how our journey with this bug went. We're a summer camp registration company, so have very heavy spurts of load during a registration event, and the Azure SQL database for our eventstore is just over 3 TBs with almost 2 billion rows in the Commits table. We originally released our eventstore upgrade (going from version 3 of the original Eventstore repo to the latest version of NEventStore) in late August. We fairly quickly saw the pollers skipping commits, implemented the Unfortunately, shortly thereafter we realized we had an issue with the primary key on the Commits table. Our primary key included the In late November we changed the primary key on the Sequencing the commits on the client side wasn't really an option for us, as we're a multi tenant system and each commit I implemented new functionality in We deployed to production last week and haven't seen any skipped commits since. Finally, we seem to be in a stable place with the primary key change, read committed lock for the in MsSqlDialect:
in SqlPersistenceEngine:
|
Wow, thanks a lot for the detailed description! ❤️ So, you were still rarely seeing skipped commits with your very high load even after introducing the It also makes some sense to me that the skip likelihood increased a lot after removing the (BTW, there's usually another UNIQUE index called Thanks also for the details about your workaround! The bad news is I think it still doesn't solve the original issue in theory, but in practice it's probably enough to get a stable solution. In case you're still interested: The theoretic issue (as I understand it) is the following sequence of events, caused by a race involving (at least) three parallel threads. Threads A and B are writing commits, thread C is pulling commits, with a query like
As Thread C is already blocked on value 18 in step 5, it will skip the value 17 written in step 7. With your solution, you still have the race condition in theory - if thread C performs its second (read uncommitted) query before Thread A gets to insert its row, it will return the same count as in step 5 and still skip value 17. The problem is inherent because the next The only other fix I know that can solve this issue is to pause when you see a gap while pulling events, but even that can only be done heuristically (how long to wait?), and it will negatively affect performance if you have lots of aborted transactions (e.g., due to high concurrency on the same streams) :( . |
It's a good theory about that UNIQUE index somehow being corrupt or missing somewhere! I'll look into that. I totally understand your example and based on all the talk of the issue being very very rare, I was suspicious of it being the cause of our problems. One thing I'm not totally getting yet is why we see the mismatch in count when querying uncommitted, and re-querying that range result in more commits getting pulled. I guess in your example, it's because by the time we count the uncommitted, we've hit step 5/thread C has taken the rowlock on 17? I really appreciate your response! |
As explained in this dba.stackexchange.com answer, the
IDENTITY
-based checkpointing mechanism used by theSqlPersistenceEngine
withMsSqlDialect
is not safe: commits could be skipped when usingGetFrom
orCommitPollingClient
under high load. The probability of running into this scenario is probably very low, but it does exist.See also NEventStore/NEventStore#425 for a discussion of possible solutions.
The text was updated successfully, but these errors were encountered: