The JDBC database interface layer constructs incomplete scan query #171

arnaudsjs · 2014-05-20T07:39:03Z

When a scan query is composed in the JDBC database interface layer, the resulting query has to following shape:

SELECT * FROM "TABLE_NAME" WHERE "PRIMARY_KEY" >= "START_KEY"

This implementation causes the scan operation to return more and more results when the amount of entries in the database raises during the benchmark. As a consequence the latencies of the scan operation raise too over time (this is how the bug was detected).

I suggest the implementation should be something like the following query:

SELECT * FROM "TABLE_NAME" WHERE "PRIMARY_KEY" >= "START_KEY" ORDER BY "PRIMARY_KEY" LIMIT "MAX_SCAN_LENGTH"

This implementation takes into account the maxscanlength property and only selects the first maxscanlength properties after the startkey (order by primary key). Experiments using the MySQL database show that this implementation makes the latencies of the scan operation constant over time.

renep123 · 2014-11-17T07:30:29Z

Hi
I am facing a similar issue while performing a scan on the mysql db when the returned result is too big and causes a Java memory out of space error. Can you please help me out with how you modified the scan implementation to take into account the MAX_SCAN_LENGTH limit.

Thanks in advance!

arnaudsjs · 2014-11-17T08:36:55Z

You have to adjust the query construction within the following file YCSB/jdbc/src/main/java/com/yahoo/ycsb/db/JdbcDBClient.java (createAndCacheScanStatement and scan method) in the same way as shown here: arnaudsjs/YCSB-mysql-cluster-binding@eaf9542. Of course ignoring the setReadOnly calls. The LIMIT keyword selects only a limited amount of rows, which will prevent the "Java memory out of space error".

Let me know if there are any further issues.
Good luck!

renep123 · 2014-11-17T17:33:02Z

Thanks a lot for your reply Arnaud!!

I changed the file JdbcClient.java as suggested by you and recompiled the ycsb from source. When I run the scan workloade now, I can see that the mysql query is now modified using the ORDER BY and LIMIT Keywords.

But I am not able to understand how is the LIMIT parameter decided in the code, ideally it should be the maxscanlength parameter.

Also, my workload output (ycsb's client) looks as follows:-
YCSB Client 0.1
Command line: -db com.yahoo.ycsb.db.JdbcDBClient -P workloads/workloademod -p db.driver=com.mysql.jdbc.Driver -p db.url=jdbc:mysql://localhost:3306/testycsb -p db.user=username -p recordcount=1000 -p operationcount=1000 -p maxexecutiontime=1800 -t
Adding shard node URL: jdbc:mysql://localhost:3306/testycsb
Using 1 shards
[OVERALL], RunTime(ms), 27828.0
[OVERALL], Throughput(ops/sec), 35.93502946672416
[CLEANUP], Operations, 1
[CLEANUP], AverageLatency(us), 1514.0
[CLEANUP], MinLatency(us), 1514
[CLEANUP], MaxLatency(us), 1514
[CLEANUP], 95thPercentileLatency(ms), 1
[CLEANUP], 99thPercentileLatency(ms), 1
[CLEANUP], 0, 0
[CLEANUP], 1, 1

What's cleanup message above, I would imagine it to be all scans, I am not sure if I am doing the right thing.

Thanks again for your help.

renep123 · 2014-11-17T19:45:33Z

Actually I can see a bunch of scan messages following the cleanup messages. What do the cleanup messages correspond to?

Thanks

busbey · 2015-10-31T00:26:36Z

Any idea if this is still a problem @kruthar ?

kruthar · 2015-10-31T01:10:17Z

@busbey, I'm looking into this one and a few other open JDBC issues. If I can reproduce this I'll look into a fix, but at this point I'm not sure if it is still an issue.

kruthar · 2015-10-31T03:00:23Z

So, this is an interesting issue. First off, it does seem to be a problem still, increasing record count increases scan latency dramatically:

100 records: [SCAN], AverageLatency(us), 1482.5882352941176
1000 records: [SCAN], AverageLatency(us), 2339.490566037736
10000 records: [SCAN], AverageLatency(us), 6464.980392156863
100000 records: [SCAN], AverageLatency(us), 42759.489795918365

At first I thought why try and support a scan in SQL in the first place? SQL doesn't really do scans perse, I guess you can simulate one visually, but it is not the same as a scan in a NoSQL database like HBase. But...sounds like people are using it I guess.

In most NoSQL systems scans are useful for retrieving sets of data that are physically adjacent to each other. You just move to the start key and advance across adjacent records. To simulate grabbing adjacent keys you have to do like proposed above, GROUP BY and LIMIT. (This is another reason why it is not really a scan, you have to apply the GROUP BY to way more records then you are actually receiving back, just doesn't seem right.) Either way, that fix suggested will not work quite as expected:

mysql> select * from usertable where YCSB_KEY >= 'user4' group by YCSB_KEY limit 5;
+----------+-----------------+-----------------+-----------------+
| YCSB_KEY | FIELD1          | FIELD2          | FIELD3          |
+----------+-----------------+-----------------+-----------------+
| user4    | user4:field2:18 | user4:field1:18 | user4:field0:18 |
| user40   | user40:field2:2 | user40:field1:2 | user40:field0:2 |
| user400  | user400:field2: | user400:field1: | user400:field0: |
| user401  | user401:field2: | user401:field1: | user401:field0: |
| user402  | user402:field2: | user402:field1: | user402:field0: |
+----------+-----------------+-----------------+-----------------+

As you can see we're dealing with string sorting not numeric sorting. We can do something hackish such as:

mysql> select * from usertable where YCSB_KEY >= 'user4' group by substring(YCSB_KEY, 5, 100)*1 limit 5;
+----------+-----------------+-----------------+-----------------+
| YCSB_KEY | FIELD1          | FIELD2          | FIELD3          |
+----------+-----------------+-----------------+-----------------+
| user4    | user4:field2:18 | user4:field1:18 | user4:field0:18 |
| user5    | user5:field2:91 | user5:field1:91 | user5:field0:91 |
| user6    | user6:field2:-1 | user6:field1:-1 | user6:field0:-1 |
| user7    | user7:field2:77 | user7:field1:77 | user7:field0:77 |
| user8    | user8:field2:-1 | user8:field1:-1 | user8:field0:-1 |
+----------+-----------------+-----------------+-----------------+

Those SQL examples were made with insertorder=ordered, if that was not true then we would just be grabbing keys that ended up next to each other after hashing. In context that scan result would not really mean anything.

So, that was sort of a convoluted way of saying, my preference would be to actually leave scan() as unimplemented, SQL systems don't really have a scan functionality. But we can visually simulate a scan, I think I would use a query like:

select * from usertable where YCSB_KEY >= 'user4' group by substring(YCSB_KEY, 5, 100)*1 limit 5;

so that we grab numerically sorted keys that are adjacent to each other. The relevancy of the result would depend on insertorder but since we are sort of arbitrarily sorting anyway, I don't think it matters at that point.

busbey · 2015-10-31T09:53:41Z

I believe there are other datastore bindings that don't implement scan. I think they just throw UnsupportedOperationException. Can you put something together to have JDBC behave the same?

kruthar · 2015-11-01T00:29:30Z

Yes, I would have thought that would be the proper way to do it too. There are about ten or so bindings that do not implement scan, they are not consistent on handling it but it looks like the general consensus is to log a relevant message then return failure.

This seems ok to me, the user is informed in the log as well as in the final results that something is wrong with scan.

I'm happy to make this change, I thought it might be considered a breaking change, since people are/were using it, but then again it is buggy so...

PR to follow.

cmccoy · 2015-11-01T16:29:51Z

Out of curiosity, do you have an index on YCSB_KEY?

kruthar · 2015-11-01T16:59:51Z

In the README I pushed as well as in the example schema files in the resources folder YCSB_KEY is specified as the PRIMARY KEY which gives it an index as far as I know.

cmccoy · 2015-11-01T17:01:53Z

Oh, of course. Thanks.

kruthar · 2015-11-02T14:26:51Z

A little more time to think on this one. A lot of my reasoning for wanting to make the scan operation unimplemented were about contextual differences an implementation and what the results returned would really mean to an application. I'm not sure that is the right way to think about this. Since YCSB is a performance testing benchmark if the scan operation CAN be implemented it probably should.

SELECT * is essentially a scan so just limiting it with LIMIT as per early suggestions on this thread is probably the best course to go here. Since we are only ever scanning over the primary keys which are indexed, we may not need any type of GROUP BY or ORDER BY, just a LIMIT.

In context this is essentially a random seek of records which may not mean anything to an application. But that part isn't important, what's important is that the performance of that action can be measured, so that a user has an idea of how a slightly altered operation that does makes sense to an application will perform.

Going to drop my current PR in favor of a new one reflecting this.

Now properly enforcing a scan of the given number of documents in the scan operation using the LIMIT modifier in the cached statement. Fixes brianfrankcooper#171.

cmatser · 2015-11-02T22:55:52Z

I concur with this. select * with a limit feels like an appropriate implementation.

Now properly enforcing a scan of the given number of documents in the scan operation using the LIMIT modifier in the cached statement. Fixes brianfrankcooper#171.

westonplatter added the jdbc label May 30, 2014

kruthar mentioned this issue Oct 31, 2015

YCSB needs to output error correct when running into OutofMemory errors #176

Closed

This was referenced Nov 1, 2015

[jdbc] made scan operation unimplemented #463

Closed

YCSB java out of memory error with mysql jdbc scan #196

Closed

very low throughputs of sharded mysql with YCSB #139

Closed

kruthar added a commit to kruthar/YCSB that referenced this issue Nov 2, 2015

[jdbc] added LIMIT to JDBC scan operation

cffe84a

Now properly enforcing a scan of the given number of documents in the scan operation using the LIMIT modifier in the cached statement. Fixes brianfrankcooper#171.

kruthar mentioned this issue Nov 2, 2015

[jdbc] added LIMIT to JDBC scan operation #468

Merged

cmccoy closed this as completed in #468 Nov 3, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The JDBC database interface layer constructs incomplete scan query #171

The JDBC database interface layer constructs incomplete scan query #171

arnaudsjs commented May 20, 2014

renep123 commented Nov 17, 2014

arnaudsjs commented Nov 17, 2014

renep123 commented Nov 17, 2014

renep123 commented Nov 17, 2014

busbey commented Oct 31, 2015

kruthar commented Oct 31, 2015

kruthar commented Oct 31, 2015

busbey commented Oct 31, 2015

kruthar commented Nov 1, 2015

cmccoy commented Nov 1, 2015

kruthar commented Nov 1, 2015

cmccoy commented Nov 1, 2015

kruthar commented Nov 2, 2015

cmatser commented Nov 2, 2015

The JDBC database interface layer constructs incomplete scan query #171

The JDBC database interface layer constructs incomplete scan query #171

Comments

arnaudsjs commented May 20, 2014

renep123 commented Nov 17, 2014

arnaudsjs commented Nov 17, 2014

renep123 commented Nov 17, 2014

renep123 commented Nov 17, 2014

busbey commented Oct 31, 2015

kruthar commented Oct 31, 2015

kruthar commented Oct 31, 2015

busbey commented Oct 31, 2015

kruthar commented Nov 1, 2015

cmccoy commented Nov 1, 2015

kruthar commented Nov 1, 2015

cmccoy commented Nov 1, 2015

kruthar commented Nov 2, 2015

cmatser commented Nov 2, 2015