-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The JDBC database interface layer constructs incomplete scan query #171
Comments
Hi Thanks in advance! |
You have to adjust the query construction within the following file YCSB/jdbc/src/main/java/com/yahoo/ycsb/db/JdbcDBClient.java (createAndCacheScanStatement and scan method) in the same way as shown here: arnaudsjs/YCSB-mysql-cluster-binding@eaf9542. Of course ignoring the setReadOnly calls. The LIMIT keyword selects only a limited amount of rows, which will prevent the "Java memory out of space error". Let me know if there are any further issues. |
Thanks a lot for your reply Arnaud!! I changed the file JdbcClient.java as suggested by you and recompiled the ycsb from source. When I run the scan workloade now, I can see that the mysql query is now modified using the ORDER BY and LIMIT Keywords. But I am not able to understand how is the LIMIT parameter decided in the code, ideally it should be the maxscanlength parameter. Also, my workload output (ycsb's client) looks as follows:- What's cleanup message above, I would imagine it to be all scans, I am not sure if I am doing the right thing. Thanks again for your help. |
Actually I can see a bunch of scan messages following the cleanup messages. What do the cleanup messages correspond to? Thanks |
Any idea if this is still a problem @kruthar ? |
@busbey, I'm looking into this one and a few other open JDBC issues. If I can reproduce this I'll look into a fix, but at this point I'm not sure if it is still an issue. |
So, this is an interesting issue. First off, it does seem to be a problem still, increasing record count increases scan latency dramatically: 100 records: [SCAN], AverageLatency(us), 1482.5882352941176 At first I thought why try and support a scan in SQL in the first place? SQL doesn't really do scans perse, I guess you can simulate one visually, but it is not the same as a scan in a NoSQL database like HBase. But...sounds like people are using it I guess. In most NoSQL systems scans are useful for retrieving sets of data that are physically adjacent to each other. You just move to the start key and advance across adjacent records. To simulate grabbing adjacent keys you have to do like proposed above, GROUP BY and LIMIT. (This is another reason why it is not really a scan, you have to apply the GROUP BY to way more records then you are actually receiving back, just doesn't seem right.) Either way, that fix suggested will not work quite as expected: mysql> select * from usertable where YCSB_KEY >= 'user4' group by YCSB_KEY limit 5;
+----------+-----------------+-----------------+-----------------+
| YCSB_KEY | FIELD1 | FIELD2 | FIELD3 |
+----------+-----------------+-----------------+-----------------+
| user4 | user4:field2:18 | user4:field1:18 | user4:field0:18 |
| user40 | user40:field2:2 | user40:field1:2 | user40:field0:2 |
| user400 | user400:field2: | user400:field1: | user400:field0: |
| user401 | user401:field2: | user401:field1: | user401:field0: |
| user402 | user402:field2: | user402:field1: | user402:field0: |
+----------+-----------------+-----------------+-----------------+ As you can see we're dealing with string sorting not numeric sorting. We can do something hackish such as: mysql> select * from usertable where YCSB_KEY >= 'user4' group by substring(YCSB_KEY, 5, 100)*1 limit 5;
+----------+-----------------+-----------------+-----------------+
| YCSB_KEY | FIELD1 | FIELD2 | FIELD3 |
+----------+-----------------+-----------------+-----------------+
| user4 | user4:field2:18 | user4:field1:18 | user4:field0:18 |
| user5 | user5:field2:91 | user5:field1:91 | user5:field0:91 |
| user6 | user6:field2:-1 | user6:field1:-1 | user6:field0:-1 |
| user7 | user7:field2:77 | user7:field1:77 | user7:field0:77 |
| user8 | user8:field2:-1 | user8:field1:-1 | user8:field0:-1 |
+----------+-----------------+-----------------+-----------------+ Those SQL examples were made with insertorder=ordered, if that was not true then we would just be grabbing keys that ended up next to each other after hashing. In context that scan result would not really mean anything. So, that was sort of a convoluted way of saying, my preference would be to actually leave scan() as unimplemented, SQL systems don't really have a scan functionality. But we can visually simulate a scan, I think I would use a query like: select * from usertable where YCSB_KEY >= 'user4' group by substring(YCSB_KEY, 5, 100)*1 limit 5; so that we grab numerically sorted keys that are adjacent to each other. The relevancy of the result would depend on insertorder but since we are sort of arbitrarily sorting anyway, I don't think it matters at that point. |
I believe there are other datastore bindings that don't implement scan. I think they just throw UnsupportedOperationException. Can you put something together to have JDBC behave the same? |
Yes, I would have thought that would be the proper way to do it too. There are about ten or so bindings that do not implement scan, they are not consistent on handling it but it looks like the general consensus is to log a relevant message then return failure. This seems ok to me, the user is informed in the log as well as in the final results that something is wrong with scan. I'm happy to make this change, I thought it might be considered a breaking change, since people are/were using it, but then again it is buggy so... PR to follow. |
Out of curiosity, do you have an index on |
In the README I pushed as well as in the example schema files in the resources folder YCSB_KEY is specified as the PRIMARY KEY which gives it an index as far as I know. |
Oh, of course. Thanks. |
A little more time to think on this one. A lot of my reasoning for wanting to make the scan operation unimplemented were about contextual differences an implementation and what the results returned would really mean to an application. I'm not sure that is the right way to think about this. Since YCSB is a performance testing benchmark if the scan operation CAN be implemented it probably should. SELECT * is essentially a scan so just limiting it with LIMIT as per early suggestions on this thread is probably the best course to go here. Since we are only ever scanning over the primary keys which are indexed, we may not need any type of GROUP BY or ORDER BY, just a LIMIT. In context this is essentially a random seek of records which may not mean anything to an application. But that part isn't important, what's important is that the performance of that action can be measured, so that a user has an idea of how a slightly altered operation that does makes sense to an application will perform. Going to drop my current PR in favor of a new one reflecting this. |
Now properly enforcing a scan of the given number of documents in the scan operation using the LIMIT modifier in the cached statement. Fixes brianfrankcooper#171.
I concur with this. select * with a limit feels like an appropriate implementation. |
Now properly enforcing a scan of the given number of documents in the scan operation using the LIMIT modifier in the cached statement. Fixes brianfrankcooper#171.
Now properly enforcing a scan of the given number of documents in the scan operation using the LIMIT modifier in the cached statement. Fixes brianfrankcooper#171.
When a scan query is composed in the JDBC database interface layer, the resulting query has to following shape:
SELECT * FROM "TABLE_NAME" WHERE "PRIMARY_KEY" >= "START_KEY"
This implementation causes the scan operation to return more and more results when the amount of entries in the database raises during the benchmark. As a consequence the latencies of the scan operation raise too over time (this is how the bug was detected).
I suggest the implementation should be something like the following query:
SELECT * FROM "TABLE_NAME" WHERE "PRIMARY_KEY" >= "START_KEY" ORDER BY "PRIMARY_KEY" LIMIT "MAX_SCAN_LENGTH"
This implementation takes into account the maxscanlength property and only selects the first maxscanlength properties after the startkey (order by primary key). Experiments using the MySQL database show that this implementation makes the latencies of the scan operation constant over time.
The text was updated successfully, but these errors were encountered: