-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add --sqlFile argument to support user-provided SQL query #66
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like some big changes. I will have a look later this week.
But have you consider using SQL views instead of a SQL file? (i.e. pointing --table
to the view name).
I think @anish749 can also have a look since it touches the parallel queries parts.
pom.xml
Outdated
@@ -126,7 +126,7 @@ | |||
<jackson.version>2.9.8</jackson.version> | |||
<slf4j.version>1.7.25</slf4j.version> | |||
<auto-value.version>1.5.3</auto-value.version> | |||
<guava.version>20.0</guava.version> | |||
<guava.version>21.0</guava.version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that needed? I think we need to be careful to use version compatible with Beam SDK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets keep this change in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed.
@labianchin The use case I am trying to cover is when a user cannot create view(s) in a DB and has rights to execute a SELECT query only.
|
/** | ||
* Wrapper class for raw SQL query (SELECT statement). | ||
*/ | ||
public class SqlQueryWrapper implements Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generic Comment, since this PR changes the way the query is being constructed:
Would a query builder pattern make sense?
I am thinking:
DbeamQueryBuilder
as a class,
fromTable
and fromQueryFile
as Smart Constructors.
withLimit
, withSplitColumn
, withPartitionColumn
as modifiers for the query.
getExtractQueries
, and getMinMaxSplitQuery
to trigger building the actual query and returning a String.
The primary aim is to have a better way to compose complex queries by adding things together.
In the implementation, every query modifier, forms a new sub query.
val source = OneOf(userProvidedQuery, tableName, currentQueryState) // delegated logic for finding out the source.
withPartitionColumnModifier
means SELECT * FROM ($source) s WHERE $partitionColumnModifier
with....
means SELECT * FROM ($source) s WHERE $....
the source
is a disjunction which can itself be another query. So the final query would look like:
SELECT * FROM (
SELECT * FROM (
user_provided_query
) x WHERE partitionColumn > '....'
) y WHERE min(splitColumn) > ... and max(splitColumn) <...
Adding a query modifier, puts the previous query in a sub query and queries from it.
This would allow a more structured way to manage modifiers and sources.
Another way might be having CTEs.
Since we are dealing with rDBMS here, I think it is safe to assume that query planners would push down the predicates as needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anish749 I addressed some comments.
This PR adds some nice and useful functionality, which is great. It does touch the query builder logic, so we might need to test it out well. I need to take a closer look at the logic. I've mostly skimmed through this PR. The way we have historically aimed at this problem is by having views which abstract out joins, filters, and prunes only the columns that need to be extracted and then expose that view. It is basically saying that the sql query doesn't stay in a file but is present in the database as a code, which is version managed, and the CI/CD system that manages database migrations manages the code of the view. It feels easier to deal with this system. I do understand the need for having a sqlFile, and it is much more flexible, when it comes to extracting. It allows dbeam to be used to extract adhoc queries as well. I am not sure if it would make version managing and deployment of sql in files easier. The files would need to go in custom docker images since the file path is only being passed as argument, when we think about deployment. |
Hey, thanks for fixing these. @labianchin can you take a look once or, I
can take a look next week. I am away this week with no laptop!
On Sun, 16 Jun 2019 at 16:03, Ruslan Altynnikov ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In dbeam-core/src/main/java/com/spotify/dbeam/args/SqlQueryWrapper.java
<#66 (comment)>:
> + * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ * -/-/-
+ */
+
+package com.spotify.dbeam.args;
+
+import java.io.Serializable;
+
+/**
+ * Wrapper class for raw SQL query (SELECT statement).
+ */
+public class SqlQueryWrapper implements Serializable {
@anish749 <https://github.com/anish749> I addressed some the comments.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#66?email_source=notifications&email_token=ADURCSB5GBZLZEUTZOTFJPTP2ZB47A5CNFSM4HXAETVKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOB3VEKRA#discussion_r294085807>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADURCSFN63SN5R54NFSLH6DP2ZB47ANCNFSM4HXAETVA>
.
--
*__*
*Anish C*
*web: https://anish749.github.io/ <https://anish749.github.io/>*
*mob: +46 72-532 38 40*
|
private DbeamQueryBuilder(final String sqlQuery) { | ||
String uppSql = sqlQuery.toUpperCase(); | ||
if (!uppSql.startsWith("SELECT")) { | ||
throw new IllegalArgumentException("Sql query should start with SELECT"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check might be too restrictive.
Some databases permit adding settings at a query level, which influence how a query is executed.
Eg: in postgres:
SET work_mem = '256MB';
SELECT * FROM users ... INNER JOIN .... ;
RESET work_mem;
Ideally we shouldn't be parsing / validating SQL queries.
Also one can use CTEs which means the query will most likely not start with a SELECT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a few comments. I think we should iterate a little more on this,
have some integration testing on at least postgres and mysql before merging.
throw new IllegalArgumentException("Sql query missing FROM clause"); | ||
} | ||
|
||
// TODO: may be check that LIMIT is not present; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way to tackle this to enclose the user submitted query inside a sub query or a CTE.
Eg:
SELECT * FROM ($user_query) u WHERE 1 = 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that we might want to strip off the trailing ;
before using this in a sub query and then add the ;
again later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree.
|
||
public DbeamQueryBuilder withPartitionCondition( | ||
String partitionColumn, String startPointIncl, String endPointExcl) { | ||
sqlBuilder.append(createSqlPartitionCondition(partitionColumn, startPointIncl, endPointExcl)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another way to handle if we go for subqueries is that.
SELECT * FROM ($current_query) xyz WHERE ${createSqlPartitionCondition(partitionColumn, startPointIncl, endPointExcl)}
return sqlQuery.replaceAll("[\\s|;]+$", ""); | ||
} | ||
|
||
public DbeamQueryBuilder withLimit(Optional<Integer> limitOpt) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to use Long
for LIMIT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
dbeam-core/src/main/java/com/spotify/dbeam/args/DbeamQueryBuilder.java
Outdated
Show resolved
Hide resolved
* Generates a new query to get MIN/MAX values for splitColumn. | ||
* | ||
* @param splitColumn column to use | ||
* @param minSplitColumnName MIN() column value alias |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a feeling we can do away with passing minSplitColumnName
/ maxSplitColumnName
and just have it fixed here. I think there should be only one caller for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use them later in the code.
@@ -38,6 +38,11 @@ | |||
|
|||
void setTable(String value); | |||
|
|||
@Description("A local file containing the SQL SELECT query.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the file reader we have would allow us to read the file from GCS. If so we should update this description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
private static Optional<String> resolveSqlQueryParameter(JdbcExportPipelineOptions options) | ||
throws IOException { | ||
if (options.getSqlFile() != null) { | ||
return Optional.ofNullable(new ParameterFileReader().readAsResource(options.getSqlFile())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need Optional.ofNullable
here? When does readAsResource
return null? I think it should be reported to the user instead of the exception getting swallowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed.
} | ||
|
||
@Test | ||
public void testItRemovesTrailingSymbols() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -78,7 +104,7 @@ class JdbcExportArgsTest extends FlatSpec with Matchers { | |||
} | |||
} | |||
it should "parse correctly with missing password parameter" in { | |||
val options = optionsFromArgs("--connectionUrl=jdbc:postgresql://some_db --table=some_table") | |||
val options = optionsFromArgs("--connectionUrl=jdbc:postgresql://some_db --table=" + "some_table") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unintended change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
} | ||
|
||
boolean isContainsWhere = false; | ||
if (uppSql.toUpperCase().contains("WHERE ")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
toUpperCase()
on something already upper case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be in a good direction. Thanks @ra1861 !!
I was initially a bit skeptical on how this feature would behave with other parameters. From your tests I can now see that it can work well.
I left a few comments and also please rebase (there are some changes on h2 driver that I had to change/fix tests, see 0882999).
/** | ||
* Wrapper class for raw SQL query. | ||
*/ | ||
public class DbeamQueryBuilder implements Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice abstraction! That is great to decouple how queries are build. In the future another DbeamQueryBase
can be extended for very specific JDBC drivers (e.g. MS SQL server).
One nitpick: we don't need to preview the class names with Dbeam
, the package
already provides namespacing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
.setTableName(tableName) | ||
.setBaseSqlQuery(baseSqlQuery) | ||
.setPartitionPeriod(Days.ONE) | ||
.build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be like the following:
return create(tableName, Optional.empty());
So we avoid a bit of duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, done.
private long[] findInputBounds(Connection connection, String tableName, String partitionCondition, | ||
String splitColumn) | ||
private long[] findInputBounds( | ||
Connection connection, DbeamQueryBuilder baseSqlQuery, String splitColumn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
baseSqlQuery
-> baseSqlQueryBuilder
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
import org.slf4j.Logger; | ||
import org.slf4j.LoggerFactory; | ||
|
||
class PasswordReader { | ||
class PasswordReader extends ParameterFileReader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for extends here.
Just update the calls to ParameterFileReader.readFromFile(...)
given this is a static method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, reverted. No need to have this functionality here.
private static Optional<String> resolveSqlQueryParameter(JdbcExportPipelineOptions options) | ||
throws IOException { | ||
if (options.getSqlFile() != null) { | ||
return Optional.of(new ParameterFileReader().readAsResource(options.getSqlFile())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused here...
By using readAsResource()
does it means that users have to bundle the sqlFile
into the JAR?
Would it make more sense to use Beam's FileSystems
? So that users can point to a gs://some-bucket/some-object.sql
, in the same way as password file..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, changed code to adhere to existing methods.
|
||
Assert.assertEquals(expected, actual); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for keeping tests extensive! Some other scenarios we could consider:
- Multi line queries
- First line with comments (before
SELECT
) - Queries with CTE (don't think we will be able to support)
- ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
private def queriesForBounds2( | ||
min: Long, max: Long, parallelism: Int, splitColumn: String, queryFormat: String): java.util.List[String] = { | ||
val queries = QueryBuilderArgs.queriesForBounds(min, max, parallelism, splitColumn, DbeamQueryBuilder.fromTablename(tablename)) | ||
val q2 = queries.asScala.map(x => x.toString()).toList.asJava |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this asScala ... asJava
necessary? Isn't queries
/ QueryBuilderArgs.queriesForBounds()
already java.util.List[String]
?
Tests fixed. Create a dedicated class for sqlQuery. Apply google code formatting.
Create a dedicated class for sqlQuery. Move all SQL mangling into one file. Tests fixed. Apply google code formatting.
Add Builder/like methods withX()/build(). More unit-tests. Refactoring of SQL parameters handling logic.
Add more tests. Restore 'maven-enforcer-plugin.version' value.
Change type of 'limit' parameter. Unit-tests are rewritten.
Add javadocs.
Adjust test to work with absolute file paths.
Create TestHelper class.
Codecov Report
@@ Coverage Diff @@
## master #66 +/- ##
============================================
+ Coverage 89.7% 89.81% +0.11%
- Complexity 177 202 +25
============================================
Files 22 23 +1
Lines 680 766 +86
Branches 52 53 +1
============================================
+ Hits 610 688 +78
- Misses 47 53 +6
- Partials 23 25 +2 |
The idea was to minimize existing code changes.
Argument --tableName is left mandatory to provide some table name, even when user provides own query.