Add --sqlFile argument to support user-provided SQL query #66

rulle-io · 2019-06-11T16:24:58Z

The idea was to minimize existing code changes.
Argument --tableName is left mandatory to provide some table name, even when user provides own query.

rulle-io · 2019-06-12T08:20:10Z

@labianchin

labianchin

This looks like some big changes. I will have a look later this week.

But have you consider using SQL views instead of a SQL file? (i.e. pointing --table to the view name).

I think @anish749 can also have a look since it touches the parallel queries parts.

labianchin · 2019-06-12T09:38:17Z

pom.xml

@@ -126,7 +126,7 @@
        <jackson.version>2.9.8</jackson.version>
        <slf4j.version>1.7.25</slf4j.version>
        <auto-value.version>1.5.3</auto-value.version>
-        <guava.version>20.0</guava.version>
+        <guava.version>21.0</guava.version>


Is that needed? I think we need to be careful to use version compatible with Beam SDK.

lets keep this change in a separate PR.

rulle-io · 2019-06-13T12:16:41Z

@labianchin The use case I am trying to cover is when a user cannot create view(s) in a DB and has rights to execute a SELECT query only.
Moreover, this approach is much more flexible:

one can join multiple tables
add custom selection criteria
select only certain columns

anish749 · 2019-06-13T15:02:04Z

dbeam-core/src/main/java/com/spotify/dbeam/args/SqlQueryWrapper.java

+/**
+ * Wrapper class for raw SQL query (SELECT statement).
+ */
+public class SqlQueryWrapper implements Serializable {


Generic Comment, since this PR changes the way the query is being constructed:

Would a query builder pattern make sense?
I am thinking:
DbeamQueryBuilder as a class,
fromTable and fromQueryFile as Smart Constructors.

withLimit, withSplitColumn, withPartitionColumn as modifiers for the query.

getExtractQueries, and getMinMaxSplitQuery to trigger building the actual query and returning a String.

The primary aim is to have a better way to compose complex queries by adding things together.

In the implementation, every query modifier, forms a new sub query.

val source = OneOf(userProvidedQuery, tableName, currentQueryState) // delegated logic for finding out the source.

withPartitionColumnModifier means SELECT * FROM ($source) s WHERE $partitionColumnModifier
with.... means SELECT * FROM ($source) s WHERE $....

the source is a disjunction which can itself be another query. So the final query would look like:

SELECT * FROM ( SELECT * FROM ( user_provided_query ) x WHERE partitionColumn > '....' ) y WHERE min(splitColumn) > ... and max(splitColumn) <...

Adding a query modifier, puts the previous query in a sub query and queries from it.
This would allow a more structured way to manage modifiers and sources.

Another way might be having CTEs.
Since we are dealing with rDBMS here, I think it is safe to assume that query planners would push down the predicates as needed.

@anish749 I addressed some comments.

anish749 · 2019-06-13T15:11:50Z

This PR adds some nice and useful functionality, which is great.

It does touch the query builder logic, so we might need to test it out well. I need to take a closer look at the logic. I've mostly skimmed through this PR.

The way we have historically aimed at this problem is by having views which abstract out joins, filters, and prunes only the columns that need to be extracted and then expose that view. It is basically saying that the sql query doesn't stay in a file but is present in the database as a code, which is version managed, and the CI/CD system that manages database migrations manages the code of the view. It feels easier to deal with this system.

I do understand the need for having a sqlFile, and it is much more flexible, when it comes to extracting. It allows dbeam to be used to extract adhoc queries as well.

I am not sure if it would make version managing and deployment of sql in files easier. The files would need to go in custom docker images since the file path is only being passed as argument, when we think about deployment.

anish749 · 2019-06-17T06:21:02Z

Hey, thanks for fixing these. @labianchin can you take a look once or, I can take a look next week. I am away this week with no laptop!

On Sun, 16 Jun 2019 at 16:03, Ruslan Altynnikov ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In dbeam-core/src/main/java/com/spotify/dbeam/args/SqlQueryWrapper.java <#66 (comment)>: > + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * -/-/- + */ + +package com.spotify.dbeam.args; + +import java.io.Serializable; + +/** + * Wrapper class for raw SQL query (SELECT statement). + */ +public class SqlQueryWrapper implements Serializable { @anish749 <https://github.com/anish749> I addressed some the comments. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#66?email_source=notifications&email_token=ADURCSB5GBZLZEUTZOTFJPTP2ZB47A5CNFSM4HXAETVKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOB3VEKRA#discussion_r294085807>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADURCSFN63SN5R54NFSLH6DP2ZB47ANCNFSM4HXAETVA> .

-- *__* *Anish C* *web: https://anish749.github.io/ <https://anish749.github.io/>* *mob: +46 72-532 38 40*

anish749 · 2019-06-28T08:24:51Z

dbeam-core/src/main/java/com/spotify/dbeam/args/DbeamQueryBuilder.java

+  private DbeamQueryBuilder(final String sqlQuery) {
+    String uppSql = sqlQuery.toUpperCase();
+    if (!uppSql.startsWith("SELECT")) {
+      throw new IllegalArgumentException("Sql query should start with SELECT");


This check might be too restrictive.
Some databases permit adding settings at a query level, which influence how a query is executed.

Eg: in postgres:

SET work_mem = '256MB'; SELECT * FROM users ... INNER JOIN .... ; RESET work_mem;

Ideally we shouldn't be parsing / validating SQL queries.

Also one can use CTEs which means the query will most likely not start with a SELECT

Right. Addressed.

anish749

I had a few comments. I think we should iterate a little more on this,
have some integration testing on at least postgres and mysql before merging.

anish749 · 2019-06-28T08:27:39Z

dbeam-core/src/main/java/com/spotify/dbeam/args/DbeamQueryBuilder.java

+      throw new IllegalArgumentException("Sql query missing FROM clause");
+    }
+
+    // TODO: may be check that LIMIT is not present;


One way to tackle this to enclose the user submitted query inside a sub query or a CTE.

Eg:

SELECT * FROM ($user_query) u WHERE 1 = 1

Note that we might want to strip off the trailing ; before using this in a sub query and then add the ; again later.

anish749 · 2019-06-28T08:29:15Z

dbeam-core/src/main/java/com/spotify/dbeam/args/DbeamQueryBuilder.java

+
+  public DbeamQueryBuilder withPartitionCondition(
+          String partitionColumn, String startPointIncl, String endPointExcl) {
+    sqlBuilder.append(createSqlPartitionCondition(partitionColumn, startPointIncl, endPointExcl));


Another way to handle if we go for subqueries is that.

SELECT * FROM ($current_query) xyz WHERE ${createSqlPartitionCondition(partitionColumn, startPointIncl, endPointExcl)}

anish749 · 2019-06-28T08:30:21Z

dbeam-core/src/main/java/com/spotify/dbeam/args/DbeamQueryBuilder.java

+    return sqlQuery.replaceAll("[\\s|;]+$", "");
+  }
+
+  public DbeamQueryBuilder withLimit(Optional<Integer> limitOpt) {


We might want to use Long for LIMIT?

dbeam-core/src/main/java/com/spotify/dbeam/args/DbeamQueryBuilder.java

anish749 · 2019-06-28T08:34:03Z

dbeam-core/src/main/java/com/spotify/dbeam/args/DbeamQueryBuilder.java

+   * Generates a new query to get MIN/MAX values for splitColumn.  
+   * 
+   * @param splitColumn column to use
+   * @param minSplitColumnName MIN() column value alias


I have a feeling we can do away with passing minSplitColumnName / maxSplitColumnName and just have it fixed here. I think there should be only one caller for this.

We use them later in the code.

anish749 · 2019-06-28T08:43:46Z

dbeam-core/src/main/java/com/spotify/dbeam/options/DBeamPipelineOptions.java

@@ -38,6 +38,11 @@

  void setTable(String value);

+  @Description("A local file containing the SQL SELECT query.")


I think the file reader we have would allow us to read the file from GCS. If so we should update this description.

anish749 · 2019-06-28T08:45:29Z

dbeam-core/src/main/java/com/spotify/dbeam/options/JdbcExportArgsFactory.java

+  private static Optional<String> resolveSqlQueryParameter(JdbcExportPipelineOptions options)
+      throws IOException {
+    if (options.getSqlFile() != null) {
+      return Optional.ofNullable(new ParameterFileReader().readAsResource(options.getSqlFile()));


Do we need Optional.ofNullable here? When does readAsResource return null? I think it should be reported to the user instead of the exception getting swallowed.

anish749 · 2019-06-28T08:49:35Z

dbeam-core/src/test/java/com/spotify/dbeam/args/DbeamQueryBuilderTest.java

+  }
+
+  @Test
+  public void testItRemovesTrailingSymbols() {


anish749 · 2019-06-28T08:50:33Z

dbeam-core/src/test/scala/com/spotify/dbeam/args/JdbcExportArgsTest.scala

@@ -78,7 +104,7 @@ class JdbcExportArgsTest extends FlatSpec with Matchers {
    }
  }
  it should "parse correctly with missing password parameter" in {
-    val options = optionsFromArgs("--connectionUrl=jdbc:postgresql://some_db --table=some_table")
+    val options = optionsFromArgs("--connectionUrl=jdbc:postgresql://some_db --table=" + "some_table")


unintended change?

anish749 · 2019-06-28T08:51:59Z

dbeam-core/src/main/java/com/spotify/dbeam/args/DbeamQueryBuilder.java

+    }
+
+    boolean isContainsWhere = false;
+    if (uppSql.toUpperCase().contains("WHERE ")) {


toUpperCase() on something already upper case.

labianchin

This seems to be in a good direction. Thanks @ra1861 !!

I was initially a bit skeptical on how this feature would behave with other parameters. From your tests I can now see that it can work well.

I left a few comments and also please rebase (there are some changes on h2 driver that I had to change/fix tests, see 0882999).

labianchin · 2019-07-08T11:35:49Z

dbeam-core/src/main/java/com/spotify/dbeam/args/DbeamQueryBuilder.java

+/**
+ * Wrapper class for raw SQL query.
+ */
+public class DbeamQueryBuilder implements Serializable {


Nice abstraction! That is great to decouple how queries are build. In the future another DbeamQueryBase can be extended for very specific JDBC drivers (e.g. MS SQL server).

One nitpick: we don't need to preview the class names with Dbeam, the package already provides namespacing.

labianchin · 2019-07-08T11:38:49Z

dbeam-core/src/main/java/com/spotify/dbeam/args/QueryBuilderArgs.java

+            .setTableName(tableName)
+            .setBaseSqlQuery(baseSqlQuery)
+            .setPartitionPeriod(Days.ONE)
+            .build();


Could this be like the following:

return create(tableName, Optional.empty());

So we avoid a bit of duplication.

Right, done.

labianchin · 2019-07-08T11:40:08Z

dbeam-core/src/main/java/com/spotify/dbeam/args/QueryBuilderArgs.java

-  private long[] findInputBounds(Connection connection, String tableName, String partitionCondition,
-      String splitColumn)
+  private long[] findInputBounds(
+      Connection connection, DbeamQueryBuilder baseSqlQuery, String splitColumn)


baseSqlQuery -> baseSqlQueryBuilder?

labianchin · 2019-07-08T11:44:20Z

dbeam-core/src/main/java/com/spotify/dbeam/options/PasswordReader.java

 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;

-class PasswordReader {
+class PasswordReader extends ParameterFileReader {


No need for extends here.

Just update the calls to ParameterFileReader.readFromFile(...) given this is a static method.

Right, reverted. No need to have this functionality here.

labianchin · 2019-07-08T11:45:43Z

dbeam-core/src/main/java/com/spotify/dbeam/options/JdbcExportArgsFactory.java

+  private static Optional<String> resolveSqlQueryParameter(JdbcExportPipelineOptions options)
+      throws IOException {
+    if (options.getSqlFile() != null) {
+      return Optional.of(new ParameterFileReader().readAsResource(options.getSqlFile()));


I am confused here...

By using readAsResource() does it means that users have to bundle the sqlFile into the JAR?

Would it make more sense to use Beam's FileSystems? So that users can point to a gs://some-bucket/some-object.sql, in the same way as password file..

Right, changed code to adhere to existing methods.

labianchin · 2019-07-08T11:47:49Z

dbeam-core/src/test/java/com/spotify/dbeam/args/DbeamQueryBuilderTest.java

+
+    Assert.assertEquals(expected, actual);
+  }
+}


Thanks for keeping tests extensive! Some other scenarios we could consider:

Multi line queries

First line with comments (before SELECT)

Queries with CTE (don't think we will be able to support)

...

labianchin · 2019-07-08T11:51:28Z

dbeam-core/src/test/scala/com/spotify/dbeam/args/ParallelQueriesTest.scala

+  private def queriesForBounds2(
+      min: Long, max: Long, parallelism: Int, splitColumn: String, queryFormat: String): java.util.List[String] = {
+    val queries = QueryBuilderArgs.queriesForBounds(min, max, parallelism, splitColumn, DbeamQueryBuilder.fromTablename(tablename))
+    val q2 = queries.asScala.map(x => x.toString()).toList.asJava


Is this asScala ... asJava necessary? Isn't queries / QueryBuilderArgs.queriesForBounds() already java.util.List[String]?

Tests fixed. Create a dedicated class for sqlQuery. Apply google code formatting.

Create a dedicated class for sqlQuery. Move all SQL mangling into one file. Tests fixed. Apply google code formatting.

Add Builder/like methods withX()/build(). More unit-tests. Refactoring of SQL parameters handling logic.

Add more tests. Restore 'maven-enforcer-plugin.version' value.

Change type of 'limit' parameter. Unit-tests are rewritten.

Add javadocs.

Adjust test to work with absolute file paths.

Create TestHelper class.

codecov · 2019-07-18T22:09:03Z

Codecov Report

Merging #66 into master will increase coverage by 0.11%.
The diff coverage is 93.18%.

@@             Coverage Diff              @@
##             master      #66      +/-   ##
============================================
+ Coverage      89.7%   89.81%   +0.11%     
- Complexity      177      202      +25     
============================================
  Files            22       23       +1     
  Lines           680      766      +86     
  Branches         52       53       +1     
============================================
+ Hits            610      688      +78     
- Misses           47       53       +6     
- Partials         23       25       +2

rulle-io closed this Jun 11, 2019

rulle-io reopened this Jun 11, 2019

labianchin reviewed Jun 12, 2019

View reviewed changes

anish749 reviewed Jun 13, 2019

View reviewed changes

anish749 reviewed Jun 28, 2019

View reviewed changes

labianchin reviewed Jul 8, 2019

View reviewed changes

rulle-io added 18 commits July 18, 2019 23:51

Introduce new parameter. Test do not work yet.

1704c18

Introduce sqlFIle parameter.

5be5675

Tests fixed. Create a dedicated class for sqlQuery. Apply google code formatting.

Introduce sqlFIle parameter.

b4f82bb

Create a dedicated class for sqlQuery. Move all SQL mangling into one file. Tests fixed. Apply google code formatting.

Move all SQL string manipulation onto one file.

19b6bc9

Add Builder/like methods withX()/build(). More unit-tests. Refactoring of SQL parameters handling logic.

Rename Builder class.

f9eae98

Add more tests. Restore 'maven-enforcer-plugin.version' value.

Add comments.

4413445

Create two different types of query.

ec4f192

Change type of 'limit' parameter. Unit-tests are rewritten.

More tests are added.

24af8ce

Remove Nullable().

ea52797

Add javadocs.

Fix javadocs.

044e113

Fix README.

f03e6e2

Addressed some comments.

5dff4f3

Restored formatting.

fd84108

Adapt to code changes after rebase (spotify@0882999).

8ea4c18

Remove ParameterFileReader class.

415365f

Adjust test to work with absolute file paths.

Rename DbeamQueryBuilder to QueryBuilder.

5fb65bf

Address various PR feedback.

5564f18

Move test SQL statements into temp outside files.

95e1347

Create TestHelper class.

rulle-io force-pushed the master branch from 5d53cf7 to 95e1347 Compare July 18, 2019 22:03

labianchin merged commit 95e1347 into spotify:master Aug 5, 2019

rulle-io mentioned this pull request Aug 6, 2019

Add support for custom WHERE clause to the queries #48

Closed

		@@ -38,6 +38,11 @@

		void setTable(String value);

		@Description("A local file containing the SQL SELECT query.")

Add --sqlFile argument to support user-provided SQL query #66

Add --sqlFile argument to support user-provided SQL query #66

Conversation

rulle-io commented Jun 11, 2019

rulle-io commented Jun 12, 2019

labianchin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rulle-io commented Jun 13, 2019

Choose a reason for hiding this comment

rulle-io Jun 16, 2019 • edited Loading

Choose a reason for hiding this comment

anish749 commented Jun 13, 2019

anish749 commented Jun 17, 2019 via email

anish749 Jun 28, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anish749 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

labianchin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jul 18, 2019

Codecov Report

rulle-io Jun 16, 2019 •

edited

Loading

anish749 Jun 28, 2019 •

edited

Loading