Update readme for parallel queries (#54)

* update readme for parallel queries * Update README.md * add parallel query recommendations. * be specific about dedicated replicas for parallel
spotify · Apr 16, 2019 · 0bc9335 · 0bc9335
1 parent 3d3378f
commit 0bc9335
Showing 1 changed file with 18 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -50,6 +50,24 @@ simply streams the table contents via JDBC into target location as Avro.
 - `--partitionPeriod`: the period in which dbeam runs, used to filter based on current partition and also to check if executions are being run for a too old partition
 - `--skipPartitionCheck`: when partition column is not specified, by default fail when the partition parameter is not too old; use this avoid this behavior
 - `--minPartitionPeriod`: the minimum partition required for the job not to fail (when partition column is not specified), by default `now() - 2*partitionPeriod`
+- `--queryParallelism`: number of queries to generate to extract in parallel. Generates one query if nothing is specified. Must be used split column defined.
+- `--splitColumn`: a long / integer column which is used to determine bounds for generating parallel queries. Must be used with parallelism defined.
+
+#### DBeam Parallel Mode
+
+This is a pre-alpha feature currently under development and experimentation.
+
+Read queries used by dbeam to extract data generally don't place any locks, and hence multiple read queries
+can run in parallel. When running in parallel mode with `--queryParallelism` specified, dbeam looks for
+`--splitColumn` argument to find the max and min values in that column. The max and min are then used
+as range bounds for generating `queryParallelism` number of queries which are then run in parallel to read data. 
+Since the splitColumn is used to calculate the query bounds, and dbeam needs to calculate intermediate
+bounds for each query, the type of the column must be long / int. It is assumed that the distribution of values on the `splitColumn` is sufficiently random and sequential. Example if the min and max of the split column is divided equally into query parallelism parts, each part would contain approximately equal number of records. Having skews in this data would result in straggling queries, and hence wont provide much improvement. Having the records sequential would help in having the queries run faster and it would reduce random disk seeks.
+
+Recommended usage:
+Beam would run each query generated by DBeam in 1 dedicated vCPU (when running with Dataflow Runner), thus for best performance it is recommended that the total number of vCPU available for a given job should be equal to the `queryParallelism` specified. Hence if `workerMachineType` for Dataflow is `n1-standard-w` and `numWorkers` is `n` then `queryParallelism` `q` should be a multiple of `n*w` and the job would be fastest if `q = n * w`.
+
+For an export of a table running from a dedicated PostgresQL replica, we have seen best performance over vCPU time and wall time when having a `queryParallelism` of 16. Bumping `queryParallelism` further increases the vCPU time without offering much gains on the wall time of the complete export. It is probably good to use `queryParallelism` less than 16 for experimenting.
 
 ## Building