Skip to content

Commit

Permalink
Update readme for parallel queries (#54)
Browse files Browse the repository at this point in the history
* update readme for parallel queries

* Update README.md

* add parallel query recommendations.

* be specific about dedicated replicas for parallel
  • Loading branch information
anish749 authored and labianchin committed Apr 16, 2019
1 parent 3d3378f commit 0bc9335
Showing 1 changed file with 18 additions and 0 deletions.
18 changes: 18 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,24 @@ simply streams the table contents via JDBC into target location as Avro.
- `--partitionPeriod`: the period in which dbeam runs, used to filter based on current partition and also to check if executions are being run for a too old partition
- `--skipPartitionCheck`: when partition column is not specified, by default fail when the partition parameter is not too old; use this avoid this behavior
- `--minPartitionPeriod`: the minimum partition required for the job not to fail (when partition column is not specified), by default `now() - 2*partitionPeriod`
- `--queryParallelism`: number of queries to generate to extract in parallel. Generates one query if nothing is specified. Must be used split column defined.
- `--splitColumn`: a long / integer column which is used to determine bounds for generating parallel queries. Must be used with parallelism defined.

#### DBeam Parallel Mode

This is a pre-alpha feature currently under development and experimentation.

Read queries used by dbeam to extract data generally don't place any locks, and hence multiple read queries
can run in parallel. When running in parallel mode with `--queryParallelism` specified, dbeam looks for
`--splitColumn` argument to find the max and min values in that column. The max and min are then used
as range bounds for generating `queryParallelism` number of queries which are then run in parallel to read data.
Since the splitColumn is used to calculate the query bounds, and dbeam needs to calculate intermediate
bounds for each query, the type of the column must be long / int. It is assumed that the distribution of values on the `splitColumn` is sufficiently random and sequential. Example if the min and max of the split column is divided equally into query parallelism parts, each part would contain approximately equal number of records. Having skews in this data would result in straggling queries, and hence wont provide much improvement. Having the records sequential would help in having the queries run faster and it would reduce random disk seeks.

Recommended usage:
Beam would run each query generated by DBeam in 1 dedicated vCPU (when running with Dataflow Runner), thus for best performance it is recommended that the total number of vCPU available for a given job should be equal to the `queryParallelism` specified. Hence if `workerMachineType` for Dataflow is `n1-standard-w` and `numWorkers` is `n` then `queryParallelism` `q` should be a multiple of `n*w` and the job would be fastest if `q = n * w`.

For an export of a table running from a dedicated PostgresQL replica, we have seen best performance over vCPU time and wall time when having a `queryParallelism` of 16. Bumping `queryParallelism` further increases the vCPU time without offering much gains on the wall time of the complete export. It is probably good to use `queryParallelism` less than 16 for experimenting.

## Building

Expand Down

0 comments on commit 0bc9335

Please sign in to comment.