Slicer controlled reader #21

kstaken · 2017-11-27T19:35:44Z

Goals

Put control of slice allocation and offset management in the slicer
Enable retries under the control of the slicer
Make kafka jobs recoverable
Enable the reader to work for both once and persistent jobs
Put partition rebalancing under the control of the slicer
- Ideally we can handle rebalance scenarios without disrupting all the workers.
Maintain the behavior of sticky partitions so that all work for a partition goes to the same worker.
- This should be the default but if a job doesn't have this requirement it should be possible to disable for maximum throughput.
Ideally, have the option for the job to use more workers than partitions.
- This will only be useful if the partition -> worker mapping is optional and the job doesn't care about the order of the data.

Assumptions

Offset storage will be under Teraslice's control
- Ideally stored in the state record for each slice.
Slice sizes may not be exact, especially with compacted topics.
If offset storage is in state then restarting a job and picking up where it left off will require the use of _recover instead of _state.
- This is something we need to carefully consider as it will be very easy to incorrectly restart a job and have it unintentionally begin processing from the beginning.

Concerns

Handling of worker failures / lockups.
- The slicer will have to be able to detect various failure scenarios and rebalance the partition assignments to other workers.
Finding the offsets to process.
- Kafka doesn't make it easy to know what the offsets are without reading data and in the case of compacted topics it's not possible to really know how many records will actually exist for a given offset range.
- For regular topics it should be possible to get high/low watermarks for the topics.
Slicer reading data.
- It's unclear if the kafka library can work without actually reading data from the cluster. It will be very un-desireable for the slicer to be holding buffers of data even if it doesn't use them.
Reading from missing/invalid offsets.
- Offsets can expire so if a job is stopped for a while, when it restarts there may be invalid offsets stored and the slicer will need to cleanly reset.

Teraslice Change Dependencies

Slicer to know which worker has completed a piece of work allow slices to be sent to specific workers terascope/teraslice#624
Retry support at the slicer level move retries to slicer terascope/teraslice#623

The text was updated successfully, but these errors were encountered:

kstaken · 2017-11-29T00:50:27Z

A realization I just had ... if the standard process to restart a job becomes based around calling _recover then any time a job is restarted it will try to replay any failed slices. This is definitely not desirable in most circumstances, especially on persistent jobs. Recover is something you generally want to do in specific circumstances and replaying all failed slices on a very long running job could cause lots of unintended issues.

The default behavior we want on restarting a persistent job is to just resume where it left off but not automatically try to recover failed slices.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slicer controlled reader #21

Slicer controlled reader #21

kstaken commented Nov 27, 2017

kstaken commented Nov 29, 2017

Slicer controlled reader #21

Slicer controlled reader #21

Comments

kstaken commented Nov 27, 2017

kstaken commented Nov 29, 2017