You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Aug 1, 2018. It is now read-only.
Put control of slice allocation and offset management in the slicer
Enable retries under the control of the slicer
Make kafka jobs recoverable
Enable the reader to work for both once and persistent jobs
Put partition rebalancing under the control of the slicer
Ideally we can handle rebalance scenarios without disrupting all the workers.
Maintain the behavior of sticky partitions so that all work for a partition goes to the same worker.
This should be the default but if a job doesn't have this requirement it should be possible to disable for maximum throughput.
Ideally, have the option for the job to use more workers than partitions.
This will only be useful if the partition -> worker mapping is optional and the job doesn't care about the order of the data.
Assumptions
Offset storage will be under Teraslice's control
Ideally stored in the state record for each slice.
Slice sizes may not be exact, especially with compacted topics.
If offset storage is in state then restarting a job and picking up where it left off will require the use of _recover instead of _state.
This is something we need to carefully consider as it will be very easy to incorrectly restart a job and have it unintentionally begin processing from the beginning.
Concerns
Handling of worker failures / lockups.
The slicer will have to be able to detect various failure scenarios and rebalance the partition assignments to other workers.
Finding the offsets to process.
Kafka doesn't make it easy to know what the offsets are without reading data and in the case of compacted topics it's not possible to really know how many records will actually exist for a given offset range.
For regular topics it should be possible to get high/low watermarks for the topics.
Slicer reading data.
It's unclear if the kafka library can work without actually reading data from the cluster. It will be very un-desireable for the slicer to be holding buffers of data even if it doesn't use them.
Reading from missing/invalid offsets.
Offsets can expire so if a job is stopped for a while, when it restarts there may be invalid offsets stored and the slicer will need to cleanly reset.
A realization I just had ... if the standard process to restart a job becomes based around calling _recover then any time a job is restarted it will try to replay any failed slices. This is definitely not desirable in most circumstances, especially on persistent jobs. Recover is something you generally want to do in specific circumstances and replaying all failed slices on a very long running job could cause lots of unintended issues.
The default behavior we want on restarting a persistent job is to just resume where it left off but not automatically try to recover failed slices.
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Goals
once
andpersistent
jobssticky
partitions so that all work for a partition goes to the same worker.Assumptions
_recover
instead of_state
.Concerns
Teraslice Change Dependencies
The text was updated successfully, but these errors were encountered: