Skip to content
This repository has been archived by the owner on Aug 1, 2018. It is now read-only.

Slicer controlled reader #21

Open
kstaken opened this issue Nov 27, 2017 · 1 comment
Open

Slicer controlled reader #21

kstaken opened this issue Nov 27, 2017 · 1 comment

Comments

@kstaken
Copy link
Contributor

kstaken commented Nov 27, 2017

Goals

  • Put control of slice allocation and offset management in the slicer
  • Enable retries under the control of the slicer
  • Make kafka jobs recoverable
  • Enable the reader to work for both once and persistent jobs
  • Put partition rebalancing under the control of the slicer
    • Ideally we can handle rebalance scenarios without disrupting all the workers.
  • Maintain the behavior of sticky partitions so that all work for a partition goes to the same worker.
    • This should be the default but if a job doesn't have this requirement it should be possible to disable for maximum throughput.
  • Ideally, have the option for the job to use more workers than partitions.
    • This will only be useful if the partition -> worker mapping is optional and the job doesn't care about the order of the data.

Assumptions

  • Offset storage will be under Teraslice's control
    • Ideally stored in the state record for each slice.
  • Slice sizes may not be exact, especially with compacted topics.
  • If offset storage is in state then restarting a job and picking up where it left off will require the use of _recover instead of _state.
    • This is something we need to carefully consider as it will be very easy to incorrectly restart a job and have it unintentionally begin processing from the beginning.

Concerns

  • Handling of worker failures / lockups.
    • The slicer will have to be able to detect various failure scenarios and rebalance the partition assignments to other workers.
  • Finding the offsets to process.
    • Kafka doesn't make it easy to know what the offsets are without reading data and in the case of compacted topics it's not possible to really know how many records will actually exist for a given offset range.
    • For regular topics it should be possible to get high/low watermarks for the topics.
  • Slicer reading data.
    • It's unclear if the kafka library can work without actually reading data from the cluster. It will be very un-desireable for the slicer to be holding buffers of data even if it doesn't use them.
  • Reading from missing/invalid offsets.
    • Offsets can expire so if a job is stopped for a while, when it restarts there may be invalid offsets stored and the slicer will need to cleanly reset.

Teraslice Change Dependencies

@kstaken
Copy link
Contributor Author

kstaken commented Nov 29, 2017

A realization I just had ... if the standard process to restart a job becomes based around calling _recover then any time a job is restarted it will try to replay any failed slices. This is definitely not desirable in most circumstances, especially on persistent jobs. Recover is something you generally want to do in specific circumstances and replaying all failed slices on a very long running job could cause lots of unintended issues.

The default behavior we want on restarting a persistent job is to just resume where it left off but not automatically try to recover failed slices.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant