-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make KafkaItemWriter extensible and document its thread-safety #3970
Comments
Thank you for opening this issue. Instead of using
I recommend disabling auto-flush in your tests as well, otherwise your test won't mimic the real world behaviour of flushing items in bulk. |
Is that a good idea? CopyOnWriteArrayList duplicates its entire contents every time it is written to. It's more useful for heavy read applications than heavy write applications. It would turn the simple I feel like that would also break the entire flush/clear logic, since COWAL only gives you a snapshot when you iterate it. If N threads are trying to iterate the list, to wait for listenableFutures, and call I'd feel much more comfortable if threads didn't have their data thrown into a shared data structure unless they're never going to read from it again (like KafkaProducer's own record buffers, which are read asynchronously by other threads).
Yeah, probably a good idea in general, but in this case I think it would only mask the issue. In production we process about 2Mil items per job, and this issue hasn't showed up yet, likely because delegating to KafkaTemplate is a pretty quick process, so it's unlikely that two |
About the test, it's a pretty big one because it tests everything in succession and Kafka is only the last step. But it boils down to a Step that uses a JdbcPagingItemReader to grabs IDs from the DB, a Processor that turns IDs into JPA objects and JPA objects into Avro objects, and then the KafkaItemWriter that writes them to Kafka. the IT only runs the whole thing (with pageSize, chunkSize, currency all == 2, and auto-flush == true), and later expects that there should be 11 Records in Kafka. In prod it would be (pageSize = cunkSize = 1000, concurrency = 10, auto-flush = false). Something like this:
|
Thank you for your feedback, I understand the use case better now. You are right about the My concern with trying to introduce thread-safety techniques (whatever the technique is: thread locals, concurrent data structures, synchronization constructs, etc) is that, in addition to making the code complex, if we do it for the I believe we should document the class as being non thread-safe because this is missing now (it was thread-safe until #3773) and let the synchronization aspect to the user with a decorator for consistency. You mentioned the |
Unfortunately, if we do nothing, it's also inconsistent, because some ItemWriters are thread safe and some aren't. For example, FlatFileItemWriter is thread safe, KafkaItemWriter isn't, JpaItemWriter and JdbcBatchItemWriter are thread safe again. I can't see any structure to this, that would let me tell which item writer is thread safe and which isn't without reading the source. Some are obviously documented like Jpa/Jdbc, but if I just assume any undocumented writers are not thread safe, then I wouldn't know that FlatFileItemWriter is actually thread safe. I think going through and documenting thread safety would be quite helpful to users, because otherwise they need to (like me) go through every reader they want to use and check the internals for thread safety, which isn't the greatest user experience and is somewhat error prone (what if I used to rely on KafkaItemWriter to be thread safe, until #3741 broke that, like you said). I also couldn't find any ItemWriter that handles its own concurrency (like the proposed KafkaItemWriter change). All the ones I've found, that are thread safe, delegate this to the next layer down. For FlatFileItemWriter, this is the BufferedWriter, for Jpa it's the EntityManager, for Jdbc I think it's probably the SessionFactory that's deep inside NamedParameterJdbcTemplate somewhere. Those next layers handle concurrency somehow, and the ItemWriter just makes sure, not to get two calls to I don't think just slapping So I've got these ideas:
|
Hi, I used ConcurrentLinkedQueue for this case. What do you think about that, should i use ThreadLocal instead? Has there been another update for this case during this time or have you changed your method after that? |
Yes, and for consistency with other implementations, there is no plan to introduce any concurrency construct in the implementation of I will proceed with the following actions:
|
I would like for
KafkaItemWriter
to be thread safe, consideringKafkaTemplate
is also thread safe.FlatFileItemWriter
/AbstractFileItemWriter
behaves similarly, doing nothing to ensure thread safety itself (beyond not sharing variables across thread boundaries), and instead relying on the underlyingBufferedWriter
to handle synchronization. The same thing should be possible forKafkaItemWriter
, sinceKafkaTemplate
is thread safe.KafkaItemWriter currently shares a
List<ListenableFuture<SendResult<K, T>>>
across threads, which is written to duringwrite()
and laterclear()
ed again. I could wrap the whole thing with aSynchronizedItemStreamWriter
, but it would be a shame to lose multithreading capabilities, especially sinceKafkaTemplate
can handle that anyway. Instead I propose to simply make thelistenableFutures
aThreadLocal<...> listenableFutures = ThreadLocal.withInitial(ArrayList::new)
, which should take care of the problem.Context:
we noticed
ConcurrentModificationException
s inKafkaItemWriter::flush
in our integration tests, which setKafkaTemplate
to auto-flush, unlike the live application, which does not. This causes thewrite()
call to take longer, and that allows 2 threads to interfere. The live application has 10 threads and chunk size 1000, processing about 1k items/s in total. This works out to about 100 items/s/thread or each thread executingwrite()
every 10s or so. Since there is no auto-flush outside of tests,write()
only dumps the items onto theKafkaProducer
's RecordAccumulator and immediately exits.It seems that these calls mostly miss each other, but the problem should still be there. We would rather not synchronize everything, so for now we are using the following modified ThreadSafeKafkaItemWriter:The text was updated successfully, but these errors were encountered: