You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The offset commit appears to be blocked by-design with the assumption that the operation should resume without issue once the underlying network problem has been resolved. The issue appears to be that the consumer is not holding onto an exclusive client lock while it is waiting. This leads to a race condition between the main thread and the heartbeat thread due to a failure to maintain lock ordering.
It is admittedly a very tight window for a race condition but it does exist based on my own experience as well as that of others in the community. The problem can be avoided by allowing the consumer exclusive access to the KafkaClient while trying to commit the offset, or by ensuring that the heartbeat thread has exclusivity to the client while it is checking things out.
It should also be noted that, while I have only spelled out the race condition as it exists between the commit and heartbeat operations, I wouldn't be surprised if the heartbeat was also interfering with other operations because of this issue.
The text was updated successfully, but these errors were encountered:
I encountered this issue while using kafka-python version 2.0.2. I would like to know the potential risks if the problem is addressed by swapping the order of acquiring two locks in base.py as follows:
Modification approach:
In methods ensure_coordinator_ready, ensure_active_group, and maybe_leave_group, change to using with self._lock, self._client._lock:
In method _run_once, swap the lock acquisition order from with self.coordinator._client._lock, self.coordinator._lock: to with self.coordinator._lock, self.coordinator._client._lock:
Would such modifications pose any risks? Is there a better solution to resolve this issue?
I believe that I have found the reason for the deadlock that has been alluded to in a few other issues on the board.
dpkp#2373
dpkp#2099
dpkp#2042
dpkp#1989
The offset commit appears to be blocked by-design with the assumption that the operation should resume without issue once the underlying network problem has been resolved. The issue appears to be that the consumer is not holding onto an exclusive client lock while it is waiting. This leads to a race condition between the main thread and the heartbeat thread due to a failure to maintain lock ordering.
The order of operations is as follows:
It is admittedly a very tight window for a race condition but it does exist based on my own experience as well as that of others in the community. The problem can be avoided by allowing the consumer exclusive access to the KafkaClient while trying to commit the offset, or by ensuring that the heartbeat thread has exclusivity to the client while it is checking things out.
It should also be noted that, while I have only spelled out the race condition as it exists between the commit and heartbeat operations, I wouldn't be surprised if the heartbeat was also interfering with other operations because of this issue.
The text was updated successfully, but these errors were encountered: