Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent "There are .. active alarms" messages and alarm state after Kafka disconnects #2267

Closed
kasemir opened this issue May 20, 2022 · 0 comments

Comments

@kasemir
Copy link
Collaborator

kasemir commented May 20, 2022

The alarm server communicates alarm state changes (PV going into alarm, alarm being acknowledged etc) via Kafka.
New clients read the most recent state from Kafka on startup. Online clients receive state updates from Kafka.

If Kafka is down or temporarily inaccessible, clients and servers can get out of sync. For example, assume a PV goes into alarm and then recovers. The alarm server will latch the alarm and send associated updates to Kafka. With Kafka being inaccessible, clients will miss the updates. New clients, assuming they can eventually connect to Kafka, will miss the latched alarm because it never reached Kafka and is not included in the compacted alarm topic. A possible symptom of this situation is that the annunciator will emit "There are N active alarms" every 15 minutes, but operators viewing the alarm table see a different number of active alarms, typically zero because they have been handling all known active alarms. Checking active alarms in the alarm server console via ls -active will show N active alarms, compatible with the annunciation, while checking the Kafka alarm state via monitor_topic.sh Accelerator will provide a different alarm state, consistent with the alarm table GUI.

In the past, the only way to recover was a restart of the alarm server. The underlying technical issue lies in the design of Kafka, which is on purpose trying to tolerate temporary network issues. The API, see
https://kafka.apache.org/090/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html,
"will transparently handle the failure of servers".
In case of network problems, clients like the alarm GUI will simply not receive any new messages, while otherwise remaining unaware of the issue. To detect this, the alarm system emits periodic idle messages, allowing the alarm table and alarm tree GUI to indicate a missing server connection. The alarm server, however, did not check for Kafka disconnects, so it was unaware of its alarm state updates going nowhere.

With #2265, the alarm server checks for Kafka disconnects, and once communication is restored, the server publishes a complete update of the alarm tree.
In tests, this has allowed bridging Kafka outages and fully re-syncing alarm clients and server.

In addition, a new "resend" command to the alarm server shell could be helpful in debugging similar issues in the future.

Side note: Placing the alarm server and Kafka on the same host can minimize the chances for disconnects, but to be more robust, you need to avoid host names. In an operational example, both the alarm server and Kafka were located on an "alarmhost". The org.phoebus.applications.alarm/server for all phoebus tools was set to alarmhost:9092, so clients anywhere on the network can reach Kafka, while the alarm server was actually running on alarmhost itself and was thus assummed to be immune from network issues. Turns out the Kafka libraries don't always keep the TCP connection between the clients and Kafka. When the name server was down and the Kafka client library in the alarm server tried to connect to alarmhost:9092, this failed...
To be most robust, the alarm server and Kafka thus need to be on the same host, the alarm server must use localhost:9092 to connect, while alarm clients elsewhere on the network naturally need to use alarmhost:9092

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant