Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problems on using a topology #471

Closed
BerndKrischok opened this issue Feb 18, 2022 · 3 comments · Fixed by #472
Closed

problems on using a topology #471

BerndKrischok opened this issue Feb 18, 2022 · 3 comments · Fixed by #472
Milestone

Comments

@BerndKrischok
Copy link

when trying to contact nodes via a topology configuration I have problems especially with a deeper nested topology.
While a flat topology seems to work.
Example topology file 'topo':
[routes]
n061901: n061902,n062001
n061902: n062002,n143501
n062001: n183601,n192201
n062002: n193201,n072602
n143501: n072701,n072702
n183601: n072801,n072802
n192201: n072901,n072902

N=n061901,n061902,n062001,n061902,n062002,n143501,n062001,n183601,n192201,n062002,n193201,n072602,n143501,n072701,n072702,n183601,n072801,n072802,n192201,n072901,n072902
clush -w $N --topology=topo hostname
^Cn061902: n061902
n061901: n061901
n062001: n062001
n062002: n062002
n143501: n143501
n192201: n192201
n183601: n183601
Keyboard interrupt.

(the interrupt is needed because it hangs for ever)

The tree in the debug looks like this:

n061901
|- n061902
| |- n062002
| | - n[072602,193201] | - n143501
| - n[072701-072702] - n062001
|- n183601
| - n[072801-072802] - n192201
`- n[072901-072902]

Maybe I am doing something wrong. Has anyone an idea what is going wrong?
(every node listed above has been checked for ssh connection to each other)

With a more flat topology I see no problems:
[routes]
n061901: n061902,n062001,n062002
n061902: n143501,n183601,n192201,n193201
n062001: n072801,n072802,n072901,n072902
n062002: n072602,n072701,n072702

clush -w $N --topology=topo hostname
n061902: n061902
n061901: n061901
n062002: n062002
n062001: n062001
n193201: n193201
n192201: n192201
n072602: n072602
n072701: n072701
n072702: n072702
n183601: n183601
n143501: n143501
n072902: n072902
n072801: n072801
n072901: n072901
n072802: n072802

Thank you
Bernd

@martinetd
Copy link
Collaborator

I can reproduce this.

Just changing node names to have something I understand

$ clush -d -w d[1-4] hostname
DEBUG:root:clush: STARTING DEBUG
Changing max open files soft limit from 1024 to 8192
User interaction: True
Create STDIN worker: False
clush: enabling tree topology (6 gateways)
clush: nodeset=d[1-4] fanout=64 [timeout conn=15.0 cmd=0.0] command="hostname"
---------------
rootnode
|- a1
|  |- b1
|  |  `- d[1-2]
|  `- b2
|     `- d[3-4]
`- a2
   |- b3
   |  `- d[5-6]
   `- b4
      `- d[7-8]
---------------
DEBUG:ClusterShell.Worker.Tree:stderr=True
DEBUG:ClusterShell.Worker.Tree:TreeWorker._launch on d[1-4] (fanout=64)
DEBUG:ClusterShell.Worker.Tree:next_hops=[('a1', 'd[1-4]')]
DEBUG:ClusterShell.Worker.Tree:trying gateway a1 to reach d[1-4]
DEBUG:ClusterShell.Worker.Tree:_execute_remote gateway=a1 cmd=hostname targets=d[1-4]
DEBUG:ClusterShell.Task:pchannel: creating new channel <ClusterShell.Propagation.PropagationChannel object at 0x7fe888841780>
SSHCLIENT: ssh -oForwardAgent=no -oForwardX11=no -oConnectTimeout=15 -oBatchMode=yes a1 python3 -m ClusterShell.Gateway -Bu
DEBUG:ClusterShell.Engine.Engine:set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7fe888989bd0> not registered
DEBUG:ClusterShell.Engine.Engine:set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7fe888989bd0> not registered
DEBUG:ClusterShell.Engine.Engine:set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7fe888989bd0> not registered
DEBUG:ClusterShell.Engine.Engine:set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7fe888989bd0> not registered
DEBUG:ClusterShell.Engine.Engine:set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7fe888989bd0> not registered
DEBUG:ClusterShell.Propagation:shell nodes=d[1-4] timeout=-1 worker=140636699497168 remote=True
DEBUG:ClusterShell.Propagation:send_queued: 0
DEBUG:ClusterShell.Worker.Tree:TreeWorker: _check_ini (0, 0)
a1: b'<?xml version="1.0" encoding="utf-8"?>'
a1: b'<channel version="1.8.4"><message type="ACK" msgid="2" ack="0"></message>'
DEBUG:ClusterShell.Propagation:recv: Message CHA (type: CHA, msgid: 2)
DEBUG:ClusterShell.Propagation:channel started (version 1.8.4 on remote gateway)
DEBUG:ClusterShell.Propagation:recv: Message ACK (type: ACK, msgid: 2, ack: 0)
DEBUG:ClusterShell.Propagation:recv_cfg
DEBUG:ClusterShell.Propagation:CTL - connection with gateway fully established
DEBUG:ClusterShell.Propagation:dequeuing sendq: Message CTL (type: CTL, msgid: 1, srcid: 140636699497168, action: shell, target: d[1-4])
a1: b'<message type="ACK" msgid="8" ack="1"></message>'
DEBUG:ClusterShell.Propagation:recv: Message ACK (type: ACK, msgid: 8, ack: 1)
DEBUG:ClusterShell.Propagation:got ack (ACK)
DEBUG:ClusterShell.Propagation:ev_close gateway=a1 <ClusterShell.Propagation.PropagationChannel object at 0x7fe888841780>
DEBUG:ClusterShell.Propagation:ev_close rc=0
$ cat topology.conf
[routes]
rootnode: a1,a2
a1: b1,b2
a2: b3,b4
b1: d1,d2
b2: d3,d4
b3: d5,d6
b4: d7,d8

with this I have no problem reaching two levels deep (b[1-4]) but I can't seem to reach any of the d nodes three levels deep. We can see in debug level that the a1 gateway closes too early, presumably it thinks it's done from b level ack when it shouldn't...

@martinetd
Copy link
Collaborator

running with CLUSTERSHELL_GW_LOG_LEVEL=debug, here's the logs of the first level of the gw (a1):

2022-02-19 11:35:55,545 ClusterShell.Gateway DEBUG Starting task
2022-02-19 11:35:55,545 ClusterShell.Engine.Engine DEBUG set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7f2065deb340> not registered
2022-02-19 11:35:55,545 ClusterShell.Gateway DEBUG ready to accept channel communication
2022-02-19 11:35:55,546 ClusterShell.Gateway DEBUG handling incoming message: Message CHA (type: CHA, msgid: 0)
2022-02-19 11:35:55,546 ClusterShell.Gateway DEBUG got start message Message CHA (type: CHA, msgid: 0)
2022-02-19 11:35:55,546 ClusterShell.Gateway DEBUG channel started (version 1.8.3 on remote end)
2022-02-19 11:35:55,546 ClusterShell.Gateway DEBUG handling incoming message: Message CFG (type: CFG, msgid: 0, gateway: a1)
2022-02-19 11:35:55,546 ClusterShell.Gateway DEBUG got channel configuration
2022-02-19 11:35:55,546 ClusterShell.Gateway DEBUG using gateway node name a1
2022-02-19 11:35:55,546 ClusterShell.Gateway DEBUG gw name a1 does not match system hostname myhostname
2022-02-19 11:35:55,547 ClusterShell.Gateway DEBUG decoded propagation tree
2022-02-19 11:35:55,547 ClusterShell.Gateway DEBUG 
myhostname
|- a1
|  |- b1
|  |  `- d[1-2]
|  `- b2
|     `- d[3-4]
`- a2
   |- b3
   |  `- d[5-6]
   `- b4
      `- d[7-8]

2022-02-19 11:35:55,549 ClusterShell.Gateway DEBUG handling incoming message: Message CTL (type: CTL, msgid: 1, srcid: 139997685073808, action: shell, target: d[1,4])
2022-02-19 11:35:55,549 ClusterShell.Gateway DEBUG GatewayChannel._state_ctl
2022-02-19 11:35:55,549 ClusterShell.Gateway DEBUG decoded gw invoke (PYTHONPATH=/home/shared/clustershell/lib CLUSTERSHELL_GW_LOG_LEVEL=debug python3 -m ClusterShell.Gateway -Bu)
2022-02-19 11:35:55,549 ClusterShell.Gateway DEBUG assigning task infos ({'debug': True, 'fanout': 64, 'grooming_delay': 0.25, 'connect_timeout': 15.0, 'command_timeout': 0.0})
2022-02-19 11:35:55,549 ClusterShell.Gateway DEBUG inherited fanout value=64
2022-02-19 11:35:55,549 ClusterShell.Gateway DEBUG launching execution/enter gathering state
2022-02-19 11:35:55,549 ClusterShell.Gateway DEBUG TreeWorkerResponder initialized grooming=0.250000
2022-02-19 11:35:55,550 ClusterShell.Worker.Tree DEBUG stderr=True
2022-02-19 11:35:55,550 ClusterShell.Worker.Tree DEBUG TreeWorker._launch on d[1,4] (fanout=64)
2022-02-19 11:35:55,551 ClusterShell.Worker.Tree DEBUG next_hops=[('b1', 'd1'), ('b2', 'd4')]
2022-02-19 11:35:55,551 ClusterShell.Worker.Tree DEBUG trying gateway b1 to reach d1
2022-02-19 11:35:55,551 ClusterShell.Worker.Tree DEBUG _execute_remote gateway=b1 cmd=hostname targets=d1
2022-02-19 11:35:55,551 ClusterShell.Task DEBUG pchannel: creating new channel <ClusterShell.Propagation.PropagationChannel object at 0x7f2065ad1d50>
2022-02-19 11:35:55,552 ClusterShell.Gateway DEBUG SSHCLIENT: ssh -oForwardAgent=no -oForwardX11=no -oConnectTimeout=15 -oBatchMode=yes b1 PYTHONPATH=/home/shared/clustershell/lib CLUSTERSHELL_GW_LOG_LEVEL=debug python3 -m ClusterShell.Gateway -Bu
2022-02-19 11:35:55,553 ClusterShell.Engine.Engine DEBUG set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7f2065deb340> not registered
2022-02-19 11:35:55,554 ClusterShell.Engine.Engine DEBUG set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7f2065deb340> not registered
2022-02-19 11:35:55,554 ClusterShell.Engine.Engine DEBUG set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7f2065deb340> not registered
2022-02-19 11:35:55,554 ClusterShell.Engine.Engine DEBUG set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7f2065deb340> not registered
2022-02-19 11:35:55,555 ClusterShell.Engine.Engine DEBUG set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7f2065deb340> not registered
2022-02-19 11:35:55,555 ClusterShell.Propagation DEBUG shell nodes=d1 timeout=-1 worker=139777121523344 remote=True
2022-02-19 11:35:55,555 ClusterShell.Propagation DEBUG send_queued: 0
2022-02-19 11:35:55,555 ClusterShell.Worker.Tree DEBUG trying gateway b2 to reach d4
2022-02-19 11:35:55,555 ClusterShell.Worker.Tree DEBUG _execute_remote gateway=b2 cmd=hostname targets=d4
2022-02-19 11:35:55,556 ClusterShell.Task DEBUG pchannel: creating new channel <ClusterShell.Propagation.PropagationChannel object at 0x7f2065ad2440>
2022-02-19 11:35:55,556 ClusterShell.Gateway DEBUG SSHCLIENT: ssh -oForwardAgent=no -oForwardX11=no -oConnectTimeout=15 -oBatchMode=yes b2 PYTHONPATH=/home/shared/clustershell/lib CLUSTERSHELL_GW_LOG_LEVEL=debug python3 -m ClusterShell.Gateway -Bu
2022-02-19 11:35:55,557 ClusterShell.Engine.Engine DEBUG set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7f2065deb340> not registered
2022-02-19 11:35:55,558 ClusterShell.Engine.Engine DEBUG set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7f2065deb340> not registered
2022-02-19 11:35:55,558 ClusterShell.Engine.Engine DEBUG set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7f2065deb340> not registered
2022-02-19 11:35:55,558 ClusterShell.Engine.Engine DEBUG set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7f2065deb340> not registered
2022-02-19 11:35:55,559 ClusterShell.Engine.Engine DEBUG set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7f2065deb340> not registered
2022-02-19 11:35:55,559 ClusterShell.Propagation DEBUG shell nodes=d4 timeout=-1 worker=139777121523344 remote=True
2022-02-19 11:35:55,560 ClusterShell.Propagation DEBUG send_queued: 0
2022-02-19 11:35:55,560 ClusterShell.Worker.Tree DEBUG TreeWorker: _check_ini (0, 0)
2022-02-19 11:35:55,560 ClusterShell.Gateway DEBUG TreeWorkerResponder: ev_start
2022-02-19 11:35:55,560 ClusterShell.Gateway DEBUG TreeWorker scheduled
2022-02-19 11:35:55,891 ClusterShell.Gateway DEBUG b1: b'<?xml version="1.0" encoding="utf-8"?>'
2022-02-19 11:35:55,893 ClusterShell.Gateway DEBUG b1: b'<channel version="1.8.3"><message type="ACK" msgid="2" ack="4"></message>'
2022-02-19 11:35:55,893 ClusterShell.Propagation DEBUG recv: Message CHA (type: CHA, msgid: 9)
2022-02-19 11:35:55,894 ClusterShell.Propagation DEBUG channel started (version 1.8.3 on remote gateway)
2022-02-19 11:35:55,894 ClusterShell.Propagation DEBUG recv: Message ACK (type: ACK, msgid: 2, ack: 4)
2022-02-19 11:35:55,894 ClusterShell.Propagation DEBUG recv_cfg
2022-02-19 11:35:55,894 ClusterShell.Propagation DEBUG CTL - connection with gateway fully established
2022-02-19 11:35:55,894 ClusterShell.Propagation DEBUG dequeuing sendq: Message CTL (type: CTL, msgid: 5, srcid: 139777121523344, action: shell, target: d1)
2022-02-19 11:35:55,900 ClusterShell.Gateway DEBUG b1: b'<message type="ACK" msgid="4" ack="5"></message>'
2022-02-19 11:35:55,901 ClusterShell.Propagation DEBUG recv: Message ACK (type: ACK, msgid: 4, ack: 5)
2022-02-19 11:35:55,901 ClusterShell.Propagation DEBUG got ack (ACK)
2022-02-19 11:35:55,933 ClusterShell.Gateway DEBUG b2: b'<?xml version="1.0" encoding="utf-8"?>'
2022-02-19 11:35:55,935 ClusterShell.Gateway DEBUG b2: b'<channel version="1.8.3"><message type="ACK" msgid="2" ack="6"></message>'
2022-02-19 11:35:55,935 ClusterShell.Propagation DEBUG recv: Message CHA (type: CHA, msgid: 12)
2022-02-19 11:35:55,935 ClusterShell.Propagation DEBUG channel started (version 1.8.3 on remote gateway)
2022-02-19 11:35:55,935 ClusterShell.Propagation DEBUG recv: Message ACK (type: ACK, msgid: 2, ack: 6)
2022-02-19 11:35:55,936 ClusterShell.Propagation DEBUG recv_cfg
2022-02-19 11:35:55,936 ClusterShell.Propagation DEBUG CTL - connection with gateway fully established
2022-02-19 11:35:55,936 ClusterShell.Propagation DEBUG dequeuing sendq: Message CTL (type: CTL, msgid: 7, srcid: 139777121523344, action: shell, target: d4)
2022-02-19 11:35:55,942 ClusterShell.Gateway DEBUG b2: b'<message type="ACK" msgid="4" ack="7"></message>'
2022-02-19 11:35:55,943 ClusterShell.Propagation DEBUG recv: Message ACK (type: ACK, msgid: 4, ack: 7)
2022-02-19 11:35:55,943 ClusterShell.Propagation DEBUG got ack (ACK)
2022-02-19 11:35:56,169 ClusterShell.Gateway DEBUG b2: b'<message type="OUT" msgid="5" srcid="139777121523344" nodes="d4">gASVGAAAAAAAAABDFGZlbnJpci5jb2Rld3JlY2sub3JnlC4=</message>'
2022-02-19 11:35:56,169 ClusterShell.Propagation DEBUG recv: Message OUT (type: OUT, msgid: 5, srcid: 139777121523344, nodes: d4)
2022-02-19 11:35:56,170 ClusterShell.Gateway DEBUG b2: b'<message type="RET" msgid="6" srcid="139777121523344" retcode="0" nodes="d4"></message>'
2022-02-19 11:35:56,170 ClusterShell.Propagation DEBUG recv: Message RET (type: RET, msgid: 6, srcid: 139777121523344, retcode: 0, nodes: d4)
2022-02-19 11:35:56,170 ClusterShell.Worker.Tree DEBUG _on_remote_node_close d4 0 via gw b2
2022-02-19 11:35:56,171 ClusterShell.Worker.Tree DEBUG check_fini 1 2
2022-02-19 11:35:56,171 ClusterShell.Worker.Tree DEBUG TreeWorker._check_fini <ClusterShell.Worker.Tree.TreeWorker object at 0x7f2065ad1690> call pchannel_release for gw b2
2022-02-19 11:35:56,171 ClusterShell.Task DEBUG pchannel_release b2 <ClusterShell.Worker.Tree.TreeWorker object at 0x7f2065ad1690>
2022-02-19 11:35:56,171 ClusterShell.Task DEBUG pchannel_release: destroying channel <ClusterShell.Propagation.PropagationChannel object at 0x7f2065ad2440>
2022-02-19 11:35:56,172 ClusterShell.Propagation DEBUG ev_close gateway=b2 <ClusterShell.Propagation.PropagationChannel object at 0x7f2065ad2440>
2022-02-19 11:35:56,172 ClusterShell.Propagation DEBUG ev_close rc=None
2022-02-19 11:35:56,172 ClusterShell.Propagation DEBUG error on gateway b2 (setup=True)
2022-02-19 11:35:56,173 ClusterShell.Gateway DEBUG GatewayChannel: ev_close
2022-02-19 11:35:56,176 ClusterShell.Engine.Engine DEBUG Traceback (most recent call last):
  File "/home/shared/clustershell/lib/ClusterShell/Engine/Engine.py", line 723, in run
    self.runloop(timeout)
  File "/home/shared/clustershell/lib/ClusterShell/Engine/EPoll.py", line 157, in runloop
    client._handle_read(sname)
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Exec.py", line 192, in _handle_read
    node_msgline(key, msg, sname)  # handle full msg line
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Exec.py", line 166, in _on_nodeset_msgline
    self.worker._on_node_msgline(nodes, msg, sname)
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Worker.py", line 279, in _on_node_msgline
    self.eh.ev_read(self, node, sname, msg)
  File "/home/shared/clustershell/lib/ClusterShell/Communication.py", line 258, in ev_read
    self.recv(msg)
  File "/home/shared/clustershell/lib/ClusterShell/Propagation.py", line 270, in recv
    self.recv_ctl(msg)
  File "/home/shared/clustershell/lib/ClusterShell/Propagation.py", line 376, in recv_ctl
    metaworker._on_remote_node_close(node, rc, self.gateway)
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Tree.py", line 446, in _on_remote_node_close
    self._check_fini(gateway)
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Tree.py", line 499, in _check_fini
    self.task._pchannel_release(gateway, self)
  File "/home/shared/clustershell/lib/ClusterShell/Task.py", line 1367, in _pchannel_release
    chanworker.abort()
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Exec.py", line 360, in abort
    client.abort()
  File "/home/shared/clustershell/lib/ClusterShell/Worker/EngineClient.py", line 438, in abort
    engine.remove(self, abort=True)
  File "/home/shared/clustershell/lib/ClusterShell/Engine/Engine.py", line 495, in remove
    self._remove(client, abort, did_timeout)
  File "/home/shared/clustershell/lib/ClusterShell/Engine/Engine.py", line 483, in _remove
    client._close(abort=abort, timeout=did_timeout)
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Exec.py", line 142, in _close
    self.worker._check_fini()
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Exec.py", line 383, in _check_fini
    _eh_sigspec_invoke_compat(self.eh.ev_close, 2, self,
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Worker.py", line 52, in _eh_sigspec_invoke_compat
    return method(*args)
  File "/home/shared/clustershell/lib/ClusterShell/Propagation.py", line 411, in ev_close
    self.task.router.mark_unreachable(gateway)
AttributeError: 'NoneType' object has no attribute 'mark_unreachable'

2022-02-19 11:35:56,177 ClusterShell.Propagation DEBUG ev_close gateway=b1 <ClusterShell.Propagation.PropagationChannel object at 0x7f2065ad1d50>
2022-02-19 11:35:56,177 ClusterShell.Propagation DEBUG ev_close rc=None
2022-02-19 11:35:56,177 ClusterShell.Propagation DEBUG error on gateway b1 (setup=True)
2022-02-19 11:35:56,177 ClusterShell.Gateway ERROR Gateway failure: 'NoneType' object has no attribute 'mark_unreachable'
Traceback (most recent call last):
  File "/home/shared/clustershell/lib/ClusterShell/Engine/Engine.py", line 723, in run
    self.runloop(timeout)
  File "/home/shared/clustershell/lib/ClusterShell/Engine/EPoll.py", line 157, in runloop
    client._handle_read(sname)
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Exec.py", line 192, in _handle_read
    node_msgline(key, msg, sname)  # handle full msg line
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Exec.py", line 166, in _on_nodeset_msgline
    self.worker._on_node_msgline(nodes, msg, sname)
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Worker.py", line 279, in _on_node_msgline
    self.eh.ev_read(self, node, sname, msg)
  File "/home/shared/clustershell/lib/ClusterShell/Communication.py", line 258, in ev_read
    self.recv(msg)
  File "/home/shared/clustershell/lib/ClusterShell/Propagation.py", line 270, in recv
    self.recv_ctl(msg)
  File "/home/shared/clustershell/lib/ClusterShell/Propagation.py", line 376, in recv_ctl
    metaworker._on_remote_node_close(node, rc, self.gateway)
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Tree.py", line 446, in _on_remote_node_close
    self._check_fini(gateway)
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Tree.py", line 499, in _check_fini
    self.task._pchannel_release(gateway, self)
  File "/home/shared/clustershell/lib/ClusterShell/Task.py", line 1367, in _pchannel_release
    chanworker.abort()
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Exec.py", line 360, in abort
    client.abort()
  File "/home/shared/clustershell/lib/ClusterShell/Worker/EngineClient.py", line 438, in abort
    engine.remove(self, abort=True)
  File "/home/shared/clustershell/lib/ClusterShell/Engine/Engine.py", line 495, in remove
    self._remove(client, abort, did_timeout)
  File "/home/shared/clustershell/lib/ClusterShell/Engine/Engine.py", line 483, in _remove
    client._close(abort=abort, timeout=did_timeout)
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Exec.py", line 142, in _close
    self.worker._check_fini()
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Exec.py", line 383, in _check_fini
    _eh_sigspec_invoke_compat(self.eh.ev_close, 2, self,
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Worker.py", line 52, in _eh_sigspec_invoke_compat
    return method(*args)
  File "/home/shared/clustershell/lib/ClusterShell/Propagation.py", line 411, in ev_close
    self.task.router.mark_unreachable(gateway)
AttributeError: 'NoneType' object has no attribute 'mark_unreachable'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/shared/clustershell/lib/ClusterShell/Task.py", line 772, in _resume
    self._run(self.timeout)
  File "/home/shared/clustershell/lib/ClusterShell/Task.py", line 400, in _run
    self._engine.run(timeout)
  File "/home/shared/clustershell/lib/ClusterShell/Engine/Engine.py", line 732, in run
    self.clear()
  File "/home/shared/clustershell/lib/ClusterShell/Engine/Engine.py", line 534, in clear
    self._remove(client, True, did_timeout)
  File "/home/shared/clustershell/lib/ClusterShell/Engine/Engine.py", line 483, in _remove
    client._close(abort=abort, timeout=did_timeout)
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Worker.py", line 452, in _close
    _eh_sigspec_invoke_compat(self.worker.eh.ev_close, 2, self, timeout)
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Worker.py", line 52, in _eh_sigspec_invoke_compat
    return method(*args)
  File "/home/shared/clustershell/lib/ClusterShell/Gateway.py", line 311, in ev_close
    self.worker.task.abort()
  File "/home/shared/clustershell/lib/ClusterShell/Task.py", line 920, in abort
    self._abort(kill)
  File "/home/shared/clustershell/lib/ClusterShell/Task.py", line 206, in taskfunc
    return f(task, *fargs, **kwargs)
  File "/home/shared/clustershell/lib/ClusterShell/Task.py", line 909, in _abort
    self._engine.abort(kill)
  File "/home/shared/clustershell/lib/ClusterShell/Engine/Engine.py", line 775, in abort
    raise EngineAbortException(kill)
ClusterShell.Engine.Engine.EngineAbortException

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/shared/clustershell/lib/ClusterShell/Gateway.py", line 368, in gateway_main
    task.resume()
  File "/home/shared/clustershell/lib/ClusterShell/Task.py", line 809, in resume
    self._resume()
  File "/home/shared/clustershell/lib/ClusterShell/Task.py", line 776, in _resume
    self._terminate(exc.kill)
  File "/home/shared/clustershell/lib/ClusterShell/Task.py", line 945, in _terminate
    self._engine.clear(clear_ports=kill)
  File "/home/shared/clustershell/lib/ClusterShell/Engine/Engine.py", line 534, in clear
    self._remove(client, True, did_timeout)
  File "/home/shared/clustershell/lib/ClusterShell/Engine/Engine.py", line 483, in _remove
    client._close(abort=abort, timeout=did_timeout)
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Exec.py", line 142, in _close
    self.worker._check_fini()
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Exec.py", line 383, in _check_fini
    _eh_sigspec_invoke_compat(self.eh.ev_close, 2, self,
  File "/home/shared/clustershell/lib/ClusterShell/Worker/Worker.py", line 52, in _eh_sigspec_invoke_compat
    return method(*args)
  File "/home/shared/clustershell/lib/ClusterShell/Propagation.py", line 411, in ev_close
    self.task.router.mark_unreachable(gateway)
AttributeError: 'NoneType' object has no attribute 'mark_unreachable'
2022-02-19 11:35:56,179 ClusterShell.Gateway DEBUG -------- The End --------

and one of the deeper gw (b2)

2022-02-19 11:35:55,932 ClusterShell.Gateway DEBUG Starting task
2022-02-19 11:35:55,932 ClusterShell.Engine.Engine DEBUG set_events: client <ClusterShell.Engine.EPoll.EngineEPoll object at 0x7fc625aa38e0> not registered
2022-02-19 11:35:55,933 ClusterShell.Gateway DEBUG ready to accept channel communication
2022-02-19 11:35:55,933 ClusterShell.Gateway DEBUG handling incoming message: Message CHA (type: CHA, msgid: 0)
2022-02-19 11:35:55,933 ClusterShell.Gateway DEBUG got start message Message CHA (type: CHA, msgid: 0)
2022-02-19 11:35:55,934 ClusterShell.Gateway DEBUG channel started (version 1.8.3 on remote end)
2022-02-19 11:35:55,934 ClusterShell.Gateway DEBUG handling incoming message: Message CFG (type: CFG, msgid: 6, gateway: b2)
2022-02-19 11:35:55,934 ClusterShell.Gateway DEBUG got channel configuration
2022-02-19 11:35:55,934 ClusterShell.Gateway DEBUG using gateway node name b2
2022-02-19 11:35:55,934 ClusterShell.Gateway DEBUG gw name b2 does not match system hostname myhostname
2022-02-19 11:35:55,934 ClusterShell.Gateway DEBUG decoded propagation tree
2022-02-19 11:35:55,935 ClusterShell.Gateway DEBUG 
myhostname
|- a1
|  |- b1
|  |  `- d[1-2]
|  `- b2
|     `- d[3-4]
`- a2
   |- b3
   |  `- d[5-6]
   `- b4
      `- d[7-8]

2022-02-19 11:35:55,938 ClusterShell.Gateway DEBUG handling incoming message: Message CTL (type: CTL, msgid: 7, srcid: 139777121523344, action: shell, target: d4)
2022-02-19 11:35:55,938 ClusterShell.Gateway DEBUG GatewayChannel._state_ctl
2022-02-19 11:35:55,938 ClusterShell.Gateway DEBUG decoded gw invoke (PYTHONPATH=/home/shared/clustershell/lib CLUSTERSHELL_GW_LOG_LEVEL=debug python3 -m ClusterShell.Gateway -Bu)
2022-02-19 11:35:55,938 ClusterShell.Gateway DEBUG assigning task infos ({'debug': True, 'fanout': 64, 'grooming_delay': 0.25, 'connect_timeout': 15.0, 'command_timeout': 0.0})
2022-02-19 11:35:55,938 ClusterShell.Gateway DEBUG inherited fanout value=64
2022-02-19 11:35:55,938 ClusterShell.Gateway DEBUG launching execution/enter gathering state
2022-02-19 11:35:55,938 ClusterShell.Gateway DEBUG TreeWorkerResponder initialized grooming=0.250000
2022-02-19 11:35:55,938 ClusterShell.Worker.Tree DEBUG stderr=True
2022-02-19 11:35:55,939 ClusterShell.Worker.Tree DEBUG TreeWorker._launch on d4 (fanout=64)
2022-02-19 11:35:55,939 ClusterShell.Worker.Tree DEBUG next_hops=[('d4', 'd4')]
2022-02-19 11:35:55,939 ClusterShell.Worker.Tree DEBUG task.shell cmd=hostname source=None nodes=d4 timeout=-1 remote=True
2022-02-19 11:35:55,939 ClusterShell.Gateway DEBUG SSHCLIENT: ssh -oForwardAgent=no -oForwardX11=no -oConnectTimeout=15 -oBatchMode=yes d4 hostname
2022-02-19 11:35:55,941 ClusterShell.Worker.Tree DEBUG MetaWorkerEventHandler: ev_start
2022-02-19 11:35:55,941 ClusterShell.Worker.Tree DEBUG TreeWorker: _check_ini (1, 1)
2022-02-19 11:35:55,941 ClusterShell.Gateway DEBUG TreeWorkerResponder: ev_start
2022-02-19 11:35:55,942 ClusterShell.Worker.Tree DEBUG added child worker <ClusterShell.Worker.Ssh.WorkerSsh object at 0x7fc625786110> count=1
2022-02-19 11:35:55,942 ClusterShell.Worker.Tree DEBUG TreeWorker: _check_ini (1, 1)
2022-02-19 11:35:55,942 ClusterShell.Gateway DEBUG TreeWorkerResponder: ev_start
2022-02-19 11:35:55,942 ClusterShell.Gateway DEBUG TreeWorker scheduled
2022-02-19 11:35:56,166 ClusterShell.Gateway DEBUG d4: b'myhostname'
2022-02-19 11:35:56,167 ClusterShell.Worker.Tree DEBUG _on_node_close d4 0 (0)
2022-02-19 11:35:56,168 ClusterShell.Worker.Tree DEBUG MetaWorkerEventHandler: ev_close, timedout=False
2022-02-19 11:35:56,168 ClusterShell.Worker.Tree DEBUG check_fini 1 1
2022-02-19 11:35:56,168 ClusterShell.Gateway DEBUG TreeWorkerResponder: ev_close timedout=False
2022-02-19 11:35:56,168 ClusterShell.Gateway DEBUG iter(stdout): d4: 20 bytes
2022-02-19 11:35:56,169 ClusterShell.Gateway DEBUG iter(rc): d4: rc=0
2022-02-19 11:35:56,172 ClusterShell.Gateway DEBUG GatewayChannel: ev_close
2022-02-19 11:35:56,172 ClusterShell.Gateway DEBUG Task performed
2022-02-19 11:35:56,173 ClusterShell.Gateway DEBUG -------- The End --------

So from the second log we can see the command actually ran successfully, just couldn't come up because of the failure.

I've fixed that error in
https://review.gerrithub.io/c/cea-hpc/clustershell/+/533465
and running now works normally. There might be some missing fallbacks if a lower level gateway is unreachable however, that'd require some testing...

@BerndKrischok
Copy link
Author

Hi Dominique,

many thanks for this fix. It works - great.
Bernd

thiell added a commit to thiell/clustershell that referenced this issue Mar 20, 2022
Intermediate gateways create a TreeWorker with a topology passed by
the parent node and a new root node, and instantiate their own
PropagationNodeRouter object.

When such TreeWorker is scheduled, we should set the task's default
router from it properly.

The error seen was:
...
File "ClusterShell/Propagation.py", line 411, in ev_close
    self.task.router.mark_unreachable(gateway)
    AttributeError: 'NoneType' object has no attribute 'mark_unreachable'

Fixes cea-hpc#471
thiell added a commit that referenced this issue Jun 18, 2022
Intermediate gateways create a TreeWorker with a topology passed by
the parent node and a new root node, and instantiate their own
PropagationNodeRouter object.

When such TreeWorker is scheduled, we should set the task's default
router from it properly.

The error seen was:
...
File "ClusterShell/Propagation.py", line 411, in ev_close
    self.task.router.mark_unreachable(gateway)
    AttributeError: 'NoneType' object has no attribute 'mark_unreachable'

Fixes #471
@thiell thiell modified the milestones: 1.8.5, 1.9 Jun 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants