Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ssh task comms method broken #2463

Closed
hjoliver opened this issue Oct 26, 2017 · 19 comments
Closed

ssh task comms method broken #2463

hjoliver opened this issue Oct 26, 2017 · 19 comments
Assignees
Labels
bug Something is wrong :(
Milestone

Comments

@hjoliver
Copy link
Member

hjoliver commented Oct 26, 2017

The ssh task communication method is broken in 7.5.0 and current master.

For remote command re-invocation by ssh, "--env" options are added to the command line at the remote end, to pass environment variables. In the case of ssh task messaging, this passes crucial elements of the task job environment to "cylc message" on the suite host. These options were being processed by bin/cylc but were omitted from recent refactor (Python to Bash) of that script.

This did not show up in our tests because if ssh task comms fails we automatically fall back to the normal HTTPS method, which works fine at our sites - so aside from some messages in task job.err from the initial failed messaging attempt everything works as expected.

Bug reported by @MartinDix

@hjoliver hjoliver added the bug Something is wrong :( label Oct 26, 2017
@hjoliver hjoliver added this to the next release milestone Oct 26, 2017
@hjoliver
Copy link
Member Author

A note on the reason for our general ssh client command re-invocation on remote hosts (reminding myself): it is required to support ssh task messaging, because task jobs may need to execute other client commands beyond just "cylc message" - and if messaging can't go back to the suite host via HTTPS then neither can anything else.

(At this point I somewhat regret the ssh comms method - it might have been more sensible to simply say if you want to use Cylc then get the right ports opened!)

@matthewrmshin
Copy link
Contributor

Sorry for the breakage. I suppose:

  • Our current test is not adequate. I remember adding logic to prevent the test from using Pyro communication via backdoor. I guess the logic no longer works in our latest setup?
  • Unfortunately, SSH communication is still relevant because cylc suites use a range of ports, so this communication method is the only single port solution for inter-host communication. It is easy to ask a site to open up a single port, but may be more difficult to ask them to open a range of ports.

@matthewrmshin
Copy link
Contributor

See also #2214 and #2302.

@hjoliver
Copy link
Member Author

The fallback logic is in the program code. If you added logic to the ssh comms tests to somehow prevent fallback, then another possibility is that no one ran remote host tests locally before the 7.5.0 release - we're relying too much on Travis CI...

@matthewrmshin
Copy link
Contributor

There should be a test. I run remote tests on our site regularly. I cannot fully trust Travis tests as we have very different environment.

@hjoliver
Copy link
Member Author

hjoliver commented Oct 27, 2017 via email

@matthewrmshin
Copy link
Contributor

Just looked at the test. We used to rely on the passphrase being installed manually on remote job hosts or by Rose. It is now done automatically by cylc job management. The comment in the test suggests that we rely on the old behaviour for it to test the right thing, (hence don't install the passphrase file). Definitely need a better test.

@matthewrmshin
Copy link
Contributor

I wonder if we should combine #2214 (with essence of #2302) into the fix for this.

  • Can we assume that we don't need to support SSH communication from hosts with shared file system with suite host? This will make it very easy to do client commands in tasks: detect ssh comms #2214.
  • We should move the SSH+HTTP(S) communication logic from the messaging module to the network API client module. We'll introduce a simple command on the suite host to handle any method in the API client.
    • The command will be expected to be invoked via SSH (although not necessarily), perhaps with STDIN being used to accept the API method arguments encoded in JSON. (Note: not quite Suite API - On-The-Fly #2123!)
    • In theory, this should allow even the GUI on a remote host to do SSH+HTTP(S) communication with a suite.

@matthewrmshin matthewrmshin self-assigned this Oct 30, 2017
@matthewrmshin
Copy link
Contributor

(I have decided on a basic fix to restore the functionality for now.)

@hjoliver
Copy link
Member Author

(yes, agreed that a basic fix is best for next release)

@hjoliver
Copy link
Member Author

Can we assume that we don't need to support SSH communication from hosts with shared file system with suite host?

How do you figure that? Implement a new task comms method that uses the filesystem?

@hjoliver
Copy link
Member Author

I like the suggested shift to network API

@matthewrmshin
Copy link
Contributor

What I really meant was that, could we assume that a host with a shared file system with the suite host would be able to use HTTP(S) communication? (So we don't get into all sort of complicated scenarios discussed in #2214.)

@hjoliver
Copy link
Member Author

I guess we could recommend if you have a shared FS between suite and task host, do not use ssh task comms. But obviously presence of a shared FS does not actually have any bearing on whether or or the HTTPS comms ports are open between the hosts. Or are you suggesting that it implies the two hosts are in the same security zone and thus getting the ports opened shouldn't be a problem?

@matthewrmshin
Copy link
Contributor

Yes. In other word, can we get away with supporting SSH+HTTP(S) communication only in the case where the remote host does not have a shared file system with the suite host? Otherwise, we'll need to find ways to get around the complication discussed in #2214.

(Otherwise, we should consider ways to make HTTP(S) communication friendlier to sites that do not want to open up a range of ports.)

@hjoliver
Copy link
Member Author

OK, I understand why we'd rather not support ssh comms if there is a shared filesystem, I'm just concerned that a shared filesystem does not automatically imply HTTPS comms will be allowed (and so so we can't assume that). However, I think it is more likely to be allowed at such sites, and it should be fine to simply document why we don't support ssh comms for that case.

@hjoliver
Copy link
Member Author

(unless we can think of an easy way to support ssh comms even with a shared filesystem)

@matthewrmshin
Copy link
Contributor

I suppose we can have a system such that when SSH+HTTP(S) communication is specified in the contact file, it will be used regardless unless the current host is the suite host itself. It may overkill (or not work?) on a mixed network, but that should be very unusual.

@hjoliver
Copy link
Member Author

If the solution is overly complex (with ongoing support implications etc.) - and especially if it fails on a mixed network - then I'd actually rather not do it, and just document to users why they cannot use ssh comms in this situtation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

No branches or pull requests

2 participants