Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zmq: support single port configuration #3077

Merged
merged 3 commits into from
May 2, 2019

Conversation

oliver-sanders
Copy link
Member

closes #3075

Also adds a nice error message in the event that ZMQ fails to bind to the specified port(s).

@oliver-sanders oliver-sanders added the bug Something is wrong :( label Apr 5, 2019
@oliver-sanders oliver-sanders added this to the cylc-8.0a1 milestone Apr 5, 2019
@oliver-sanders oliver-sanders self-assigned this Apr 5, 2019
@oliver-sanders oliver-sanders requested a review from kinow April 5, 2019 13:45
@oliver-sanders
Copy link
Member Author

@matthewrmshin small change and a test added since your review.

@kinow
Copy link
Member

kinow commented Apr 5, 2019

Unit tests got stuck in Travis. Checked out the branch locally and the same happened:

lib/cylc/tests/cycling/test_integer.py::TestIntegerSequence::test_multiple_exclusions_integer_sequence_weird_valid_formatting PASSED [ 47%]
lib/cylc/tests/cycling/test_integer.py::TestIntegerSequence::test_multiple_exclusions_simple PASSED [ 48%]
lib/cylc/tests/cycling/test_integer.py::TestIntegerSequence::test_simple PASSED [ 48%]
lib/cylc/tests/cycling/test_iso8601.py::TestISO8601Sequence::test_advanced_exclusions_partial_datetime1 

Copy link
Member

@kinow kinow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a hanging thread/socket somewhere. I wasn't going to use any linux tool to check files, but @hjoliver taught me (thanks!) about the shortcuts in htop, so quickly inspected pytest process, and found that it hung while waiting a file descriptor type stream (I believe these are used by PyZMQ under the hood). 💥

Screenshot from 2019-04-06 10-54-47

(fd=14 is the file descriptor it's using with poll)

Screenshot from 2019-04-06 10-57-01

So suspected PyZMQ was not closing its sockets properly. And after reading some of its docs (opening the classes via reference in the IDE), tried the following that worked:

diff --git a/lib/cylc/network/server.py b/lib/cylc/network/server.py
index 44e860979..26e6506c5 100644
--- a/lib/cylc/network/server.py
+++ b/lib/cylc/network/server.py
@@ -105,6 +105,9 @@ class ZMQServer(object):
                 self.port = self.socket.bind_to_random_port(
                     'tcp://*', min_port, max_port)
         except zmq.error.ZMQError as exc:
+            if self.socket:
+                self.socket.close()
+            del self.context
             raise CylcError('could not start Cylc ZMQ server: %s' % str(exc))
 
         # start accepting requests
@@ -120,6 +123,8 @@ class ZMQServer(object):
         LOG.debug('stopping zmq server...')
         self.queue.put('STOP')
         self.thread.join()  # wait for the listener to return
+        self.socket.close()
+        del self.context
         LOG.debug('...stopped')
 
     def register_endpoints(self):
diff --git a/lib/cylc/tests/test_zmq.py b/lib/cylc/tests/test_zmq.py
index b609365e3..4727274a6 100644
--- a/lib/cylc/tests/test_zmq.py
+++ b/lib/cylc/tests/test_zmq.py
@@ -33,6 +33,6 @@ def test_single_port():
 
     with pytest.raises(CylcError) as exc:
         serv2.start(port, port)
-        assert 'Address already in use' in str(exc)
+    assert 'Address already in use' in str(exc)
 
     serv1.stop()

Hope that helps

@@ -278,7 +286,7 @@ def __init__(self, schd):
self,
encrypt,
decrypt,
lambda: get_secret(schd.suite)
partial(get_secret, schd.suite)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that for performance? I think lambdas are a bit easier to follow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR; lambdas are considered by some (incl guido) to have been a mistake, one of "Python's glitches"!

There are some benefits of partial including:

But really it's about lambas falling from favour rather than partial delivering superiour functionality, nowerdays Python has turned against lambdas:

  • They were never properly implemented (e.g. can only be defined on one line)
  • Named lambdas break pycodestyle (the argument being you should just use a function instead)
  • Late binding issues cause endless confusion (for example).

https://stackoverflow.com/a/3252425

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! I've never used functools.partial 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha! But in this case lambdas and partials deliver pretty much the same feature, right?

In Java/Scala lambdas and closures are still being used, and the equivalent for partials are currying in Scala (supported out of the box by the language by simply not providing all parameters) and in Java by (as verbose as always) creating some objects instantiating classes from java.util.function.

I think both are still used, in some cases developers prefer lambdas for clarity/simplicity, while others prefer currying to be closer for FP. And there are cases when one or the other can be misused causing unwanted bugs, performance regressions, etc. But I think none are discouraged yet.

If I understood correctly, in this case, we have something similar to this code:

from functools import partial


class Scheduler(object):

    def __init__(self):
        self.suite = "sugar"
        self.testa = lambda: self.echovalue(self.suite)
        self.testb = partial(self.echovalue, self.suite)

    def echovalue(self, value):
        print(value)


s = Scheduler()
s.testa()  # sugar
s.testb()  # sugar

s.suite = "salt"
s.testa()  # salt
s.testb()  # sugar, because partial is similar to a... FP applicative functor? keeping the value inside it.. I think...

So I think it means that if we ever touch sched.suite and change its value, then the partial object we passed would be outdated, and we would actually have to create a new one. I think?

So much to learn in Python 😟 and quite late here, so sorry if wrote anything silly ☕

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So much to learn in Python worried and quite late here, so sorry if wrote anything silly coffee

Nothing silly whatsoever.

Gotcha! But in this case lambdas and partials deliver pretty much the same feature, right?

Exactly!

In Java/Scala lambdas and closures are still being used

Lambdas can be nice, but as a concept they have never really been properly integrated into Python. Their scope is limited, only one expression and no line breaks permitted.

But I think none are discouraged yet.

Lambdas are discouraged in Python to some extent if not entirely:

  • List comprehensions and generator expressions are preferred to lambdas where possible removing most lambda use cases.
  • Named lambdas are actively discouraged.
  • The reason that the map function is discouraged is actually to do with the lambda use case e.g. map([1, 2, 3], lambda x: x**2).

So I think it means that if we ever touch sched.suite and change its value, then the partial object we passed would be outdated

Yep, to me that's the correct behaviour!

Late closures can be useful but I think they are likely to catch you out more often than they are to come in handy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, to me that's the correct behaviour!

Well, I'm -0 here. Fine if others (cc @cylc/core) prefer the partial. For me if I need to change the .suite property of Scheduler, I wouldn't want to have to remember that I also need to update the encode and the decode method.

Late closures can be useful but I think they are likely to catch you out more often than they are to come in handy.

Agreed, but in this case I think I would stick with the lambda to prevent a possible bug. -0 and -1 because we don't change the value of .suite now. As far as I can tell, renaming a suite causes that suite to be re-registered, creating a new instance of Scheduler. But if that changed later to actually reuse the Scheduler (for reasons), then we would have a bug where the suite was not able to authenticate I think.

else:
self.port = self.socket.bind_to_random_port(
'tcp://*', min_port, max_port)
except zmq.error.ZMQError as exc:
Copy link
Member

@kinow kinow Apr 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to catch ZMQBindError... ZMQError - strangely - is not the parent class of all errors in PyZMQ... so ZMQBindError extends ZMQBaseError... ZMQError also extends ZMQBaseError.

bind_to_random_port may raise a ZMQBindError I think... my IDE shows the documentation of the API used, and luckily PyZMQ documents return/args/raises/etc.

image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ZMQError - strangely - is not the parent class of all errors in PyZMQ

Strange, good spot!

Copy link
Member Author

@oliver-sanders oliver-sanders Apr 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, just tried this, turns out that ZMQ raises a ZMQError rather than the more specific ZMQBindError documented!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the pyzmq code:

        for i in range(max_tries):
            try:
                port = random.randrange(min_port, max_port)
                self.bind('%s:%s' % (addr, port))
            except ZMQError as exception:
                en = exception.errno
                if en == zmq.EADDRINUSE:
                    continue
                elif sys.platform == 'win32' and en == errno.EACCES:
                    continue
                else:
                    raise
            else:
                return port
        raise ZMQBindError("Could not bind socket to random port.")

So it raises ZMQError if it fails to bind to a specific port and ZMQBindError if it fails to find a socket to bind to.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will catch both...

@oliver-sanders
Copy link
Member Author

Thanks Bruno for your excellent spotting! Tests pass without hanging for me, hows it at your end?

@kinow
Copy link
Member

kinow commented Apr 8, 2019

Code looks good, Travis seems to be going to sleep (must be in NZ time), so just kicked its unit test stage. It failed with:

lib/cylc/tests/cycling/test_integer.py::TestIntegerSequence::test_multiple_exclusions_integer_sequence2 
No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#Build-times-out-because-no-output-was-received
The build has been terminated

Will take another look in the morning and hopefully merge it! 🎉

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Apr 8, 2019

TB has timed out again. Strange, I don't get that locally, do you?

I had put in all your diffs except these lines:

+        del self.context

Which seemed a little harsh to poor ZMQ.

I'll put them in now and see if that fixes it.

Copy link
Member

@kinow kinow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to update one use of partial as per comment to get Travis build passing.

@oliver-sanders
Copy link
Member Author

Poke.

@kinow
Copy link
Member

kinow commented May 2, 2019

Sorry for missing this one! Merging now! 🎉 🚀

@kinow kinow merged commit 6469c13 into cylc:master May 2, 2019
Copy link
Member

@hjoliver hjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine to me (I went thorough it and tested it about a week ago, but got somewhat distracted by testing the ZMQ random port grabbing functionality - for a multi-port range, as opposed to the point of this PR - which didn't seem to work as I recall ... I always go a failure on trying to start a 2nd suite ... need to revisit that though). Anyway, long story short ... I'll approve this, but do we have agreement on the partial vs lambda discussion?

@hjoliver
Copy link
Member

hjoliver commented May 2, 2019

Haha, @kinow bet me to it! 👍

@kinow
Copy link
Member

kinow commented May 2, 2019

I'll approve this, but do we have agreement on the partial vs lambda discussion?

Sorry for merging it before you had time to comment!

but do we have agreement on the partial vs lambda discussion?

Not a consensus I think. No strong opinion on enforcing lambdas, was more curious why one was used replacing the other. Both would work here I think, and some libraries (e.g. graphql) have code & examples with lambdas (sure we can find some with partial too, and the ones with lambdas could possibly be re-written with partial). Possibly like for vs. list comprehension, where the best could be finding the balance between readability and simplicity?

@oliver-sanders oliver-sanders deleted the zmq-single-port branch May 2, 2019 12:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Not able to run Cylc with a single port
4 participants