Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protocol addition to bulk register jobs (x CAN_DO -> 1 MASS_DO) #6

Open
hashar opened this issue Jul 21, 2016 · 1 comment
Open

Protocol addition to bulk register jobs (x CAN_DO -> 1 MASS_DO) #6

hashar opened this issue Jul 21, 2016 · 1 comment

Comments

@hashar
Copy link

hashar commented Jul 21, 2016

OpenStack Zuul relies on Gearman to run jobs on a farm of 800 nodes each having 20000 functions. The massive amount of CAN_DO cause stress to the server. @jeblair went with an expansion of the Gearman protocol that let ones register multiple functions in a single MASS_DO command.

Their software is in python and the protocol expansion is handled via child class: openstack-infra/zuul@d437159 . For the worker:

class GearWorker(gear.Worker):
    MASS_DO = 101

    def sendMassDo(self, functions):
        data = b'\x00'.join([gear.convert_to_bytes(x) for x in functions])
        self.broadcast_lock.acquire()
        try:
            p = gear.Packet(gear.constants.REQ, self.MASS_DO, data)
            self.broadcast(p)
        finally:
            self.broadcast_lock.release()

On the server side:

    def handleMassDo(self, packet):
        packet.connection.functions = set()
        for name in packet.data.split(b'\x00'):
            self.log.debug("Adding function %s to %s" % (
                name, packet.connection))
            packet.connection.functions.add(name)
            self.functions.add(name)

So that is merely \x00 joining the payload of several CAN_DO packets in a single one. That saves you all the overhead of a packet handling.

Would that be a feature that could be added to the reference Gearman protocol?

A bulk equivalent for CAN_DO_TIMEOUT could be of interest as well.

@SpamapS
Copy link
Member

SpamapS commented Aug 21, 2016

This sounds awesome. I'm a zuul user myself and I can see how this could happen. Could you prepare a patch to the protocol docs and maybe a patch for the C/C++ server as well? Thanks!

wmfgerrit pushed a commit to wikimedia/integration-zuul that referenced this issue Jun 15, 2020
Zuul 2.5 Ansible launcher registered ten of thousands of functions on
each node which, when done serially, took a while.  To alleviate that
issue the Gear protocol had been extended with a custom MASS_DO packet
to register several functions in a single call (see d437159).

The Ansible launcher has been superseeded by the executor server
removing the sole use of MASS_DO.  The extended Gear.Server had not been
cleaned up though.

Replace custom zuul.lib.gearserver.GearServer() with gear.Server() and
remove code.

For posterity, the MASS_DO idea is captured in Gearman upstream issue
tracker:
gearman/gearmand#6

Change-Id: Ifc57f9b7a17d1d9291a535eb0d9f5e1da3713241
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants