Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running tile-reduce accross containers #108

Open
jakepruitt opened this issue Mar 9, 2017 · 6 comments
Open

Running tile-reduce accross containers #108

jakepruitt opened this issue Mar 9, 2017 · 6 comments
Labels

Comments

@jakepruitt
Copy link

The current internal workings of tile-reduce make it very good for running on a single machine, but due to an issue with require('os').cpus.length, which doesn't accurately report the number of CPU's available within a docker container, tile-reduce can be very resource hungry when running in a container.

Rather than working with a single-machine/process-forking model for distributing work, could we write a version of tile-reduce that can run with workers as their own short-lived containers that communicate to a master container using HTTP or TCP? For use in ECS, we could use https://github.com/mapbox/ecs-watchbot to manage the orchestration of running new containers on an AWS ECS cluster. We could possibly create affordances for other orchestration tools (like mesos or kubernetes) if other people would like them.

cc/ @nickcordella @mourner @tcql @rclark

@mourner
Copy link
Member

mourner commented Mar 9, 2017

You can specify maxWorkers option to limit the number of workers (e.g. to 1 or 2).

Also, didnt't we have a watchbot repo that split and managed tile-reduce jobs? Can't find it right now though.

@tcql
Copy link
Contributor

tcql commented Mar 9, 2017

it seems cool on paper, but tbh I'm not sure what this would gain you that you can't already do with watchbot.

I'd love to be proven wrong, but I suspect it'd be easiest to use ecs-watchbot in reduce mode, storing worker output data in a known location on S3 that the reducer can pull from. You'd need to orchestrate a queueing script that generates the list of tiles to process, as well as replicate logic for reading mbtiles / s3 tiles with tilelive in the workers, but that's relatively minimal effort and could be abstracted.

@rclark
Copy link

rclark commented Mar 9, 2017

I suspect it'd be easiest to use ecs-watchbot in reduce mode, storing worker output data in a known location on S3 that the reducer can pull from.

I agree... but only from the perspective of someone who has worked pretty extensively with both libraries. I think the primary motivation for work like this would be to ease transition of systems currently running tile-reduce on EC2s over to ECS. Does that seem worthwhile? Or is it more worthwhile to have developers learn the ins and outs of ecs-watchbot to transition these stacks?

@nickcordella
Copy link

I think the primary motivation for work like this would be to ease transition of systems currently running tile-reduce on EC2s over to ECS.

As someone who recently lived through the use case this is meant to address, I lean toward not dedicating concerted resources toward a custom tile-reduce version. It is easy to stack-parameterize MaxWorkers and tie it to reservation.cpu as a way of getting a tile-reduce job up and running on ECS. If that framework is not too offensive to the platform team, I'd suggest calling that out somewhere in the docs as a strategy to ease the transition a bit. But also acknowledge that a full leveraging of ECS Watchbot is a more elegant approach, whenever the developer feels comfortable.

@rclark
Copy link

rclark commented Mar 9, 2017

If that framework is not too offensive to the platform team

It is not, except

  • there's not going to be any hard-and-fast rule about where you should set MaxWorkers, we may come harass you if we find your tasks being bad neighbors
  • you're basically throttling yourself by insisting that workers operate as part of the same ECS task. Without this limitation you could be scaling to thousands of workers spread across our cluster, rather than dozens bunched up on a single EC2.

@morganherlocker
Copy link
Contributor

An HTTP/TCP based mapreduce implementation is going to have very different performance characteristics compared to tile-reduce. The communication protocol has a major effect on architecture considerations for any particular job such as tile zoom or memory usage. I think something like this is worthwhile, but it's a very different problem than what tile-reduce is trying to solve (closer to Hadoop).

If the noisy neighbor problem is a common footgun though, I would be in favor of making maxWorkers a required parameter with no default. We can document an example that sets maxWorkers to os.cpus, so the normal "I'm running this on a laptop" use case is easy to achieve. In general, I do think it is always best practice to carefully consider how many workers you want to use, even on a laptop (for other reasons like RAM constraints).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants