Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix resource bookkeeping bug with acquiring unknown resource. #4945

Merged
merged 5 commits into from
Jun 8, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 12 additions & 2 deletions python/ray/tests/test_basic.py
Original file line number Diff line number Diff line change
Expand Up @@ -1754,7 +1754,7 @@ def f(n):
def g(n):
time.sleep(n)

time_buffer = 0.5
time_buffer = 2

start_time = time.time()
ray.get([f.remote(0.5), g.remote(0.5)])
Expand Down Expand Up @@ -1878,13 +1878,23 @@ def test(self):
def test_zero_cpus(shutdown_only):
ray.init(num_cpus=0)

# We should be able to execute a task that requires 0 CPU resources.
@ray.remote(num_cpus=0)
def f():
return 1

# The task should be able to execute.
ray.get(f.remote())

# We should be able to create an actor that requires 0 CPU resources.
@ray.remote(num_cpus=0)
class Actor(object):
def method(self):
pass

a = Actor.remote()
x = a.method.remote()
ray.get(x)


def test_zero_cpus_actor(ray_start_cluster):
cluster = ray_start_cluster
Expand Down
6 changes: 3 additions & 3 deletions src/ray/raylet/node_manager.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1814,9 +1814,9 @@ bool NodeManager::AssignTask(const Task &task) {
cluster_resource_map_[my_client_id].Acquire(spec.GetRequiredResources());

if (spec.IsActorCreationTask()) {
// Check that we are not placing an actor creation task on a node with 0 CPUs.
RAY_CHECK(cluster_resource_map_[my_client_id].GetTotalResources().GetResourceMap().at(
kCPU_ResourceLabel) != 0);
// Check that the actor's placement resource requirements are satisfied.
RAY_CHECK(spec.GetRequiredPlacementResources().IsSubset(
cluster_resource_map_[my_client_id].GetTotalResources()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change looks like it would allow placing zero CPU actor creation tasks on a node with zero CPUs. The current behavior is to disallow it. Is this what we want?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the current behavior is to crash according to #4892. I view this more as a bug fix and not a change in behavior. The requirement of needing a CPU for placing the actor only applies to actors that do not acquire lifetime resources. The actors that do acquire lifetime resources can be placed as long as they can acquire those resources.

worker->SetLifetimeResourceIds(acquired_resources);
} else {
worker->SetTaskResourceIds(acquired_resources);
Expand Down
15 changes: 11 additions & 4 deletions src/ray/raylet/scheduling_resources.cc
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,11 @@ ResourceSet::ResourceSet() {}

ResourceSet::ResourceSet(
const std::unordered_map<std::string, FractionalResourceQuantity> &resource_map)
: resource_capacity_(resource_map) {}
: resource_capacity_(resource_map) {
for (auto const &resource_pair : resource_map) {
RAY_CHECK(resource_pair.second > 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prevents the ability to create a resource set with any zero capacity resource. One of the use cases we supported in the past is a local laptop with zero CPU allocation to prevent work assignment/force all work assignment to the remote cluster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, this change would break backward compatibility

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's actually still supported. This PR shouldn't have any API implications. The resources with quantity 0 are filtered out before getting to this constructor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just not comfortable with a constructor that breaks the whole raylet if you pass it a fairly reasonable zero resource capacity attribute. This restricts the programmatic expressivity of the ResourceSet. This should also be extensively tested with dynamic custom resources (ESCHER), while allowing some custom resource to drop down to zero capacity. There's a possibility that somewhere in the code we make a copy of the ResourceSet that's dynamically updated to \vec{r} = [..., 0, ...] and trigger this check. I.e., I'm not as worried about the static resource case as I am about the dynamic resource case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We maintain the invariant that all quantities in a ResourceSet are positive. That means that there is a unique internal representation for a given set of resource capacities and reduces the number of cases we have to check. We implemented those semantics in #4555.

This change makes sure to enforce those semantics more clearly (and the absence of this change allowed the bug to creep in). Note that the constructor below has the same check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as long as other methods of ResourceSet that update capacity are correctly implemented, dynamic custom resources should be fine even when capacity drops to zero. As @robertnishihara pointed out, any illegal updates must be filtered out before this code is invoked. If they're not, raising a check fail here seems helpful for isolating bugs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robertnishihara , I see, I suppose it makes sense to be consistent with the second constructor, which is already performing this check. Longer term, not having the ability to represent a zero in a system seems broken to me. There was this major breakthrough in mathematics when people invented zero...

}
}

ResourceSet::ResourceSet(const std::unordered_map<std::string, double> &resource_map) {
for (auto const &resource_pair : resource_map) {
Expand Down Expand Up @@ -169,7 +173,8 @@ void ResourceSet::SubtractResourcesStrict(const ResourceSet &other) {
const std::string &resource_label = resource_pair.first;
const FractionalResourceQuantity &resource_capacity = resource_pair.second;
RAY_CHECK(resource_capacity_.count(resource_label) == 1)
<< "Attempt to acquire unknown resource: " << resource_label;
<< "Attempt to acquire unknown resource: " << resource_label << " capacity "
<< resource_capacity.ToDouble();
resource_capacity_[resource_label] -= resource_capacity;

// Ensure that quantity is positive. Note, we have to have the check before
Expand Down Expand Up @@ -233,8 +238,10 @@ FractionalResourceQuantity ResourceSet::GetResource(

const ResourceSet ResourceSet::GetNumCpus() const {
ResourceSet cpu_resource_set;
cpu_resource_set.resource_capacity_[kCPU_ResourceLabel] =
GetResource(kCPU_ResourceLabel);
const FractionalResourceQuantity cpu_quantity = GetResource(kCPU_ResourceLabel);
if (cpu_quantity > 0) {
cpu_resource_set.resource_capacity_[kCPU_ResourceLabel] = cpu_quantity;
}
return cpu_resource_set;
}

Expand Down