Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

undefined method `[]' for #<Consul::Async::ConsulTemplateNodes:0x000055744a631640> #74

Closed
jeromegn opened this issue Aug 16, 2021 · 10 comments

Comments

@jeromegn
Copy link
Contributor

We're seeing sporadic

undefined method `[]' for #<Consul::Async::ConsulTemplateNodes:0x000055744a631640>

in our logs.

This is coming from a node(node_id)["Services"] call.

We're currently on consul-templaterb version 1.29.0, running consul 1.9.1.

Since this works 99% of the time, I'm not sure where the issue is coming from. Perhaps it hasn't resolved to the consul response yet?

@jeromegn
Copy link
Contributor Author

Looking at more logs, I'm also seeing a few:

undefined method `map' for #<Consul::Async::ConsulTemplateNodes:0x000055996a095ed0>

I assume it's the same issue.

@pierresouchay
Copy link
Contributor

pierresouchay commented Aug 16, 2021

Hello @jeromegn,

Well, there is a reason (which sucks) for this behavior.

If you list nodes using nodes() and then try to call node(my_node_name), there might be a race condition in such code, because it is possible that the call to nodes() would return node1, node2, node3, but just after the value is returned, node3 is deregistered in the Consul cluster. In such case, the value for node(node3) returned by Consul API would be null (not even {}).

In such case, the value read by consul-templaterb would be nil, which does not contains [], hence, the error you have.

This can be reproduced trivially with this ERB code:

<%
   node('a_nodes_that_does_not_exists')['Services']
%>

This sucks a bit, I agree, in theory, we should have an helper to check this.

How to avoid this?

There is no way the root cause could be prevented easily because the operations inside the clusters would be hard to get in a "transactional way".

However, it is possible to do the following:

<%=
  res = node('a_nodes_that_does_not_exists')
  unless res.result.is_a?(Hash)
    {} # alternative, return nil if you want to decide between node without service and node that does not exists
  else
    res['Services']
  end
%>

if you want to use this in many places, you can also have a function to do that:

<%
def get_services_for_node(node_name)
  res = node(node_name)
  return res['Services'] if res.result.is_a?(Hash)
  {} # services is actually an Hash
end
%>
<%=
  get_services_for_node("a_node_that_does_not_exists")
%>

Other Considerations

To be fair, we did not used that much this function on our clusters as it gives poor performance with many nodes since you have to interact with every node (and this creates lots of calls).

I also found that actually, there is a little bug on init time where node() -> is initialized with [] by default instead of {}, I'll try to fix this soon

Regards

pierresouchay pushed a commit to pierresouchay/consul-templaterb that referenced this issue Aug 16, 2021
Provide new helpers for node() result

Also fix default valud to be an Hash, not an array
@pierresouchay
Copy link
Contributor

@jeromegn Fixed by #75

@jeromegn
Copy link
Contributor Author

Our nodes are never deregistered really, they never change. Unless a short restart might cause this? We very rarely restart them, but I'm seeing this across the board from time to time.

Could it also be caused by a transient network issue?

@pierresouchay
Copy link
Contributor

pierresouchay commented Aug 17, 2021

Yes, restart of a node might explain it maybe, but not a network transient issue (Consul is supposed to be protected against that)

@jeromegn
Copy link
Contributor Author

To be fair, we did not used that much this function on our clusters as it gives poor performance with many nodes since you have to interact with every node (and this creates lots of calls).

Is there a better way to get all services (with all their data)? Using /catalog/services only gives service names and their tags. We have tens of thousands of different services. Querying per node seemed more efficient since we don't add as many nodes as we add services.

@jeromegn
Copy link
Contributor Author

Perhaps we should've named all our services with the same name (or the same few names) and used the service(name) method?

@pierresouchay
Copy link
Contributor

We I used to work at Criteo, that's what we did:

we had 4000+ services, with around 260k+ instances, we did list all services to have an aggregated view of all services (this is exactly what is done in https://github.com/criteo/consul-templaterb/blob/master/samples/consul-ui/consul_services.json.erb). The big bonus is that you also have the status of all the services at the same time

kamaradclimber added a commit that referenced this issue Aug 23, 2021
Added new utility methods to fix #74
@pierresouchay
Copy link
Contributor

Thank you @kamaradclimber for merging this!

Do you think you might be able to do a release as well?

@jeromegn
Copy link
Contributor Author

FWIW: we're going to be slowly transitioning to non-unique service names and this should help a lot with Consul issues we've been having (including this one).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants