Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix topic error parsing in MetadataResponse #1997

Merged
merged 1 commit into from
Feb 16, 2020

Conversation

jtribble
Copy link

@jtribble jtribble commented Feb 11, 2020

It looks like AdminClient.describe_topics is failing because of how we're parsing the response for topic errors:

# In Java, the error fieldname is inconsistent:
# - CreateTopicsResponse / CreatePartitionsResponse uses topic_errors
# - DeleteTopicsResponse uses topic_error_codes
# So this is a little brittle in that it assumes all responses have
# one of these attributes and that they always unpack into
# (topic, error_code) tuples.
topic_error_tuples = (response.topic_errors if hasattr(response, 'topic_errors')
else response.topic_error_codes)

Here's an example of what happens when I run describe_topics before the change:

$ ./balance-partitions --bootstrap-servers localhost:9092 --topic dummy_topic
=> admin_client = KafkaAdminClient(bootstrap_servers=args.bootstrap_servers)
=> admin_client.describe_topics([args.topic])
Traceback (most recent call last):
  File "./balance-partitions", line 16, in <module>
    main()
  File "~/kafka-tools/lib/commands/balance_partitions.py", line 44, in main
    pprint(admin_client.describe_topics([args.topic]))
  File "~/kafka-tools/vendor/kafka/admin/client.py", line 521, in describe_topics
    metadata = self._get_cluster_metadata(topics=topics, use_controller=True)
  File "~/kafka-tools/vendor/kafka/admin/client.py", line 506, in _get_cluster_metadata
    return self._send_request_to_controller(request)
  File "~/kafka-tools/vendor/kafka/admin/client.py", line 384, in _send_request_to_controller
    else response.topic_error_codes)
AttributeError: 'MetadataResponse_v2' object has no attribute 'topic_error_codes'

And after the change:

$ ./balance-partitions --bootstrap-servers localhost:9092 --topic dummy_topic
=> admin_client = KafkaAdminClient(bootstrap_servers=args.bootstrap_servers)
=> admin_client.describe_topics([args.topic])
[{'error_code': 0,
  'is_internal': False,
  'partitions': [{'error_code': 0,
                  'isr': [10003, 20001, 30001],
                  'leader': 10003,
                  'partition': 1,
                  'replicas': [30001, 20001, 10003]},
                 {'error_code': 0,
                  'isr': [10001, 20001, 30001],
                  'leader': 20001,
                  'partition': 0,
                  'replicas': [10001, 30001, 20001]}],
  'topic': 'dummy_topic'}]

This change is Reviewable

@jtribble jtribble requested review from jeffwidman and dpkp February 11, 2020 22:43
@jeffwidman
Copy link
Collaborator

Thanks for this.

I remember this being tricky because there's both top-level error codes and per-topic error codes. I'll need to dig into the protocol docs before pulling this in to doublecheck some things.

Also, I kinda wonder if we should be checking the type of request/response rather than just looking for the error code... ie, this hack scales a bit but as more request/response pairs come we may need to switch to another model such as a callback where the request/response pair itself maintains the code check of whether or not there's a failure.

But in the meantime, this may be a simple hack for another little bit, I just want to doublecheck first.

@jtribble
Copy link
Author

@jeffwidman I agree, definitely worth a double-check and I like your idea about relying on the response type rather than hasattr 😅 .

If we're not sure about the semantics of the error codes in MetadataResponse.topics, maybe a safer temporary fix would be:

Before:

topic_error_tuples = (response.topic_errors if hasattr(response, 'topic_errors')
        else response.topic_error_codes)

After:

topic_error_tuples = []
if hasattr(response, 'topic_errors'):
    topic_error_tuples.extend(response.topic_errors)
elif hasattr(response, 'topic_error_codes'):
    topic_error_tuples.extend(response.topic_error_codes)

Just depends if you're more interested in a short-term bug fix or would rather dig into this more.

Copy link
Collaborator

@jeffwidman jeffwidman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked into this a bit more and this is good for now.

It's less brittle in that now it allows responses that use other fields for error codes without blowing up, although as a con it no longer blows up so this may not get patched to watch errors on the new fields (IE, not sure this PR would have happened if it was elif rather than else).

This whole thing is a bit hacky, it'd be better if the caller exposed a callback or something that _send_to_controller() could use to determine if a NotControllerError was thrown to refresh. That way we keep the error-parsing logic bundled together with the specific protocol parsing method rather than jumbled around. But that'd be a fairly large refactor so that can wait til someone has the time/inclination. Maybe when that happens this can have unit tests to help verify the request/response parsing.

So merging.

Thank you again!

@jeffwidman jeffwidman merged commit 7195f03 into dpkp:master Feb 16, 2020
@jeffwidman
Copy link
Collaborator

jeffwidman commented Feb 16, 2020

I couldn't figure out why this wasn't reported previously, until I realized that there are two paths for sending metadata request/responses.

Normally we send to the least_loaded_node which has no error checking. But for the recently added describe_topics()(#1993) we have to send to the controller (#1995) so that's when this is hit because it's checking to see if the currently cached controller is correct.

Given that, I'd probably prefer to also check the protocol family / error code pairing as the number of pairs should be relatively small. And then blow up if it's not one of the expected pairs. That way we keep our error surface relatively constrained. Will leave for future enhancement if anyone is interested.

@jeffwidman
Copy link
Collaborator

Actually, thinking about this one more time... I wonder if Metadata response ever even throws the NotControllerError? Since it is valid to be sent to any broker unless we care about describing the topics...

If it doesn't, then send_to_controller should silently let it through w/o error checking since it would never reset the cached controller ID...

Need to also investigate what the expected failure is if describe_topics() is sent to a non-controller, but the client thinks its the controller... is there any indication to the client? Does the client get outdated topic info or is the topic info simply missing?

So this may need a bit more looking at.

@jeffwidman
Copy link
Collaborator

I dug a bit further and realized that Java changed to querying the least-loaded-node rather than the controller for describe_topics(), so this code path isn't even relevant anymore. Implemented in #2000.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants