Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(consumer): duplicate ID2 breaking DB-mode #1007

Closed
wants to merge 6 commits into from

Conversation

mflendrich
Copy link
Contributor

@mflendrich mflendrich commented Jan 19, 2021

fixes #729

Before this PR, KIC failed to apply config to DB-backed Kong, in
case the config contained a Consumer with two credentials having
equivalent ID2 fields (Key for key-auth, ClientID for OAuth2, etc.).

This commit de-duplicates credentials of each type, by replacing the
Consumer store for credentials with a map (ID2 -> credential object),
and updates relevant tests.

Also, this commit implements consistent credential ordering.


refactor(controller): implement sortByString

This REVERSES the sorting order from reverse to normal.

The history behind reverse order is purely accidental. There are no
known reasons in favor of one or the other. What matters is the order
being deterministic.


test(controller): toDeckContent sorts creds


test(kongstate): deduplicates credentials

@mflendrich mflendrich changed the base branch from main to next January 19, 2021 14:21
@mflendrich mflendrich force-pushed the fix/duplicate-creds branch 2 times, most recently from c2b8086 to 3169a69 Compare January 20, 2021 13:16
@mflendrich mflendrich marked this pull request as ready for review January 20, 2021 13:31
@mflendrich mflendrich requested a review from shaneutt January 20, 2021 13:31
@rainest
Copy link
Contributor

rainest commented Jan 20, 2021

How does this handle sort order for credentials that have the same ID2? It's effectively random, correct?

I think this introduces a footgun that allows you to inadvertently replace in-use configuration, e.g.

  1. UserA creates a JWT credential with key secret and RSApubkeyA. Their application starts using this credential.
  2. UserB, having not consulted with UserA, creates another JWT credential with key secret and RSApubkeyB.
  3. The controller arbitrarily chooses UserB's credential after sorting, rejects UserA's credential as a duplicate, and updates state to include UserB's credential only.
  4. UserA's application starts receiving 401s.

This is somewhat mitigated because this only handles credentials for a single consumer, not credentials on separate consumers (we'll still attempt to apply those and hit the same admin API error). We'd need to further tiebreak by credential creation time to avoid that.

Copy link
Contributor

@shaneutt shaneutt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM, just a couple of small comments 👍

internal/ingress/controller/kong.go Show resolved Hide resolved
internal/ingress/controller/kong.go Show resolved Hide resolved
internal/ingress/controller/kong.go Show resolved Hide resolved
internal/ingress/controller/kong_test.go Show resolved Hide resolved
internal/ingress/controller/kong_test.go Outdated Show resolved Hide resolved
@mflendrich mflendrich requested a review from shaneutt January 20, 2021 21:19
shaneutt
shaneutt previously approved these changes Jan 20, 2021
Copy link
Contributor

@shaneutt shaneutt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good explanations in the code review, thank you.

LGTM ✔️

@mflendrich
Copy link
Contributor Author

How does this handle sort order for credentials that have the same ID2? It's effectively random, correct?

The order is directly as returned from informer stores. Therefore, we need to treat it as effectively not guaranteed to be anything meaningful.

I think this introduces a footgun that allows you to inadvertently replace in-use configuration, e.g.

1. UserA creates a JWT credential with key `secret` and RSApubkeyA. Their application starts using this credential.

2. UserB, having not consulted with UserA, creates another JWT credential with key `secret` and RSApubkeyB.

3. The controller arbitrarily chooses UserB's credential after sorting, rejects UserA's credential as a duplicate, and updates state to include UserB's credential only.

4. UserA's application starts receiving 401s.

That is correct, and this is a part of a bigger problem: given two conflicting entities that made it through our admission validations, KIC needs to distill some valid config from it and apply it to Kong. Otherwise the only option is for KIC's sync loop to lock up, which is terrible user experience.

The problem is that a fair "tie breaker" for conflicting configs is an unsolved problem. The best we can do today is to drop an arbitrary subset that makes the config no longer nonsensical.

This is somewhat mitigated because this only handles credentials for a single consumer, not credentials on separate consumers (we'll still attempt to apply those and hit the same admin API error). We'd need to further tiebreak by credential creation time to avoid that.

I would like to avoid using creation time because it introduces an external factor (creation time) to KIC's semantics. This external factor is likely to surprisingly change if the user, for example, restores their k8s objects from a backup, or from a gitops repository. Therefore I don't see a clear benefit of using creation time rather than picking an arbitrary resource. I'm all in for deterministic and well-defined conflict resolution, but I would rather see something that relies on the configuration that the user controls (e.g., in case of conflicting resources, drop one that has a lexicographically greater namespace+name).

I recommend leaving this as-is and creating an issue to be solved later (deterministic conflict resolution based on user-controlled factors). WDYT?

mflendrich and others added 6 commits January 21, 2021 12:42
Context: #729

Before this PR, KIC failed to apply config to DB-backed Kong, in
case the config contained a Consumer with two credentials having
equivalent ID2 fields (Key for key-auth, ClientID for OAuth2, etc.).

This commit de-duplicates credentials of each type, by replacing the
Consumer store for credentials with a map (ID2 -> credential object),
and updates relevant tests.

Also, this commit implements consistent credential ordering.
This REVERSES the sorting order from reverse to normal.

The history behind reverse order is purely accidental. There are no
known reasons in favor of one or the other. What matters is the order
being deterministic.
Co-authored-by: Shane Utt <shaneutt@users.noreply.github.com>
@codecov
Copy link

codecov bot commented Jan 21, 2021

Codecov Report

Merging #1007 (70dbb1a) into next (daa9afa) will increase coverage by 1.13%.
The diff coverage is 64.70%.

Impacted file tree graph

@@            Coverage Diff             @@
##             next    #1007      +/-   ##
==========================================
+ Coverage   49.56%   50.70%   +1.13%     
==========================================
  Files          32       32              
  Lines        3198     3205       +7     
==========================================
+ Hits         1585     1625      +40     
+ Misses       1483     1440      -43     
- Partials      130      140      +10     
Impacted Files Coverage Δ
...l/ingress/controller/parser/kongstate/kongstate.go 20.49% <0.00%> (ø)
internal/ingress/controller/kong.go 16.25% <56.52%> (+9.58%) ⬆️
...al/ingress/controller/parser/kongstate/consumer.go 73.25% <74.07%> (-1.75%) ⬇️
internal/ingress/controller/parser/parser.go 85.26% <0.00%> (+1.33%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update daa9afa...70dbb1a. Read the comment docs.

@mflendrich
Copy link
Contributor Author

Solution blocked by discrepancy between DB-enabled and DB-less mode of Kong (as described in Kong/kong#6732). Semantics of conflict resolution to be defined following agreement with Kong/kong maintainers on the desired semantics of duplicate entries.

@rainest
Copy link
Contributor

rainest commented Jan 28, 2021

Re discussion elsewhere on whether to use creation timestamp or something else, I'd like to step back and reject the underlying premise:

This is a part of a bigger problem: given two conflicting entities that made it through our admission validations, KIC needs to distill some valid config from it and apply it to Kong.

This is hamstrung by a lack of any valid configuration in this scenario. If you've created two credentials that collide, we can only apply one, and which we choose to apply affects the proxy behavior. I originally suggested creation timestamp because it seemed least likely to result in a change in proxy behavior, since the older resource would have a greater chance of being applied first, but it's not a great solution either.

Any tiebreaker we choose is effectively arbitrary, but user-controlled tiebreakers pose an odd question: if you know that two resources collide, why would you set the tiebreaker such that one resource is used rather than either deleting one of the resources or choosing non-colliding values? If you don't know, you could try at random until your configuration wins, but I don't think it's something we'd want to encourage.

Otherwise the only option is for KIC's sync loop to lock up, which is terrible user experience.

Agreed, but it at least fails relatively safely, since your proxy will continue using existing configuration. Assuming you're not aware of the conflict (again, if you are, you should resolve it by other means), using a tiebreaker results in one of the following:

  • New configuration loses the tiebreaker. The controller doesn't actually apply your configuration, and reports it applied configuration successfully. You don't know why your configuration isn't working.
  • New configuration wins the tiebreaker. The controller reports success, and as far as you're concerned, everything is working great. However, someone else's configuration has spontaneously stopped working, and they probably aren't happy. They don't know why this happened, because the controller just reports a successful sync, and their configuration wasn't deleted.

These are both bad user experience, and potentially more disruptive, since the latter obliterates in-use configuration.

The real issue with the sync loop blocking is that we cannot discard configuration piecemeal: it'd be a much smaller problem if it did not lock up entirely and prevent sync of other (valid) configuration added after. This is hard to solve, but I don't think attempting to mask collisions is a good way to handle it. Absent some way to report (and ideally reject) the problem configuration, I think the only safe option is to let the status quo (sync block) stand.

@mflendrich
Copy link
Contributor Author

Any tiebreaker we choose is effectively arbitrary, but user-controlled tiebreakers pose an odd question: if you know that two resources collide, why would you set the tiebreaker such that one resource is used rather than either deleting one of the resources or choosing non-colliding values? If you don't know, you could try at random until your configuration wins, but I don't think it's something we'd want to encourage.

I fully agree that we don't want to encourage KIC with invalid configurations.

That said, as a matter of fact, users will keep feeding invalid configurations to KIC. What can we do about it? I see three layers of defense here:

  1. prevent invalid configurations from being fed to KIC altogether
  2. given invalid configuration, warn the user visibly
  3. given invalid configuration, fail gracefully

(1) is achieved by CRD schemas and webhook validations. None of these is capable of validating unique constraints across all resources in the cluster, other than resource name. There are measures we can theoretically take to prevent some of these cases, but conflicts will fall through the cracks. Outside of the scope of this issue I think, but we can agree here to implement such a solution separately to improve UX.

(2) KIC does this today by print warning log entries. A better way to do that would be in status fields of affected resources. A perfect way would also to enhance KIC with a Kubernetes resource for generated configuration (likely happening as part of #702 - @shaneutt) and put the overall success/failure/errors/warnings in the status thereof.

(3) is the clue here.

It's quite clear that if KIC config contains 2 or more conflicting entities, there's no theoretical possibility to apply a working and correct config for that subset of configuration because there's no "correct" way to do it. This is obviously an unrecoverable failure scenario.

In an unrecoverable failure scenario one thing that we have under control is the scope of that failure. And:

  • if KIC stops syncing config altogether, then that's equivalent to the entire KIC failing. The user loses the ability to configure any other workloads and consumers, possibly causing an outage elsewhere. Imagine a multi-tenant k8s cluster with one ingress controller where:
    • tenant A accidentally creates a route on a critical workload that causes a disaster (e.g. directs all traffic to a backend service that should be cached instead)
    • tenant B creates a consumer conflict in their toy app
    • tenant A reverts the route change (ineffectively) and sees their application melt down further
  • if KIC contains the failure within conflicting resources, operations for other resources continue without downgrade

So, the question here is whether, in case of failure, we choose to degrade only to the extent of the conflicted resources, or to degrade everything.

Here are the concerns about the proposed solution I've understood from #1007 (comment) (please correct me if I missed something):

  1. KIC syncing new configuration after a conflicts was introduced can cause the user to oversee the conflict
  2. adding new conflicting config may obliterate the old conflicting config
  3. we cannot discard configuration piecemeal

On (1): I believe that we can solve this by providing better feedback about configuration errors - using status fields on (existing and future) resoures. As soon as we have a resource for the entire rendered configuration, its status would be the most natural place to post such errors - as well as status of conflicted resources themselves.

On (2): the premise that old config is more meaningful than new config, and should always win, incompatible with the Kubernetes configuration model that is based on eventual consistency and no temporal relationships between entities.
Furthermore, I view the solution of preserving old configuration by locking up the entire sync loop as throwing the baby out with the bathwater. Alas, I don't see any better way to achieve preserving old configuration.

On (3): why not? This is a serious question. Maybe I'm missing some hidden complexity. My thinking is: "walk through all the resources in a deterministic, well known order. If the currently processed resource is conflicted with the already seen state, warn loudly and assume the fiction that it didn't exist in the config".

Therefore I propose that:

  • we do improve user feedback by utilizing status fields to expose conflicts,
  • we do minimize the configuration impact of a configuration error by leaving everything but the conflicted resources unaffected,
  • we drop the idea to break ties by entity age (unless we find a way that doesn't use timestamps and doesn't lock up too many resources)

WDYT @rainest?

@mflendrich
Copy link
Contributor Author

Removed the "blocked" label because the blocking premise (need to rely on Kong/kong maintainers' input about the discrepancy between db and dbless) is obsolete.

@mflendrich
Copy link
Contributor Author

Closing this PR - migration of KIC to Railgun will command a different solution.

@mflendrich mflendrich closed this Feb 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants