-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rpc: refuse incoming connections unless dialback succeeds #84289
Comments
This is related to #70111, in that we don't currently attempt to form a full mesh network -- a node will only connect to a different node if some component needs to reach it. I don't know exactly if or how this impacts the proposal here, but it's possible that we may find ourself in a situation where a node connects to a peer but the peer does not attempt to connect back (unless we explicitly make it so). |
100% agree, hence the proposal above to:
|
Do we have ETA for this one? |
As seen in https://github.com/cockroachlabs/support/issues/1875, 22.2 includes two additional protections that may help here, by preventing lease transfers to nodes that aren't able to apply the lease:
These are particularly relevant for the case where a lease is transferred to a node which is able to establish outbound connections (and thus can successfully heartbeat), but can't receive inbound connections (in which case it won't catch up on the Raft log and can't apply the lease). Previously, this would cause the range to become unavailable. However, this doesn't help when leases are already present on the semi-partitioned node, so the proposal here is still relevant (in particular by preventing the node from heartbeating, assuming the liveness leaseholder is also one-way partitioned from the node).
No concrete ETA I'm afraid, but we're very aware of this class of problem. |
A simpler thing than described here which feels worth doing is finding a way to at least observe and identify when one-way partitions do happen. Today we can't tell. The latency tracking code is happy to use just one direction of connection to update its view. |
Technically both sides of the connection maintain the latency metric for "their side" and we can see both in the UI (across the diagonal). We'd lose information if we merged them together. |
In addition to this, we should require that all connection classes can be established. We have seen cases where nodes were able to establish one but not the other connection class due to a bug in the RPC circuit breakers. |
Looking at solutions for this (both the initial and ongoing partitions), there are two general ways it could be done:
The second since it allows for differentiating why the heartbeats are not making it back and has more information such as the status of each connection class. It is not exactly clear what we should do if some connection classes are able to connect backward and others can't but at least we would have the information for logging/debugging purposes. Regarding the gossip of the data and its usage for blocking. I agree it is useful to put the data in gossip, but I'm less clear about exactly how it should be used. If we are blocking certain connections based on "hearsay" then it seems possible that a bad node can cause a larger outage than it should (in the example, what if X is connected to both Y and Z and both report the other through gossip)? I'm not exactly sure about the conditions where this would happen, but it seems possible. Unless anyone feels strongly I wasn't going to use this "proxy" information initially, but it will be available in the logs / gossip dump. |
I think we can start simple with 1 and not send ping responses. We can log an error on the rejecting node, which I think should be sufficient and wouldn't need any protocol changes. As for the gossip aspect, let's defer that. This is related to global failure detection, and there are multiple conversations going on about this, so let's not implement anything until we have a clear picture of how failure detection should work in general (in particular wrt. leases). |
The implementation of this was a little more complicated than I originally thought due to the behavior of the
But this doesn't work because they are deadlocked. I was hoping to examine the state of the connection that is being used since there are really three "important" states
However, there does not appear to be a reliable way to tell after either 1 or 2 is complete with the way our connections are set up today. The approach I am looking at now is the following:
The other alternative is to send an error back immediately on the first PingRequest but that has the unfortunate impact of also failing the first request sent and also not sending another PingResponse for a while. |
Previously the latency to remote nodes was tracked by address rather than the node's id. This could result in a few problems. First, the remote address could be reused across nodes. This could result in incorrect information. Additionally, places that used this information (such as the allocator) needed to unnecessarily map the node id to address just to do a lookup. Finally in preparation for dialback on heartbeat cockroachdb#84289 the use of the OriginAddr field in the PingRequest will change to be the actual address that a node should use to dial back. Currently this field is not set correctly. Epic: none Release note: None
Previously the latency to remote nodes was tracked by address rather than the node's id. This could result in a few problems. First, the remote address could be reused across nodes. This could result in incorrect information. Additionally, places that used this information (such as the allocator) needed to unnecessarily map the node id to address just to do a lookup. Finally in preparation for dialback on heartbeat cockroachdb#84289 the use of the OriginAddr field in the PingRequest is deprecated. That PR will add back another field that can be used to lookup the Locality correct field to use. Epic: none Release note: None
Previously the latency to remote nodes was tracked by address rather than the node's id. This could result in a few problems. First, the remote address could be reused across nodes. This could result in incorrect information. Additionally, places that used this information (such as the allocator) needed to unnecessarily map the node id to address just to do a lookup. Finally in preparation for dialback on heartbeat cockroachdb#84289 the use of the OriginAddr field in the PingRequest is deprecated. That PR will add back another field that can be used to lookup the Locality correct field to use. Epic: none Release note: None
94438: sql,storage: add support for COL_BATCH_RESPONSE scan format r=yuzefovich a=yuzefovich This commit introduces a new `COL_BATCH_RESPONSE` scan format for Scans and ReverseScans which results only in needed columns to be returned from the KV server. In other words, this commit introduces the ability to perform the KV projection pushdown. The main idea of this feature is to use the injected decoding logic from SQL in order to process each KV and keep only the needed parts (i.e. necessary SQL columns). Those needed parts are then propagated back to the KV client as coldata.Batch'es (serialized in the Apache Arrow format). Here is the outline of all components involved: ``` ┌────────────────────────────────────────────────┐ │ SQL │ │________________________________________________│ │ colfetcher.ColBatchDirectScan │ │ │ │ │ ▼ │ │ row.txnKVFetcher │ │ (behind the row.KVBatchFetcher interface) │ └────────────────────────────────────────────────┘ │ ▼ ┌────────────────────────────────────────────────┐ │ KV Client │ └────────────────────────────────────────────────┘ │ ▼ ┌────────────────────────────────────────────────┐ │ KV Server │ │________________________________________________│ │ colfetcher.cFetcherWrapper │ │ (behind the storage.CFetcherWrapper interface) │ │ │ │ │ ▼ │ │ colfetcher.cFetcher │ │ │ │ │ ▼ │ │ storage.mvccScanFetchAdapter ────────┐│ │ (behind the storage.NextKVer interface) ││ │ │ ││ │ ▼ ││ │ storage.pebbleMVCCScanner ││ │ (which put's KVs into storage.singleResults) <┘│ └────────────────────────────────────────────────┘ ``` On the KV client side, `row.txnKVFetcher` issues Scans and ReverseScans with the `COL_BATCH_RESPONSE` format and returns the response (which contains the columnar data) to the `colfetcher.ColBatchDirectScan`. On the KV server side, we create a `storage.CFetcherWrapper` that asks the `colfetcher.cFetcher` for the next `coldata.Batch`. The `cFetcher`, in turn, fetches the next KV, decodes it, and keeps only values for the needed SQL columns, discarding the rest of the KV. The KV is emitted by the `mvccScanFetchAdapter` which - via the `singleResults` struct - exposes access to the current KV that the `pebbleMVCCScanner` is pointing at. Note that there is an additional "implicit synchronization" between components that is not shown on this diagram. In particular, `storage.singleResults.maybeTrimPartialLastRow` must be in sync with the `colfetcher.cFetcher` which is achieved by - the `cFetcher` exposing access to the first key of the last incomplete SQL row via the `FirstKeyOfRowGetter`, - the `singleResults` using that key as the resume key for the response, - and the `cFetcher` removing that last partial SQL row when `NextKV()` returns `partialRow=true`. This "upstream" link (although breaking the layering a bit) allows us to avoid a performance penalty for handling the case with multiple column families. (This case is handled by the `storage.pebbleResults` via tracking offsets into the `pebbleResults.repr`.) This code structure deserves some elaboration. First, there is a mismatch between the "push" mode in which the `pebbleMVCCScanner` operates and the "pull" mode that the `NextKVer` exposes. The adaption between two different modes is achieved via the `mvccScanFetcherAdapter` grabbing (when the control returns to it) the current unstable KV pair from the `singleResults` struct which serves as a one KV pair buffer that the `pebbleMVCCScanner` `put`s into. Second, in order be able to use the unstable KV pair without performing a copy, the `pebbleMVCCScanner` stops at the current KV pair and returns the control flow (which is exactly what `pebbleMVCCScanner.getOne` does) back to the `mvccScanFetcherAdapter`, with the adapter advancing the scanner only when the next KV pair is needed. There are multiple scenarios which are currently not supported: - SQL cannot issue Get requests (likely will support in 23.1) - `TraceKV` option is not supported (likely will support in 23.1) - user-defined types other than enums are not supported (will _not_ support in 23.1) - non-default key locking strength as well as SKIP LOCKED wait policy are not supported (will _not_ support in 23.1). The usage of this feature is currently disabled by default, but I intend to enable it by default for multi-tenant setups. The rationale is that currently there is a large performance hit when enabling it for single-tenant deployments whereas it offers significant speed up in the multi-tenant world. The microbenchmarks [show](https://gist.github.com/yuzefovich/669c295a8a4fdffa6490532284c5a719) the expected improvement in multi-tenant setups when the tenant runs in a separate process whenever we don't need to decode all of the columns from the table. The TPCH numbers, though, don't show the expected speedup: ``` Q1: before: 11.47s after: 8.84s -22.89% Q2: before: 0.41s after: 0.29s -27.71% Q3: before: 7.89s after: 9.68s 22.63% Q4: before: 4.48s after: 4.52s 0.86% Q5: before: 10.39s after: 10.35s -0.29% Q6: before: 33.57s after: 33.41s -0.48% Q7: before: 23.82s after: 23.81s -0.02% Q8: before: 3.78s after: 3.76s -0.68% Q9: before: 28.15s after: 28.03s -0.42% Q10: before: 5.00s after: 4.98s -0.42% Q11: before: 2.44s after: 2.44s 0.22% Q12: before: 34.78s after: 34.65s -0.37% Q13: before: 3.20s after: 2.94s -8.28% Q14: before: 3.13s after: 3.21s 2.43% Q15: before: 16.80s after: 16.73s -0.38% Q16: before: 1.60s after: 1.65s 2.96% Q17: before: 0.85s after: 0.96s 13.04% Q18: before: 16.39s after: 15.47s -5.61% Q19: before: 13.76s after: 13.01s -5.45% Q20: before: 55.33s after: 55.12s -0.38% Q21: before: 24.31s after: 24.31s -0.00% Q22: before: 1.28s after: 1.41s 10.26% ``` At the moment, `coldata.Batch` that is included into the response is always serialized into the Arrow format, but I intend to introduce the local fastpath to avoid that serialization. That work will be done in a follow-up and should be able to reduce the perf hit for single-tenant deployments. A quick note on the TODOs sprinkled in this commit: - `TODO(yuzefovich)` means that this will be left for 23.2 or later. - `TODO(yuzefovich, 23.1)` means that it should be addressed in 23.1. A quick note on testing: this commit randomizes the fact whether the new infrastructure is used in almost all test builds. Introducing some unit testing (say, in `storage` package) seems rather annoying since we must create keys that are valid SQL keys (i.e. have TableID / Index ID prefix) and need to come with the corresponding `fetchpb.IndexFetchSpec`. Not having unit tests in the `storage` seems ok to me given that the "meat" of the work there is still done by the `pebbleMVCCScanner` which is exercised using the regular Scans. End-to-end testing is well covered by all of our existing tests which now runs randomly. I did run the CI multiple times with the new feature enabled by default with no failure, so I hope that it shouldn't become flaky. Addresses: #82323. Informs: #87610. Epic: CRDB-14837 Release note: None 95701: gossip: Track latency by nodeID rather than addr r=kvoli,erikgrinaker a=andrewbaptist Previously the latency to remote nodes was tracked by address rather than the node's id. This could result in a few problems. First, the remote address could be reused across nodes. This could result in incorrect information. Additionally, places that used this information (such as the allocator) needed to unnecessarily map the node id to address just to do a lookup. Finally in preparation for dialback on heartbeat #84289 the use of the OriginAddr field in the PingRequest will change to be the actual address that a node should use to dial back. Currently this field is not set correctly. Epic: none Release note: None 95796: ui: add CPU Time chart do statement details r=maryliag a=maryliag This commit adds a new chart for CPU time on Statement Details page. Part Of #87213 <img width="1508" alt="Screen Shot 2023-01-24 at 6 01 07 PM" src="https://user-images.githubusercontent.com/1017486/214440274-c48d3bb6-ecbe-47a2-861a-0a8407d219c4.png"> Release note (ui change): Add CPU Time chart to Statement Details page. 95832: cdc: remove 'nonsensitive' tag from changefeed description in telemetry logs r=jayshrivastava a=jayshrivastava Previously, the description field in changefeed telemetry logs was marked as `nonsensitive`. This is incorrect because the description field may contain an SQL statement which is not safe to report. This change removes the `nonsensitive` tag so the field is redacted by default. Fixes: #95823 Epic: none Release note: none 95838: logictest: remove smallEngineBlocks randomization r=yuzefovich a=yuzefovich This metamorphic randomization has caused some flakiness (due to a subset of tests taking very long time) so is now removed. This feature should be tested in a more targeted fashion. Fixes: #95799. Fixes: #95829 Release note: None 95840: opt: replace make with dev in test instructions r=mgartner a=mgartner Epic: None Release note: None 95842: roachtest: fix parameters passed to require.NoError r=yuzefovich,srosenberg,herkolategan a=renatolabs When context is passed to an assertion, the parameters *must* be a string format, followed by arguments (as you would in a call to `fmt.Sprintf`). The previous code would panic trying to cast int to string. Informs #95416 Release note: None Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Andrew Baptist <baptist@cockroachlabs.com> Co-authored-by: maryliag <marylia@cockroachlabs.com> Co-authored-by: Jayant Shrivastava <jayants@cockroachlabs.com> Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com> Co-authored-by: Renato Costa <renato@cockroachlabs.com>
Hi @andrewbaptist, please add branch-* labels to identify which branch(es) this release-blocker affects. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Fixes: cockroachdb#84289 Previously one-way partitions where a node could initiate a successful TCP connection in one direction, but the reverse connection fails which causes problems. The node that initiates outgoing connections can acquire leases and cause failures for reads and writes to those ranges. This is particularly a problem if it acquires the liveness range leases, but is a problem even for other ranges. This commit adds an additional check during server-to-server communication where the recipient of a new PingRequest first validates that it is able to open a reverse connection to the initiator before responding. Additionally, it will monitor whether it has a successful reverse connection over time and asynchronously validate reverse connections to the sender. The ongoing validation is asynchronous to avoid adding delays to PingResponses as they are used for measuring clock offsets. Also the onlyOnceDialer prevents retrying after a dial error, however this can get into a state where it continually retries for certain network connections. This is not easy to reproduce in a unit tests as it requires killing the connection using iptables (normal closes don't cauuse this). After this change the onlyOnceDialer will no longer repeatedly retry to reconnect after a broken connection during setup. Release note (bug fix): RPC connections between nodes now require RPC connections to be established in both directions, otherwise the connection will be closed. This is done to prevent asymmetric network partitions where nodes are able to send outbound messages but not receive inbound messages, which could result in persistent unavailability. This behavior can be disabled by the cluster setting rpc.dialback.enabled. Epic: CRDB-2488
Fixes: cockroachdb#99104 Informs cockroachdb#84289 As part of the previous fix for partition handling, we tracked the state of a previous attempt and use that result on the next attempt. However if there were multiple connections, we may only block system traffic connections and not default class connections. This change addresses that by ensuring a failed dialback attempt is remembered until we are able to successfully connect back to the pinging node. Epic: none Release note: None
Fixes: cockroachdb#99104 Informs cockroachdb#84289 As part of the previous fix for partition handling, we tracked the state of a previous attempt and use that result on the next attempt. However if there were multiple connections, we may only block system traffic connections and not default class connections. This change addresses that by ensuring a failed dialback attempt is remembered until we are able to successfully connect back to the pinging node. Epic: none Release note: None
99429: ccl/sqlproxyccl: add PROXY protocol support via CLI flag to sqlproxy r=pjtatlow a=jaylim-crl This commits adds a new `require-proxy-protocol` flag to `mt start-proxy`, and that changes the sqlproxy's behavior to support the PROXY protocol. When the flag is set, the protocol will be enforced on the SQL listener, and supported on a best-effort basis on the HTTP listener. If the PROXY protocol isn't used, but is enforced, the connection will be rejected. The rationale behind doing best-effort basis on the HTTP listener is that some healthcheck systems don't support the protocol. This work is needed for the AWS PrivateLink work in CockroachCloud, which requires the use of the PROXY protocol. Release note: None Epic: none Release justification: SQL Proxy change only. Changes are needed for the AWS PrivateLink work in CockroachCloud. 99840: rpc: Fix blackhole recv r=erikgrinaker a=andrewbaptist Fixes: #99104 Informs #84289 As part of the previous fix for partition handling, we tracked the state of a previous attempt and use that result on the next attempt. However if there were multiple connections, we may only block system traffic connections and not default class connections. This change addresses that by ensuring a failed dialback attempt is remembered until we are able to successfully connect back to the pinging node. Epic: none Release note: None Co-authored-by: Jay <jay@cockroachlabs.com> Co-authored-by: Andrew Baptist <baptist@cockroachlabs.com>
Fixes: #99104 Informs #84289 As part of the previous fix for partition handling, we tracked the state of a previous attempt and use that result on the next attempt. However if there were multiple connections, we may only block system traffic connections and not default class connections. This change addresses that by ensuring a failed dialback attempt is remembered until we are able to successfully connect back to the pinging node. Epic: none Release note: None
Previously the latency to remote nodes was tracked by address rather than the node's id. This could result in a few problems. First, the remote address could be reused across nodes. This could result in incorrect information. Additionally, places that used this information (such as the allocator) needed to unnecessarily map the node id to address just to do a lookup. Finally in preparation for dialback on heartbeat cockroachdb#84289 the use of the OriginAddr field in the PingRequest is deprecated. That PR will add back another field that can be used to lookup the Locality correct field to use. Epic: none Release note: None
Is your feature request related to a problem? Please describe.
Followup to #49220 / cockroachlabs/support#1690
We know of multiple situations where a cluster can find itself into an asymmetric partition, whtiich causes all kinds of symptoms (#49220), these include at least:
It would be good if we had some automation to exclude nodes which appear to be partially partitioned away. (and require operator attention)
Describe the solution you'd like
We could have a couple relatively simple mechanisms to protect a cluster:
Two point-to-point mechanisms, to protect against pairwise partions:
When receiving an incoming connection from another node, refuse to respond to heartbeats until we get a successful outgoing dial to the same node. In other words: when n2 connects to n1, have n1 refuse the conn until it can dial back to n2 successfully.
When an outgoing dial fails, or a heartbeat fails, keep a timestamp of the failure for the remote node ID, and when receiving a heartbeat from that ID, refuse to respond to the heartbeat if there's a recent failure (and possibly actively close the connection). In other words, start refusing a heartbeat from n2 to n1, if n1 has failed to connect/heartbeat to n2 recently.
Then a cluster-wisde mechanism, to protect against global partitions (e.g. n1-n2 and n2-n3 can connect, but not n1-n3)
Jira issue: CRDB-17572
gz#13169
Epic CRDB-2488
The text was updated successfully, but these errors were encountered: