Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi: pong enforcement #7828

Merged
merged 8 commits into from
Oct 19, 2023
Merged

Conversation

ProofOfKeags
Copy link
Collaborator

@ProofOfKeags ProofOfKeags commented Jul 12, 2023

Change Description

This change introduces pong enforcement: see #3003

  • Introduce subsystem for low latency querying of global blockchain state
  • Remove blockchain subscription from pingHandler
  • Record most recent numPongBytes on Brontide connection
  • Add verification logic to readHandler
  • Launch timeout function during queuing to race the verification logic in the readHandler
  • Handle API misuse of CurrentChainStateTracker
  • Take some action on pong failure (disconnect?)
  • Add tests to verify correctness of this code

Steps to Test

make unit

Pull Request Checklist

Testing

  • Your PR passes all CI checks.
  • Tests covering the positive and negative (error paths) are included.
    - [ ] Bug fixes contain tests triggering the bug to prevent regressions.

Code Style and Documentation

📝 Please see our Contribution Guidelines for further guidance.


This change is Reviewable

@ProofOfKeags ProofOfKeags force-pushed the pong-enforcement branch 2 times, most recently from 2afa841 to 15265ca Compare July 12, 2023 23:29
@ProofOfKeags ProofOfKeags linked an issue Jul 14, 2023 that may be closed by this pull request
1 task
@ProofOfKeags ProofOfKeags added this to the v0.17.1 milestone Jul 14, 2023
@ProofOfKeags ProofOfKeags added the p2p Code related to the peer-to-peer behaviour label Jul 14, 2023
@ProofOfKeags ProofOfKeags self-assigned this Jul 14, 2023
@ProofOfKeags ProofOfKeags requested review from Roasbeef and saubyk July 14, 2023 17:38
@ProofOfKeags ProofOfKeags force-pushed the pong-enforcement branch 2 times, most recently from 5383103 to b0806fc Compare July 14, 2023 18:08
@ProofOfKeags ProofOfKeags changed the title Pong enforcement multi: pong enforcement Jul 14, 2023
@ProofOfKeags ProofOfKeags force-pushed the pong-enforcement branch 2 times, most recently from f0fef56 to 34fea86 Compare July 14, 2023 20:37
@ProofOfKeags ProofOfKeags marked this pull request as ready for review July 17, 2023 18:47
@ProofOfKeags ProofOfKeags force-pushed the pong-enforcement branch 6 times, most recently from 5570eba to 9a85614 Compare July 18, 2023 23:40
Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I dig the attention to test coverage, and overall the testability of the new sub-systems and related gorotuines. Completed an initial pass with a few comments on: style, simplifying the deps of the new sub-system, and re-using some standard library functions throughout the diff.

chainntnfs/current_chain_view.go Outdated Show resolved Hide resolved
chainntnfs/current_chain_view.go Outdated Show resolved Hide resolved
chainntnfs/current_chain_view.go Outdated Show resolved Hide resolved
return err
}
t.changeStream = changeStream
t.quit = make(chan struct{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these variables just be created by the NewChainStateTracker constructor?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows the tracker to be started, stopped, then started again. If we place the channel construction in the constructor itself then it would complicate the logic of the Start() function or we would introduce the limitation that it cannot be started again once it is stopped. Perhaps this is OK.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we place the channel construction in the constructor itself then it would complicate the logic of the Start() function or we would introduce the limitation that it cannot be started again once it is stopped. Perhaps this is OK.

Gotcha, yeah usually most of these are only ever started once, as their lifetime tracks the lifetime of the peer connection. So I think it'd be safe to move all the initialization into the main constructor.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes agreed - should be moved to main constructor 👍

Copy link
Collaborator Author

@ProofOfKeags ProofOfKeags Aug 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove the Start function then? I don't think it makes sense to do some initialization in the constructor and some initialization in a Start method

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constructor is responsible for building the object. The start method is responsible for actually doing things and kicking off goroutines.

chainntnfs/current_chain_view.go Outdated Show resolved Hide resolved
peer/brontide.go Outdated
return uint16(
// We don't need cryptographic randomness here.
/* #nosec */
rand.Intn(65532),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we'll maximally end up here with ~1 kB/s

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal with this was to reach maximum variance to minimize accidental collisions of pings. It's possibly overkill and I can shrink it down but I decided to cover the entire valid input space.

peer/brontide.go Outdated Show resolved Hide resolved
chainntnfs/current_chain_view_test.go Outdated Show resolved Hide resolved
chainntnfs/current_chain_view_test.go Outdated Show resolved Hide resolved
chainntnfs/current_chain_view_test.go Outdated Show resolved Hide resolved
Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 8 of 8 files at r1, all commit messages.
Reviewable status: all files reviewed, 21 unresolved discussions (waiting on @ProofOfKeags and @saubyk)

@ProofOfKeags ProofOfKeags force-pushed the pong-enforcement branch 7 times, most recently from 3b38c05 to ba21f3c Compare July 27, 2023 23:17
@lightninglabs-deploy
Copy link

@morehouse: review reminder
@ellemouton: review reminder
@ProofOfKeags, remember to re-request review from reviewers when ready

@ellemouton
Copy link
Collaborator

@ProofOfKeags - there is a race in TestPingManager (see CI)

Copy link
Collaborator

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting there!

Here is a proposal for how to simplify things quite a bit:

pong.patch

chainntnfs/best_block_view.go Outdated Show resolved Hide resolved
chainntnfs/best_block_view.go Outdated Show resolved Hide resolved
peer/brontide.go Outdated Show resolved Hide resolved
peer/brontide.go Outdated
Comment on lines 687 to 688
pingManager, pingChan, pongChan, failChain := NewPingManager(
newPingPayload, randPongSize, pingInterval, pingTimeout)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

construction should happen in the Brontide constructor.

Then rather call pingManger.Stop() from the Brontide Disconnect method rather than passing in the quit channel. The ping manager should have its own quit channel instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then rather call pingManger.Stop() from the Brontide Disconnect method rather than passing in the quit channel. The ping manager should have its own quit channel instead.

Adjusted this.

construction should happen in the Brontide constructor.

While I understand that the pattern throughout the codebase is this, I want to point out the difficulty with doing that in this case.

On the input side we allocate a couple of stack variables to help with some optimizations. We can drop the optimizations, or we will have to put those stack variables into the Brontide struct and I'm reticent to further bloat the Brontide. Dropping the optimizations I'm much more amenable to.

On the output side we would need to also store the ping, pong, and fail channels on the brontide and my goal with the "receive only" and "send only" type casts here is to enlist the go type system to help make sure that these things don't get misused. It is also why I made the pingManagerLiason and the readHandler take these channels as arguments as well, to prevent those values from being made freely available to other goroutines.

While I'm sympathetic to following patterns to make code more uniform, I worry about progressively piling more and more state into the Brontide, which in this context is serving as a global namespace and I don't think it's that controversial to say that large global namespaces are bad for reasoning about code.

How do you think about this tradeoff?

Copy link
Collaborator

@ellemouton ellemouton Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the output side we would need to also store the ping, pong, and fail channels on the brontide and my goal with the "receive only" and "send only" t

Did you take a look at the patch I sent yesterday? it doesnt require having these channels at all and I would argue that things are way more self contained with the patch implementation where no "liason" is needed

peer/ping_manager.go Outdated Show resolved Hide resolved
peer/ping_manager.go Outdated Show resolved Hide resolved
peer/brontide.go Outdated Show resolved Hide resolved
peer/ping_manager_test.go Show resolved Hide resolved
peer/brontide.go Show resolved Hide resolved
peer/brontide.go Outdated
blockHeader = epoch.BlockHeader
headerBuf := bytes.NewBuffer(pingPayload[0:0])
err := blockHeader.Serialize(headerBuf)
case ping := <-pingChan:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can instead pass callbacks to the ping manager. See proposed patch

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at the proposed patch and I think it removes some of the desired structure I had from it. This was designed to take a channels first approach to things which as far as I understand is the preferred go model of thinking, and then secondly I think it's important to not pollute the Brontide struct with additional state that only pertains to parts of it, hence my choice to thread state via arguments and ensure proper message flow by downcasting channel types into their send-only and receive-only counterparts. I'm amenable to making some adjustments, but I'd like it to be understood why some of these choices were made before we throw them out.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

designed to take a channels first approach to things which as far as I understand is the preferred go model of thinking

Not necessarily. Channels are nice within a subsystem when used in a "closed" way such that how the system is used from the outside doesnt affect things. With the current impl as it currently stands, there is the whole "ping manager liason" thing which imo is leaking implementation into the brontide system. I opt for keeping things contained within the ping manager

I think it's important to not pollute the Brontide struct with additional state

where is the polluted state?

Comment on lines 57 to 68
// pingChan is the channel on which the pingManager will write Ping
// messages it wishes to send out
pingChan chan<- lnwire.Ping

// pongChan is the channel on which the pingManager will write Pong
// messages it is evaluating
pongChan <-chan lnwire.Pong

// failChan is the channel on which the pingManager will report ping
// failures. This will happen if the received Pong message deviates
// from what is acceptable, or if it times out.
failChan chan<- struct{}
Copy link
Collaborator

@ellemouton ellemouton Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the patch I sent, you no longer need these channels.
With these channels, this subsystem (PingManager) depends on Brontide correctly reading and writing to these channels. So implementation details leak.

The patch I sent is a pattern we use often throughout the codebase where the calling system just provides some callbacks.

@ellemouton
Copy link
Collaborator

reposting the suggested adjusted files instead of the patch so that it easier to view and apply.

Suggested ping manager: Note the use of PingManagerConfig with call backs instead of passing back channels and depending on Brontide using them correctly:

(Ive removed all the comments for the sake of making this comment as small as possible)

package peer

import (
	"fmt"
	"sync"
	"sync/atomic"
	"time"

	"github.com/lightningnetwork/lnd/lnwire"
)

type PingManagerConfig struct {
	NewPingPayload func() []byte
	NewPongSize func() uint16
	IntervalDuration time.Duration
	TimeoutDuration time.Duration
	SendPing func(ping *lnwire.Ping)
	OnNoPongReceived func()
}

type PingManager struct {
	cfg *PingManagerConfig
	pingTime atomic.Pointer[time.Duration]
	pingLastSend *time.Time
	outstandingPongSize int32
	pingTicker *time.Ticker
	pingTimeout *time.Timer
	pongChan chan *lnwire.Pong

	started sync.Once
	stopped sync.Once

	mu   sync.Mutex
	quit chan struct{}
	wg   sync.WaitGroup
}

func NewPingManager(cfg *PingManagerConfig) *PingManager {
	m := PingManager{
		cfg:                 cfg,
		outstandingPongSize: -1,
		pongChan:            make(chan *lnwire.Pong, 1),
		quit:                make(chan struct{}),
	}

	return &m
}

func (m *PingManager) ReceivedPong(msg *lnwire.Pong) {
	select {
	case m.pongChan <- msg:
	case <-m.quit:
	}

	return
}

func (m *PingManager) Start() error {
	var err error
	m.started.Do(func() {
		err = m.start()
	})

	return err
}

func (m *PingManager) start() error {
	m.pingTicker = time.NewTicker(m.cfg.IntervalDuration)

	m.pingTimeout = time.NewTimer(0)
	defer m.pingTimeout.Stop()

	if !m.pingTimeout.Stop() {
		<-m.pingTimeout.C
	}

	m.wg.Add(1)
	go func() {
		defer m.wg.Done()
		for {
			select {
			case <-m.pingTicker.C:
				if m.outstandingPongSize >= 0 {
					m.cfg.OnNoPongReceived()
					return
				}

				pongSize := m.cfg.NewPongSize()
				ping := &lnwire.Ping{
					NumPongBytes: pongSize,
					PaddingBytes: m.cfg.NewPingPayload(),
				}

				if err := m.setPingState(pongSize); err != nil {
					m.cfg.OnNoPongReceived()

					return
				}

				m.cfg.SendPing(ping)

			case <-m.pingTimeout.C:
				m.resetPingState()

				m.cfg.OnNoPongReceived()

				return

			case pong := <-m.pongChan:
				pongSize := int32(len(pong.PongBytes))
				expected := m.outstandingPongSize
				lastPing := m.pingLastSend
				m.resetPingState()
				if pongSize != expected {
					m.cfg.OnNoPongReceived()

					return
				}

				if lastPing != nil {
					rtt := time.Since(*lastPing)
					m.pingTime.Store(&rtt)
				}
			case <-m.quit:
				return
			}
		}
	}()

	return nil
}

func (m *PingManager) Stop() error {
	close(m.quit)
	m.wg.Wait()

	m.pingTicker.Stop()
	m.pingTimeout.Stop()

	return nil
}

func (m *PingManager) setPingState(pongSize uint16) error {
	t := time.Now()
	m.pingLastSend = &t
	m.outstandingPongSize = int32(pongSize)
	if m.pingTimeout.Reset(m.cfg.TimeoutDuration) {
		return fmt.Errorf(
			"impossible: ping timeout reset when already active",
		)
	}

	return nil
}

func (m *PingManager) resetPingState() {
	m.pingLastSend = nil
	m.outstandingPongSize = -1
	if !m.pingTimeout.Stop() {
		select {
		case <-m.pingTimeout.C:
		default:
		}
	}
}

func (m *PingManager) GetPingTimeMicroSeconds() int64 {
	rtt := m.pingTime.Load()

	if rtt == nil {
		return -1
	}

	return rtt.Microseconds()
}

Test:

package peer

import (
	"testing"
	"time"

	"github.com/lightningnetwork/lnd/lnwire"
	"github.com/stretchr/testify/require"
)

func TestPingManager(t *testing.T) {
	t.Parallel()

	testCases := []struct {
		name     string
		delay    int
		pongSize uint16
		result   bool
	}{
		{
			name:     "Happy Path",
			delay:    0,
			pongSize: 4,
			result:   true,
		},
		{
			name:     "Bad Pong",
			delay:    0,
			pongSize: 3,
			result:   false,
		},
		{
			name:     "Timeout",
			delay:    2,
			pongSize: 4,
			result:   false,
		},
	}

	payload := make([]byte, 4)
	for _, test := range testCases {
		test := test

		t.Run(test.name, func(t *testing.T) {
			t.Parallel()

			// Set up PingManager.
			onPing := make(chan struct{})
			disconnected := make(chan struct{})
			mgr := NewPingManager(&PingManagerConfig{
				NewPingPayload: func() []byte {
					return payload
				},
				NewPongSize: func() uint16 {
					return 4
				},
				IntervalDuration: time.Second * 2,
				TimeoutDuration:  time.Second,
				SendPing: func(ping *lnwire.Ping) {
					close(onPing)
				},
				OnNoPongReceived: func() {
					close(disconnected)
				},
			})

			require.NoError(t, mgr.Start())

			// Wait for initial Ping.
			<-onPing

			// Wait for pre-determined time before sending Pong
			// response.
			time.Sleep(time.Duration(test.delay) * time.Second)

			// Send Pong back.
			res := &lnwire.Pong{
				PongBytes: make([]byte, test.pongSize),
			}
			mgr.ReceivedPong(res)

			// Evaluate result
			select {
			case <-time.NewTimer(time.Second / 2).C:
				require.True(t, test.result)
			case <-disconnected:
				require.False(t, test.result)
			}

			require.NoError(t, mgr.Stop())
		})
	}
}

Resulting brontide.go changes: (had to change the suffix to .txt to trick github into accepting the file type)

brontide.txt

@ProofOfKeags
Copy link
Collaborator Author

@ellemouton all feedback has been integrated. Let's get this shipped :)

@ellemouton
Copy link
Collaborator

@ProofOfKeags - can you re-push the branch to try get the CI to run? 🙏

@ProofOfKeags
Copy link
Collaborator Author

While it looks like the unit-race is still failing it appears to me that this is unrelated to the PR itself and is an artifact of the rebase.

Copy link
Collaborator

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🔥 left a last set of comments

peer/brontide.go Outdated Show resolved Hide resolved
peer/brontide.go Outdated Show resolved Hide resolved
peer/brontide.go Outdated Show resolved Hide resolved
peer/brontide.go Show resolved Hide resolved
peer/ping_manager.go Show resolved Hide resolved
Comment on lines 147 to 148
if pongSize != expected {
m.cfg.OnPongFailure()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since OnPongFailure is called for a variety of reasons, perhaps it makes sense to give the callback an error param so we can provide more info which Brontide can then log.

peer/ping_manager_test.go Outdated Show resolved Hide resolved
peer/ping_manager.go Show resolved Hide resolved
peer/brontide.go Outdated Show resolved Hide resolved
peer/brontide.go Outdated Show resolved Hide resolved
@ProofOfKeags
Copy link
Collaborator Author

LGTM 🔥 left a last set of comments

All comments addressed.

Copy link
Collaborator

@saubyk saubyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

@ellemouton
Copy link
Collaborator

@ProofOfKeags - The linter is failing and the race check failed on TestPingManager

This change adds a new subsystem that is responsible for providing
an up to date view of some global chainstate parameters.
This commit takes the best block tracker and adds it to the
ChainControl objects and appropriately initializes it on
ChainControl creation
This change makes the generation of the ping payload a no-arg
closure parameter, relieving the pingHandler of having to
directly monitor the chain state. This makes use of the
BestBlockView that was introduced in earlier commits.
This commit refactors some of the bookkeeping around the ping logic
inside of the Brontide. If the pong response is noncompliant with
the spec or if it times out, we disconnect from the peer.
@guggero guggero merged commit 15f4213 into lightningnetwork:master Oct 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p2p Code related to the peer-to-peer behaviour
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

peer: enforce pong responses
7 participants