Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

monitor: reduce overhead #2037

Merged
merged 4 commits into from
Nov 22, 2017
Merged

monitor: reduce overhead #2037

merged 4 commits into from
Nov 22, 2017

Conversation

aalemayhu
Copy link
Contributor

Reduce the number of allocations from the monitor.

@aalemayhu aalemayhu requested a review from a team November 14, 2017 16:44
@aalemayhu aalemayhu requested review from a team as code owners November 14, 2017 16:44
Copy link
Contributor

@rlenglet rlenglet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea, thanks!

@@ -77,14 +77,20 @@ func (m *Monitor) Run(npages int) {
break
}
}
if todo > 0 {
if todo > 0 && numListeners() > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would revert this change. Even if you have no listeners you must keep reading from the perf buffer, otherwise you'll start losing events and incorrectly report a high lost event count when the first reader connects.
Plus, it's always interesting to collect and report the "lost event" metric even when there are no readers.

However, you could optimize the no-listener case by not decoding the monitor events, just read and discard them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want to disconnect from the perf ring buffer to stop triggering event generation in the data path. We don't want to pay the datapath cost as a penalty if no-one is listening.

@aalemayhu aalemayhu requested a review from a team as a code owner November 15, 2017 06:29
@@ -1058,6 +1058,9 @@ definitions:
unknown:
description: Number of unknown samples.
type: integer
numlisteners:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about renaming into nlisteners for consistency with npages above, or just listeners for consistency with cpus, lost, and unknown above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nlisteners sounds good.

log.WithError(err).Warn("Error received while reading from perf buffer")

if numListeners() > 0 {
if todo > 0 {
Copy link
Contributor

@rlenglet rlenglet Nov 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

epoll is used in its default edge-triggered mode here. Therefore, if you don't call monitorEvents.ReadAll every time monitorEvents.Poll returns todo > 0, the next call to monitorEvents.Poll will not block, because there's still data to be read from the file descriptor. Therefore, this loop will spin and consume 100% of one CPU, and will also result in the perf ring buffer filling up.

In short, this if todo > 0 test CAN'T be conditioned on anything, e.g. numListeners() > 0.
You must revert this change.

You should either:

  • Move your optimization into m.receiveEvent, where you could just ignore read events (don't decode them) if there are no listeners, as I previously suggested.
  • Or close the file descriptor until numListeners() > 0, as @tgraf suggested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please wait till I request a new review. Thanks!

@aalemayhu aalemayhu added the wip label Nov 15, 2017
eventConfig.NumPages = npages

main_event_loop:
for ok := <-waitForListeners; !ok; {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose for this channel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To wait for listeners which are added in the handle connection. Am I using it wrong?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few patterns you might use that would be more canonical. The most useful would be that you can make a buffered channel (make(chan bool, 1000)) and then the writes you have to it below wouldn't need the goroutine wrappers. Since you're using main_event_loop as a normal label, instead of one for the loop, you can replace the for ok := <-waitForListeners; !ok; with a simple for <-waitForListeners but then only write to the channel when you would have written a true. You currently always write true in handleConnection so that doesn't change, and you would have to use an if statement in the send case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks cleaner. The only other change I did was drop the loop because it seemed to lock forever. Please take one more pass in case I misunderstood you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe wait looks like I missed something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort of. My goal was to avoid the busy loop on the waitForListeners channel. The thing is, we're mixing concerns a bit, in that we want the loop to pause until there are listeners but the only way we have to do that is by blocking on the channel every iteration, thus, we also need to bypass this channel read :/

Would it be simpler to coordinate this a different way?

  • Spawn a goroutine that runs the logic in main_event_loop when we handleConnection and we go from numListeners==0 to numListeners==1
  • In this main_event_loop goroutine, exit it completely when there are no more listeners
    after purgeInactiveListeners (or just check at the end/start of the loop).
  • We can also defer closing sockets and things in this goroutine, so we know the cleanups happen at the right time.
  • This version would not have this waitForListeners channel at all.

My reasoning is that it is simple to not run the perf reading parts at all if there are no listeners. I think this was the motivation for these changes right?

If you want to stick with what you have now, then it might be easier to split the gotos and labels:

  • add a new label waiting_for_listers. jump to it when we have no more listeners (so, where we currently jump to main_event_loop)
  • main_event_loop is moved to after the waitForListeners check. We jump to it in the case where there are listeners still, and we need to poll again and keep reading.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Will do it as purposed, just need to test it a little bit locally and will push it out. Thanks, Ray.

@@ -287,7 +287,9 @@ start:
}
}

if pl.Type == payload.EventSample {
if meta.Size == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When there are no monitors, is that what produces an API response with size 0 here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention of writing an empty meta is so that we can detect a write failure when any of the known connections are gone / inactive. If we are not aware of any connections we never write anything and these lines are not reached. The connections are added to a list of listeners, but the Go API has no good way for us to know when a connection is inactive. So we need to try to write in order to detect a failure. Does this answer your question?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The explanation makes sense, can you add it as a comment to the code? It's not obvious otherwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

log.WithError(err).Error("Error in Poll")
if err == syscall.EBADF {
break
select {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        t := time.NewTicker(5 * time.Second)
        for range t.C {
                if monitorEvents != nil {
                        m.dumpState()
                }
        }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -295,21 +297,19 @@ func (e *PerfEvent) Disable() error {
return nil
}

C.free(e.state)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is e.state. guaranteed to never be NULL?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. We are using malloc so there is a chance it can return NULL. What is the proper way to handle this. Do you want to do a null check first?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we'd have to check on line 293 and return an error there. To be safe, we could make this no-op on e.state being nil too

Copy link
Contributor Author

@aalemayhu aalemayhu Nov 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will leave this as is unless @tgraf really wants it to change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm taking my comment back, calling free() on NULL is perfectly fine so we can keep this.

@@ -287,7 +287,9 @@ start:
}
}

if pl.Type == payload.EventSample {
if meta.Size == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The explanation makes sense, can you add it as a comment to the code? It's not obvious otherwise.

if nm := sr.NodeMonitor; nm != nil {
fmt.Fprintf(w, "NodeMonitor:\tListening for events on %d CPUs with %dx%d of shared memory\n",
nm.Cpus, nm.Npages, nm.Pagesize)
/* If there are no listeners, let it appear as if the node
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convert this to // comments

// dumpStat prints out the monitor status in JSON.
func (m *Monitor) dumpStat() {
c := int64(monitorEvents.Cpus)
n := int64(monitorEvents.Npages)
p := int64(monitorEvents.Pagesize)
l, u := monitorEvents.Stats()
ms := models.MonitorStatus{Cpus: c, Npages: n, Pagesize: p, Lost: int64(l), Unknown: int64(u)}
num := int64(numListeners())
ms := models.MonitorStatus{Cpus: c, Npages: n, Pagesize: p, Lost: int64(l), Unknown: int64(u), Nlisteners: num}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expand this to multiple lines, it will become easier to read and you don't need to store num first.

listeners.PushBack(conn)
log.WithField("count.listener", listeners.Len()).Info("New monitor connected.")
/* Since this is our first listener that means the event loop
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convert to //

}

func purgeInactiveListeners() {
meta := &payload.Meta{Size: uint32(0)}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this into a function:

func generateHealthCheckMsg() {
	meta := &payload.Meta{Size: uint32(0)}
 	metaBuf, err := meta.MarshalBinary()
 	if err != nil {
 		log.WithError(err).Fatal("meta encode")
 	}

        return metaBuf
}

meta := &payload.Meta{Size: uint32(0)}
metaBuf, err := meta.MarshalBinary()
if err != nil {
log.WithError(err).Fatal("meta encode")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we safely continue here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should continue will change it to a info.

Copy link
Contributor

@raybejjani raybejjani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly small things that you can ignore, with one logic order thing that might be worth revisiting.

@@ -287,7 +287,9 @@ start:
}
}

if pl.Type == payload.EventSample {
if meta.Size == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if-else chain can be a switch statement, if you want. That would allow you to handle the last pl.Type == payload.RecordLost case explicitly and then have default report an error or whatever.

	switch {
	case meta.Size == 0:
		continue
	case pl.Type == payload.EventSample:
		receiveEvent(pl.Data, pl.CPU)
	case pl.Type == payload.RecordLost:
		lostEvent(pl.Lost, pl.CPU)
	default:
		log.WithError(err).Fatal("unsupported type in meta object")
	}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks better

c.NumPages = npages
eventConfig.NumPages = npages
t := time.NewTicker(5 * time.Second)
for range t.C {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally prefer to create the Ticker as you have here but if you wanted a more compact form you can do:

for range time.Tick(5 * time.Second)

The assumption here is that this function is never intended to return (which it can't, since we loop infinitely and have no returns or breaks) and so the Ticker providing the channel returned by Tick does not need to Close called on it (this would normally cause it to leak (https://golang.org/pkg/time/#Tick)
If you keep the explicit t := thing then could you also add a defer t.Close(). It isn't necessary but if someone ever refactors this code and it returns, we won't leak the timer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add a comment but otherwise fine as is.

/* Since this is our first listener that means the event loop
* is not running and has to be started.
*/
if first {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be `if first := listeners.Len() == 0; first {``` since we don't use first otherwise

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not do what you think it does. We only want to start the event loop if this is really our first listener. If you move the initialization further down after the list is being modified the meaning of first is not the same anymore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! I see, yes. We also can't start the main loop since it would then exit because there are no listeners. You are totally right!

monitorEvents = me

for {
purgeInactiveListeners()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do the numListeners() == 0 check (the else on line 151 below) here? this then means we can return before the Poll if there are no listeners, and also means that if we do the Poll call we always read data in the todo > 0 case (I think this was rlenglet's concern earlier).
The general structure of the function would then look like:

bpf.NewCpuEvents()
for {
	purgeInactiveListeners()
	if numListeners == 0 {
		CloseAll()
		return
	}
	Poll()
	if todo > 0 {
		ReadAll()
	}
}

An alternative would be to defer monitorEventsCloseAll() and the error handling for it and if numListeners == 0 simply returns from the function/breaks the loop to allow the function to exit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change it to the suggested but deferring the closing of the perf event side might lead to an invalid memory address or nil pointer dereference so we can't do that.

log.WithError(err).Error("Error while starting monitor")
return
}
monitorEvents = me
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nit, but you could assign monitorEvents directly on line 128 (I suspect this is a holdover from a complicated if statement that didn't create monitorEvents in the outer scope).

@@ -105,6 +105,11 @@ func ReadMetaPayload(r io.Reader, meta *Meta, pl *Payload) error {
return err
}

// If there is no payload just abort
if meta.Size == 0 {
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implied behaviour here is that payload's contents are undefined in the meta.Size==0 case. We can zero out payload or add a comment about this in the function comment (line 102)

@@ -295,21 +297,19 @@ func (e *PerfEvent) Disable() error {
return nil
}

C.free(e.state)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we'd have to check on line 293 and return an error there. To be safe, we could make this no-op on e.state being nil too

}
}
}

func generateHealthCheckMsg() []byte {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it just occurred to me that since we just keep sending this same empty meta, and only in purgeInactiveListeners, we can store the []byte and just keep reusing that. It's not a big deal, though, this isn't a big memory use or anything.

@aalemayhu aalemayhu requested review from rlenglet and tgraf November 22, 2017 12:49
@aalemayhu aalemayhu dismissed rlenglet’s stale review November 22, 2017 12:51

Dismissing review, since there's been changes since last time

The `TotalAlloc` average was `824380337.5` and after this is
`685498361.9`. The numbers were retrieved by sending signal to the node
monitor every 10 seconds while running the connection tracking tests
(tests/01-ct.sh).

What is `TotalAlloc`? From the Go docs[0]

        // TotalAlloc is cumulative bytes allocated for heap objects.
        //
        // TotalAlloc increases as heap objects are allocated, but
        // unlike Alloc and HeapAlloc, it does not decrease when
        // objects are freed.
        TotalAlloc uint64

[0]: https://golang.org/pkg/runtime/#MemStats

Related: #1926 (monitor: Reduce overhead)
Suggested-by: Ray Bejjani <ray@covalent.io>
Signed-off-by: Alexander Alemayhu <alexander@alemayhu.com>
Before the total allocation was `778696759` now it's `20105978.88`.
These numbers are after reducing state allocations.

Related: #1926 (monitor: Reduce overhead)
Suggested-by: Ray Bejjani <ray@covalent.io>
Signed-off-by: Alexander Alemayhu <alexander@alemayhu.com>
This will help debugging the monitor and also useful to see other info
messages.

Closes: #1931 (agent: runtime option to show monitor logs in the agent logs)
Signed-off-by: Alexander Alemayhu <alexander@alemayhu.com>
This should reduce the overhead on small instances that need all the CPU
and RAM available when not actively monitoring. Reading the perf events
is now dynamically managed. When a new monitor connection is there we
enable and stop as soon there are no more connections.

We now routinely check for inactive connections and remove them. Before
the write fail would only occur if we got a event while trying to write
the meta or payload. The readers should check the meta size and ignore
if size is 0. This way we can update the listener count earlier but the
delay for the polling timeout is still there.

Also update the output when listening and in status.

Related: #1926 (monitor: Reduce overhead)
Closes: #1927 (monitor: improve handling of write failure)
Signed-off-by: Alexander Alemayhu <alexander@alemayhu.com>
@tgraf tgraf merged commit 1a33ccc into cilium:master Nov 22, 2017
@aalemayhu aalemayhu deleted the reduce-overhead branch November 22, 2017 16:17
@tgraf tgraf added the release-note/minor This PR changes functionality that users may find relevant to operating Cilium. label Dec 4, 2017
christarazi added a commit to christarazi/cilium that referenced this pull request Jul 17, 2024
As the previous commit mentions, this does not occur on main, v1.16, and
v1.15 due to cilium#29036.

However, in v1.14 and v1.13, we need to take special care because the
NameManager and SelectorCache lock can be taken while the Endpoint lock
is held during Endpoint deletion.

Here are the relevant stacktraces regarding the deadlock:

```
1: sync.Mutex.Lock [75 minutes] [Created by http.(*Server).Serve in goroutine 1699 @ server.go:3086]
    sync            sema.go:77                   runtime_SemacquireMutex(*uint32(0x5), false, 43690)
    sync            mutex.go:171                 (*Mutex).lockSlow(*Mutex(cilium#1733))
    sync            mutex.go:90                  (*Mutex).Lock(...)
    sync            rwmutex.go:147               (*RWMutex).Lock(*RWMutex(0xb0))
    fqdn            name_manager.go:70           (*NameManager).Lock(0xffffffffffffffff)
    policy          selectorcache.go:964         (*SelectorCache).RemoveSelectors(cilium#1088, {cilium#28569, 0xb, 1}, {cilium#643, cilium#32582})
    policy          l4.go:810                    (*L4Filter).removeSelectors(cilium#32582, cilium#29992)
    policy          l4.go:817                    (*L4Filter).detach(cilium#719, cilium#29993)
    policy          l4.go:988                    L4PolicyMap.Detach(...)
    policy          l4.go:1179                   (*L4Policy).Detach(cilium#20318, cilium#1383)
    policy          resolve.go:103               (*selectorPolicy).Detach(...)
    policy          distillery.go:81             (*PolicyCache).delete(cilium#1354, cilium#19354)
    policy          distillery.go:138            (*PolicyCache).LocalEndpointIdentityRemoved(cilium#523, cilium#1163)
    identitymanager manager.go:167               (*IdentityManager).remove(cilium#706, cilium#19354)
    identitymanager manager.go:147               (*IdentityManager).Remove(cilium#706, cilium#19354)
    identitymanager manager.go:52                Remove(...)
    endpoint        endpoint.go:1146             (*Endpoint).leaveLocked(cilium#1883, cilium#12221, {0x30, 0})
    endpoint        endpoint.go:2192             (*Endpoint).Delete(cilium#1883, {0x80, 0xaa})
    endpointmanager manager.go:380               (*EndpointManager).removeEndpoint(cilium#1161, 0, {0xff, 0xff})
    endpointmanager manager.go:394               (*EndpointManager).RemoveEndpoint(...)
    cmd             endpoint.go:684              (*Daemon).deleteEndpointQuiet(...)
    cmd             endpoint.go:666              (*Daemon).deleteEndpoint(cilium#1155, cilium#1883)
    cmd             endpoint.go:713              (*Daemon).DeleteEndpoint(cilium#1155, {cilium#27399, cilium#8108})
    cmd             endpoint.go:770              (*deleteEndpointID).Handle(cilium#21193, {cilium#2452, {cilium#27399, 0x4d}})
    endpoint        delete_endpoint_id.go:66     (*DeleteEndpointID).ServeHTTP(cilium#5934, {cilium#666, cilium#2242}, cilium#2452)
    middleware      operation.go:28              (*Context).RoutesHandler.NewOperationExecutor.func1({cilium#666, cilium#2242}, cilium#2452)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#130), func{cilium#2242, 0x3})
    middleware      router.go:78                 NewRouter.func1({cilium#666, cilium#2242}, cilium#2451)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#718), func{cilium#2242, #59})
    middleware      redoc.go:72                  Redoc.func1({cilium#666, cilium#2242}, cilium#1251)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#4920), func{cilium#2242, #45})
    middleware      spec.go:46                   Spec.func1({cilium#666, cilium#2242}, cilium#4921)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#10532), func{cilium#2242, cilium#23015})
    metrics         middleware.go:64             (*APIEventTSHelper).ServeHTTP(cilium#1459, {cilium#668, cilium#10533}, cilium#2451)
    api             apipanic.go:42               (*APIPanicHandler).ServeHTTP(cilium#722, {cilium#668, cilium#10533}, cilium#4922)
    http            server.go:2938               serverHandler.ServeHTTP(*Server(cilium#8105), cilium#668, cilium#10533, 0x6)
    http            server.go:2009               (*conn).serve(*conn(cilium#16005), Context{cilium#673, cilium#1554})
8: sync.Mutex.Lock [74 minutes] [Created by http.(*Server).Serve in goroutine 1699 @ server.go:3086]
    sync            sema.go:77                   runtime_SemacquireMutex(*, 0x47, cilium#1154)
    sync            mutex.go:171                 (*Mutex).lockSlow(cilium#706)
    sync            mutex.go:90                  (*Mutex).Lock(...)
    sync            rwmutex.go:147               (*RWMutex).Lock(*)
    identitymanager manager.go:99                (*IdentityManager).RemoveOldAddNew(cilium#706, 0, cilium#1154)
    identitymanager manager.go:123               RemoveOldAddNew(...)
    endpoint        policy.go:852                (*Endpoint).SetIdentity(*, cilium#1154, 0)
    endpoint        endpoint.go:1932             (*Endpoint).identityLabelsChanged(*, {cilium#674, *}, 1)
    endpoint        endpoint.go:1780             (*Endpoint).runIdentityResolver(*, {cilium#674, *}, 1, 1)
    endpoint        endpoint.go:1720             (*Endpoint).UpdateLabels(*, {cilium#674, *}, *, *, 8)
    cmd             endpoint.go:477              (*Daemon).createEndpoint(cilium#1155, {cilium#673, *}, {cilium#683, cilium#1155}, *)
    cmd             endpoint.go:542              (*putEndpointID).Handle(cilium#21192, {*, *, {*, 0xe}})
    endpoint        put_endpoint_id.go:58        (*PutEndpointID).ServeHTTP(cilium#3961, {cilium#666, *}, *)
    middleware      operation.go:28              (*Context).RoutesHandler.NewOperationExecutor.func1({cilium#666, *}, *)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, *)
    middleware      router.go:78                 NewRouter.func1({cilium#666, *}, *)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, *)
    middleware      redoc.go:72                  Redoc.func1({cilium#666, *}, cilium#1251)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, #45)
    middleware      spec.go:46                   Spec.func1({cilium#666, *}, *)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, *)
    metrics         middleware.go:64             (*APIEventTSHelper).ServeHTTP(cilium#1459, {cilium#668, *}, *)
    api             apipanic.go:42               (*APIPanicHandler).ServeHTTP(#49, {cilium#668, *}, *)
    http            server.go:2938               serverHandler.ServeHTTP({cilium#653}, {cilium#668, *}, 6)
    http            server.go:2009               (*conn).serve(*, {cilium#673, cilium#1554})
5: sync.Mutex.Lock [75 minutes] [Created by eventqueue.(*EventQueue).Run in goroutine 1482 @ eventqueue.go:229]
    sync            sema.go:77                   runtime_SemacquireMutex(cilium#142, 0xe8, *)
    sync            mutex.go:171                 (*Mutex).lockSlow(cilium#1733)
    sync            mutex.go:90                  (*Mutex).Lock(...)
    sync            rwmutex.go:147               (*RWMutex).Lock(0x68)
    fqdn            name_manager.go:70           (*NameManager).Lock(*)
    policy          selectorcache.go:798         (*SelectorCache).AddFQDNSelector(cilium#1088, {cilium#643, *}, {{*, 0x4d}, {0, 0}})
    policy          l4.go:628                    (*L4Filter).cacheFQDNSelector(...)
    policy          l4.go:623                    (*L4Filter).cacheFQDNSelectors(*, {*, 4, cilium#193}, cilium#536)
    policy          l4.go:725                    createL4Filter({cilium#680, *}, {*, 1, 1}, 0, {cilium#660, *}, {{*, 4}, ...}, ...)
    policy          l4.go:879                    createL4EgressFilter(...)
    policy          rule.go:717                  mergeEgressPortProto({cilium#680, *}, #44, {*, 0xa, 0}, *, {cilium#660, *}, {{*, ...}, ...}, ...)
    policy          rule.go:672                  mergeEgress.func1({cilium#660, *})
    api             l4.go:284                    PortRules.Iterate({*, 1, cilium#546}, *)
    policy          rule.go:624                  mergeEgress({cilium#680, *}, *, {*, 1, 1}, 0, {cilium#661, *}, {cilium#662, ...}, ...)
    policy          rule.go:753                  (*rule).resolveEgressPolicy(*, {cilium#680, *}, *, *, *, {0, 0, 0}, {0, ...})
    policy          rules.go:103                 ruleSlice.resolveL4EgressPolicy({*, *, *}, {cilium#680, *}, *)
    policy          repository.go:718            (*Repository).resolvePolicyLocked(cilium#1089, *)
    policy          distillery.go:119            (*PolicyCache).updateSelectorPolicy(cilium#1354, *)
    policy          distillery.go:153            (*PolicyCache).UpdatePolicy(...)
    endpoint        policy.go:262                (*Endpoint).regeneratePolicy(*)
    endpoint        bpf.go:744                   (*Endpoint).runPreCompilationSteps(*, *, *)
    endpoint        bpf.go:589                   (*Endpoint).regenerateBPF(*, *)
    endpoint        policy.go:457                (*Endpoint).regenerate(*, *)
    endpoint        events.go:53                 (*EndpointRegenerationEvent).Handle(*, *)
    eventqueue      eventqueue.go:245            (*EventQueue).run.func1()
    sync            once.go:74                   (*Once).doSlow(*, *)
    sync            once.go:65                   (*Once).Do(...)
    eventqueue      eventqueue.go:233            (*EventQueue).run(*)
1: select [75 minutes] [Created by eventqueue.(*EventQueue).Run in goroutine 1482 @ eventqueue.go:229]
    semaphore       semaphore.go:60              (*Weighted).Acquire(cilium#1092, {cilium#671, cilium#722}, cilium#766)
    lock            semaphored_mutex.go:30       (*SemaphoredMutex).Lock(...)
    ipcache         ipcache.go:140               (*IPCache).Lock(...)
    ipcache         cidr.go:56                   (*IPCache).AllocateCIDRs(cilium#1316, {0, 0, cilium#2037}, {0, 0, 0}, 0)
    ipcache         cidr.go:103                  (*IPCache).AllocateCIDRsForIPs(0, {0, cilium#697, 0}, 0xffffffffffffffff)
    cmd             identity.go:114              cachingIdentityAllocator.AllocateCIDRsForIPs(...)
    policy          selectorcache.go:509         (*SelectorCache).allocateIdentityMappings(cilium#1088, {{0, 0}, {cilium#5036, 0x2b}}, #45)
    policy          selectorcache.go:843         (*SelectorCache).AddFQDNSelector(cilium#1088, {cilium#643, cilium#27811}, {{0, 0}, {cilium#5036, 0x2b}})
    policy          l4.go:628                    (*L4Filter).cacheFQDNSelector(...)
    policy          l4.go:623                    (*L4Filter).cacheFQDNSelectors(cilium#27811, {#42936, 0x287, cilium#193}, cilium#536)
    policy          l4.go:725                    createL4Filter({cilium#680, cilium#22826}, {cilium#22828, 1, 1}, 0, {cilium#660, cilium#24510}, {{cilium#4487, 3}, ...}, ...)
    policy          l4.go:879                    createL4EgressFilter(...)
    policy          rule.go:717                  mergeEgressPortProto({cilium#680, cilium#22826}, #44, {cilium#22828, 0xa, #78536}, #79633, {cilium#660, cilium#24510}, {{cilium#4487, ...}, ...}, ...)
    policy          rule.go:672                  mergeEgress.func1({cilium#660, cilium#24510})
    api             l4.go:284                    PortRules.Iterate({cilium#24510, 1, cilium#546}, cilium#11741)
    policy          rule.go:624                  mergeEgress({cilium#680, cilium#22826}, cilium#18687, {cilium#22828, 1, 1}, 0, {cilium#661, cilium#5624}, {cilium#662, ...}, ...)
    policy          rule.go:753                  (*rule).resolveEgressPolicy(cilium#24575, {cilium#680, cilium#22826}, cilium#18687, cilium#29345, cilium#4782, {0, 0, 0}, {0, ...})
    policy          rules.go:103                 ruleSlice.resolveL4EgressPolicy({cilium#10690, 0xb, 5}, {cilium#680, cilium#22826}, cilium#18687)
    policy          repository.go:718            (*Repository).resolvePolicyLocked(cilium#1089, cilium#18461)
    policy          distillery.go:119            (*PolicyCache).updateSelectorPolicy(cilium#1354, cilium#18461)
    policy          distillery.go:153            (*PolicyCache).UpdatePolicy(...)
    endpoint        policy.go:262                (*Endpoint).regeneratePolicy(cilium#1748)
    endpoint        bpf.go:744                   (*Endpoint).runPreCompilationSteps(cilium#1748, cilium#27542, cilium#4781)
    endpoint        bpf.go:589                   (*Endpoint).regenerateBPF(cilium#1748, cilium#27542)
    endpoint        policy.go:457                (*Endpoint).regenerate(cilium#1748, cilium#27542)
    endpoint        events.go:53                 (*EndpointRegenerationEvent).Handle(cilium#18609, cilium#703)
    eventqueue      eventqueue.go:245            (*EventQueue).run.func1()
    sync            once.go:74                   (*Once).doSlow(*Once(#45414), func(cilium#69))
    sync            once.go:65                   (*Once).Do(...)
    eventqueue      eventqueue.go:233            (*EventQueue).run(cilium#6023)
```

Generated from pp tool: https://github.com/maruel/panicparse

Signed-off-by: Chris Tarazi <chris@isovalent.com>
christarazi added a commit to christarazi/cilium that referenced this pull request Aug 7, 2024
As the previous commit mentions, this does not occur on main, v1.16, and
v1.15 due to cilium#29036.

However, in v1.14 and v1.13, we need to take special care because the
NameManager and SelectorCache lock can be taken while the Endpoint lock
is held during Endpoint deletion.

Here are the relevant stacktraces regarding the deadlock:

```
1: sync.Mutex.Lock [75 minutes] [Created by http.(*Server).Serve in goroutine 1699 @ server.go:3086]
    sync            sema.go:77                   runtime_SemacquireMutex(*uint32(0x5), false, 43690)
    sync            mutex.go:171                 (*Mutex).lockSlow(*Mutex(cilium#1733))
    sync            mutex.go:90                  (*Mutex).Lock(...)
    sync            rwmutex.go:147               (*RWMutex).Lock(*RWMutex(0xb0))
    fqdn            name_manager.go:70           (*NameManager).Lock(0xffffffffffffffff)
    policy          selectorcache.go:964         (*SelectorCache).RemoveSelectors(cilium#1088, {cilium#28569, 0xb, 1}, {cilium#643, cilium#32582})
    policy          l4.go:810                    (*L4Filter).removeSelectors(cilium#32582, cilium#29992)
    policy          l4.go:817                    (*L4Filter).detach(cilium#719, cilium#29993)
    policy          l4.go:988                    L4PolicyMap.Detach(...)
    policy          l4.go:1179                   (*L4Policy).Detach(cilium#20318, cilium#1383)
    policy          resolve.go:103               (*selectorPolicy).Detach(...)
    policy          distillery.go:81             (*PolicyCache).delete(cilium#1354, cilium#19354)
    policy          distillery.go:138            (*PolicyCache).LocalEndpointIdentityRemoved(cilium#523, cilium#1163)
    identitymanager manager.go:167               (*IdentityManager).remove(cilium#706, cilium#19354)
    identitymanager manager.go:147               (*IdentityManager).Remove(cilium#706, cilium#19354)
    identitymanager manager.go:52                Remove(...)
    endpoint        endpoint.go:1146             (*Endpoint).leaveLocked(cilium#1883, cilium#12221, {0x30, 0})
    endpoint        endpoint.go:2192             (*Endpoint).Delete(cilium#1883, {0x80, 0xaa})
    endpointmanager manager.go:380               (*EndpointManager).removeEndpoint(cilium#1161, 0, {0xff, 0xff})
    endpointmanager manager.go:394               (*EndpointManager).RemoveEndpoint(...)
    cmd             endpoint.go:684              (*Daemon).deleteEndpointQuiet(...)
    cmd             endpoint.go:666              (*Daemon).deleteEndpoint(cilium#1155, cilium#1883)
    cmd             endpoint.go:713              (*Daemon).DeleteEndpoint(cilium#1155, {cilium#27399, cilium#8108})
    cmd             endpoint.go:770              (*deleteEndpointID).Handle(cilium#21193, {cilium#2452, {cilium#27399, 0x4d}})
    endpoint        delete_endpoint_id.go:66     (*DeleteEndpointID).ServeHTTP(cilium#5934, {cilium#666, cilium#2242}, cilium#2452)
    middleware      operation.go:28              (*Context).RoutesHandler.NewOperationExecutor.func1({cilium#666, cilium#2242}, cilium#2452)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#130), func{cilium#2242, 0x3})
    middleware      router.go:78                 NewRouter.func1({cilium#666, cilium#2242}, cilium#2451)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#718), func{cilium#2242, #59})
    middleware      redoc.go:72                  Redoc.func1({cilium#666, cilium#2242}, cilium#1251)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#4920), func{cilium#2242, #45})
    middleware      spec.go:46                   Spec.func1({cilium#666, cilium#2242}, cilium#4921)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#10532), func{cilium#2242, cilium#23015})
    metrics         middleware.go:64             (*APIEventTSHelper).ServeHTTP(cilium#1459, {cilium#668, cilium#10533}, cilium#2451)
    api             apipanic.go:42               (*APIPanicHandler).ServeHTTP(cilium#722, {cilium#668, cilium#10533}, cilium#4922)
    http            server.go:2938               serverHandler.ServeHTTP(*Server(cilium#8105), cilium#668, cilium#10533, 0x6)
    http            server.go:2009               (*conn).serve(*conn(cilium#16005), Context{cilium#673, cilium#1554})
8: sync.Mutex.Lock [74 minutes] [Created by http.(*Server).Serve in goroutine 1699 @ server.go:3086]
    sync            sema.go:77                   runtime_SemacquireMutex(*, 0x47, cilium#1154)
    sync            mutex.go:171                 (*Mutex).lockSlow(cilium#706)
    sync            mutex.go:90                  (*Mutex).Lock(...)
    sync            rwmutex.go:147               (*RWMutex).Lock(*)
    identitymanager manager.go:99                (*IdentityManager).RemoveOldAddNew(cilium#706, 0, cilium#1154)
    identitymanager manager.go:123               RemoveOldAddNew(...)
    endpoint        policy.go:852                (*Endpoint).SetIdentity(*, cilium#1154, 0)
    endpoint        endpoint.go:1932             (*Endpoint).identityLabelsChanged(*, {cilium#674, *}, 1)
    endpoint        endpoint.go:1780             (*Endpoint).runIdentityResolver(*, {cilium#674, *}, 1, 1)
    endpoint        endpoint.go:1720             (*Endpoint).UpdateLabels(*, {cilium#674, *}, *, *, 8)
    cmd             endpoint.go:477              (*Daemon).createEndpoint(cilium#1155, {cilium#673, *}, {cilium#683, cilium#1155}, *)
    cmd             endpoint.go:542              (*putEndpointID).Handle(cilium#21192, {*, *, {*, 0xe}})
    endpoint        put_endpoint_id.go:58        (*PutEndpointID).ServeHTTP(cilium#3961, {cilium#666, *}, *)
    middleware      operation.go:28              (*Context).RoutesHandler.NewOperationExecutor.func1({cilium#666, *}, *)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, *)
    middleware      router.go:78                 NewRouter.func1({cilium#666, *}, *)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, *)
    middleware      redoc.go:72                  Redoc.func1({cilium#666, *}, cilium#1251)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, #45)
    middleware      spec.go:46                   Spec.func1({cilium#666, *}, *)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, *)
    metrics         middleware.go:64             (*APIEventTSHelper).ServeHTTP(cilium#1459, {cilium#668, *}, *)
    api             apipanic.go:42               (*APIPanicHandler).ServeHTTP(#49, {cilium#668, *}, *)
    http            server.go:2938               serverHandler.ServeHTTP({cilium#653}, {cilium#668, *}, 6)
    http            server.go:2009               (*conn).serve(*, {cilium#673, cilium#1554})
5: sync.Mutex.Lock [75 minutes] [Created by eventqueue.(*EventQueue).Run in goroutine 1482 @ eventqueue.go:229]
    sync            sema.go:77                   runtime_SemacquireMutex(cilium#142, 0xe8, *)
    sync            mutex.go:171                 (*Mutex).lockSlow(cilium#1733)
    sync            mutex.go:90                  (*Mutex).Lock(...)
    sync            rwmutex.go:147               (*RWMutex).Lock(0x68)
    fqdn            name_manager.go:70           (*NameManager).Lock(*)
    policy          selectorcache.go:798         (*SelectorCache).AddFQDNSelector(cilium#1088, {cilium#643, *}, {{*, 0x4d}, {0, 0}})
    policy          l4.go:628                    (*L4Filter).cacheFQDNSelector(...)
    policy          l4.go:623                    (*L4Filter).cacheFQDNSelectors(*, {*, 4, cilium#193}, cilium#536)
    policy          l4.go:725                    createL4Filter({cilium#680, *}, {*, 1, 1}, 0, {cilium#660, *}, {{*, 4}, ...}, ...)
    policy          l4.go:879                    createL4EgressFilter(...)
    policy          rule.go:717                  mergeEgressPortProto({cilium#680, *}, #44, {*, 0xa, 0}, *, {cilium#660, *}, {{*, ...}, ...}, ...)
    policy          rule.go:672                  mergeEgress.func1({cilium#660, *})
    api             l4.go:284                    PortRules.Iterate({*, 1, cilium#546}, *)
    policy          rule.go:624                  mergeEgress({cilium#680, *}, *, {*, 1, 1}, 0, {cilium#661, *}, {cilium#662, ...}, ...)
    policy          rule.go:753                  (*rule).resolveEgressPolicy(*, {cilium#680, *}, *, *, *, {0, 0, 0}, {0, ...})
    policy          rules.go:103                 ruleSlice.resolveL4EgressPolicy({*, *, *}, {cilium#680, *}, *)
    policy          repository.go:718            (*Repository).resolvePolicyLocked(cilium#1089, *)
    policy          distillery.go:119            (*PolicyCache).updateSelectorPolicy(cilium#1354, *)
    policy          distillery.go:153            (*PolicyCache).UpdatePolicy(...)
    endpoint        policy.go:262                (*Endpoint).regeneratePolicy(*)
    endpoint        bpf.go:744                   (*Endpoint).runPreCompilationSteps(*, *, *)
    endpoint        bpf.go:589                   (*Endpoint).regenerateBPF(*, *)
    endpoint        policy.go:457                (*Endpoint).regenerate(*, *)
    endpoint        events.go:53                 (*EndpointRegenerationEvent).Handle(*, *)
    eventqueue      eventqueue.go:245            (*EventQueue).run.func1()
    sync            once.go:74                   (*Once).doSlow(*, *)
    sync            once.go:65                   (*Once).Do(...)
    eventqueue      eventqueue.go:233            (*EventQueue).run(*)
1: select [75 minutes] [Created by eventqueue.(*EventQueue).Run in goroutine 1482 @ eventqueue.go:229]
    semaphore       semaphore.go:60              (*Weighted).Acquire(cilium#1092, {cilium#671, cilium#722}, cilium#766)
    lock            semaphored_mutex.go:30       (*SemaphoredMutex).Lock(...)
    ipcache         ipcache.go:140               (*IPCache).Lock(...)
    ipcache         cidr.go:56                   (*IPCache).AllocateCIDRs(cilium#1316, {0, 0, cilium#2037}, {0, 0, 0}, 0)
    ipcache         cidr.go:103                  (*IPCache).AllocateCIDRsForIPs(0, {0, cilium#697, 0}, 0xffffffffffffffff)
    cmd             identity.go:114              cachingIdentityAllocator.AllocateCIDRsForIPs(...)
    policy          selectorcache.go:509         (*SelectorCache).allocateIdentityMappings(cilium#1088, {{0, 0}, {cilium#5036, 0x2b}}, #45)
    policy          selectorcache.go:843         (*SelectorCache).AddFQDNSelector(cilium#1088, {cilium#643, cilium#27811}, {{0, 0}, {cilium#5036, 0x2b}})
    policy          l4.go:628                    (*L4Filter).cacheFQDNSelector(...)
    policy          l4.go:623                    (*L4Filter).cacheFQDNSelectors(cilium#27811, {#42936, 0x287, cilium#193}, cilium#536)
    policy          l4.go:725                    createL4Filter({cilium#680, cilium#22826}, {cilium#22828, 1, 1}, 0, {cilium#660, cilium#24510}, {{cilium#4487, 3}, ...}, ...)
    policy          l4.go:879                    createL4EgressFilter(...)
    policy          rule.go:717                  mergeEgressPortProto({cilium#680, cilium#22826}, #44, {cilium#22828, 0xa, #78536}, #79633, {cilium#660, cilium#24510}, {{cilium#4487, ...}, ...}, ...)
    policy          rule.go:672                  mergeEgress.func1({cilium#660, cilium#24510})
    api             l4.go:284                    PortRules.Iterate({cilium#24510, 1, cilium#546}, cilium#11741)
    policy          rule.go:624                  mergeEgress({cilium#680, cilium#22826}, cilium#18687, {cilium#22828, 1, 1}, 0, {cilium#661, cilium#5624}, {cilium#662, ...}, ...)
    policy          rule.go:753                  (*rule).resolveEgressPolicy(cilium#24575, {cilium#680, cilium#22826}, cilium#18687, cilium#29345, cilium#4782, {0, 0, 0}, {0, ...})
    policy          rules.go:103                 ruleSlice.resolveL4EgressPolicy({cilium#10690, 0xb, 5}, {cilium#680, cilium#22826}, cilium#18687)
    policy          repository.go:718            (*Repository).resolvePolicyLocked(cilium#1089, cilium#18461)
    policy          distillery.go:119            (*PolicyCache).updateSelectorPolicy(cilium#1354, cilium#18461)
    policy          distillery.go:153            (*PolicyCache).UpdatePolicy(...)
    endpoint        policy.go:262                (*Endpoint).regeneratePolicy(cilium#1748)
    endpoint        bpf.go:744                   (*Endpoint).runPreCompilationSteps(cilium#1748, cilium#27542, cilium#4781)
    endpoint        bpf.go:589                   (*Endpoint).regenerateBPF(cilium#1748, cilium#27542)
    endpoint        policy.go:457                (*Endpoint).regenerate(cilium#1748, cilium#27542)
    endpoint        events.go:53                 (*EndpointRegenerationEvent).Handle(cilium#18609, cilium#703)
    eventqueue      eventqueue.go:245            (*EventQueue).run.func1()
    sync            once.go:74                   (*Once).doSlow(*Once(#45414), func(cilium#69))
    sync            once.go:65                   (*Once).Do(...)
    eventqueue      eventqueue.go:233            (*EventQueue).run(cilium#6023)
```

Generated from pp tool: https://github.com/maruel/panicparse

Signed-off-by: Chris Tarazi <chris@isovalent.com>
christarazi added a commit to christarazi/cilium that referenced this pull request Aug 13, 2024
As the previous commit mentions, this does not occur on main, v1.16, and
v1.15 due to cilium#29036.

However, in v1.14 and v1.13, we need to take special care because the
NameManager and SelectorCache lock can be taken while the Endpoint lock
is held during Endpoint deletion.

Here are the relevant stacktraces regarding the deadlock:

```
1: sync.Mutex.Lock [75 minutes] [Created by http.(*Server).Serve in goroutine 1699 @ server.go:3086]
    sync            sema.go:77                   runtime_SemacquireMutex(*uint32(0x5), false, 43690)
    sync            mutex.go:171                 (*Mutex).lockSlow(*Mutex(cilium#1733))
    sync            mutex.go:90                  (*Mutex).Lock(...)
    sync            rwmutex.go:147               (*RWMutex).Lock(*RWMutex(0xb0))
    fqdn            name_manager.go:70           (*NameManager).Lock(0xffffffffffffffff)
    policy          selectorcache.go:964         (*SelectorCache).RemoveSelectors(cilium#1088, {cilium#28569, 0xb, 1}, {cilium#643, cilium#32582})
    policy          l4.go:810                    (*L4Filter).removeSelectors(cilium#32582, cilium#29992)
    policy          l4.go:817                    (*L4Filter).detach(cilium#719, cilium#29993)
    policy          l4.go:988                    L4PolicyMap.Detach(...)
    policy          l4.go:1179                   (*L4Policy).Detach(cilium#20318, cilium#1383)
    policy          resolve.go:103               (*selectorPolicy).Detach(...)
    policy          distillery.go:81             (*PolicyCache).delete(cilium#1354, cilium#19354)
    policy          distillery.go:138            (*PolicyCache).LocalEndpointIdentityRemoved(cilium#523, cilium#1163)
    identitymanager manager.go:167               (*IdentityManager).remove(cilium#706, cilium#19354)
    identitymanager manager.go:147               (*IdentityManager).Remove(cilium#706, cilium#19354)
    identitymanager manager.go:52                Remove(...)
    endpoint        endpoint.go:1146             (*Endpoint).leaveLocked(cilium#1883, cilium#12221, {0x30, 0})
    endpoint        endpoint.go:2192             (*Endpoint).Delete(cilium#1883, {0x80, 0xaa})
    endpointmanager manager.go:380               (*EndpointManager).removeEndpoint(cilium#1161, 0, {0xff, 0xff})
    endpointmanager manager.go:394               (*EndpointManager).RemoveEndpoint(...)
    cmd             endpoint.go:684              (*Daemon).deleteEndpointQuiet(...)
    cmd             endpoint.go:666              (*Daemon).deleteEndpoint(cilium#1155, cilium#1883)
    cmd             endpoint.go:713              (*Daemon).DeleteEndpoint(cilium#1155, {cilium#27399, cilium#8108})
    cmd             endpoint.go:770              (*deleteEndpointID).Handle(cilium#21193, {cilium#2452, {cilium#27399, 0x4d}})
    endpoint        delete_endpoint_id.go:66     (*DeleteEndpointID).ServeHTTP(cilium#5934, {cilium#666, cilium#2242}, cilium#2452)
    middleware      operation.go:28              (*Context).RoutesHandler.NewOperationExecutor.func1({cilium#666, cilium#2242}, cilium#2452)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#130), func{cilium#2242, 0x3})
    middleware      router.go:78                 NewRouter.func1({cilium#666, cilium#2242}, cilium#2451)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#718), func{cilium#2242, #59})
    middleware      redoc.go:72                  Redoc.func1({cilium#666, cilium#2242}, cilium#1251)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#4920), func{cilium#2242, #45})
    middleware      spec.go:46                   Spec.func1({cilium#666, cilium#2242}, cilium#4921)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#10532), func{cilium#2242, cilium#23015})
    metrics         middleware.go:64             (*APIEventTSHelper).ServeHTTP(cilium#1459, {cilium#668, cilium#10533}, cilium#2451)
    api             apipanic.go:42               (*APIPanicHandler).ServeHTTP(cilium#722, {cilium#668, cilium#10533}, cilium#4922)
    http            server.go:2938               serverHandler.ServeHTTP(*Server(cilium#8105), cilium#668, cilium#10533, 0x6)
    http            server.go:2009               (*conn).serve(*conn(cilium#16005), Context{cilium#673, cilium#1554})
8: sync.Mutex.Lock [74 minutes] [Created by http.(*Server).Serve in goroutine 1699 @ server.go:3086]
    sync            sema.go:77                   runtime_SemacquireMutex(*, 0x47, cilium#1154)
    sync            mutex.go:171                 (*Mutex).lockSlow(cilium#706)
    sync            mutex.go:90                  (*Mutex).Lock(...)
    sync            rwmutex.go:147               (*RWMutex).Lock(*)
    identitymanager manager.go:99                (*IdentityManager).RemoveOldAddNew(cilium#706, 0, cilium#1154)
    identitymanager manager.go:123               RemoveOldAddNew(...)
    endpoint        policy.go:852                (*Endpoint).SetIdentity(*, cilium#1154, 0)
    endpoint        endpoint.go:1932             (*Endpoint).identityLabelsChanged(*, {cilium#674, *}, 1)
    endpoint        endpoint.go:1780             (*Endpoint).runIdentityResolver(*, {cilium#674, *}, 1, 1)
    endpoint        endpoint.go:1720             (*Endpoint).UpdateLabels(*, {cilium#674, *}, *, *, 8)
    cmd             endpoint.go:477              (*Daemon).createEndpoint(cilium#1155, {cilium#673, *}, {cilium#683, cilium#1155}, *)
    cmd             endpoint.go:542              (*putEndpointID).Handle(cilium#21192, {*, *, {*, 0xe}})
    endpoint        put_endpoint_id.go:58        (*PutEndpointID).ServeHTTP(cilium#3961, {cilium#666, *}, *)
    middleware      operation.go:28              (*Context).RoutesHandler.NewOperationExecutor.func1({cilium#666, *}, *)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, *)
    middleware      router.go:78                 NewRouter.func1({cilium#666, *}, *)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, *)
    middleware      redoc.go:72                  Redoc.func1({cilium#666, *}, cilium#1251)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, #45)
    middleware      spec.go:46                   Spec.func1({cilium#666, *}, *)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, *)
    metrics         middleware.go:64             (*APIEventTSHelper).ServeHTTP(cilium#1459, {cilium#668, *}, *)
    api             apipanic.go:42               (*APIPanicHandler).ServeHTTP(#49, {cilium#668, *}, *)
    http            server.go:2938               serverHandler.ServeHTTP({cilium#653}, {cilium#668, *}, 6)
    http            server.go:2009               (*conn).serve(*, {cilium#673, cilium#1554})
5: sync.Mutex.Lock [75 minutes] [Created by eventqueue.(*EventQueue).Run in goroutine 1482 @ eventqueue.go:229]
    sync            sema.go:77                   runtime_SemacquireMutex(cilium#142, 0xe8, *)
    sync            mutex.go:171                 (*Mutex).lockSlow(cilium#1733)
    sync            mutex.go:90                  (*Mutex).Lock(...)
    sync            rwmutex.go:147               (*RWMutex).Lock(0x68)
    fqdn            name_manager.go:70           (*NameManager).Lock(*)
    policy          selectorcache.go:798         (*SelectorCache).AddFQDNSelector(cilium#1088, {cilium#643, *}, {{*, 0x4d}, {0, 0}})
    policy          l4.go:628                    (*L4Filter).cacheFQDNSelector(...)
    policy          l4.go:623                    (*L4Filter).cacheFQDNSelectors(*, {*, 4, cilium#193}, cilium#536)
    policy          l4.go:725                    createL4Filter({cilium#680, *}, {*, 1, 1}, 0, {cilium#660, *}, {{*, 4}, ...}, ...)
    policy          l4.go:879                    createL4EgressFilter(...)
    policy          rule.go:717                  mergeEgressPortProto({cilium#680, *}, #44, {*, 0xa, 0}, *, {cilium#660, *}, {{*, ...}, ...}, ...)
    policy          rule.go:672                  mergeEgress.func1({cilium#660, *})
    api             l4.go:284                    PortRules.Iterate({*, 1, cilium#546}, *)
    policy          rule.go:624                  mergeEgress({cilium#680, *}, *, {*, 1, 1}, 0, {cilium#661, *}, {cilium#662, ...}, ...)
    policy          rule.go:753                  (*rule).resolveEgressPolicy(*, {cilium#680, *}, *, *, *, {0, 0, 0}, {0, ...})
    policy          rules.go:103                 ruleSlice.resolveL4EgressPolicy({*, *, *}, {cilium#680, *}, *)
    policy          repository.go:718            (*Repository).resolvePolicyLocked(cilium#1089, *)
    policy          distillery.go:119            (*PolicyCache).updateSelectorPolicy(cilium#1354, *)
    policy          distillery.go:153            (*PolicyCache).UpdatePolicy(...)
    endpoint        policy.go:262                (*Endpoint).regeneratePolicy(*)
    endpoint        bpf.go:744                   (*Endpoint).runPreCompilationSteps(*, *, *)
    endpoint        bpf.go:589                   (*Endpoint).regenerateBPF(*, *)
    endpoint        policy.go:457                (*Endpoint).regenerate(*, *)
    endpoint        events.go:53                 (*EndpointRegenerationEvent).Handle(*, *)
    eventqueue      eventqueue.go:245            (*EventQueue).run.func1()
    sync            once.go:74                   (*Once).doSlow(*, *)
    sync            once.go:65                   (*Once).Do(...)
    eventqueue      eventqueue.go:233            (*EventQueue).run(*)
1: select [75 minutes] [Created by eventqueue.(*EventQueue).Run in goroutine 1482 @ eventqueue.go:229]
    semaphore       semaphore.go:60              (*Weighted).Acquire(cilium#1092, {cilium#671, cilium#722}, cilium#766)
    lock            semaphored_mutex.go:30       (*SemaphoredMutex).Lock(...)
    ipcache         ipcache.go:140               (*IPCache).Lock(...)
    ipcache         cidr.go:56                   (*IPCache).AllocateCIDRs(cilium#1316, {0, 0, cilium#2037}, {0, 0, 0}, 0)
    ipcache         cidr.go:103                  (*IPCache).AllocateCIDRsForIPs(0, {0, cilium#697, 0}, 0xffffffffffffffff)
    cmd             identity.go:114              cachingIdentityAllocator.AllocateCIDRsForIPs(...)
    policy          selectorcache.go:509         (*SelectorCache).allocateIdentityMappings(cilium#1088, {{0, 0}, {cilium#5036, 0x2b}}, #45)
    policy          selectorcache.go:843         (*SelectorCache).AddFQDNSelector(cilium#1088, {cilium#643, cilium#27811}, {{0, 0}, {cilium#5036, 0x2b}})
    policy          l4.go:628                    (*L4Filter).cacheFQDNSelector(...)
    policy          l4.go:623                    (*L4Filter).cacheFQDNSelectors(cilium#27811, {#42936, 0x287, cilium#193}, cilium#536)
    policy          l4.go:725                    createL4Filter({cilium#680, cilium#22826}, {cilium#22828, 1, 1}, 0, {cilium#660, cilium#24510}, {{cilium#4487, 3}, ...}, ...)
    policy          l4.go:879                    createL4EgressFilter(...)
    policy          rule.go:717                  mergeEgressPortProto({cilium#680, cilium#22826}, #44, {cilium#22828, 0xa, #78536}, #79633, {cilium#660, cilium#24510}, {{cilium#4487, ...}, ...}, ...)
    policy          rule.go:672                  mergeEgress.func1({cilium#660, cilium#24510})
    api             l4.go:284                    PortRules.Iterate({cilium#24510, 1, cilium#546}, cilium#11741)
    policy          rule.go:624                  mergeEgress({cilium#680, cilium#22826}, cilium#18687, {cilium#22828, 1, 1}, 0, {cilium#661, cilium#5624}, {cilium#662, ...}, ...)
    policy          rule.go:753                  (*rule).resolveEgressPolicy(cilium#24575, {cilium#680, cilium#22826}, cilium#18687, cilium#29345, cilium#4782, {0, 0, 0}, {0, ...})
    policy          rules.go:103                 ruleSlice.resolveL4EgressPolicy({cilium#10690, 0xb, 5}, {cilium#680, cilium#22826}, cilium#18687)
    policy          repository.go:718            (*Repository).resolvePolicyLocked(cilium#1089, cilium#18461)
    policy          distillery.go:119            (*PolicyCache).updateSelectorPolicy(cilium#1354, cilium#18461)
    policy          distillery.go:153            (*PolicyCache).UpdatePolicy(...)
    endpoint        policy.go:262                (*Endpoint).regeneratePolicy(cilium#1748)
    endpoint        bpf.go:744                   (*Endpoint).runPreCompilationSteps(cilium#1748, cilium#27542, cilium#4781)
    endpoint        bpf.go:589                   (*Endpoint).regenerateBPF(cilium#1748, cilium#27542)
    endpoint        policy.go:457                (*Endpoint).regenerate(cilium#1748, cilium#27542)
    endpoint        events.go:53                 (*EndpointRegenerationEvent).Handle(cilium#18609, cilium#703)
    eventqueue      eventqueue.go:245            (*EventQueue).run.func1()
    sync            once.go:74                   (*Once).doSlow(*Once(#45414), func(cilium#69))
    sync            once.go:65                   (*Once).Do(...)
    eventqueue      eventqueue.go:233            (*EventQueue).run(cilium#6023)
```

Generated from pp tool: https://github.com/maruel/panicparse

Signed-off-by: Chris Tarazi <chris@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants