Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestGRPCConnection unit test fails #969

Closed
dswarbrick opened this issue Sep 18, 2022 · 13 comments
Closed

TestGRPCConnection unit test fails #969

dswarbrick opened this issue Sep 18, 2022 · 13 comments

Comments

@dswarbrick
Copy link
Contributor

Running go test ./... on a checkout of v0.22.0, the TestGRPCConnection test consistently fails as follows:

--- FAIL: TestGRPCConnection (0.00s)
    grpc_test.go:73: GRPC probe failed
panic: Fail in goroutine after TestGRPCConnection has completed

goroutine 579 [running]:
testing.(*common).Fail(0xc000318820)
        /usr/lib/go-1.19/src/testing/testing.go:824 +0xe5
testing.(*common).Errorf(0xc000318820, {0xc2f5b0?, 0xc0002df8f0?}, {0xc000333fc0?, 0x0?, 0xc000601820?})
        /usr/lib/go-1.19/src/testing/testing.go:941 +0x65
github.com/prometheus/blackbox_exporter/prober.TestGRPCConnection.func1()
        /home/daniel/Work/Debian/prometheus-blackbox-exporter/prober/grpc_test.go:56 +0x6e
created by github.com/prometheus/blackbox_exporter/prober.TestGRPCConnection
        /home/daniel/Work/Debian/prometheus-blackbox-exporter/prober/grpc_test.go:54 +0x325
FAIL    github.com/prometheus/blackbox_exporter/prober  0.013s
FAIL

Tested with Go 1.19.1 on linux/amd64.

@dswarbrick
Copy link
Contributor Author

dswarbrick commented Jan 2, 2023

This still fails identically with v0.23.0 (and master branch, as of writing).

What's more, isolating the tests to just the GRPC tests fails in a completely different way:

$ go test ./...
ok      github.com/prometheus/blackbox_exporter (cached)
ok      github.com/prometheus/blackbox_exporter/config  (cached)
--- FAIL: TestGRPCConnection (0.00s)
    grpc_test.go:73: GRPC probe failed
panic: Fail in goroutine after TestGRPCConnection has completed

goroutine 596 [running]:
testing.(*common).Fail(0xc000301d40)
        /usr/lib/go-1.19/src/testing/testing.go:824 +0xe5
testing.(*common).Errorf(0xc000301d40, {0xc68a18?, 0xc000391218?}, {0xc000521fc0?, 0x12b4c70?, 0xc000529a90?})
        /usr/lib/go-1.19/src/testing/testing.go:941 +0x65
github.com/prometheus/blackbox_exporter/prober.TestGRPCConnection.func1()
        /tmp/blackbox_exporter/prober/grpc_test.go:56 +0x6e
created by github.com/prometheus/blackbox_exporter/prober.TestGRPCConnection
        /tmp/blackbox_exporter/prober/grpc_test.go:54 +0x325
FAIL    github.com/prometheus/blackbox_exporter/prober  0.012s
FAIL

vs.

$ go test -run TestGRPC ./...
ok      github.com/prometheus/blackbox_exporter (cached) [no tests to run]
ok      github.com/prometheus/blackbox_exporter/config  (cached) [no tests to run]
--- FAIL: TestGRPCConnection (0.00s)
    grpc_test.go:73: GRPC probe failed
--- FAIL: TestGRPCTLSConnection (0.14s)
    grpc_test.go:237: GRPC probe failed
--- FAIL: TestGRPCServiceNotFound (0.00s)
    utils_test.go:48: Expected: probe_grpc_status_code: 5, got: probe_grpc_status_code: 0
--- FAIL: TestGRPCHealthCheckUnimplemented (0.00s)
    utils_test.go:48: Expected: probe_grpc_status_code: 12, got: probe_grpc_status_code: 0
FAIL
FAIL    github.com/prometheus/blackbox_exporter/prober  0.148s
FAIL

In fact, the latter is inconsistent. Sometimes the gRPC server error message is reported:

--- FAIL: TestGRPCTLSConnection (0.06s)
    grpc_test.go:237: GRPC probe failed
    grpc_test.go:218: failed to serve: grpc: the server has been stopped

Other times, the test panics as it does when running the full suite of tests:

$ go test -run TestGRPC ./...
ok      github.com/prometheus/blackbox_exporter (cached) [no tests to run]
ok      github.com/prometheus/blackbox_exporter/config  (cached) [no tests to run]
--- FAIL: TestGRPCConnection (0.00s)
    grpc_test.go:73: GRPC probe failed
--- FAIL: TestGRPCTLSConnection (0.05s)
    grpc_test.go:237: GRPC probe failed
panic: Fail in goroutine after TestGRPCTLSConnection has completed

goroutine 53 [running]:
testing.(*common).Fail(0xc0002b6000)
        /usr/lib/go-1.19/src/testing/testing.go:824 +0xe5
testing.(*common).Errorf(0xc0002b6000, {0xc68a18?, 0xc000281098?}, {0xc0004087c0?, 0x0?, 0x0?})
        /usr/lib/go-1.19/src/testing/testing.go:941 +0x65
github.com/prometheus/blackbox_exporter/prober.TestGRPCTLSConnection.func1()
        /tmp/blackbox_exporter/prober/grpc_test.go:218 +0x6e
created by github.com/prometheus/blackbox_exporter/prober.TestGRPCTLSConnection
        /tmp/blackbox_exporter/prober/grpc_test.go:216 +0x805
FAIL    github.com/prometheus/blackbox_exporter/prober  0.055s
FAIL

Something smells racy here.

@dswarbrick
Copy link
Contributor Author

@averzicco As the author of the GRPC tests, are you able to reproduce the failures that I see here? I'm surprised this made it past CI when being merged.

@averzicco
Copy link
Contributor

hi @dswarbrick,

I've quickly checkout the code, all test are successful:

$ git clone git@github.com:prometheus/blackbox_exporter.git
Cloning into 'blackbox_exporter'...
remote: Enumerating objects: 5159, done.
remote: Counting objects: 100% (10/10), done.
remote: Compressing objects: 100% (9/9), done.
remote: Total 5159 (delta 4), reused 2 (delta 1), pack-reused 5149
Receiving objects: 100% (5159/5159), 6.95 MiB | 7.05 MiB/s, done.
Resolving deltas: 100% (2912/2912), done.

$ cd blackbox_exporter/

$ make test
>> running all tests
go test -race  ./...
go: downloading github.com/miekg/dns v1.1.50
go: downloading github.com/go-kit/log v0.2.1
go: downloading github.com/prometheus/common v0.39.0
go: downloading github.com/prometheus/client_golang v1.14.0
go: downloading golang.org/x/net v0.4.0
go: downloading google.golang.org/grpc v1.51.0
go: downloading github.com/prometheus/client_model v0.3.0
go: downloading github.com/prometheus/exporter-toolkit v0.8.2
go: downloading golang.org/x/sys v0.3.0
go: downloading golang.org/x/oauth2 v0.3.0
go: downloading github.com/matttproud/golang_protobuf_extensions v1.0.4
go: downloading golang.org/x/text v0.5.0
go: downloading github.com/coreos/go-systemd/v22 v22.4.0
go: downloading golang.org/x/sync v0.1.0
ok  	github.com/prometheus/blackbox_exporter	0.042s
ok  	github.com/prometheus/blackbox_exporter/config	0.081s
ok  	github.com/prometheus/blackbox_exporter/prober	2.860s

$ go test ./...
ok  	github.com/prometheus/blackbox_exporter	0.006s
ok  	github.com/prometheus/blackbox_exporter/config	0.014s
ok  	github.com/prometheus/blackbox_exporter/prober	1.906s

$ go test -run TestGRPC ./...
ok  	github.com/prometheus/blackbox_exporter	0.011s [no tests to run]
ok  	github.com/prometheus/blackbox_exporter/config	0.009s [no tests to run]
ok  	github.com/prometheus/blackbox_exporter/prober	0.130s
$ go version
go version go1.19.3 linux/amd64

@dswarbrick
Copy link
Contributor Author

dswarbrick commented Jan 3, 2023

@averzicco I get inconsistent results, depending on whether I run tests with -race or not.

$ go test -race ./...
ok      github.com/prometheus/blackbox_exporter (cached)
ok      github.com/prometheus/blackbox_exporter/config  (cached)
--- FAIL: TestGRPCConnection (0.00s)
    grpc_test.go:73: GRPC probe failed
--- FAIL: TestMultipleGRPCservices (0.00s)
    grpc_test.go:140: GRPC probe failed for service1
--- FAIL: TestGRPCTLSConnection (0.13s)
    grpc_test.go:237: GRPC probe failed
--- FAIL: TestNoTLSConnection (0.00s)
    utils_test.go:48: Expected: probe_grpc_status_code: 14, got: probe_grpc_status_code: 0
--- FAIL: TestGRPCServiceNotFound (0.00s)
    utils_test.go:48: Expected: probe_grpc_status_code: 5, got: probe_grpc_status_code: 0
--- FAIL: TestGRPCHealthCheckUnimplemented (0.00s)
    utils_test.go:48: Expected: probe_grpc_status_code: 12, got: probe_grpc_status_code: 0
2023/01/03 22:40:07 http: TLS handshake error from 127.0.0.1:43152: remote error: tls: bad certificate
level=info msg="Resolving target address" target=ipv6.google.com ip_protocol=ip4
level=info msg="Resolved target address" target=ipv6.google.com ip=2404:6800:4006:804::200e
level=info msg="Resolving target address" target=ipv6.google.com ip_protocol=ip4
level=error msg="Resolution with IP protocol failed" target=ipv6.google.com ip_protocol=ip4 err="lookup ipv6.google.com: no such host"
FAIL
FAIL    github.com/prometheus/blackbox_exporter/prober  1.455s
FAIL
$ go test ./...
ok      github.com/prometheus/blackbox_exporter (cached)
ok      github.com/prometheus/blackbox_exporter/config  (cached)
--- FAIL: TestGRPCConnection (0.00s)
    grpc_test.go:73: GRPC probe failed
panic: Fail in goroutine after TestGRPCConnection has completed

goroutine 617 [running]:
testing.(*common).Fail(0xc00023c820)
        /usr/lib/go-1.19/src/testing/testing.go:824 +0xe5
testing.(*common).Errorf(0xc00023c820, {0xc6b41c?, 0xc00038a378?}, {0xc0002a9fc0?, 0x12b9b70?, 0xc0000b7cc0?})
        /usr/lib/go-1.19/src/testing/testing.go:941 +0x65
github.com/prometheus/blackbox_exporter/prober.TestGRPCConnection.func1()
        /tmp/blackbox_exporter/prober/grpc_test.go:56 +0x6e
created by github.com/prometheus/blackbox_exporter/prober.TestGRPCConnection
        /tmp/blackbox_exporter/prober/grpc_test.go:54 +0x325
FAIL    github.com/prometheus/blackbox_exporter/prober  0.012s
FAIL

That's on Ubuntu 22.10 (amd64) with Go 1.19.2 (and also 1.19.4, installed from Ubuntu lunar repos), Intel Core i7 (16 cores). The same results are observed when testing in a Debian sid schroot on that hardware.

Running the tests in a Debian sid schroot on the same hardware, or on a Debian testing (bookworm) Intel Core i3, the tests pass.

@averzicco
Copy link
Contributor

averzicco commented Jan 3, 2023

Based on the output you posted the result is not inconsistent, the test failed in both cases.

I've re-run the test with the -race flag (which is used when running) and are still successful:

$ go test -race ./...
ok  	github.com/prometheus/blackbox_exporter	0.050s
ok  	github.com/prometheus/blackbox_exporter/config	0.065s
ok  	github.com/prometheus/blackbox_exporter/prober	3.121s

$ go test -race -run TestGRPC ./...
ok  	github.com/prometheus/blackbox_exporter	0.039s [no tests to run]
ok  	github.com/prometheus/blackbox_exporter/config	0.043s [no tests to run]
ok  	github.com/prometheus/blackbox_exporter/prober	0.200s

I've also tried running the test on a 32 cores debian 10 machine, test are successful. Not sure what can be the issue here, since you can reproduce it locally, could you try to run the test in debug mode to check what is failing?

@dswarbrick
Copy link
Contributor Author

@averzicco By "inconsistent" I mean that the test are failing in different ways - but yes, they consistently fail.

A small correction to my earlier comment - the tests pass in the Debian sid schroot, yet fail when run natively on the Ubuntu 22.10 which hosts the schroot envs. That would seem to suggest something weird with Ubuntu's Go toolchain, which I find a bit hard to swallow.

Also, despite the tests passing in a Debian sid schroot on the Core i7, and natively on a Debian testing Core i3, these tests have been quite flaky on the Debian CI infrastructure which builds the prometheus-blackbox-exporter package (for a variety of archs), e.g. https://tests.reproducible-builds.org/debian/rbuild/unstable/amd64/prometheus-blackbox-exporter_0.23.0-1.rbuild.log.gz

@averzicco
Copy link
Contributor

for what i see in the log you linked, it seems something is wrong with the grpc build-deps (the TestGRPCConnection and TestMultipleGRPCservices gets skipped), maybe the TestGRPCTLSConnection doesn't get skipped because the server is created with TLS enabled? I guess that the important part is that something is trying to skip all gRPC tests but doesn't skip TestGRPCTLSConnection

=== RUN   TestGRPCConnection
    grpc_test.go:38: Skipping unsupported test; gRPC build-deps too oudated.
--- SKIP: TestGRPCConnection (0.00s)
=== RUN   TestMultipleGRPCservices
    grpc_test.go:104: Skipping unsupported test; gRPC build-deps too oudated.
--- SKIP: TestMultipleGRPCservices (0.00s)
=== RUN   TestGRPCTLSConnection
    grpc_test.go:239: GRPC probe failed
--- FAIL: TestGRPCTLSConnection (1.20s)
=== RUN   TestNoTLSConnection
panic: Fail in goroutine after TestGRPCTLSConnection has completed

@dswarbrick
Copy link
Contributor Author

@averzicco Disregard the skipped tests; they were previously patched out due to older grpc build-deps in Debian, unable to support grpc.WithTransportCredentials(insecure.NewCredentials())) in the grpc prober code. That has now been patched to grpc.WithInsecure(), which is supported by the build-deps.

I think the tests are racy when executed on slower / busy systems (which Debian CI infrastructure often is). Look closely at the failure:

--- FAIL: TestGRPCConnection (0.00s)
    grpc_test.go:73: GRPC probe failed
panic: Fail in goroutine after TestGRPCConnection has completed

goroutine 612 [running]:
testing.(*common).Fail(0xc00023ad00)
        /usr/lib/go-1.19/src/testing/testing.go:824 +0xe5
testing.(*common).Errorf(0xc00023ad00, {0xc68a18?, 0xc000314390?}, 
{0xc00030dfc0?, 0x0?, 0xc000394ea0?})
        /usr/lib/go-1.19/src/testing/testing.go:941 +0x65
github.com/prometheus/blackbox_exporter/prober.TestGRPCConnection.func1()
        /tmp/blackbox_exporter/prober/grpc_test.go:56 +0x6e
created by github.com/prometheus/blackbox_exporter/prober.TestGRPCConnection
        /tmp/blackbox_exporter/prober/grpc_test.go:54 +0x325

I think it is due to the inherent raciness caused by tests which spin up http servers, tcp servers etc in goroutines within the same test.

    s := grpc.NewServer()
    healthServer := health.NewServer()
    healthServer.SetServingStatus("service", grpc_health_v1.HealthCheckResponse_SERVING)
    grpc_health_v1.RegisterHealthServer(s, healthServer)

    go func() {
        if err := s.Serve(ln); err != nil {
            t.Errorf("failed to serve: %v", err)
            return
        }   
    }() 
    defer s.GracefulStop()

    testCTX, cancel := context.WithTimeout(context.Background(), 10*time.Second)
    defer cancel()
    registry := prometheus.NewRegistry()

    result := ProbeGRPC(testCTX, "localhost:"+port,
        config.Module{Timeout: time.Second, GRPC: config.GRPCProbe{
            IPProtocolFallback: false,
        },  
        }, registry, log.NewNopLogger())

    if !result {
        t.Fatalf("GRPC probe failed")
    }

I suspect that the grpc server in the goroutine is not ready in time, so ProbeGRPC() fails, and the deferred s.GracefulStop() is called as the test exits. However, the goroutine is still in the s.Serve() loop, so... panic.

It wouldn't be the first time that I've needed to add a small delay after the goroutine to give tcp / http server time to initialize before a request is fired at it.

@dswarbrick
Copy link
Contributor Author

Putting aside the goroutine panic for a moment, I found out why the other type of test failure occurs, e.g.

--- FAIL: TestGRPCConnection (0.00s)
    grpc_test.go:73: GRPC probe failed
--- FAIL: TestGRPCTLSConnection (0.14s)
    grpc_test.go:237: GRPC probe failed
--- FAIL: TestGRPCServiceNotFound (0.00s)
    utils_test.go:48: Expected: probe_grpc_status_code: 5, got: probe_grpc_status_code: 0
--- FAIL: TestGRPCHealthCheckUnimplemented (0.00s)
    utils_test.go:48: Expected: probe_grpc_status_code: 12, got: probe_grpc_status_code: 0

It seems that the lookup of "localhost" is failing in ProbeGRPC due to the probe module DNS resolution options.

address localhost: no suitable address found

Setting IPProtocolFallback: true fixes that. Even trying to probe "127.0.0.1:"+port in the test fails, unless PreferredIPProtocol: "ip4" (or simply "ip") is set. Digging through the code, it seems that blackbox_exporter defaults to "ip6" for name resolution. So even if we were to probe the listener address directly (e.g. 127.0.0.1:54321), the name resolution would still fail if it were forced to treat that as IPv6.

@averzicco Out of curiosity, are you setting IPProtocolFallback: false for a specific reason? The other tests in this repo set it to true, and the blackbox_exporter config docs also state that it is true by default.

I don't yet fully understand why a little test program produces these results on Ubuntu:

net.ResolveIPAddr("ip", "localhost"): 127.0.0.1
net.ResolveIPAddr("ip4", "localhost"): 127.0.0.1
net.ResolveIPAddr("ip6", "localhost"): err: address localhost: no suitable address found

... despite this:

$ host localhost
localhost has address 127.0.0.1
localhost has IPv6 address ::1

Running the same little test program on Debian produces the expected results:

net.ResolveIPAddr("ip", "localhost"): 127.0.0.1
net.ResolveIPAddr("ip4", "localhost"): 127.0.0.1
net.ResolveIPAddr("ip6", "localhost"): ::1

@dswarbrick
Copy link
Contributor Author

I tracked the strange failure of Go's (and Python's) inability to resolve "localhost" to "::1" on Ubuntu to the lack of an entry for it in /etc/hosts.

It seems that an out-of-the-box /etc/hosts on Ubuntu contains the following entries for IPv6:

::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

... whereas a default Debian /etc/hosts contains (as per netcfg.h):

::1     localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

I'm not sure yet what generates the default /etc/hosts on Ubuntu, but I don't think it is netcfg anymore.

So, on Ubuntu, "localhost" was only resolving to "127.0.0.1" in net.Listen("tcp", "localhost:0"), and the combination of IPProtocolFallback: false with the default PreferredIPProtocol: "ip6" meant that blackbox_exporter's chooseProtocol function was failing to resolve "localhost" to an IPv6 address.

@averzicco
Copy link
Contributor

Sorry, more than one year has passed since I wrote that code and I don't remember why I've used IPProtocolFallback:false in the test maybe there was a good reason or I just tried with that and worked...

@electron0zero
Copy link
Member

I was able to reproduce it on master with make test, or via go test ./... on my local machine (running go version go1.21.4 linux/amd64)

but it works fine when I run it the way we run it in CI,

we run it like this in CI:

export DOCKER_TEST_IMAGE_NAME=quay.io/prometheus/golang-builder:1.20-base
# setup ipv6, add following to `/etc/docker/daemon.json`
{
              "ipv6": true,
              "fixed-cidr-v6": "2001:db8:1::/64"
}
# restart docker deamon
sudo service docker restart
# run it
docker run --rm -t -v "$(pwd):/app" "${DOCKER_TEST_IMAGE_NAME}" -i github.com/prometheus/blackbox_exporter -T

this makes me think that there is something different we are doing in CI, and we need to do that in local.

I don't have time to dig down further right now so just sharing my findings.

@electron0zero
Copy link
Member

these tests fail when IPv6 is not enabled. if you are running these inside docker, you need to enable IPv6 (as shown in the comment above), or skip tests that needs IPv6. you can set CI=true env var to skip these tests.

I fixed these in #1342, and I am hoping that fixes this as well.

please reopen if this is not fixed, and you are able to reproduce this on master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants