Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration tests consistently fail on first run #34495

Closed
hughesjj opened this issue Aug 7, 2024 · 8 comments
Closed

Integration tests consistently fail on first run #34495

hughesjj opened this issue Aug 7, 2024 · 8 comments
Assignees
Labels
bug Something isn't working ci-cd CI, CD, testing, build issues priority:p1 High

Comments

@hughesjj
Copy link
Contributor

hughesjj commented Aug 7, 2024

Component(s)

ci/cd

What happened?

Description

Integration tests don't fail the parent workflow if an integration test fails

Steps to Reproduce

Example job

At least two tests in this spot check fail but the workflow succeeds. Search jmx in the build logs, output is below though.

Expected Result

I expect the failing test to fail the workflow

Actual Result

image

Collector version

latest (main)

Environment information

No response

OpenTelemetry Collector configuration

No response

Log output

DONE 42 tests in 6.975s
Running target 'mod-integration-test' in module 'receiver/jmxreceiver' as part of group 'receiver-1'
make --no-print-directory -C receiver/jmxreceiver mod-integration-test
go: downloading go.opentelemetry.io/collector/receiver/otlpreceiver v0.106.1
running go integration test ./... in /home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/receiver/jmxreceiver
/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/.tools/gotestsum --rerun-fails=1 --packages="./..." -- -race -timeout 360s -parallel 4 -tags=integration,""
∅  internal/metadata
✖  internal/subprocess (776ms)
✖  . (28.327s)

However, on rerun, it passes more or less immediately

∅  internal/metadata
✖  internal/subprocess (769ms)
✖  . (33.44s)

DONE 51 tests, 2 failures in 34.640s

✓  . (cached)
✓  internal/subprocess (cached)

DONE 2 runs, 51 tests in 35.369s

another receiver:


Running target 'mod-integration-test' in module 'receiver/googlecloudspannerreceiver' as part of group 'receiver-1'
make --no-print-directory -C receiver/googlecloudspannerreceiver mod-integration-test
go: downloading cloud.google.com/go/spanner v1.65.0
go: downloading github.com/mitchellh/hashstructure v1.1.0
go: downloading github.com/ReneKroon/ttlcache/v2 v2.11.0
go: downloading github.com/GoogleCloudPlatform/grpc-gcp-go/grpcgcp v1.5.0
go: downloading github.com/envoyproxy/go-control-plane v0.12.0
go: downloading github.com/cncf/xds/go v0.0.0-20240423153145-555b57ec207b
go: downloading github.com/envoyproxy/protoc-gen-validate v1.0.4
go: downloading cel.dev/expr v0.15.0
go: downloading github.com/census-instrumentation/opencensus-proto v0.4.1
running go integration test ./... in /home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/receiver/googlecloudspannerreceiver
/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/.tools/gotestsum --rerun-fails=1 --packages="./..." -- -race -timeout 360s -parallel 4 -tags=integration,""
✖  internal/filter (567ms)
✓  internal/filterfactory (1.069s)
✓  internal/datasource (1.183s)
✓  internal/metadata (1.129s)
✓  internal/metadataconfig (1.105s)
✓  . (3.911s)
✓  internal/metadataparser (1.064s)
✓  internal/statsreader (1.193s)

DONE 288 tests, 1 skipped, 1 failure in 23.004s


### Additional context

_No response_
@hughesjj hughesjj added bug Something isn't working needs triage New item requiring triage labels Aug 7, 2024
@hughesjj
Copy link
Contributor Author

hughesjj commented Aug 7, 2024

/label ci-cd

@crobert-1 crobert-1 added the ci-cd CI, CD, testing, build issues label Aug 7, 2024
@atoulme atoulme added priority:p1 High and removed needs triage New item requiring triage labels Aug 7, 2024
@odubajDT
Copy link
Contributor

odubajDT commented Aug 8, 2024

I am gonna look into this issue. Seems like the gotestsum used for running unit and also integration tests is not propagating the exit code of go test which is ran in the background, since gotestsum is just a wrapper around it. I am gonna investigate the possible options here

@hughesjj
Copy link
Contributor Author

hughesjj commented Aug 8, 2024

@odubajDT I'll assign this to you, feel free to assign to me if you don't have the time to take it up I don't have assignment permissions, sorry

Adding some useful links for context..

Thoughts on adding a canary to integration tests? (ex replace a random file from the above integration targets with /dev/random or a known failing file and see if the whole workflow fails?)

@hughesjj
Copy link
Contributor Author

Hey @odubajDT I'd like to take a stab at this tomorrow if you don't have a fix/root cause for it yet, wanted to give you the opportunity to share anything before I jump the gun

@odubajDT
Copy link
Contributor

Hey @odubajDT I'd like to take a stab at this tomorrow if you don't have a fix/root cause for it yet, wanted to give you the opportunity to share anything before I jump the gun

Hey @hughesjj sorry for not responding, I was offline for a week. Sure take a look if you are interested

@odubajDT
Copy link
Contributor

odubajDT commented Aug 19, 2024

From what I can see is that the tests actually passes, therefore the status is reported correctly

running go integration test ./... in /home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/receiver/jmxreceiver
/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/.tools/gotestsum --rerun-fails=1 --packages="./..." -- -race -timeout 360s -parallel 4 -tags=integration,""
∅  internal/metadata
✖  internal/subprocess (776ms)
✖  . (28.327s)

DONE 51 tests, 2 failures in 30.608s

✓  . (1.047s)
✓  internal/subprocess (1.01s)

DONE 2 runs, 51 tests in 34.243s

As we can see, the second run of the internal/subprocess passes, therefore the correct status is reported.

The test is re-run due to the presence of --rerun-fails=1 parameter in the gotestsum.

This PR showcases it #34729

Also one additional (not related) issue I spotted, seems like that in certain cases unit tests are run as part of the integration tests test-run (and maybe vice-versa).

See that internal/filter test in receiver-1 is run in integration-tests and unit-tests as well. This is due to that tests are not correctly tagged.

@hughesjj
Copy link
Contributor Author

No worries, I jinxed myself by saying I'd have time to do something 😅

Just sharing info from my side:

You're 100% correct. It's interesting how consistently this fails (at least for the jmx cases), and super annoying when using goland for integ tests -- they fail every time with the "default" run configuration, so breakpoints get annoying.

For jmx in particular, I was getting some complaints about a ryuk container not terminating in goland.


I'll rename this issue to reflect the current understanding of things and see about removing the p1 label

@hughesjj hughesjj changed the title Integration tests don't fail the workflow Integration tests consistently fail on first run Aug 21, 2024
@odubajDT
Copy link
Contributor

No worries, I jinxed myself by saying I'd have time to do something 😅

Just sharing info from my side:

You're 100% correct. It's interesting how consistently this fails (at least for the jmx cases), and super annoying when using goland for integ tests -- they fail every time with the "default" run configuration, so breakpoints get annoying.

For jmx in particular, I was getting some complaints about a ryuk container not terminating in goland.

I'll rename this issue to reflect the current understanding of things and see about removing the p1 label

Thanks! The PR should be ready for review if you have time.

andrzej-stencel pushed a commit that referenced this issue Oct 9, 2024
**Description:** <Describe what has changed.>
- removing re-running parameter from gotestsum
- fixing tests/code with go routine leaks
- resolving race conditions (mostly caused by parallel tests)
- placing goleak ignorers for go routine leaks from external libraries

**Link to tracking Issue:** #34495

---------

Signed-off-by: odubajDT <ondrej.dubaj@dynatrace.com>
Co-authored-by: Alex Boten <223565+codeboten@users.noreply.github.com>
jmichalek132 pushed a commit to jmichalek132/opentelemetry-collector-contrib that referenced this issue Oct 10, 2024
**Description:** <Describe what has changed.>
- removing re-running parameter from gotestsum
- fixing tests/code with go routine leaks
- resolving race conditions (mostly caused by parallel tests)
- placing goleak ignorers for go routine leaks from external libraries

**Link to tracking Issue:** open-telemetry#34495

---------

Signed-off-by: odubajDT <ondrej.dubaj@dynatrace.com>
Co-authored-by: Alex Boten <223565+codeboten@users.noreply.github.com>
sbylica-splunk pushed a commit to sbylica-splunk/opentelemetry-collector-contrib that referenced this issue Dec 17, 2024
**Description:** <Describe what has changed.>
- removing re-running parameter from gotestsum
- fixing tests/code with go routine leaks
- resolving race conditions (mostly caused by parallel tests)
- placing goleak ignorers for go routine leaks from external libraries

**Link to tracking Issue:** open-telemetry#34495

---------

Signed-off-by: odubajDT <ondrej.dubaj@dynatrace.com>
Co-authored-by: Alex Boten <223565+codeboten@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ci-cd CI, CD, testing, build issues priority:p1 High
Projects
None yet
Development

No branches or pull requests

4 participants