fix: refactor azureeventhubrehydrationreceiver to stream blobs as to not lock up on larger environments (BPOP-831) #2098

schmikei · 2025-01-07T21:47:51Z

Proposed Change

Streaming results of the blobs rather than doing a lump sum for the requests of the blobs. This would cause the collector to stall for large rehydration efforts using azure eventhub.

Checklist

Changes are tested
CI has passed

…dration

…:observIQ/bindplane-agent into feat/paginate-azure-eventhub-rehydration

schmikei · 2025-01-09T21:41:41Z

Opening up for review; but going to see if I can add some more tests/benchmarks as I go before merging anything

dpaasman00

Some nits and a context thing

receiver/azureblobrehydrationreceiver/README.md

receiver/azureblobrehydrationreceiver/config.go

receiver/azureblobrehydrationreceiver/receiver.go

receiver/azureblobrehydrationreceiver/config.go

jsirianni

Working great. I did run into upgrade issues, left a comment around that.

jsirianni · 2025-01-14T15:53:48Z

receiver/azureblobrehydrationreceiver/config.go

+	if c.PollInterval != 0 {
+		return errors.New("poll_interval is no longer supported and batch_size/page_size should be used instead")
+	}
+
+	if c.PollTimeout != 0 {
+		return errors.New("poll_timeout is no longer supported and batch_size/page_size should be used instead")
+	}


This is a breaking change. Can we log warnings and ignore these options instead of failing?

{ "level":"fatal", "ts":"2025-01-14T10:50:55.721-0500", "caller":"collector/main.go:121", "msg":"RunService returned error", "error":"failed to start service: error during OpAmp connection: collector failed to start: invalid configuration: receivers::azureblobrehydration/source1_01JHJQHCEX5NHT9HN2RS9FCRHX: poll_interval is no longer supported and batch_size/page_size should be used instead"}

I get that error when using Bindplane's Azure Blob source.

receivers: azureblobrehydration/source1_01JHJQHCEX5NHT9HN2RS9FCRHX: connection_string: redacted container: test delete_on_read: false ending_time: 2025-01-15T00:00 poll_interval: 1m starting_time: 2025-01-01T00:00

This will allow us some time to get the Bindplane source updated, and provide users with upgrade compatibility. The source docs could include some disclaimers around agent version and support config options.

Yep I can look into that, was something originally brought up here which I discussed with @dpaasman00 #2098 (comment)

@dpaasman00 I think I'm leaning towards what Joe is suggesting here and just logging a warning within the start for a couple releases in order to maintain backward compatibility for a couple releases.

Yea that works for me!

jsirianni

edit: Debug exporter samples its output, which explains my findings :)

Still working great, however, I am having some performance issues.

Without a batch processor, I seem to get five logs per request. I get a burst of requests and then a ~7 second delay before the next "burst"

{"level":"info","ts":"2025-01-14T16:01:09.776-0500","msg":"Logs","kind":"exporter","data_type":"logs","name":"debug/debug","resource logs":5,"log records":5}
{"level":"info","ts":"2025-01-14T16:01:10.919-0500","msg":"Logs","kind":"exporter","data_type":"logs","name":"debug/debug","resource logs":5,"log records":5}
{"level":"info","ts":"2025-01-14T16:01:10.999-0500","msg":"Logs","kind":"exporter","data_type":"logs","name":"debug/debug","resource logs":5,"log records":5}
{"level":"info","ts":"2025-01-14T16:01:11.080-0500","msg":"Logs","kind":"exporter","data_type":"logs","name":"debug/debug","resource logs":5,"log records":5}
{"level":"info","ts":"2025-01-14T16:01:11.161-0500","msg":"Logs","kind":"exporter","data_type":"logs","name":"debug/debug","resource logs":5,"log records":5}
{"level":"info","ts":"2025-01-14T16:01:11.241-0500","msg":"Logs","kind":"exporter","data_type":"logs","name":"debug/debug","resource logs":5,"log records":5}
{"level":"info","ts":"2025-01-14T16:01:11.322-0500","msg":"Logs","kind":"exporter","data_type":"logs","name":"debug/debug","resource logs":5,"log records":5}
{"level":"info","ts":"2025-01-14T16:01:11.402-0500","msg":"Logs","kind":"exporter","data_type":"logs","name":"debug/debug","resource logs":5,"log records":5}
{"level":"info","ts":"2025-01-14T16:01:11.484-0500","msg":"Logs","kind":"exporter","data_type":"logs","name":"debug/debug","resource logs":5,"log records":5}
{"level":"info","ts":"2025-01-14T16:01:11.563-0500","msg":"Logs","kind":"exporter","data_type":"logs","name":"debug/debug","resource logs":5,"log records":5}
{"level":"info","ts":"2025-01-14T16:01:11.644-0500","msg":"Logs","kind":"exporter","data_type":"logs","name":"debug/debug","resource logs":5,"log records":5}

DELAY

{"level":"info","ts":"2025-01-14T16:01:19.787-0500","msg":"Logs","kind":"exporter","data_type":"logs","name":"debug/debug","resource logs":5,"log records":5}
{"level":"info","ts":"2025-01-14T16:01:20.934-0500","msg":"Logs","kind":"exporter","data_type":"logs","name":"debug/debug","resource logs":5,"log records":5}
{"level":"info","ts":"2025-01-14T16:01:21.015-0500","msg":"Logs","kind":"exporter","data_type":"logs","name":"debug/debug","resource logs":5,"log records":5}
{"level":"info","ts":"2025-01-14T16:01:21.097-0500","msg":"Logs","kind":"exporter","data_type":"logs","name":"debug/debug","resource logs":5,"log records":5}
{"level":"info","ts":"2025-01-14T16:01:21.179-0500","msg":"Logs","kind":"exporter","data_type":"logs","name":"debug/debug","resource logs":5,"log records":5}
{"level":"info","ts":"2025-01-14T16:01:21.265-0500","msg":"Logs","kind":"exporter","data_type":"logs","name":"debug/debug","resource logs":5,"log records":5}
{"level":"info","ts":"2025-01-14T16:01:21.351-0500","msg":"Logs","kind":"exporter","data_type":"logs","name":"debug/debug","resource logs":5,"log records":5}
{"level":"info","ts":"2025-01-14T16:01:21.432-0500","msg":"Logs","kind":"exporter","data_type":"logs","name":"debug/debug","resource logs":5,"log records":5}
{"level":"info","ts":"2025-01-14T16:01:21.512-0500","msg":"Logs","kind":"exporter","data_type":"logs","name":"debug/debug","resource logs":5,"log records":5}
{"level":"info","ts":"2025-01-14T16:01:21.592-0500","msg":"Logs","kind":"exporter","data_type":"logs","name":"debug/debug","resource logs":5,"log records":5}
{"level":"info","ts":"2025-01-14T16:01:21.675-0500","msg":"Logs","kind":"exporter","data_type":"logs","name":"debug/debug","resource logs":5,"log records":5}

When I use a batch processor with a 2 second send interval, the logs seem to be more steady.

{"level":"info","ts":"2025-01-14T16:01:40.796-0500","msg":"Logs","kind":"exporter","data_type":"logs","name":"debug/debug","resource logs":60,"log records":60}
{"level":"info","ts":"2025-01-14T16:01:42.798-0500","msg":"Logs","kind":"exporter","data_type":"logs","name":"debug/debug","resource logs":105,"log records":105}
{"level":"info","ts":"2025-01-14T16:01:44.799-0500","msg":"Logs","kind":"exporter","data_type":"logs","name":"debug/debug","resource logs":105,"log records":105}
{"level":"info","ts":"2025-01-14T16:01:46.800-0500","msg":"Logs","kind":"exporter","data_type":"logs","name":"debug/debug","resource logs":120,"log records":120}
{"level":"info","ts":"2025-01-14T16:01:48.801-0500","msg":"Logs","kind":"exporter","data_type":"logs","name":"debug/debug","resource logs":125,"log records":125}
{"level":"info","ts":"2025-01-14T16:01:50.803-0500","msg":"Logs","kind":"exporter","data_type":"logs","name":"debug/debug","resource logs":100,"log records":100}
{"level":"info","ts":"2025-01-14T16:01:52.805-0500","msg":"Logs","kind":"exporter","data_type":"logs","name":"debug/debug","resource logs":125,"log records":125}
{"level":"info","ts":"2025-01-14T16:01:54.806-0500","msg":"Logs","kind":"exporter","data_type":"logs","name":"debug/debug","resource logs":115,"log records":115}

receiver/azureblobrehydrationreceiver/internal/azureblob/blob_client.go

… 30 to keep a moderate default

jsirianni

General question: With the concurrency changes, we should make sure the shutdown logic allows the receiver to process all in flight batches before shutting down or persisting the marker to storage.

It may already be doing this, just want to be sure.

jsirianni · 2025-01-15T20:49:13Z

receiver/azureblobrehydrationreceiver/receiver.go

 		checkpointStore: rehydration.NewNopStorage(),
 		startingTime:    startingTime,
 		endingTime:      endingTime,
-		ctx:             ctx,
-		cancelFunc:      cancel,
+		blobChan:        make(chan *azureblob.BlobResults),


I think we should consider making this a buffered channel. I think it would be okay for the receiver to slow to a crawl if the channel is full and we are waiting on blobs to be processed and sent down the pipeline.

Good idea 👍

jsirianni · 2025-01-15T20:53:37Z

receiver/azureblobrehydrationreceiver/receiver.go

+				r.logger.Warn("No blobs processed for 3 consecutive polls, assuming no more blobs to process")
+				return


I am still curious about this. It feels awkward to me to have the receiver stop processing logs despite continuing to have the process run.

I think it would be nice to progressively backoff. Starting with some short interval in the seconds, maxing out in the minutes. If someone leaves the agent running too long, and it is backing off to retrying every 5m or 10m, I would not expect their Azure bill to be impacted much.

I would agree generally that maybe a backoff approach could be taken, I was mostly just trying to keep the old functionality of the receiver stopping requests if nothing is being processed as a cost saving measure. I think Corbin was focusing on that for this use case rather than thinking a backoff approach should be taken...

@dpaasman00 do you have any thoughts on this as you've looked at s3 rehydration?

Okay, Im okay with keeping it the same but would like to hear Ryan's take on it.

It could certainly be a future change.

It looks like this functionality was removed, but I think that's okay since we're now using a better pagination process. We shouldn't need to worry about getting empty polls anymore.

receiver/azureblobrehydrationreceiver/config.go

schmikei · 2025-01-15T20:59:25Z

General question: With the concurrency changes, we should make sure the shutdown logic allows the receiver to process all in flight batches before shutting down or persisting the marker to storage.

It may already be doing this, just want to be sure.

It should be doing this because of the wait group but I'll write a test to make sure!

jsirianni · 2025-01-16T15:13:35Z

receiver/azureblobrehydrationreceiver/receiver.go

+// factor of buffered channel size
+// number of blobs to process at a time is blobChanSize * batchSize
+const blobChanSize = 5


Cool, this seems like a good size, its performing well for me. In the future maybe we can expose it. I don't see a need to right now.

jsirianni

This is working well for me. Id like @dpaasman00 to approve as well, he knows more of the collector inner workings.

schmikei · 2025-01-16T16:38:46Z

@dpaasman00 was noticing some shutdown hanging but with my most recent commit I think this is ready for another look if you have a moment!

dpaasman00

Some comments, nothing major

receiver/azureblobrehydrationreceiver/README.md

receiver/azureblobrehydrationreceiver/internal/azureblob/blob_client.go

receiver/azureblobrehydrationreceiver/receiver.go

dpaasman00

Left a few more comments about nits, but this looks good. Merge when you're happy with it

dpaasman00 · 2025-01-17T13:47:47Z

receiver/azureblobrehydrationreceiver/README.md

 2. The receiver will parse each blob's path to determine if it matches a path created by the [Azure Blob Exporter](../../exporter/azureblobexporter/README.md#blob-path).
 3. If the blob path is from the exporter, the receiver will parse the timestamp represented by the path.
 4. If the timestamp is within the configured range the receiver will download the blob and parse its contents into OTLP data.

    a. The receiver will process both uncompressed JSON blobs and blobs compressed with gzip.

+> Note: There is no current way of specifying a time range to rehydrate so any blobs outside fo the time range still  need to be retrieved from the API in order to filter via the `starting_time` and `ending_time` configuration.


Suggested change

> Note: There is no current way of specifying a time range to rehydrate so any blobs outside fo the time range still need to be retrieved from the API in order to filter via the `starting_time` and `ending_time` configuration.

> Note: There is no current way of specifying a time range to rehydrate so any blobs outside of the time range still need to be retrieved from the API in order to filter via the `starting_time` and `ending_time` configuration.

dpaasman00 · 2025-01-17T13:49:17Z

receiver/azureblobrehydrationreceiver/receiver.go

-	var marker *string
+	var errs error
+	if err := r.makeCheckpoint(shutdownCtx); err != nil {
+		r.logger.Error("Error while saving checkpoint", zap.Error(err))


I think this log can get removed since we're in shutdown and the error will get reported. Pretty nit tho so up to you

dpaasman00 · 2025-01-17T13:50:05Z

receiver/azureblobrehydrationreceiver/receiver.go

 	// Go through each blob and parse it's path to determine if we should consume it or not
+	r.logger.Debug("received a batch of blobs, parsing through them to determine if they should be rehydrated", zap.Int("num_blobs", len(blobs)))


Suggested change

r.logger.Debug("received a batch of blobs, parsing through them to determine if they should be rehydrated", zap.Int("num_blobs", len(blobs)))

r.logger.Debug("Received a batch of blobs, parsing through them to determine if they should be rehydrated", zap.Int("num_blobs", len(blobs)))

… into feat/paginate-azure-eventhub-rehydration

…dration

pre tests working; refactor to stream blobs as to not lock up

3b6cebe

schmikei changed the title ~~pre tests working; refactor to stream blobs as to not lock up~~ fix: refactor azureeventhubrehydrationreceiver to stream blobs as to not lock up on larger environments (BPOP-831) Jan 7, 2025

schmikei and others added 4 commits January 9, 2025 14:25

fix tests

7cce249

Merge branch 'release/v1.69.0' into feat/paginate-azure-eventhub-rehy…

be53471

…dration

remove polling parameters and fix gosec error

f467c37

Merge branch 'feat/paginate-azure-eventhub-rehydration' of github.com…

dc531b1

…:observIQ/bindplane-agent into feat/paginate-azure-eventhub-rehydration

schmikei marked this pull request as ready for review January 9, 2025 21:41

schmikei requested review from dpaasman00 and a team as code owners January 9, 2025 21:41

schmikei added 3 commits January 10, 2025 10:53

some more tests

02de7ba

add license

7829596

remove extra debug line

ac47ac7

dpaasman00 requested changes Jan 13, 2025

View reviewed changes

address PR feedback

aec8cff

jsirianni self-assigned this Jan 14, 2025

jsirianni requested changes Jan 14, 2025

View reviewed changes

only log messages rather than submit validation errors

5a7ceb9

jsirianni requested changes Jan 14, 2025

View reviewed changes

receiver/azureblobrehydrationreceiver/internal/azureblob/blob_client.go Outdated Show resolved Hide resolved

spin up goroutine per blob in the batch, change default batch size to…

6c0dbc1

… 30 to keep a moderate default

schmikei requested review from dpaasman00 and jsirianni January 15, 2025 15:29

jsirianni requested changes Jan 15, 2025

View reviewed changes

receiver/azureblobrehydrationreceiver/config.go Outdated Show resolved Hide resolved

schmikei added 3 commits January 15, 2025 17:45

add buffered chan size of 5 to start

2c3d4ee

remove 3 empty request limit

43ab839

remove testing log line

8006279

jsirianni reviewed Jan 16, 2025

View reviewed changes

jsirianni approved these changes Jan 16, 2025

View reviewed changes

schmikei added 2 commits January 16, 2025 11:32

harden shutdown logic to close channel with a timeout

0f17cba

fix lint

4bac025

dpaasman00 requested changes Jan 16, 2025

View reviewed changes

dakota PR feedback

537270e

dpaasman00 reviewed Jan 16, 2025

View reviewed changes

receiver/azureblobrehydrationreceiver/receiver.go Show resolved Hide resolved

dpaasman00 reviewed Jan 16, 2025

View reviewed changes

receiver/azureblobrehydrationreceiver/receiver.go Outdated Show resolved Hide resolved

dpaasman00 reviewed Jan 16, 2025

View reviewed changes

receiver/azureblobrehydrationreceiver/receiver.go Show resolved Hide resolved

more feedback; sans checkpointing after every blob

2ed65a8

dpaasman00 approved these changes Jan 17, 2025

View reviewed changes

schmikei and others added 3 commits January 17, 2025 09:18

Merge branch 'release/v1.69.0' of github.com:observIQ/bindplane-agent…

cc797e7

… into feat/paginate-azure-eventhub-rehydration

minor updates; pr review

4e57ff6

Merge branch 'release/v1.69.0' into feat/paginate-azure-eventhub-rehy…

569f6f1

…dration

schmikei merged commit 9a225bf into release/v1.69.0 Jan 17, 2025
15 checks passed

schmikei deleted the feat/paginate-azure-eventhub-rehydration branch January 17, 2025 15:01

schmikei mentioned this pull request Jan 17, 2025

fix(awss3rehydrationreceiver): Stream objects from S3 to improve performance #2103

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: refactor azureeventhubrehydrationreceiver to stream blobs as to not lock up on larger environments (BPOP-831) #2098

fix: refactor azureeventhubrehydrationreceiver to stream blobs as to not lock up on larger environments (BPOP-831) #2098

schmikei commented Jan 7, 2025

schmikei commented Jan 9, 2025

dpaasman00 left a comment

jsirianni left a comment

jsirianni Jan 14, 2025

schmikei Jan 14, 2025

dpaasman00 Jan 15, 2025

jsirianni left a comment •

edited

Loading

jsirianni left a comment

jsirianni Jan 15, 2025

schmikei Jan 15, 2025

jsirianni Jan 15, 2025

schmikei Jan 15, 2025

jsirianni Jan 15, 2025

jsirianni Jan 15, 2025

dpaasman00 Jan 16, 2025

schmikei commented Jan 15, 2025

jsirianni Jan 16, 2025

jsirianni left a comment

schmikei commented Jan 16, 2025

dpaasman00 left a comment

dpaasman00 left a comment

dpaasman00 Jan 17, 2025

dpaasman00 Jan 17, 2025

dpaasman00 Jan 17, 2025

		r.logger.Warn("No blobs processed for 3 consecutive polls, assuming no more blobs to process")
		return

	> Note: There is no current way of specifying a time range to rehydrate so any blobs outside fo the time range still need to be retrieved from the API in order to filter via the `starting_time` and `ending_time` configuration.
	> Note: There is no current way of specifying a time range to rehydrate so any blobs outside of the time range still need to be retrieved from the API in order to filter via the `starting_time` and `ending_time` configuration.

		// Go through each blob and parse it's path to determine if we should consume it or not
		r.logger.Debug("received a batch of blobs, parsing through them to determine if they should be rehydrated", zap.Int("num_blobs", len(blobs)))

fix: refactor azureeventhubrehydrationreceiver to stream blobs as to not lock up on larger environments (BPOP-831) #2098

fix: refactor azureeventhubrehydrationreceiver to stream blobs as to not lock up on larger environments (BPOP-831) #2098

Conversation

schmikei commented Jan 7, 2025

Proposed Change

Checklist

schmikei commented Jan 9, 2025

dpaasman00 left a comment

Choose a reason for hiding this comment

jsirianni left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsirianni left a comment • edited Loading

Choose a reason for hiding this comment

jsirianni left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schmikei commented Jan 15, 2025

Choose a reason for hiding this comment

jsirianni left a comment

Choose a reason for hiding this comment

schmikei commented Jan 16, 2025

dpaasman00 left a comment

Choose a reason for hiding this comment

dpaasman00 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsirianni left a comment •

edited

Loading