Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add batchprocessor support for client metadata #7325

Closed
wants to merge 0 commits into from

Conversation

jmacd
Copy link
Contributor

@jmacd jmacd commented Mar 6, 2023

Description: Add support for batching by metadata keys. We are aware that this has been frequently requested.

This work was prioritized on our team because it is a prerequisite for the OTel-Arrow compression bridge, see open-telemetry/community#1332 and open-telemetry/oteps#171

Link to tracking Issue: #4544

Testing: One new test was added.

Documentation: The README was updated.

@jmacd jmacd requested review from a team and jpkrohling March 6, 2023 19:50
@codecov
Copy link

codecov bot commented Mar 6, 2023

Codecov Report

Patch coverage: 88.05% and project coverage change: -0.03 ⚠️

Comparison is base (5a55ff8) 91.10% compared to head (9ab2157) 91.07%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7325      +/-   ##
==========================================
- Coverage   91.10%   91.07%   -0.03%     
==========================================
  Files         295      295              
  Lines       14373    14475     +102     
==========================================
+ Hits        13094    13183      +89     
- Misses       1011     1021      +10     
- Partials      268      271       +3     
Impacted Files Coverage Δ
processor/batchprocessor/metrics.go 85.21% <80.00%> (-1.05%) ⬇️
processor/batchprocessor/batch_processor.go 89.70% <88.05%> (-0.78%) ⬇️
processor/batchprocessor/config.go 100.00% <100.00%> (ø)
processor/batchprocessor/factory.go 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Member

@bogdandrutu bogdandrutu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Do you need to batch by metadata or you need the batch to preserve the metadata? Are both needed or just the preservation?

processor/batchprocessor/batch_processor.go Outdated Show resolved Hide resolved
@jmacd
Copy link
Contributor Author

jmacd commented Mar 13, 2023

Question: Do you need to batch by metadata or you need the batch to preserve the metadata? Are both needed or just the preservation?

I need both for the OTel-Arrow compression bridge, see open-telemetry/community#1332 and open-telemetry/oteps#171 because I want the use of a bridge to be transparent with regards to selected metadata.

@jmacd
Copy link
Contributor Author

jmacd commented Mar 13, 2023

@bogdandrutu would you like a boolean setting to control whether the batchprocessor's output context sets the metadata values? I don't have any applications in mind for batching without propagating the metadata.

Copy link
Contributor

@ericmustin ericmustin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broadly this looks good to me, I had a question for my own edification around the behavior when the key isn't present on Metadata.

It's also worth noting that Metadata itself is marked as Experimental( see

// Metadata is the request metadata from the client connecting to this connector.
// Experimental: *NOTE* this structure is subject to change or removal in the future.
Metadata Metadata
), would it be appropriate to also mark this configuration option as Experimental as a result? Or, should Metadata have it's Experimental distinction removed?

Copy link
Contributor

@codeboten codeboten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of non blocking comments on my end, PTAL

processor/batchprocessor/README.md Outdated Show resolved Hide resolved
processor/batchprocessor/factory.go Outdated Show resolved Hide resolved
@codeboten
Copy link
Contributor

@bogdandrutu @dmitryax @astencel-sumo @ericmustin please review and approve if your comments have been addressed

Copy link
Contributor

@ericmustin ericmustin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

processor/batchprocessor/README.md Outdated Show resolved Hide resolved
processor/batchprocessor/batch_processor.go Outdated Show resolved Hide resolved
processor/batchprocessor/batch_processor.go Outdated Show resolved Hide resolved
processor/batchprocessor/batch_processor.go Outdated Show resolved Hide resolved
processor/batchprocessor/batch_processor.go Outdated Show resolved Hide resolved
}

func (bp *batchProcessor) currentMetadataCardinality() int {
bp.lock.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about a read lock here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lock will be contended by writers on lines 294-295 and will only be read occasionally, so I don't think that would be an improvement. I've commented on how the need to maintain a map of attribute-sets while avoiding lock contention, allocations and also supporting recycling is a tricky problem, one faced by a metrics SDK.

processor/batchprocessor/batch_processor.go Outdated Show resolved Hide resolved
processor/batchprocessor/batch_processor_test.go Outdated Show resolved Hide resolved
processor/batchprocessor/config.go Outdated Show resolved Hide resolved
defaultTimeout = 200 * time.Millisecond
defaultSendBatchSize = uint32(8192)
defaultTimeout = 200 * time.Millisecond
defaultMetadataCardinalityLimit = 100
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see users asking already for guidance on setting this value. Is there a reason why this is 100 and not 1000? I have a feeling that 100 is too low, but I don't know the impact of having 1000 as the default. Have you played with the settings, and can you write a couple of sentences on picking the right value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what kind of user will think 100 is too low. In the scenario that Lightstep cares about, the number of metadata combinations is small, it's the number of projects per customer. The customer might have multiple projects in use, but they'll know how many that is. I wrote:

	// defaultMetadataCardinalityLimit should be set to the number
	// of metadata configurations the user expects to submit to
	// the collector.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My view needs to be adjusted to the reality of common users of this, but when I think about metadata combinations, I think of tenant-id and cluster-name, for instance. In that case, 5 clusters and 20 tenants would already be 100 potential combinations. Going back to what I had originally in mind: is there guidance we can provide on the costs of 1000 vs. 100? Is it mostly about memory? There might also be a processing/scheduling penalty, but I believe this will be mostly memory, right?

@jmacd
Copy link
Contributor Author

jmacd commented Apr 5, 2023

I will process all the feedback above, today. Thanks reviewers!

@jmacd
Copy link
Contributor Author

jmacd commented Apr 5, 2023

@jpkrohling I want to draw a connection between your remarks:

Here's why I think it's OK to go with a simple approach for now and defer complex discussions about how expensive the process is and how quickly it can recycle memory when batchers go idle. I expect this batching process to be performed on the customer side of an OTel collector pipeline. The customer will be responsible for setting this number to at most the number of distinct metadata keys that they are using themselves, for it is they who benefit from batching before sending data across an expensive network connection. In this case, the customer will have as many distinct metadata sets as they have teams, or possibly environments*teams. This should be a fixed number for the customer.

This is not a feature meant to help a vendor install a batching process in their own pipeline, where the number of distinct attribute sets might be determined by the number of customers or expected to be large and variable.

Copy link
Member

@jpkrohling jpkrohling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, perhaps a specific readme doc can be improved, but this is not blocking.

defaultTimeout = 200 * time.Millisecond
defaultSendBatchSize = uint32(8192)
defaultTimeout = 200 * time.Millisecond
defaultMetadataCardinalityLimit = 100
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My view needs to be adjusted to the reality of common users of this, but when I think about metadata combinations, I think of tenant-id and cluster-name, for instance. In that case, 5 clusters and 20 tenants would already be 100 potential combinations. Going back to what I had originally in mind: is there guidance we can provide on the costs of 1000 vs. 100? Is it mostly about memory? There might also be a processing/scheduling penalty, but I believe this will be mostly memory, right?

@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@jmacd
Copy link
Contributor Author

jmacd commented Apr 26, 2023

I have re-opened this PR #7578. I had accidentally force-pushed the branch and it's easier now to create a new PR, sorry. Identical code as this pr had through 9ab2157, followed by merge with main, conflict resolution, and then e7a7919

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants