-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
slices: add Chunk function to divide []T into [][]T chunks #53987
Comments
A small correction to your given implementation: https://github.com/golang/go/wiki/SliceTricks#batching-with-minimal-allocation it’s important to use the 3 arg form when chunking, otherwise appending to a chunk could overwrite unintentionally |
I just needed this functionality last week and almost opened an issue. I think it would be useful. |
I think |
I wonder if this might be better to wait on a general iterator API for. If one of the main intentions is to use it with for start := 0; start < len(slice); start += chunkSize {
end := min(start + chunkSize, len(slice))
chunk := slice[start:end:end]
// ...
} Is there a common use for this that doesn't involve immediately proceeding to loop over it? |
Ive needed this function for 2 out of the last 3 projects I've worked on. An iterator based approach would not have had any benefit in either of these use cases. |
But would an iterator-based approach have been worse, @icholy? The inner slices don't need to be reallocated in either case, and the outer slice does need to be newly allocated in either case, so I don't think that an iterator-based variant would essentially ever be worse. If there's some |
You probably want to be explicit about your panic condition and panic in one additional case where size == 0 if size < 1 {
panic("chunk size cannot be less than 1")
} Here's an alternate implementation that some may like more func Chunk[T any](slice []T, size int) (chunks [][]T) {
if size < 1 {
panic("chunk size cannot be less than 1")
}
for i := 0; ; i++ {
next := i * size
if len(slice[next:]) > size {
end := next + size
chunks = append(chunks, slice[next:end:end])
} else {
chunks = append(chunks, slice[i*size:])
return
}
}
} |
Ah, nice to see a proposal for this. I've just been using my personal implementation: func Chunk[E any](values []E, size int) [][]E {
if size <= 0 {
panic("size must be > 0")
}
var chunks [][]E
for remaining := len(values); remaining > 0; remaining = len(values) {
if remaining < size {
size = remaining
}
chunks = append(chunks, values[:size:size])
values = values[size:]
}
return chunks
} |
A group of Go developers at my company much need this function. In my opinion, I run into this frequently when calling web-based API, but the downside is that the implementation is error-prone, such as an out of range error or an off-by-one error. My usecase: For example, the Azure Monitor API has a metricNames parameter, but it is limited to a maximum number of 20 names(undocumented yet) in a request. Also, in case of the legacy (binary) APNs-like protocol, to improve performance, chunked multiple request packets may be written to the network via a socket at once. |
If we were going to add this function, I think we'd want to use the new iterator functions and write it as
(See #61897 for the definition of iter.Seq.) |
@rsc thanks for chiming in on this old issue that existed before the new iterator proposal! your comment got me thinking - what about an additional
Do you think this would be useful too? I think chunking on a stream has immediate uses - a batch handler for a message queue for instance |
This proposal has been added to the active column of the proposals project |
Finishing this proposal discussion is blocked on #61405. |
I often run into a very similar but slightly different use case -- the 2nd argument is "n" (for n slices of roughly the same size) instead of "size" (with possibly one odd slice at the end). Something like this -- // Chunks breaks the given slice into n slices of (almost) the same size.
func Chunks[E any](s []E, n int, yield func(int, []E)) {
if n <= 0 {
return
}
var (
size, remainder = len(s) / n, len(s) % n
start, end int
)
for i := 0; i < n; i++ {
start, end = end, end+size
if remainder > 0 {
remainder--
end++
}
yield(i, s[start:end])
}
} |
@avamsi that sounds more like a |
Chunk and Partition are both useful but in different scenarios. Chunk can work on an iterator, whereas Partition needs a slice. I think they both come up enough to be added. |
I think this could be generalized further: func Split[T any](seq iter.Seq[T], when func(len int, a, b T) bool) iter.Seq[iter.Seq[T]] {
//...
} The Then Chunk is just: func Chunk[T any](maxLen int, seq iter.Seq[T]) iter.Seq[iter.Seq[T]] {
return Split(seq, func(len int, _, _ T) bool {
return len == maxLen
})
} It's also easy to write a function that chunks the input based on a cost function instead of length: func SplitByCost[T any](maxCost int, seq iter.Seq[T], costOf func(T) uint) iter.Seq[iter.Seq[T]] {
var cost uint
return Split(seq, func(len int, a, _ T) bool {
if len == 1 {
cost = 0
}
cost += costOf(a)
return cost >= maxCost
})
} Or to split if some distance threshold is crossed: func CloseTogether(threshold int, seq iter.Seq[int]) iter.Seq[iter.Seq[int]] {
return Split(seq, func(_, a, b int) bool {
return b - a > threshold
})
} Partition would be a bit more involved but still simple to fit into this scheme. It might be worth it to have something similar for the special case of dividing a slice into subslices of the same backing store as that would be much cheaper. |
Iterators are unblocked for Go 1.23 so let's figure out what to do with this. It seems like #53987 (comment) is the current winner: still a simple API, but not having to allocate the entire slice-of-slices. Do I have that right?
More complex splitting logic is possible and sometimes useful, of course, but the max-n chunking is very common and seems like it can stand on its own. |
Is that a typo for "All but the last sub-slice will have size n"? |
Sorry for coming in late. Some time back, I created https://github.com/veggiemonk/batch Do you think I should add a proposal for Batch or can I piggy back on this one ?
|
@veggiemonk, I think what you're suggesting is the same as #53987 (comment)? @icholy (and @earthboundkid) suggested it be called |
@avamsi yes it is. I'm not really familiar how other languages do it but here is https://docs-lodash.com/v4/partition/ So far, Batch is only term that seems to fit but my first language is not English so I'll leave it to someone native to decide. |
Okay, I opened a new issue: #65523. |
Based on the discussion above, this proposal seems like a likely accept.
|
Change https://go.dev/cl/562935 mentions this issue: |
No change in consensus, so accepted. 🎉
|
Edit: updated return On the CL (https://go-review.googlesource.com/c/go/+/562935), @earthboundkid mentioned that the signature should be generalized further to permit the slice's type and underlying elements to differ, as is the case with other functions: // Chunk returns an iterator over consecutive sub-slices of up to n elements of s.
// All but the last sub-slice will have size n.
// All sub-slices are clipped to have no capacity beyond the length.
// If s is empty, the sequence is empty: there is no empty slice in the sequence.
func Chunk[S ~[]E, E any](s S, n int) iter.Seq[S] { } Presumably this slightly updated API is also acceptable given that it is the common pattern already used throughout |
Wouldn't it make sense to use func Chunk[S ~[]E, E any](s S, n int) iter.Seq[S] { } |
Yes it does, I will update based on other functions in slices: https://pkg.go.dev/slices#Grow. |
Now that |
I just need slices.Chunk and found this...and discovered it didn't quite fit the shape of my problem, because I also wanted the start index of every chunk. Think of printing output like:
I can of course maintain my own offset counter, but I wonder whether it mightn't be better to return in |
(I know this is very last minute, but figured I should mention anyway...) |
A problem I've run into a fair amount is dealing with APIs which only accept a maximum number of inputs at once, though I may have more than that number of inputs that I would like to ultimately process.
For example, Amazon S3 can only delete up to 1000 objects in a single
DeleteObjects
API request (https://docs.aws.amazon.com/AmazonS3/latest/API/API_DeleteObjects.html). I've run into similar issues in the past when injecting a large number of routes into the Linux kernel via route netlink, or when modifying a large number of WireGuard peers at once.To deal with this situation in a generic way, I've come up with the following:
Which is then used as follows:
If this proposal is accepted, I'd be happy to send a CL for the above. Thanks for your time.
Edit 1: updated second parameter name to
size int
as inspired by Rust'schunks
method: https://doc.rust-lang.org/std/primitive.slice.html#method.chunks.Edit 2: set capacity for each chunk per next comment.
The text was updated successfully, but these errors were encountered: