Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slices: add Chunk function to divide []T into [][]T chunks #53987

Closed
mdlayher opened this issue Jul 21, 2022 · 40 comments
Closed

slices: add Chunk function to divide []T into [][]T chunks #53987

mdlayher opened this issue Jul 21, 2022 · 40 comments

Comments

@mdlayher
Copy link
Member

mdlayher commented Jul 21, 2022

A problem I've run into a fair amount is dealing with APIs which only accept a maximum number of inputs at once, though I may have more than that number of inputs that I would like to ultimately process.

For example, Amazon S3 can only delete up to 1000 objects in a single DeleteObjects API request (https://docs.aws.amazon.com/AmazonS3/latest/API/API_DeleteObjects.html). I've run into similar issues in the past when injecting a large number of routes into the Linux kernel via route netlink, or when modifying a large number of WireGuard peers at once.

To deal with this situation in a generic way, I've come up with the following:

package slices

// Chunk batches []T into [][]T in groups of size. The final chunk of []T will be
// smaller than size if the input slice cannot be chunked evenly. It does not
// make any copies of slice elements.
//
// As an example, take a slice of 5 integers and create chunks of 2 integers
// each (the final value creates a short chunk):
//   slices.Chunk([]int{1, 2, 3, 4, 5}, 2) = [][]int{{1, 2}, {3, 4}, {5}}
func Chunk[T any](slice []T, size int) [][]T {
	var chunks [][]T
	for i := 0; i < len(slice); {
		// Clamp the last chunk to the slice bound as necessary.
		end := size
		if l := len(slice[i:]); l < size {
			end = l
		}

		// Set the capacity of each chunk so that appending to a chunk does not
		// modify the original slice.
		chunks = append(chunks, slice[i:i+end:i+end])
		i += end
	}

	return chunks
}

Which is then used as follows:

	objs, err := c.allObjects(ctx)
	if err != nil {
		return nil, err
	}

	// Begin deleting the objects concurrently in batches of 1000 (the
	// maximum limit size supported by S3).
	const limit = 1000

	eg, ctx := errgroup.WithContext(ctx)
	eg.SetLimit(10)

	for _, chunk := range slices.Chunk(objs, limit) {
		chunk := chunk
		eg.Go(func() error { /* deletion logic */ })
	}

If this proposal is accepted, I'd be happy to send a CL for the above. Thanks for your time.

Edit 1: updated second parameter name to size int as inspired by Rust's chunks method: https://doc.rust-lang.org/std/primitive.slice.html#method.chunks.

Edit 2: set capacity for each chunk per next comment.

@gopherbot gopherbot added this to the Proposal milestone Jul 21, 2022
@JeremyLoy
Copy link

JeremyLoy commented Jul 22, 2022

A small correction to your given implementation:

https://github.com/golang/go/wiki/SliceTricks#batching-with-minimal-allocation

it’s important to use the 3 arg form when chunking, otherwise appending to a chunk could overwrite unintentionally

https://go.dev/play/p/Zp0Dh5cB1rB

@sten4eg
Copy link

sten4eg commented Jul 22, 2022

@earthboundkid
Copy link
Contributor

I just needed this functionality last week and almost opened an issue. I think it would be useful.

@icholy
Copy link

icholy commented Jul 25, 2022

I think Batch is a better name.

@DeedleFake
Copy link

DeedleFake commented Jul 25, 2022

I wonder if this might be better to wait on a general iterator API for. If one of the main intentions is to use it with range, this is a fairly inefficient way to do it, but its existence will encourage people to use it anyways. Map() and Filter() were dropped from slices for the same reason. This could be fairly easily done manually with a for loop in an efficient way, however, if there was some easier way to clamp that last index:

for start := 0; start < len(slice); start += chunkSize {
  end := min(start + chunkSize, len(slice))
  chunk := slice[start:end:end]

  // ...
}

Is there a common use for this that doesn't involve immediately proceeding to loop over it?

@icholy
Copy link

icholy commented Jul 31, 2022

Ive needed this function for 2 out of the last 3 projects I've worked on. An iterator based approach would not have had any benefit in either of these use cases.

@DeedleFake
Copy link

But would an iterator-based approach have been worse, @icholy? The inner slices don't need to be reallocated in either case, and the outer slice does need to be newly allocated in either case, so I don't think that an iterator-based variant would essentially ever be worse. If there's some iters.ToSlice() function, then it would only take one extra function call to get a slice if that would be easier to work with for a specific use-case.

@rsc rsc moved this to Incoming in Proposals Aug 10, 2022
@rsc rsc added this to Proposals Aug 10, 2022
@kf6nux
Copy link

kf6nux commented Aug 23, 2022

You probably want to be explicit about your panic condition and panic in one additional case where size == 0

if size < 1 {
	panic("chunk size cannot be less than 1")
}

Here's an alternate implementation that some may like more

func Chunk[T any](slice []T, size int) (chunks [][]T) {
	if size < 1 {
		panic("chunk size cannot be less than 1")
	}
	for i := 0; ; i++ {
		next := i * size
		if len(slice[next:]) > size {
			end := next + size
			chunks = append(chunks, slice[next:end:end])
		} else {
			chunks = append(chunks, slice[i*size:])
			return
		}
	}
}

@dcormier
Copy link
Contributor

dcormier commented Jan 6, 2023

Ah, nice to see a proposal for this. I've just been using my personal implementation:

func Chunk[E any](values []E, size int) [][]E {
	if size <= 0 {
		panic("size must be > 0")
	}

	var chunks [][]E
	for remaining := len(values); remaining > 0; remaining = len(values) {
		if remaining < size {
			size = remaining
		}

		chunks = append(chunks, values[:size:size])
		values = values[size:]
	}

	return chunks
}

@ianlancetaylor ianlancetaylor changed the title proposal: x/exp/slices: add Chunk function to divide []T into [][]T chunks proposal: slices: add Chunk function to divide []T into [][]T chunks Apr 7, 2023
@lufia
Copy link
Contributor

lufia commented Jul 19, 2023

A group of Go developers at my company much need this function.

In my opinion, I run into this frequently when calling web-based API, but the downside is that the implementation is error-prone, such as an out of range error or an off-by-one error.

My usecase:

For example, the Azure Monitor API has a metricNames parameter, but it is limited to a maximum number of 20 names(undocumented yet) in a request.

Also, in case of the legacy (binary) APNs-like protocol, to improve performance, chunked multiple request packets may be written to the network via a socket at once.

@golang golang deleted a comment from jimwei Jul 19, 2023
@rsc
Copy link
Contributor

rsc commented Aug 9, 2023

If we were going to add this function, I think we'd want to use the new iterator functions and write it as

func Chunk[T any](slice []T, size int) iter.Seq[[]T]

(See #61897 for the definition of iter.Seq.)

@JeremyLoy
Copy link

@rsc thanks for chiming in on this old issue that existed before the new iterator proposal!

your comment got me thinking - what about an additional chunk function that accepted a sequence of T?

func ChunkSeq[T any](seq iter.Seq[T], size int) iter.Seq[[]T]

Do you think this would be useful too? I think chunking on a stream has immediate uses - a batch handler for a message queue for instance

@rsc
Copy link
Contributor

rsc commented Aug 9, 2023

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc rsc moved this from Incoming to Active in Proposals Aug 9, 2023
@rsc
Copy link
Contributor

rsc commented Aug 30, 2023

Finishing this proposal discussion is blocked on #61405.

@avamsi
Copy link

avamsi commented Aug 31, 2023

I often run into a very similar but slightly different use case -- the 2nd argument is "n" (for n slices of roughly the same size) instead of "size" (with possibly one odd slice at the end).

Something like this --

// Chunks breaks the given slice into n slices of (almost) the same size.
func Chunks[E any](s []E, n int, yield func(int, []E)) {
	if n <= 0 {
		return
	}
	var (
		size, remainder = len(s) / n, len(s) % n
		start, end      int
	)
	for i := 0; i < n; i++ {
		start, end = end, end+size
		if remainder > 0 {
			remainder--
			end++
		}
		yield(i, s[start:end])
	}
}

@icholy
Copy link

icholy commented Aug 31, 2023

@avamsi that sounds more like a Partition function.

@earthboundkid
Copy link
Contributor

Chunk and Partition are both useful but in different scenarios. Chunk can work on an iterator, whereas Partition needs a slice. I think they both come up enough to be added.

@jimmyfrasche
Copy link
Member

I think this could be generalized further:

func Split[T any](seq iter.Seq[T], when func(len int, a, b T) bool) iter.Seq[iter.Seq[T]] {
  //...
}

The when predicate returns true if it should start a new subsequence after a and before b. len is the length of the current subsequence and len > 0 as a will have already been yielded (yelt?). If the original sequence yields less than two values, when is never called.

Then Chunk is just:

func Chunk[T any](maxLen int, seq iter.Seq[T]) iter.Seq[iter.Seq[T]] {
  return Split(seq, func(len int, _, _ T) bool {
    return len == maxLen
  })
}

It's also easy to write a function that chunks the input based on a cost function instead of length:

func SplitByCost[T any](maxCost int, seq iter.Seq[T], costOf func(T) uint) iter.Seq[iter.Seq[T]] {
  var cost uint
  return Split(seq, func(len int, a, _ T) bool {
    if len == 1 {
      cost = 0
    }
    cost += costOf(a)
    return cost >= maxCost
  })
}

Or to split if some distance threshold is crossed:

func CloseTogether(threshold int, seq iter.Seq[int]) iter.Seq[iter.Seq[int]] {
  return Split(seq, func(_, a, b int) bool {
    return b - a > threshold
  })
}

Partition would be a bit more involved but still simple to fit into this scheme.

It might be worth it to have something similar for the special case of dividing a slice into subslices of the same backing store as that would be much cheaper.

@rsc
Copy link
Contributor

rsc commented Jan 24, 2024

Iterators are unblocked for Go 1.23 so let's figure out what to do with this. It seems like #53987 (comment) is the current winner: still a simple API, but not having to allocate the entire slice-of-slices. Do I have that right?

// Chunk returns an iterator over consecutive sub-slices of up to n elements of x.
// All but the last sub-slice will have size n.
// If x is empty, the sequence is empty: there is no empty slice in the sequence.
func Chunk[T any](x []T, n int) iter.Seq[[]T]

More complex splitting logic is possible and sometimes useful, of course, but the max-n chunking is very common and seems like it can stand on its own.

@earthboundkid
Copy link
Contributor

// Chunk returns an iterator over consecutive sub-slices of up to n elements of x.
// All but the first sub-slice will have size n.
// If x is empty, the sequence is empty: there is no empty slice in the sequence.
func Chunk[T any](x []T, n int) iter.Seq[[]T]

Is that a typo for "All but the last sub-slice will have size n"?

@veggiemonk
Copy link

veggiemonk commented Feb 5, 2024

Sorry for coming in late.
Is there way to make it clear that the n int parameter is the max size of a chunk (besides reading the doc) and not the number of chunks ?

Some time back, I created https://github.com/veggiemonk/batch
In this repository I explained why, for most use cases, one might want evenly distributed slices.
In any case, this is not mutually exclusive.

Do you think I should add a proposal for Batch or can I piggy back on this one ?
Naming might need some work, I need help on that, which is why I ask here. Batch <> Slices <> Chunks ...

// Batch returns an iterator over consecutive sub-slices evenly n number of batches.
// The size of each batch never deviates more than 1 from the average batch size.
func Batch[T any](x []T, n int) iter.Seq[[]T] 

@avamsi
Copy link

avamsi commented Feb 5, 2024

@veggiemonk, I think what you're suggesting is the same as #53987 (comment)? @icholy (and @earthboundkid) suggested it be called Partition (does some other language / library use this terminology?), I renamed it to Shard in my kitchen sink library (https://github.com/avamsi/ergo/blob/bc8d6f210e71d95a8e63c513c49ed5f18d8e9710/slices/slices.go#L6)

@veggiemonk
Copy link

@avamsi yes it is.

I'm not really familiar how other languages do it but here is https://docs-lodash.com/v4/partition/
indeed partition is one name considered but it seems more like a predicate/function is needed to define the split. This is very generic and goes beyond the simple use case we have.

So far, Batch is only term that seems to fit but my first language is not English so I'll leave it to someone native to decide.

@earthboundkid
Copy link
Contributor

Okay, I opened a new issue: #65523.

@rsc
Copy link
Contributor

rsc commented Feb 8, 2024

Based on the discussion above, this proposal seems like a likely accept.
— rsc for the proposal review group

// Chunk returns an iterator over consecutive sub-slices of up to n elements of x.
// All but the last sub-slice will have size n.
// All sub-slices are clipped to have no capacity beyond the length.
// If x is empty, the sequence is empty: there is no empty slice in the sequence.
func Chunk[T any](x []T, n int) iter.Seq[[]T]

@rsc rsc moved this from Active to Likely Accept in Proposals Feb 8, 2024
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/562935 mentions this issue: slices: add Chunk

@rsc rsc moved this from Likely Accept to Accepted in Proposals Feb 14, 2024
@rsc
Copy link
Contributor

rsc commented Feb 14, 2024

No change in consensus, so accepted. 🎉
This issue now tracks the work of implementing the proposal.
— rsc for the proposal review group

// Chunk returns an iterator over consecutive sub-slices of up to n elements of x.
// All but the last sub-slice will have size n.
// All sub-slices are clipped to have no capacity beyond the length.
// If x is empty, the sequence is empty: there is no empty slice in the sequence.
func Chunk[T any](x []T, n int) iter.Seq[[]T]

@rsc rsc changed the title proposal: slices: add Chunk function to divide []T into [][]T chunks slices: add Chunk function to divide []T into [][]T chunks Feb 14, 2024
@rsc rsc modified the milestones: Proposal, Backlog Feb 14, 2024
@mdlayher
Copy link
Member Author

mdlayher commented Feb 15, 2024

Edit: updated return []E to S

On the CL (https://go-review.googlesource.com/c/go/+/562935), @earthboundkid mentioned that the signature should be generalized further to permit the slice's type and underlying elements to differ, as is the case with other functions:

// Chunk returns an iterator over consecutive sub-slices of up to n elements of s.
// All but the last sub-slice will have size n.
// All sub-slices are clipped to have no capacity beyond the length.
// If s is empty, the sequence is empty: there is no empty slice in the sequence.
func Chunk[S ~[]E, E any](s S, n int) iter.Seq[S] { }

Presumably this slightly updated API is also acceptable given that it is the common pattern already used throughout slices?

@icholy
Copy link

icholy commented Feb 15, 2024

Wouldn't it make sense to use S as the return type too?

func Chunk[S ~[]E, E any](s S, n int) iter.Seq[S] { }

@mdlayher
Copy link
Member Author

Yes it does, I will update based on other functions in slices: https://pkg.go.dev/slices#Grow.

@mdlayher
Copy link
Member Author

mdlayher commented May 7, 2024

Now that iter.Seq[S] is available to the stdlib at tip, I've updated my CL and have marked it ready for review: https://go-review.googlesource.com/c/go/+/562935

@josharian
Copy link
Contributor

I just need slices.Chunk and found this...and discovered it didn't quite fit the shape of my problem, because I also wanted the start index of every chunk. Think of printing output like:

0: a b c d
4: e f g h
8: i j

I can of course maintain my own offset counter, but I wonder whether it mightn't be better to return in iter.Seq2[int, S], where the first yielded value is the start offset. This would also make it a more natural extensive of plain old iteration.

@josharian
Copy link
Contributor

(I know this is very last minute, but figured I should mention anyway...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Accepted
Development

No branches or pull requests