Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize large array creation in const-eval #120069

Merged
merged 1 commit into from
Jan 19, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 19 additions & 12 deletions compiler/rustc_const_eval/src/interpret/memory.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1209,21 +1209,28 @@ impl<'mir, 'tcx: 'mir, M: Machine<'mir, 'tcx>> InterpCx<'mir, 'tcx, M> {
throw_ub_custom!(fluent::const_eval_copy_nonoverlapping_overlapping);
}
}
}

for i in 0..num_copies {
ptr::copy(
src_bytes,
dest_bytes.add((size * i).bytes_usize()), // `Size` multiplication
size.bytes_usize(),
);
let size_in_bytes = size.bytes_usize();
// For particularly large arrays (where this is perf-sensitive) it's common that
// we're writing a single byte repeatedly. So, optimize that case to a memset.
if size_in_bytes == 1 && num_copies >= 1 {
Copy link
Member

@saethlin saethlin Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you generalize this to handle arrays of other types by extending this size_in_bytes == 1 to any value where all bytes are equal? It would be nice if large arrays of usize or u32, or even an enum could also benefit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most data-less enums would already benefit (since they are 1 byte in size)... and if they have data it seems super unlikely the bytes are all the same.

For the cases where this could work (e.g., 0u32), I'm wary of #120069 (comment) -- not sure if there's a good, cheap way to check "is there padding in there?"

It might also be worth noting that anything significantly fancier here probably merits looking at replicating the algorithm in slice::repeat -- this was pointed out in earlier optimization work in this code, but not implemented at the time. For larger types I think that generalizes much better than trying to detect memset being possible...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if there's a good, cheap way to check "is there padding in there?"

🤷 That probably deserves a comment that we've checked in a way that only works if the value is a single byte. If I'm doing the logic right, as-written, it's "are any bytes init" not "are all bytes init". I don't think I'd argue against doing that specifically because this is one byte but it's subtle.

InitMask::is_range_initialized seems like the better function here, but the API has a gnarly error path... that only one caller uses. It might be worth it to split the function into two, so that the one error path that wants to know where there are uninit bytes can call some new function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, a comment seems like a good idea. I've added that.

it's "are any bytes init" not "are all bytes init"

Yes, but if there's only one byte, that seems like the same thing, right?

InitMask::is_range_initialized seems like the better function here, but the API has a gnarly error path... that only one caller uses. It might be worth it to split the function into two, so that the one error path that wants to know where there are uninit bytes can call some new function.

Yeah, that seems like the way to go for a more complex change here. I'm still worried that adding extra checks is going to lead to regressions -- or at least more code to ship for little value. But 🤷 , I could see it being a win. This PR already makes all cases a very tight loop around the memcpy/memmove call for each value, which is probably pretty fast for not excessively large things, and those are almost always byte-like I suspect.

RalfJung marked this conversation as resolved.
Show resolved Hide resolved
// SAFETY: `src_bytes` would be read from anyway by copies below (num_copies >= 1).
// Since size_in_bytes = 1, then the `init.no_bytes_init()` check above guarantees
// that this read at type `u8` is OK -- it must be an initialized byte.
Copy link
Member

@RalfJung RalfJung Jan 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment seems confused. SAFETY comments are about the safety of the interpreter itself. The bytes in question are always initialized in that sense. The check above is about whether they are initialized in the emulated interpreter state, i.e., whether the interpreted program initialized them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to edit it! I didn't obviously find any guarantees that the interpreter always initializes backing allocations before passing them around (seems like at minimum based on profiling that's not happening with the destination buffer, so it would be bad to read that at type u8 here?).

I would have expected that the emulated interpreter state having the bytes initialized implies the interpreter's state also has them initialized.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The u8 buffer backing interpreter allocations is always initialized. It has type u8 rather than MaybeUninit<u8>, so documenting that seems redundant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Profiling seems to be misleading then. I think we are calling calloc (via try_new_zeroed_slice) which might do the zero-init via efficient copy-on-write tricks or so.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps! I'm 99% sure we're not touching every byte until we hit this code. I'll look into changing some things here then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New allocations get created here:

fn uninit_inner<R>(size: Size, align: Align, fail: impl FnOnce() -> R) -> Result<Self, R> {
// We raise an error if we cannot create the allocation on the host.
// This results in an error that can happen non-deterministically, since the memory
// available to the compiler can change between runs. Normally queries are always
// deterministic. However, we can be non-deterministic here because all uses of const
// evaluation (including ConstProp!) will make compilation fail (via hard error
// or ICE) upon encountering a `MemoryExhausted` error.
let bytes = Bytes::zeroed(size, align).ok_or_else(fail)?;
Ok(Allocation {
bytes,
provenance: ProvenanceMap::new(),
init_mask: InitMask::new(size, false),
align,
mutability: Mutability::Mut,
extra: (),
})
}
/// Try to create an Allocation of `size` bytes, failing if there is not enough memory
/// available to the compiler to do so.
pub fn try_uninit<'tcx>(size: Size, align: Align) -> InterpResult<'tcx, Self> {
Self::uninit_inner(size, align, || {
ty::tls::with(|tcx| tcx.dcx().delayed_bug("exhausted memory during interpretation"));
InterpError::ResourceExhaustion(ResourceExhaustionInfo::MemoryExhausted).into()
})
}
/// Try to create an Allocation of `size` bytes, panics if there is not enough memory
/// available to the compiler to do so.
///
/// Example use case: To obtain an Allocation filled with specific data,
/// first call this function and then call write_scalar to fill in the right data.
pub fn uninit(size: Size, align: Align) -> Self {
match Self::uninit_inner(size, align, || {
panic!("Allocation::uninit called with panic_on_fail had allocation failure");
}) {
Ok(x) => x,
Err(x) => x,
}
}

That ends up calling this:

fn zeroed(size: Size, _align: Align) -> Option<Self> {
let bytes = Box::<[u8]>::try_new_zeroed_slice(size.bytes_usize()).ok()?;
// SAFETY: the box was zero-allocated, which is a valid initial value for Box<[u8]>
let bytes = unsafe { bytes.assume_init() };
Some(bytes)
}

I am 100% sure that all these u8 are initialized.

let value = *src_bytes;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is OK. We've checked above that the bytes are initialized (if !init.no_bytes_init() {), so reading the byte should be safe.

dest_bytes.write_bytes(value, (size * num_copies).bytes_usize());
} else if src_alloc_id == dest_alloc_id {
let mut dest_ptr = dest_bytes;
for _ in 0..num_copies {
ptr::copy(src_bytes, dest_ptr, size_in_bytes);
dest_ptr = dest_ptr.add(size_in_bytes);
}
} else {
for i in 0..num_copies {
ptr::copy_nonoverlapping(
src_bytes,
dest_bytes.add((size * i).bytes_usize()), // `Size` multiplication
size.bytes_usize(),
);
let mut dest_ptr = dest_bytes;
for _ in 0..num_copies {
ptr::copy_nonoverlapping(src_bytes, dest_ptr, size_in_bytes);
dest_ptr = dest_ptr.add(size_in_bytes);
}
}
}
Expand Down
Loading