-
-
Notifications
You must be signed in to change notification settings - Fork 647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Permission denied" while creating sandbox with immutable inputs #13899
Comments
This is most likely related to #13848: it's possible that we have overlapping immutable input digests somewhere. |
Happened again in the lint shard for a different PR. Same error. |
And again:
|
I continue seeing this frequently in the |
Previously we used the double-checked-cell-async crate (See: https://github.com/chrislearn/double-checked-cell-async/blob/46cd3b04eddddbe279282143fe8a936d5854588c/src/lib.rs#L228-L260), which performed the second check of the double check lock with relaxed ordering. The relevant code is snipped below and annotated: ```rust pub struct DoubleCheckedCell<T> { value: UnsafeCell<Option<T>>, initialized: AtomicBool, lock: Mutex<()>, } impl<T> DoubleCheckedCell<T> { pub fn new() -> DoubleCheckedCell<T> { DoubleCheckedCell { value: UnsafeCell::new(None), initialized: AtomicBool::new(false), lock: Mutex::new(()), } } pub async fn get_or_try_init<Fut, E>(&self, init: Fut) -> Result<&T, E> where Fut: Future<Output = Result<T, E>> { // 1.) 1st load & check. if !self.initialized.load(Ordering::Acquire) { // 2.) Lock. let _lock = self.lock.lock().await; // 3.) 2nd load & check. if !self.initialized.load(Ordering::Relaxed) { { // 4.) Critical section. let result = init.await?; let value = unsafe { &mut *self.value.get() }; value.replace(result); } // 5.) Store with lock held. self.initialized.store(true, Ordering::Release); } } let value = unsafe { &*self.value.get() }; Ok(unsafe { value.as_ref().unchecked_unwrap() }) } } ``` Per the C++11 memory model used by Rust (See: https://en.cppreference.com/w/cpp/language/memory_model), this would seem to indicate the second load could be reordered to occur anywhere after the 1st load with acquire ordering and anywhere before the store with released ordering. If that second load was reordered to occur before the lock was acquired, two threads could enter the critical section in serial and the second thread would try to materialize to paths that already were created and marked read-only. Switch to the async-oncecell crate which performs both loads of the double-checked lock with acquire ordering, ensuring they are not re-ordered with respect to the interleaved non-atomics code. Also fixup the materialization process to be atomic. We now cleanup materialization chroots when materialization fails and only move their contents to the destination path if the full materialization has succeeded. Fixes pantsbuild#13899 [ci skip-build-wheels]
I'm going to unassign and release this back to the pool since I won't be hacking again until Jan 7th. |
Previously we used the double-checked-cell-async crate (See: https://github.com/chrislearn/double-checked-cell-async/blob/46cd3b04eddddbe279282143fe8a936d5854588c/src/lib.rs#L228-L260), which performed the second check of the double check lock with relaxed ordering. The relevant code is snipped below and annotated: ```rust pub struct DoubleCheckedCell<T> { value: UnsafeCell<Option<T>>, initialized: AtomicBool, lock: Mutex<()>, } impl<T> DoubleCheckedCell<T> { pub fn new() -> DoubleCheckedCell<T> { DoubleCheckedCell { value: UnsafeCell::new(None), initialized: AtomicBool::new(false), lock: Mutex::new(()), } } pub async fn get_or_try_init<Fut, E>(&self, init: Fut) -> Result<&T, E> where Fut: Future<Output = Result<T, E>> { // 1.) 1st load & check. if !self.initialized.load(Ordering::Acquire) { // 2.) Lock. let _lock = self.lock.lock().await; // 3.) 2nd load & check. if !self.initialized.load(Ordering::Relaxed) { { // 4.) Critical section. let result = init.await?; let value = unsafe { &mut *self.value.get() }; value.replace(result); } // 5.) Store with lock held. self.initialized.store(true, Ordering::Release); } } let value = unsafe { &*self.value.get() }; Ok(unsafe { value.as_ref().unchecked_unwrap() }) } } ``` Per the C++11 memory model used by Rust (See: https://en.cppreference.com/w/cpp/language/memory_model), this would seem to indicate the second load could be reordered to occur anywhere after the 1st load with acquire ordering and anywhere before the store with released ordering. If that second load was reordered to occur before the lock was acquired, two threads could enter the critical section in serial and the second thread would try to materialize to paths that already were created and marked read-only. Switch to the async-oncecell crate which performs both loads of the double-checked lock with acquire ordering, ensuring they are not re-ordered with respect to the interleaved non-atomics code. Also fixup the materialization process to be atomic. We now cleanup materialization chroots when materialization fails and only move their contents to the destination path if the full materialization has succeeded. Fixes pantsbuild#13899 [ci skip-build-wheels]
Previously we used the double-checked-cell-async crate (See: https://github.com/chrislearn/double-checked-cell-async/blob/46cd3b04eddddbe279282143fe8a936d5854588c/src/lib.rs#L228-L260), which performed the second check of the double check lock with relaxed ordering. The relevant code is snipped below and annotated: ```rust pub struct DoubleCheckedCell<T> { value: UnsafeCell<Option<T>>, initialized: AtomicBool, lock: Mutex<()>, } impl<T> DoubleCheckedCell<T> { pub fn new() -> DoubleCheckedCell<T> { DoubleCheckedCell { value: UnsafeCell::new(None), initialized: AtomicBool::new(false), lock: Mutex::new(()), } } pub async fn get_or_try_init<Fut, E>(&self, init: Fut) -> Result<&T, E> where Fut: Future<Output = Result<T, E>> { // 1.) 1st load & check. if !self.initialized.load(Ordering::Acquire) { // 2.) Lock. let _lock = self.lock.lock().await; // 3.) 2nd load & check. if !self.initialized.load(Ordering::Relaxed) { { // 4.) Critical section. let result = init.await?; let value = unsafe { &mut *self.value.get() }; value.replace(result); } // 5.) Store with lock held. self.initialized.store(true, Ordering::Release); } } let value = unsafe { &*self.value.get() }; Ok(unsafe { value.as_ref().unchecked_unwrap() }) } } ``` Per the C++11 memory model used by Rust (See: https://en.cppreference.com/w/cpp/language/memory_model), this would seem to indicate the second load could be reordered to occur anywhere after the 1st load with acquire ordering and anywhere before the store with released ordering. If that second load was reordered to occur before the lock was acquired, two threads could enter the critical section in serial and the second thread would try to materialize to paths that already were created and marked read-only. Switch to the async-oncecell crate which performs both loads of the double-checked lock with acquire ordering, ensuring they are not re-ordered with respect to the interleaved non-atomics code. Also fixup the materialization process to be atomic. We now cleanup materialization chroots when materialization fails and only move their contents to the destination path if the full materialization has succeeded. Fixes pantsbuild#13899 [ci skip-build-wheels]
@asherf there actually was interesting, but unfortunately unenlightening, data from your failure / @stuhood's debug addition. The perms on the pre-existing dir were 0o40755 (16877); so it was writable. That means it either could be a remnant that was partially materialized or a DCL bug where racing materializations were observable before completion:
|
Previously we used the double-checked-cell-async crate (See: https://github.com/chrislearn/double-checked-cell-async/blob/46cd3b04eddddbe279282143fe8a936d5854588c/src/lib.rs#L228-L260), which performed the second check of the double check lock with relaxed ordering. The relevant code is snipped below and annotated: ```rust pub struct DoubleCheckedCell<T> { value: UnsafeCell<Option<T>>, initialized: AtomicBool, lock: Mutex<()>, } impl<T> DoubleCheckedCell<T> { pub fn new() -> DoubleCheckedCell<T> { DoubleCheckedCell { value: UnsafeCell::new(None), initialized: AtomicBool::new(false), lock: Mutex::new(()), } } pub async fn get_or_try_init<Fut, E>(&self, init: Fut) -> Result<&T, E> where Fut: Future<Output = Result<T, E>> { // 1.) 1st load & check. if !self.initialized.load(Ordering::Acquire) { // 2.) Lock. let _lock = self.lock.lock().await; // 3.) 2nd load & check. if !self.initialized.load(Ordering::Relaxed) { { // 4.) Critical section. let result = init.await?; let value = unsafe { &mut *self.value.get() }; value.replace(result); } // 5.) Store with lock held. self.initialized.store(true, Ordering::Release); } } let value = unsafe { &*self.value.get() }; Ok(unsafe { value.as_ref().unchecked_unwrap() }) } } ``` Per the C++11 memory model used by Rust (See: https://en.cppreference.com/w/cpp/language/memory_model), this would seem to indicate the second load could be reordered to occur anywhere after the 1st load with acquire ordering and anywhere before the store with released ordering. If that second load was reordered to occur before the lock was acquired, two threads could enter the critical section in serial and the second thread would try to materialize to paths that already were created and marked read-only. Switch to the async-oncecell crate which performs both loads of the double-checked lock with acquire ordering, ensuring they are not re-ordered with respect to the interleaved non-atomics code. Also fixup the materialization process to be atomic. We now cleanup materialization chroots when materialization fails and only move their contents to the destination path if the full materialization has succeeded. Fixes #13899
This resurfaced in
|
also seeing this w/ the golang backend enabled:
|
coursier_wrapper_script.sh
I wrote a unit test that reliably triggers the error. #15140 |
I'm inclined to fix the issue by appending a sequence number to the digest's final path. This would prevent path conflicts if there were a partially materialized digest already in the output directory from an earlier materialization of that digest in the same run potentially failing. |
FWIW -still. happens on pants 2.11 rc5:
|
Seeing something similar relatively often on CI with remote cache enabled (never saw it in CI before we enabled it). Pants 2.11.0.
|
Oh! Thanks @jyggen for that hint about remote caching! Notably Pants and Toolchain both use remote caching. Pantsbuild/pants sees this error quite frequently |
Thanks @jyggen ... that gives me a fairly good idea of what this might be then. Cancellation, causing the materialization of directories to be interrupted. i.e.: the various async-value crates we've used must not be cancellation safe. Will take a look at this one. |
… digest. (pantsbuild#15652) Fixes pantsbuild#13899. [ci skip-build-wheels]
… digest. (pantsbuild#15652) Fixes pantsbuild#13899. [ci skip-build-wheels]
Seen in CI in the
lint
job:The text was updated successfully, but these errors were encountered: