Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix immutable inputs DCL bug. #14016

Merged
merged 3 commits into from
Jan 8, 2022
Merged

Fix immutable inputs DCL bug. #14016

merged 3 commits into from
Jan 8, 2022

Conversation

jsirois
Copy link
Contributor

@jsirois jsirois commented Dec 28, 2021

Previously we used the double-checked-cell-async crate (See:
https://github.com/chrislearn/double-checked-cell-async/blob/46cd3b04eddddbe279282143fe8a936d5854588c/src/lib.rs#L228-L260),
which performed the second check of the double check lock with relaxed
ordering. The relevant code is snipped below and annotated:

pub struct DoubleCheckedCell<T> {
    value: UnsafeCell<Option<T>>,
    initialized: AtomicBool,
    lock: Mutex<()>,
}

impl<T> DoubleCheckedCell<T> {
    pub fn new() -> DoubleCheckedCell<T> {
        DoubleCheckedCell {
            value: UnsafeCell::new(None),
            initialized: AtomicBool::new(false),
            lock: Mutex::new(()),
        }
    }

    pub async fn get_or_try_init<Fut, E>(&self, init: Fut) -> Result<&T, E>
    where
        Fut: Future<Output = Result<T, E>>
    {
        // 1.) 1st load & check.
        if !self.initialized.load(Ordering::Acquire) {
            // 2.) Lock.
            let _lock = self.lock.lock().await;
            // 3.) 2nd load & check.
            if !self.initialized.load(Ordering::Relaxed) {
                {
                    // 4.) Critical section.
                    let result = init.await?;
                    let value = unsafe { &mut *self.value.get() };
                    value.replace(result);
                }
                // 5.) Store with lock held.
                self.initialized.store(true, Ordering::Release);
            }
        }
        let value = unsafe { &*self.value.get() };
        Ok(unsafe { value.as_ref().unchecked_unwrap() })
    }
}

Per the C++11 memory model used by Rust (See:
https://en.cppreference.com/w/cpp/atomic/memory_order), this would
seem to indicate the second load could be reordered to occur anywhere
after the 1st load with acquire ordering and anywhere before the store
with released ordering. If that second load was reordered to occur
before the lock was acquired, two threads could enter the critical
section in serial and the second thread would try to materialize to
paths that already were created and marked read-only. Switch to the
async-oncecell crate which performs both loads of the double-checked
lock with acquire ordering, ensuring they are not re-ordered with
respect to the interleaved non-atomics code.

Also fixup the materialization process to be atomic. We now cleanup
materialization chroots when materialization fails and only move their
contents to the destination path if the full materialization has
succeeded.

Fixes #13899

[ci skip-build-wheels]

@jsirois
Copy link
Contributor Author

jsirois commented Dec 28, 2021

Einteresting. The test failure calls BS on me.

@jsirois
Copy link
Contributor Author

jsirois commented Dec 29, 2021

I've unassigned #13899 and released back to the pool since I still don't understand this and my fix continues to be broken in the same way the current code is.

@jsirois jsirois closed this Dec 29, 2021
@jsirois jsirois reopened this Jan 4, 2022
@jsirois
Copy link
Contributor Author

jsirois commented Jan 4, 2022

Re-opened just for some CI debugging, not for review.

let materialized_chroot = chroot.into_path();
let digest_str = digest.hash.to_hex();
let path = self.workdir.path().join(&digest_str);
tokio::fs::rename(&materialized_chroot, &path)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it turns out you cannot rename a 555 dir under macOS; thus the perm denied on macOS only in the tests right now. Who knew? I <3 macOS.

Previously we used the double-checked-cell-async crate (See:
https://github.com/chrislearn/double-checked-cell-async/blob/46cd3b04eddddbe279282143fe8a936d5854588c/src/lib.rs#L228-L260),
which performed the second check of the double check lock with relaxed
ordering. The relevant code is snipped below and annotated:

```rust
pub struct DoubleCheckedCell<T> {
    value: UnsafeCell<Option<T>>,
    initialized: AtomicBool,
    lock: Mutex<()>,
}

impl<T> DoubleCheckedCell<T> {
    pub fn new() -> DoubleCheckedCell<T> {
        DoubleCheckedCell {
            value: UnsafeCell::new(None),
            initialized: AtomicBool::new(false),
            lock: Mutex::new(()),
        }
    }

    pub async fn get_or_try_init<Fut, E>(&self, init: Fut) -> Result<&T, E>
    where
        Fut: Future<Output = Result<T, E>>
    {
        // 1.) 1st load & check.
        if !self.initialized.load(Ordering::Acquire) {
            // 2.) Lock.
            let _lock = self.lock.lock().await;
            // 3.) 2nd load & check.
            if !self.initialized.load(Ordering::Relaxed) {
                {
                    // 4.) Critical section.
                    let result = init.await?;
                    let value = unsafe { &mut *self.value.get() };
                    value.replace(result);
                }
                // 5.) Store with lock held.
                self.initialized.store(true, Ordering::Release);
            }
        }
        let value = unsafe { &*self.value.get() };
        Ok(unsafe { value.as_ref().unchecked_unwrap() })
    }
}
```

Per the C++11 memory model used by Rust (See:
https://en.cppreference.com/w/cpp/language/memory_model), this would
seem to indicate the second load could be reordered to occur anywhere
after the 1st load with acquire ordering and anywhere before the store
with released ordering. If that second load was reordered to occur
before the lock was acquired, two threads could enter the critical
section in serial and the second thread would try to materialize to
paths that already were created and marked read-only. Switch to the
async-oncecell crate which performs both loads of the double-checked
lock with acquire ordering, ensuring they are not re-ordered with
respect to the interleaved non-atomics code.

Also fixup the materialization process to be atomic. We now cleanup
materialization chroots when materialization fails and only move their
contents to the destination path if the full materialization has
succeeded.

Fixes pantsbuild#13899

[ci skip-build-wheels]
Also provide more complete error reporting should the switch to
async-oncecell and use of atomic renames to ensure no incomplete
materializations does not do the trick.

[ci skip-build-wheels]
@jsirois
Copy link
Contributor Author

jsirois commented Jan 7, 2022

Ok, reviewers - PTAL. There was a bugfix for macOS perms oddity and I've added more failure debug info in case my DCL bug fix (switch to OnceCell) was wrong. That debug info includes all the ways I could think of getting a perm denied in the wild and should help narrow this down more quickly if the issue persists.

# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
Copy link
Member

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for persisting here!

dest: impl AsRef<Path>,
map_rename_err: impl Fn(std::io::Error) -> String,
) -> Result<(), String> {
// If you try to rename a read-only directory (mode 0o555) under masOS you get permission
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// If you try to rename a read-only directory (mode 0o555) under masOS you get permission
// If you try to rename a read-only directory (mode 0o555) under macOS you get permission

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to save a tree since this is a comment and you've proved its understandable despite typos.

Comment on lines +15 to +19
async fn rename_readonly_directory(
src: impl AsRef<Path>,
dest: impl AsRef<Path>,
map_rename_err: impl Fn(std::io::Error) -> String,
) -> Result<(), String> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we end up needing to do this for any other materialized directories, we might be able to bake it into materialize_directory itself, before the root ends up marked readonly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, maybe. The materialize and move case that this use has would need a callback or an expclict parameter for a final dest.

@jsirois jsirois merged commit 1fe755b into pantsbuild:main Jan 8, 2022
@jsirois jsirois deleted the issues/13899 branch January 8, 2022 01:31
@stuhood stuhood added this to the 2.9.x milestone Jan 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"Permission denied" while creating sandbox with immutable inputs
4 participants