Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spawning more processes than a channel can handle seems to result in them locking up #539

Closed
yorickpeterse opened this issue May 22, 2023 · 5 comments
Labels
accepting contributions Issues that are suitable to be worked on by anybody, not just maintainers bug Defects, unintended behaviour, etc runtime Changes related to the Rust-based runtime library
Milestone

Comments

@yorickpeterse
Copy link
Collaborator

Please describe the bug

Reading through https://pkolaczk.github.io/memory-consumption-of-async/, I decided to implement something similar in Inko (but using a two second sleep to get the results faster):

import std::env::(arguments)
import std::process::(sleep)
import std::time::Duration

class async Sleeper {
  fn async run(output: Channel[Nil]) {
    sleep(Duration.from_secs(2))
    output.send(nil)
  }
}

class async Main {
  fn async main {
    let chan = Channel.new(size: 512)
    let num =
      arguments.opt(0).then fn (v) { Int.from_base10(v) }.unwrap_or(1_000)

    num.times fn (_) { Sleeper {}.run(chan) }
    num.times fn (_) { chan.receive }
  }
}

When running this with 340 or more processes (inko run test.inko 340), the program seems to lock up in the chan.receive loop. Weirdly enough it will work just fine when using 339 as the argument.

My guess is that we're deadlocking on the channels somewhere, or that too many threads are fighting over the mutex.

Please list the exact steps necessary to reproduce the bug

Save the above script to test.inko and run it using inko run test.inko 340.

Operating system

Fedora Silverblue 38

Inko version

main

Rust version

1.68.2

@yorickpeterse yorickpeterse added accepting contributions Issues that are suitable to be worked on by anybody, not just maintainers bug Defects, unintended behaviour, etc runtime Changes related to the Rust-based runtime library labels May 22, 2023
@yorickpeterse yorickpeterse added this to the 0.12.0 milestone May 22, 2023
@yorickpeterse
Copy link
Collaborator Author

Digging deeper it seems the channel size is a red herring, because if we remove the sleep we finish just fine, even with a number of processes much greater than the channel size.

My guess is that the timeout worker either deadlocks somehow, or loses processes to reschedule.

@yorickpeterse

This comment was marked as off-topic.

@yorickpeterse
Copy link
Collaborator Author

yorickpeterse commented May 23, 2023

Some notes from debugging this:

  • How long we keep the lock around in inko_process_suspend is irrelevant
  • Sleeping for up to 1 second when the timeout worker has no timeouts (instead of sleeping until woken up) makes no difference
  • If I instrument the runtime to print "sleeping" when the processes go to sleep, and "rescheduling" when they are rescheduled by the timeout worker, the numbers even out (800 in htis case) suggesting all of them are rescheduled. However, we only seem to receive a value 311 times
  • Out of the 800 processes rescheduled, only 455 seem to actually run after this. Out of these 455 all of them write to the channel.
  • It's not an issue with run locks either, as the number of attempts to acquire it equals the number of times it is acquired.

The script I'm using:

import std::env::(arguments)
import std::process::(sleep)
import std::time::Duration
import std::stdio::STDOUT

class async Sleeper {
  fn async run(output: Channel[Nil]) {
    let out = STDOUT.new

    out.write_string("sleeping\n")
    sleep(Duration.from_secs(5))
    out.write_string("sending\n")
    output.send(nil)
    out.write_string("sent\n")
  }
}

class async Main {
  fn async main {
    let chan = Channel.new(size: 2048)
    let num =
      arguments.opt(0).then fn (v) { Int.from_base10(v) }.unwrap_or(1_000)

    let out = STDOUT.new

    num.times fn (_) { Sleeper {}.run(chan) }
    num.times fn (_) {
      chan.receive
      out.write_string("received\n")
    }
  }
}

And with this runtime patch:

diff --git a/rt/src/scheduler/timeout_worker.rs b/rt/src/scheduler/timeout_worker.rs
index 2bcc6fa1..dd9a49e4 100644
--- a/rt/src/scheduler/timeout_worker.rs
+++ b/rt/src/scheduler/timeout_worker.rs
@@ -149,6 +149,10 @@ impl TimeoutWorker {
         let (expired, time_until_expiration) =
             inner.timeouts.processes_to_reschedule();
 
+        for _ in 0..expired.len() {
+            println!("rescheduling");
+        }
+
         scheduler.schedule_multiple(expired);
         time_until_expiration
     }

@yorickpeterse
Copy link
Collaborator Author

The problem doesn't appear to be due to OS threads going to sleep when there's work available, as simply disabling sleeping entirely (so they just spin) doesn't solve the problem.

@yorickpeterse
Copy link
Collaborator Author

I think I found the problem: in Thread.steal_from_global we break if the local queue is full, but we don't push the remaining jobs we stole back onto the global queue. This then results in those jobs being lost forever.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepting contributions Issues that are suitable to be worked on by anybody, not just maintainers bug Defects, unintended behaviour, etc runtime Changes related to the Rust-based runtime library
Projects
None yet
Development

No branches or pull requests

1 participant