Spawning more processes than a channel can handle seems to result in them locking up #539

yorickpeterse · 2023-05-22T22:16:23Z

Please describe the bug

Reading through https://pkolaczk.github.io/memory-consumption-of-async/, I decided to implement something similar in Inko (but using a two second sleep to get the results faster):

import std::env::(arguments)
import std::process::(sleep)
import std::time::Duration

class async Sleeper {
  fn async run(output: Channel[Nil]) {
    sleep(Duration.from_secs(2))
    output.send(nil)
  }
}

class async Main {
  fn async main {
    let chan = Channel.new(size: 512)
    let num =
      arguments.opt(0).then fn (v) { Int.from_base10(v) }.unwrap_or(1_000)

    num.times fn (_) { Sleeper {}.run(chan) }
    num.times fn (_) { chan.receive }
  }
}

When running this with 340 or more processes (inko run test.inko 340), the program seems to lock up in the chan.receive loop. Weirdly enough it will work just fine when using 339 as the argument.

My guess is that we're deadlocking on the channels somewhere, or that too many threads are fighting over the mutex.

Please list the exact steps necessary to reproduce the bug

Save the above script to test.inko and run it using inko run test.inko 340.

Operating system

Fedora Silverblue 38

Inko version

main

Rust version

1.68.2

The text was updated successfully, but these errors were encountered:

yorickpeterse · 2023-05-22T22:31:36Z

Digging deeper it seems the channel size is a red herring, because if we remove the sleep we finish just fine, even with a number of processes much greater than the channel size.

My guess is that the timeout worker either deadlocks somehow, or loses processes to reschedule.

yorickpeterse · 2023-05-23T01:03:54Z

Some notes from debugging this:

How long we keep the lock around in inko_process_suspend is irrelevant
Sleeping for up to 1 second when the timeout worker has no timeouts (instead of sleeping until woken up) makes no difference
If I instrument the runtime to print "sleeping" when the processes go to sleep, and "rescheduling" when they are rescheduled by the timeout worker, the numbers even out (800 in htis case) suggesting all of them are rescheduled. However, we only seem to receive a value 311 times
Out of the 800 processes rescheduled, only 455 seem to actually run after this. Out of these 455 all of them write to the channel.
It's not an issue with run locks either, as the number of attempts to acquire it equals the number of times it is acquired.

The script I'm using:

import std::env::(arguments)
import std::process::(sleep)
import std::time::Duration
import std::stdio::STDOUT

class async Sleeper {
  fn async run(output: Channel[Nil]) {
    let out = STDOUT.new

    out.write_string("sleeping\n")
    sleep(Duration.from_secs(5))
    out.write_string("sending\n")
    output.send(nil)
    out.write_string("sent\n")
  }
}

class async Main {
  fn async main {
    let chan = Channel.new(size: 2048)
    let num =
      arguments.opt(0).then fn (v) { Int.from_base10(v) }.unwrap_or(1_000)

    let out = STDOUT.new

    num.times fn (_) { Sleeper {}.run(chan) }
    num.times fn (_) {
      chan.receive
      out.write_string("received\n")
    }
  }
}

And with this runtime patch:

diff --git a/rt/src/scheduler/timeout_worker.rs b/rt/src/scheduler/timeout_worker.rs
index 2bcc6fa1..dd9a49e4 100644
--- a/rt/src/scheduler/timeout_worker.rs
+++ b/rt/src/scheduler/timeout_worker.rs
@@ -149,6 +149,10 @@ impl TimeoutWorker {
         let (expired, time_until_expiration) =
             inner.timeouts.processes_to_reschedule();
 
+        for _ in 0..expired.len() {
+            println!("rescheduling");
+        }
+
         scheduler.schedule_multiple(expired);
         time_until_expiration
     }

yorickpeterse · 2023-05-23T14:45:38Z

The problem doesn't appear to be due to OS threads going to sleep when there's work available, as simply disabling sleeping entirely (so they just spin) doesn't solve the problem.

yorickpeterse · 2023-05-23T15:03:23Z

I think I found the problem: in Thread.steal_from_global we break if the local queue is full, but we don't push the remaining jobs we stole back onto the global queue. This then results in those jobs being lost forever.

yorickpeterse added accepting contributions Issues that are suitable to be worked on by anybody, not just maintainers bug Defects, unintended behaviour, etc runtime Changes related to the Rust-based runtime library labels May 22, 2023

yorickpeterse added this to the 0.12.0 milestone May 22, 2023

This comment was marked as off-topic.

Sign in to view

yorickpeterse mentioned this issue May 22, 2023

When spawning many processes, the runtime crashes when trying to call mprotect() to set up stack guards #540

Closed

yorickpeterse closed this as completed in 5b2980a May 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spawning more processes than a channel can handle seems to result in them locking up #539

Spawning more processes than a channel can handle seems to result in them locking up #539

yorickpeterse commented May 22, 2023

yorickpeterse commented May 22, 2023

This comment was marked as off-topic.

yorickpeterse commented May 23, 2023 •

edited

Loading

yorickpeterse commented May 23, 2023

yorickpeterse commented May 23, 2023

Spawning more processes than a channel can handle seems to result in them locking up #539

Spawning more processes than a channel can handle seems to result in them locking up #539

Comments

yorickpeterse commented May 22, 2023

Please describe the bug

Please list the exact steps necessary to reproduce the bug

Operating system

Inko version

Rust version

yorickpeterse commented May 22, 2023

This comment was marked as off-topic.

yorickpeterse commented May 23, 2023 • edited Loading

yorickpeterse commented May 23, 2023

yorickpeterse commented May 23, 2023

yorickpeterse commented May 23, 2023 •

edited

Loading