Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is this flag working properly? -Xgil #118874

Closed
jefer94 opened this issue May 9, 2024 · 10 comments
Closed

Is this flag working properly? -Xgil #118874

jefer94 opened this issue May 9, 2024 · 10 comments
Assignees
Labels
topic-free-threading type-bug An unexpected behavior, bug, or error

Comments

@jefer94
Copy link

jefer94 commented May 9, 2024

Bug report

Bug description:

Windows 11, WSL Archlinux
Hash 1b1db2f
Linux Jeferson 5.15.146.1-microsoft-standard-WSL2 #1 SMP Thu Jan 11 04:09:03 UTC 2024 x86_64 GNU/Linux
My test file

import threading
import time

# Shared resource
shared_resource = 0

# Lock for synchronization
# lock = threading.Lock()


# Define a function that will be executed by each thread
def worker():
    global shared_resource
    for n in range(1000000):  # Perform some computation
        # with lock:
        shared_resource += n


def run_threads(num_threads):
    start_time = time.time()

    # Create a list to store references to the threads
    threads = []

    # Create and start the specified number of threads
    for _ in range(num_threads):
        thread = threading.Thread(target=worker)
        thread.start()
        threads.append(thread)

    # Wait for all threads to complete
    for thread in threads:
        thread.join()

    end_time = time.time()
    elapsed_time = end_time - start_time
    print(f"Ran {num_threads} threads cooperatively in {elapsed_time:.2f} seconds")


# Run the benchmark with different numbers of threads
for num_threads in [1, 2, 4, 8, 20]:
    run_threads(num_threads)
jefer@Jeferson ~/d/w/a/b/threads (development)> PYTHON_GIL=1 /opt/python-beta/bin/python3 -Xgil=1 -m main
Ran 1 threads cooperatively in 0.04 seconds
Ran 2 threads cooperatively in 0.09 seconds
Ran 4 threads cooperatively in 0.18 seconds
Ran 8 threads cooperatively in 0.35 seconds
Ran 20 threads cooperatively in 0.86 seconds
jefer@Jeferson ~/d/w/a/b/threads (development)> PYTHON_GIL=0 /opt/python-beta/bin/python3 -Xgil=0 -m main
Ran 1 threads cooperatively in 0.04 seconds
Ran 2 threads cooperatively in 0.10 seconds
Ran 4 threads cooperatively in 0.35 seconds
Ran 8 threads cooperatively in 1.50 seconds
Ran 20 threads cooperatively in 22.48 seconds

I think that GIL 0 means no GIL, but it's slower.

CPython versions tested on:

3.13, CPython main branch

Operating systems tested on:

Linux, Other

@jefer94 jefer94 added the type-bug An unexpected behavior, bug, or error label May 9, 2024
@mpage
Copy link
Contributor

mpage commented May 10, 2024

The -Xgil={0,1} flag works on my system (I'm assuming you've configured CPython with --disable-gil):

> /usr/bin/time -f '%P' ./python -Xgil=1 ~/local/scratch/repro_118874.py
gil_enabled=True
Ran 1 threads cooperatively in 0.13 seconds
Ran 2 threads cooperatively in 0.25 seconds
Ran 4 threads cooperatively in 0.52 seconds
Ran 8 threads cooperatively in 1.03 seconds
101%
> /usr/bin/time -f '%P' ./python -Xgil=0 ~/local/scratch/repro_118874.py
gil_enabled=False
Ran 1 threads cooperatively in 0.13 seconds
Ran 2 threads cooperatively in 0.36 seconds
Ran 4 threads cooperatively in 1.05 seconds
Ran 8 threads cooperatively in 5.28 seconds
549%
>

I think the problem that you're running into is that there is a limited amount of exploitable parallelism in the program. The shared_resource variable is stored in the globals dictionary, which is protected by its per-object lock. Each iteration of the loop in worker() requires locking the globals dictionary twice (once to read the value and once to stored the updated value). Contention on the lock of the globals dictionary increases with the number of threads, which I believe explains the slowdown that you're seeing.

You should get a result more in line with what you were expecting if you slightly modify the program to perform thread-local additions inside the loop and only update the global count with the result:

> ./python -Xgil=0 ~/local/scratch/repro_118874_with_parallelism.py
gil_enabled=False
Ran 1 threads cooperatively in 0.66 seconds
Ran 2 threads cooperatively in 0.67 seconds
Ran 4 threads cooperatively in 0.67 seconds
Ran 8 threads cooperatively in 0.68 seconds
> ./python -Xgil=1 ~/local/scratch/repro_118874_with_parallelism.py
gil_enabled=True
Ran 1 threads cooperatively in 0.67 seconds
Ran 2 threads cooperatively in 1.37 seconds
Ran 4 threads cooperatively in 2.75 seconds
Ran 8 threads cooperatively in 5.50 seconds
>

@jefer94
Copy link
Author

jefer94 commented May 10, 2024

Mmmmm, I used:

./configure --disable-gil --enable-optimizations --prefix=/opt/python-beta
make
sudo make install

Right now the performance changed with your code using the accumulator.

jefer@Jeferson ~/d/w/a/b/threads (development)> PYTHON_GIL=1 /opt/python-beta/bin/python3 -Xgil=1 -m main
gil_enabled=True
Ran 1 threads cooperatively in 0.23 seconds
Ran 2 threads cooperatively in 0.46 seconds
Ran 4 threads cooperatively in 0.92 seconds
Ran 8 threads cooperatively in 1.83 seconds
Ran 20 threads cooperatively in 4.67 seconds
jefer@Jeferson ~/d/w/a/b/threads (development)> PYTHON_GIL=0 /opt/python-beta/bin/python3 -Xgil=0 -m main
gil_enabled=False
Ran 1 threads cooperatively in 0.23 seconds
Ran 2 threads cooperatively in 0.23 seconds
Ran 4 threads cooperatively in 0.23 seconds
Ran 8 threads cooperatively in 0.31 seconds
Ran 20 threads cooperatively in 0.56 seconds
jefer@Jeferson ~/d/w/a/b/threads (development)> 

I suppose these are required some examples using this properly

I tried to use this with Gunicorn and Granian, I couldn't install Granian, Gunicorn didn't get more performance with threads, but the execution with workers got x2 in perf, I suppose that the workers require some tweaks.

Do you know if this is thread-safe?

base

https://github.com/breatheco-de/apiv2/tree/main/benchmarks/django-workers

benchmark

# trio is not supported by django yet and should break gevent

FILE="./threads-py313.md"
CONNECTIONS=2000
THREADS=20
PORT=8000
HOST="http://localhost:$PORT"
TIMEOUT=10
SLEEP_TIME=3


# it support wsgi and asgi
function bench {
    HOST="http://localhost:8000"

    echo "" >> "$FILE"
    echo "### JSON performance" >> "$FILE"
    echo "#### Sync" >> "$FILE"
    echo "" >> "$FILE"
    echo "\`\`\`bash" >> "$FILE"
    wrk -t "$THREADS" -c "$CONNECTIONS" -d10s "$HOST/myapp/sync/json" >> "$FILE"
    echo "\`\`\`" >> "$FILE"
    echo "" >> "$FILE"
    echo "#### Async" >> "$FILE"
    echo "" >> "$FILE"
    echo "\`\`\`bash" >> "$FILE"
    wrk -t "$THREADS" -c "$CONNECTIONS" -d10s "$HOST/myapp/async/json" >> "$FILE"
    echo "\`\`\`" >> "$FILE"
    echo "" >> "$FILE"

    echo "### Queries returned as JSON" >> "$FILE"
    echo "#### Sync" >> "$FILE"
    echo "" >> "$FILE"
    echo "\`\`\`bash" >> "$FILE"
    wrk -t "$THREADS" -c "$CONNECTIONS" -d10s "$HOST/myapp/sync/json_query" >> "$FILE"
    echo "\`\`\`" >> "$FILE"
    echo "" >> "$FILE"
    echo "#### Async" >> "$FILE"
    echo "" >> "$FILE"
    echo "\`\`\`bash" >> "$FILE"
    wrk -t "$THREADS" -c "$CONNECTIONS" -d10s "$HOST/myapp/async/json_query" >> "$FILE"
    echo "\`\`\`" >> "$FILE"
    echo "" >> "$FILE"

    echo "### Queries returned as HTML" >> "$FILE"
    echo "#### Sync" >> "$FILE"
    echo "" >> "$FILE"
    echo "\`\`\`bash" >> "$FILE"
    wrk -t "$THREADS" -c "$CONNECTIONS" -d10s "$HOST/myapp/sync/template_query" >> "$FILE"
    echo "\`\`\`" >> "$FILE"
    echo "" >> "$FILE"
    echo "#### Async" >> "$FILE"
    echo "" >> "$FILE"
    echo "\`\`\`bash" >> "$FILE"
    wrk -t "$THREADS" -c "$CONNECTIONS" -d10s "$HOST/myapp/async/template_query" >> "$FILE"
    echo "\`\`\`" >> "$FILE"
    echo "" >> "$FILE"

    echo "### Simulate a request 1s inside the server, then return a JSON" >> "$FILE"
    echo "#### Sync" >> "$FILE"
    echo "" >> "$FILE"
    echo "\`\`\`bash" >> "$FILE"
    wrk -t "$THREADS" -c "$CONNECTIONS" -d10s "$HOST/myapp/sync/gateway_1s" >> "$FILE"
    echo "\`\`\`" >> "$FILE"
    echo "" >> "$FILE"
    echo "#### Async" >> "$FILE"
    echo "" >> "$FILE"
    echo "\`\`\`bash" >> "$FILE"
    wrk -t "$THREADS" -c "$CONNECTIONS" -d10s "$HOST/myapp/async/gateway_1s" >> "$FILE"
    echo "\`\`\`" >> "$FILE"
    echo "" >> "$FILE"

    sleep $SLEEP_TIME

    echo "### Simulate a request 3s inside the server, then return a JSON" >> "$FILE"
    echo "#### Sync" >> "$FILE"
    echo "" >> "$FILE"
    echo "\`\`\`bash" >> "$FILE"
    wrk -t "$THREADS" -c "$CONNECTIONS" -d10s "$HOST/myapp/sync/gateway_3s" >> "$FILE"
    echo "\`\`\`" >> "$FILE"
    echo "" >> "$FILE"
    echo "#### Async" >> "$FILE"
    echo "" >> "$FILE"
    echo "\`\`\`bash" >> "$FILE"
    wrk -t "$THREADS" -c "$CONNECTIONS" -d10s "$HOST/myapp/async/gateway_3s" >> "$FILE"
    echo "\`\`\`" >> "$FILE"
    echo "" >> "$FILE"

    sleep $SLEEP_TIME

    echo "### Simulate a request 10s inside the server, then return a JSON" >> "$FILE"
    echo "#### Sync" >> "$FILE"
    echo "" >> "$FILE"
    echo "\`\`\`bash" >> "$FILE"
    wrk -t "$THREADS" -c "$CONNECTIONS" -d10s "$HOST/myapp/sync/gateway_10s" >> "$FILE"
    echo "\`\`\`" >> "$FILE"
    echo "" >> "$FILE"
    echo "#### Async" >> "$FILE"
    echo "" >> "$FILE"
    echo "\`\`\`bash" >> "$FILE"
    wrk -t "$THREADS" -c "$CONNECTIONS" -d10s "$HOST/myapp/async/gateway_10s" >> "$FILE"
    echo "\`\`\`" >> "$FILE"
    echo "" >> "$FILE"

    sleep $SLEEP_TIME

    echo "### Brotli" >> "$FILE"
    echo "#### Sync" >> "$FILE"
    echo "" >> "$FILE"
    echo "\`\`\`bash" >> "$FILE"
    wrk -t "$THREADS" -c "$CONNECTIONS" -d10s "$HOST/myapp/sync/brotli" >> "$FILE"
    echo "\`\`\`" >> "$FILE"
    echo "" >> "$FILE"
    echo "#### Async" >> "$FILE"
    echo "" >> "$FILE"
    echo "\`\`\`bash" >> "$FILE"
    wrk -t "$THREADS" -c "$CONNECTIONS" -d10s "$HOST/myapp/async/brotli" >> "$FILE"
    echo "\`\`\`" >> "$FILE"
    echo "" >> "$FILE"
}


echo "# Django Workers" > $FILE

sudo fuser -k $PORT/tcp
PYTHON_GIL=0 python -Xgil=0 -m gunicorn mysite.asgi --timeout $TIMEOUT --threads $THREADS --worker-class uvicorn.workers.UvicornWorker & echo "starting server..."
sleep $SLEEP_TIME
echo "## ASGI Gunicorn Uvicorn, with threads -Xgil=0" >> $FILE
bench

sudo fuser -k $PORT/tcp
PYTHON_GIL=1 python -Xgil=1 -m gunicorn mysite.asgi --timeout $TIMEOUT --threads $THREADS --worker-class uvicorn.workers.UvicornWorker & echo "starting server..."
sleep $SLEEP_TIME
echo "## ASGI Gunicorn Uvicorn, with threads -Xgil=1" >> $FILE
bench

sudo fuser -k $PORT/tcp
PYTHON_GIL=0 python -Xgil=0 -m gunicorn mysite.asgi --timeout $TIMEOUT --workers $THREADS --worker-class uvicorn.workers.UvicornWorker & echo "starting server..."
sleep $SLEEP_TIME
echo "## ASGI Gunicorn Uvicorn, with workers -Xgil=0" >> $FILE
bench

sudo fuser -k $PORT/tcp
PYTHON_GIL=1 python -Xgil=1 -m gunicorn mysite.asgi --timeout $TIMEOUT --workers $THREADS --worker-class uvicorn.workers.UvicornWorker & echo "starting server..."
sleep $SLEEP_TIME
echo "## ASGI Gunicorn Uvicorn, with workers -Xgil=1" >> $FILE
bench

sudo fuser -k $PORT/tcp

Requests per seconds result

Django Workers

ASGI Gunicorn Uvicorn, with threads -Xgil=0

JSON performance

Sync

Running 10s test @ http://localhost:8000/myapp/sync/json
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.00us    0.00us   0.00us    -nan%
    Req/Sec   136.35    242.30     0.99k    86.49%
  2000 requests in 10.09s, 550.78KB read
  Socket errors: connect 0, read 0, write 0, timeout 2000
Requests/sec:    198.31
Transfer/sec:     54.61KB

Async

Running 10s test @ http://localhost:8000/myapp/async/json
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.00us    0.00us   0.00us    -nan%
    Req/Sec   179.31    276.02     0.91k    83.33%
  2000 requests in 10.05s, 550.78KB read
  Socket errors: connect 0, read 0, write 0, timeout 2000
Requests/sec:    199.02
Transfer/sec:     54.81KB

Queries returned as JSON

Sync

Running 10s test @ http://localhost:8000/myapp/sync/json_query
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.00us    0.00us   0.00us    -nan%
    Req/Sec     0.00      0.00     0.00      -nan%
  0 requests in 10.08s, 0.00B read
Requests/sec:      0.00
Transfer/sec:       0.00B

Async

Running 10s test @ http://localhost:8000/myapp/async/json_query
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.00us    0.00us   0.00us    -nan%
    Req/Sec     0.00      0.00     0.00      -nan%
  0 requests in 10.09s, 0.00B read
  Socket errors: connect 0, read 2000, write 0, timeout 0
Requests/sec:      0.00
Transfer/sec:       0.00B

Queries returned as HTML

Sync

Running 10s test @ http://localhost:8000/myapp/sync/template_query
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.00us    0.00us   0.00us    -nan%
    Req/Sec     0.00      0.00     0.00      -nan%
  0 requests in 10.09s, 0.00B read
  Socket errors: connect 0, read 2000, write 0, timeout 0
Requests/sec:      0.00
Transfer/sec:       0.00B

Async

Running 10s test @ http://localhost:8000/myapp/async/template_query
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.00us    0.00us   0.00us    -nan%
    Req/Sec     0.00      0.00     0.00      -nan%
  0 requests in 10.09s, 0.00B read
Requests/sec:      0.00
Transfer/sec:       0.00B

Simulate a request 1s inside the server, then return a JSON

Sync

Running 10s test @ http://localhost:8000/myapp/sync/gateway_1s
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.00us    0.00us   0.00us    -nan%
    Req/Sec     0.00      0.00     0.00      -nan%
  0 requests in 10.09s, 0.00B read
  Socket errors: connect 0, read 2000, write 0, timeout 0
Requests/sec:      0.00
Transfer/sec:       0.00B

Async

Running 10s test @ http://localhost:8000/myapp/async/gateway_1s
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.00us    0.00us   0.00us    -nan%
    Req/Sec   153.47    254.75     0.87k    79.41%
  2000 requests in 10.07s, 523.44KB read
  Socket errors: connect 0, read 0, write 0, timeout 2000
Requests/sec:    198.62
Transfer/sec:     51.98KB

Simulate a request 3s inside the server, then return a JSON

Sync

Running 10s test @ http://localhost:8000/myapp/sync/gateway_3s
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.00us    0.00us   0.00us    -nan%
    Req/Sec   175.09    239.35     0.96k    84.06%
  1715 requests in 10.09s, 448.85KB read
  Socket errors: connect 0, read 0, write 0, timeout 1715
Requests/sec:    169.98
Transfer/sec:     44.49KB

Async

Running 10s test @ http://localhost:8000/myapp/async/gateway_3s
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.00us    0.00us   0.00us    -nan%
    Req/Sec     0.00      0.00     0.00      -nan%
  0 requests in 10.08s, 0.00B read
  Socket errors: connect 0, read 2000, write 0, timeout 0
Requests/sec:      0.00
Transfer/sec:       0.00B

Simulate a request 10s inside the server, then return a JSON

Sync

Running 10s test @ http://localhost:8000/myapp/sync/gateway_10s
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.00us    0.00us   0.00us    -nan%
    Req/Sec     0.00      0.00     0.00      -nan%
  0 requests in 10.10s, 0.00B read
Requests/sec:      0.00
Transfer/sec:       0.00B

Async

Running 10s test @ http://localhost:8000/myapp/async/gateway_10s
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.00us    0.00us   0.00us    -nan%
    Req/Sec     0.00      0.00     0.00      -nan%
  0 requests in 10.10s, 0.00B read
Requests/sec:      0.00
Transfer/sec:       0.00B

Brotli

Sync

Running 10s test @ http://localhost:8000/myapp/sync/brotli
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.00us    0.00us   0.00us    -nan%
    Req/Sec     0.10      0.32     1.00     90.00%
  10 requests in 10.06s, 65.45KB read
  Socket errors: connect 0, read 0, write 0, timeout 10
Requests/sec:      0.99
Transfer/sec:      6.50KB

Async

Running 10s test @ http://localhost:8000/myapp/async/brotli
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.00us    0.00us   0.00us    -nan%
    Req/Sec     0.00      0.00     0.00      -nan%
  0 requests in 10.05s, 0.00B read
Requests/sec:      0.00
Transfer/sec:       0.00B

ASGI Gunicorn Uvicorn, with threads -Xgil=1

JSON performance

Sync

Running 10s test @ http://localhost:8000/myapp/sync/json
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.00us    0.00us   0.00us    -nan%
    Req/Sec   186.82    273.64   830.00     78.75%
  2000 requests in 10.07s, 550.78KB read
  Socket errors: connect 0, read 0, write 0, timeout 2000
Requests/sec:    198.58
Transfer/sec:     54.69KB

Async

Running 10s test @ http://localhost:8000/myapp/async/json
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.00us    0.00us   0.00us    -nan%
    Req/Sec     0.00      0.00     0.00      -nan%
  0 requests in 10.10s, 0.00B read
Requests/sec:      0.00
Transfer/sec:       0.00B

Queries returned as JSON

Sync

Running 10s test @ http://localhost:8000/myapp/sync/json_query
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.00us    0.00us   0.00us    -nan%
    Req/Sec     0.00      0.00     0.00      -nan%
  0 requests in 10.07s, 0.00B read
  Socket errors: connect 0, read 2000, write 0, timeout 0
Requests/sec:      0.00
Transfer/sec:       0.00B

Async

Running 10s test @ http://localhost:8000/myapp/async/json_query
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.00us    0.00us   0.00us    -nan%
    Req/Sec     0.00      0.00     0.00      -nan%
  0 requests in 10.09s, 0.00B read
  Socket errors: connect 0, read 2000, write 0, timeout 0
Requests/sec:      0.00
Transfer/sec:       0.00B

Queries returned as HTML

Sync

Running 10s test @ http://localhost:8000/myapp/sync/template_query
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.00us    0.00us   0.00us    -nan%
    Req/Sec     0.00      0.00     0.00      -nan%
  0 requests in 10.09s, 0.00B read
  Socket errors: connect 0, read 2000, write 0, timeout 0
Requests/sec:      0.00
Transfer/sec:       0.00B

Async

Running 10s test @ http://localhost:8000/myapp/async/template_query
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.00us    0.00us   0.00us    -nan%
    Req/Sec     0.00      0.00     0.00      -nan%
  0 requests in 10.05s, 0.00B read
  Socket errors: connect 0, read 2000, write 0, timeout 0
Requests/sec:      0.00
Transfer/sec:       0.00B

Simulate a request 1s inside the server, then return a JSON

Sync

Running 10s test @ http://localhost:8000/myapp/sync/gateway_1s
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.00us    0.00us   0.00us    -nan%
    Req/Sec     0.00      0.00     0.00      -nan%
  0 requests in 10.10s, 0.00B read
  Socket errors: connect 0, read 2000, write 0, timeout 0
Requests/sec:      0.00
Transfer/sec:       0.00B

Async

Running 10s test @ http://localhost:8000/myapp/async/gateway_1s
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.00us    0.00us   0.00us    -nan%
    Req/Sec     0.00      0.00     0.00      -nan%
  0 requests in 10.05s, 0.00B read
Requests/sec:      0.00
Transfer/sec:       0.00B

Simulate a request 3s inside the server, then return a JSON

Sync

Running 10s test @ http://localhost:8000/myapp/sync/gateway_3s
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.00us    0.00us   0.00us    -nan%
    Req/Sec     0.00      0.00     0.00      -nan%
  0 requests in 10.06s, 0.00B read
Requests/sec:      0.00
Transfer/sec:       0.00B

Async

Running 10s test @ http://localhost:8000/myapp/async/gateway_3s
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.00us    0.00us   0.00us    -nan%
    Req/Sec     0.00      0.00     0.00      -nan%
  0 requests in 10.08s, 0.00B read
  Socket errors: connect 0, read 2000, write 0, timeout 0
Requests/sec:      0.00
Transfer/sec:       0.00B

Simulate a request 10s inside the server, then return a JSON

Sync

Running 10s test @ http://localhost:8000/myapp/sync/gateway_10s
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.00us    0.00us   0.00us    -nan%
    Req/Sec     0.00      0.00     0.00      -nan%
  0 requests in 10.07s, 0.00B read
Requests/sec:      0.00
Transfer/sec:       0.00B

Async

Running 10s test @ http://localhost:8000/myapp/async/gateway_10s
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.00us    0.00us   0.00us    -nan%
    Req/Sec     0.00      0.00     0.00      -nan%
  0 requests in 10.03s, 0.00B read
Requests/sec:      0.00
Transfer/sec:       0.00B

Brotli

Sync

Running 10s test @ http://localhost:8000/myapp/sync/brotli
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.00us    0.00us   0.00us    -nan%
    Req/Sec     0.00      0.00     0.00      -nan%
  0 requests in 10.11s, 0.00B read
Requests/sec:      0.00
Transfer/sec:       0.00B

Async

Running 10s test @ http://localhost:8000/myapp/async/brotli
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.00us    0.00us   0.00us    -nan%
    Req/Sec     0.00      0.00     0.00      -nan%
  0 requests in 10.09s, 0.00B read
  Socket errors: connect 0, read 2000, write 0, timeout 0
Requests/sec:      0.00
Transfer/sec:       0.00B

ASGI Gunicorn Uvicorn, with workers -Xgil=0

JSON performance

Sync

Running 10s test @ http://localhost:8000/myapp/sync/json
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   178.74ms  339.44ms   2.00s    93.94%
    Req/Sec   383.85    379.53     1.18k    70.97%
  28846 requests in 10.10s, 7.76MB read
  Socket errors: connect 0, read 0, write 0, timeout 2256
Requests/sec:   2856.05
Transfer/sec:    786.55KB

Async

Running 10s test @ http://localhost:8000/myapp/async/json
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   698.98ms  384.09ms   1.98s    69.18%
    Req/Sec   148.62    121.28     1.67k    73.84%
  27942 requests in 10.06s, 7.51MB read
  Socket errors: connect 0, read 0, write 0, timeout 112
Requests/sec:   2776.16
Transfer/sec:    764.55KB

Queries returned as JSON

Sync

Running 10s test @ http://localhost:8000/myapp/sync/json_query
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   879.42ms  533.47ms   2.00s    54.36%
    Req/Sec   107.06     93.71   600.00     78.11%
  20292 requests in 10.08s, 4.90MB read
  Socket errors: connect 0, read 0, write 0, timeout 948
Requests/sec:   2013.20
Transfer/sec:    497.43KB

Async

Running 10s test @ http://localhost:8000/myapp/async/json_query
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   957.97ms  309.21ms   1.85s    71.55%
    Req/Sec   113.84    124.82   660.00     86.43%
  16727 requests in 10.10s, 4.04MB read
  Socket errors: connect 0, read 0, write 0, timeout 1428
Requests/sec:   1656.08
Transfer/sec:    409.19KB

Queries returned as HTML

Sync

Running 10s test @ http://localhost:8000/myapp/sync/template_query
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.13s   314.86ms   1.96s    64.61%
    Req/Sec    87.98     88.27   730.00     86.57%
  14841 requests in 10.10s, 8.14MB read
  Socket errors: connect 0, read 0, write 0, timeout 1182
Requests/sec:   1470.13
Transfer/sec:    825.54KB

Async

Running 10s test @ http://localhost:8000/myapp/async/template_query
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   946.53ms  420.16ms   2.00s    78.18%
    Req/Sec   100.77    112.74   727.00     86.11%
  14171 requests in 10.08s, 7.77MB read
  Socket errors: connect 0, read 0, write 0, timeout 3069
Requests/sec:   1405.21
Transfer/sec:    789.06KB

Simulate a request 1s inside the server, then return a JSON

Sync

Running 10s test @ http://localhost:8000/myapp/sync/gateway_1s
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.45s   262.79ms   2.00s    59.58%
    Req/Sec    97.63    113.74   800.00     86.95%
  11344 requests in 10.10s, 2.90MB read
  Socket errors: connect 0, read 0, write 0, timeout 2651
Requests/sec:   1123.38
Transfer/sec:    294.01KB

Async

Running 10s test @ http://localhost:8000/myapp/async/gateway_1s
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.42s   244.45ms   2.00s    68.50%
    Req/Sec    92.13    104.20     0.91k    87.93%
  12180 requests in 10.10s, 3.11MB read
  Socket errors: connect 0, read 0, write 0, timeout 1337
Requests/sec:   1206.22
Transfer/sec:    315.69KB

Simulate a request 3s inside the server, then return a JSON

Sync

Running 10s test @ http://localhost:8000/myapp/sync/gateway_3s
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.00us    0.00us   0.00us    -nan%
    Req/Sec   106.23    141.99     0.86k    87.46%
  4721 requests in 10.06s, 1.21MB read
  Socket errors: connect 0, read 0, write 0, timeout 4721
Requests/sec:    469.39
Transfer/sec:    122.87KB

Async

Running 10s test @ http://localhost:8000/myapp/async/gateway_3s
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.35s   212.12ms   1.94s    70.02%
    Req/Sec    85.03     86.76   737.00     91.33%
  13657 requests in 10.10s, 3.49MB read
  Socket errors: connect 0, read 0, write 0, timeout 176
Requests/sec:   1352.33
Transfer/sec:    353.93KB

Simulate a request 10s inside the server, then return a JSON

Sync

Running 10s test @ http://localhost:8000/myapp/sync/gateway_10s
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.00us    0.00us   0.00us    -nan%
    Req/Sec   145.50    282.27   717.00     83.33%
  157 requests in 10.05s, 41.09KB read
  Socket errors: connect 0, read 0, write 0, timeout 157
Requests/sec:     15.63
Transfer/sec:      4.09KB

Async

Running 10s test @ http://localhost:8000/myapp/async/gateway_10s
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.00us    0.00us   0.00us    -nan%
    Req/Sec     0.00      0.00     0.00    100.00%
  5 requests in 10.10s, 1.31KB read
  Socket errors: connect 0, read 0, write 0, timeout 5
Requests/sec:      0.50
Transfer/sec:     132.68B

Brotli

Sync

Running 10s test @ http://localhost:8000/myapp/sync/brotli
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.43s   409.09ms   2.00s    68.74%
    Req/Sec    24.95     27.19   210.00     89.87%
  3057 requests in 10.10s, 19.54MB read
  Socket errors: connect 0, read 172, write 0, timeout 2190
Requests/sec:    302.78
Transfer/sec:      1.94MB

Async

Running 10s test @ http://localhost:8000/myapp/async/brotli
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.36s   405.31ms   1.95s    61.58%
    Req/Sec    31.40    104.70     0.92k    97.06%
  1287 requests in 10.09s, 8.23MB read
  Socket errors: connect 0, read 1142, write 0, timeout 920
Requests/sec:    127.49
Transfer/sec:    834.41KB

ASGI Gunicorn Uvicorn, with workers -Xgil=1

JSON performance

Sync

Running 10s test @ http://localhost:8000/myapp/sync/json
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   552.16ms  366.20ms   2.00s    80.77%
    Req/Sec   113.62    102.55   740.00     76.80%
  18140 requests in 10.10s, 4.88MB read
  Socket errors: connect 0, read 0, write 0, timeout 4102
Requests/sec:   1796.89
Transfer/sec:    494.87KB

Async

Running 10s test @ http://localhost:8000/myapp/async/json
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   806.98ms  418.05ms   2.00s    62.49%
    Req/Sec   102.03     94.98   580.00     78.75%
  15685 requests in 10.09s, 4.22MB read
  Socket errors: connect 0, read 0, write 0, timeout 1028
Requests/sec:   1554.58
Transfer/sec:    428.19KB

Queries returned as JSON

Sync

Running 10s test @ http://localhost:8000/myapp/sync/json_query
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.09s   604.32ms   2.00s    56.31%
    Req/Sec    63.52     59.24   450.00     82.75%
  9382 requests in 10.06s, 2.26MB read
  Socket errors: connect 0, read 0, write 0, timeout 2097
Requests/sec:    932.46
Transfer/sec:    230.38KB

Async

Running 10s test @ http://localhost:8000/myapp/async/json_query
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.25s   471.66ms   2.00s    66.91%
    Req/Sec    65.14     62.76   464.00     83.21%
  10158 requests in 10.04s, 2.45MB read
  Socket errors: connect 0, read 0, write 0, timeout 3876
Requests/sec:   1011.50
Transfer/sec:    249.96KB

Queries returned as HTML

Sync

Running 10s test @ http://localhost:8000/myapp/sync/template_query
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.33s   483.35ms   2.00s    66.95%
    Req/Sec    41.79     37.81   525.00     86.85%
  6981 requests in 10.10s, 3.83MB read
  Socket errors: connect 0, read 0, write 0, timeout 2806
Requests/sec:    691.23
Transfer/sec:    388.17KB

Async

Running 10s test @ http://localhost:8000/myapp/async/template_query
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.36s   403.77ms   2.00s    65.89%
    Req/Sec    55.69     61.10   383.00     86.28%
  7921 requests in 10.06s, 4.34MB read
  Socket errors: connect 0, read 0, write 0, timeout 3978
Requests/sec:    787.03
Transfer/sec:    441.94KB

Simulate a request 1s inside the server, then return a JSON

Sync

Running 10s test @ http://localhost:8000/myapp/sync/gateway_1s
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.49s   245.81ms   2.00s    66.50%
    Req/Sec    66.44     74.91   490.00     89.22%
  9192 requests in 10.07s, 2.35MB read
  Socket errors: connect 0, read 0, write 0, timeout 3380
Requests/sec:    912.86
Transfer/sec:    238.94KB

Async

Running 10s test @ http://localhost:8000/myapp/async/gateway_1s
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.44s   269.68ms   2.00s    60.46%
    Req/Sec    67.22     70.76   590.00     86.94%
  10597 requests in 10.08s, 2.71MB read
  Socket errors: connect 0, read 0, write 0, timeout 2774
Requests/sec:   1051.56
Transfer/sec:    275.21KB

Simulate a request 3s inside the server, then return a JSON

Sync

Running 10s test @ http://localhost:8000/myapp/sync/gateway_3s
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.00us    0.00us   0.00us    -nan%
    Req/Sec    65.69     78.46   505.00     88.74%
  4381 requests in 10.05s, 1.12MB read
  Socket errors: connect 0, read 0, write 0, timeout 4381
Requests/sec:    436.10
Transfer/sec:    114.14KB

Async

Running 10s test @ http://localhost:8000/myapp/async/gateway_3s
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.48s   236.21ms   2.00s    63.70%
    Req/Sec    62.13     70.03   530.00     88.32%
  9284 requests in 10.07s, 2.37MB read
  Socket errors: connect 0, read 0, write 0, timeout 3064
Requests/sec:    922.30
Transfer/sec:    241.41KB

Simulate a request 10s inside the server, then return a JSON

Sync

Running 10s test @ http://localhost:8000/myapp/sync/gateway_10s
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.00us    0.00us   0.00us    -nan%
    Req/Sec     2.00      0.82     3.00     50.00%
  94 requests in 10.10s, 25.09KB read
  Socket errors: connect 0, read 0, write 0, timeout 94
Requests/sec:      9.30
Transfer/sec:      2.48KB

Async

Running 10s test @ http://localhost:8000/myapp/async/gateway_10s
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.00us    0.00us   0.00us    -nan%
    Req/Sec     0.00      0.00     0.00    100.00%
  3 requests in 10.08s, 804.00B read
  Socket errors: connect 0, read 0, write 0, timeout 3
Requests/sec:      0.30
Transfer/sec:      79.74B

Brotli

Sync

Running 10s test @ http://localhost:8000/myapp/sync/brotli
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.17s   540.89ms   2.00s    57.58%
    Req/Sec    22.56     30.03   363.00     90.41%
  2575 requests in 10.10s, 16.46MB read
  Socket errors: connect 0, read 0, write 0, timeout 1948
Requests/sec:    254.92
Transfer/sec:      1.63MB

Async

Running 10s test @ http://localhost:8000/myapp/async/brotli
  20 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   975.61ms  236.78ms   1.56s    72.91%
    Req/Sec    45.34     69.98   434.00     91.47%
  1384 requests in 10.09s, 8.85MB read
  Socket errors: connect 0, read 1144, write 0, timeout 775
Requests/sec:    137.11
Transfer/sec:      0.88MB

@tonybaloney
Copy link
Contributor

I have the same observation. Neither the -Xgil=1 nor PYTHON_GIL=1 seem to force the GIL, or at least when on they don't exhibit the old behaviour of slow multithreading.

@mpage
Copy link
Contributor

mpage commented May 10, 2024

Are you using the benchmark in the issue?

@tonybaloney
Copy link
Contributor

No, I have my own benchmark.

Benchmark Code

import pyperf
from multiprocessing import Process
from threading import Thread
try:
    import _xxsubinterpreters as interpreters
except ImportError:
    import _interpreters as interpreters

import itertools

DEFAULT_DIGITS = 2000
icount = itertools.count
islice = itertools.islice

def gen_x():
    return map(lambda k: (k, 4 * k + 2, 0, 2 * k + 1), icount(1))


def compose(a, b):
    aq, ar, as_, at = a
    bq, br, bs, bt = b
    return (aq * bq,
            aq * br + ar * bt,
            as_ * bq + at * bs,
            as_ * br + at * bt)


def extract(z, j):
    q, r, s, t = z
    return (q * j + r) // (s * j + t)


def gen_pi_digits():
    z = (1, 0, 0, 1)
    x = gen_x()
    while 1:
        y = extract(z, 3)
        while y != extract(z, 4):
            z = compose(z, next(x))
            y = extract(z, 3)
        z = compose((10, -10 * y, 0, 1), z)
        yield y


def calc_ndigits(n=DEFAULT_DIGITS):
    return list(islice(gen_pi_digits(), n))

test ="""
import itertools

DEFAULT_DIGITS = 2000
icount = itertools.count
islice = itertools.islice

def gen_x():
    return map(lambda k: (k, 4 * k + 2, 0, 2 * k + 1), icount(1))


def compose(a, b):
    aq, ar, as_, at = a
    bq, br, bs, bt = b
    return (aq * bq,
            aq * br + ar * bt,
            as_ * bq + at * bs,
            as_ * br + at * bt)


def extract(z, j):
    q, r, s, t = z
    return (q * j + r) // (s * j + t)


def gen_pi_digits():
    z = (1, 0, 0, 1)
    x = gen_x()
    while 1:
        y = extract(z, 3)
        while y != extract(z, 4):
            z = compose(z, next(x))
            y = extract(z, 3)
        z = compose((10, -10 * y, 0, 1), z)
        yield y


def calc_ndigits(n=DEFAULT_DIGITS):
    return list(islice(gen_pi_digits(), n))
calc_ndigits()
"""

def bench_threading(n):
    # Code to launch specific model
    threads = []
    for _ in range(n):
        t = Thread(target=calc_ndigits)
        t.start()
        threads.append(t)
    for thread in threads:
        thread.join()

def bench_subinterpreters(n, site=True):
    # Code to launch specific model
    def _spawn_sub():
        sid = interpreters.create()
        interpreters.run_string(sid, test)
        interpreters.destroy(sid)

    threads = []
    for _ in range(n):
        t = Thread(target=_spawn_sub)
        t.start()
        threads.append(t)
    for thread in threads:
        thread.join()

def bench_multiprocessing(n):
    # Code to launch specific model
    processes = []
    for _ in range(n):
        t = Process(target=calc_ndigits)
        t.start()
        processes.append(t)
    for process in processes:
        process.join()

if __name__ == "__main__":
    runner = pyperf.Runner()
    runner.metadata['description'] = "Benchmark execution models"
    n = 10
    runner.bench_func('threading', bench_threading, n)
    runner.bench_func('subinterpreters', bench_subinterpreters, n)
    runner.bench_func('multiprocessing', bench_multiprocessing, n)

Results when running a CPython without --disable-gil:

output_gil1_compiled

Running the benchmark with a no-gil build, but PYTHON_GIL=1 uses all 4 CPU cores and gives the fastest result (faster than PYTHON_GIL=0)
output_gil1

@jefer94
Copy link
Author

jefer94 commented May 10, 2024

Yes, but it seems like a Gunicorn issue:

  • ASGI Gunicorn Uvicorn, with threads with -Xgil=0 and -Xgil=1 does not change.
  • ASGI Gunicorn Uvicorn, with workers with -Xgil=0 improves its performance.

@mpage
Copy link
Contributor

mpage commented May 10, 2024

@tonybaloney - That's really strange. I modified your benchmark to remove the pyperf dependency and only run the threading benchmark. I get the following:

> time ./python -Xgil=1 ~/local/scratch/repro_118874_calcpi.py

real	0m3.309s
user	0m3.328s
sys	0m0.049s
> time env PYTHON_GIL=1 ./python ~/local/scratch/repro_118874_calcpi.py

real	0m3.293s
user	0m3.316s
sys	0m0.040s
> time ./python -Xgil=0 ~/local/scratch/repro_118874_calcpi.py

real	0m0.455s
user	0m3.474s
sys	0m0.022s
> time env PYTHON_GIL=0 ./python ~/local/scratch/repro_118874_calcpi.py

real	0m0.379s
user	0m3.227s
sys	0m0.019s
>

I'm unfamiliar with the internals of pyperf. Does it use subprocesses to run the benchmarks? If so, is it possible that it's not passing along the -X option and PYTHON_GIL environment variable properly when executing the benchmarks?

Also what version of CPython are you using?

@tonybaloney
Copy link
Contributor

Also what version of CPython are you using?

3.13b1 from source using the tag released yesterday.

I think you're right and pyperf isn't propagating the flag, will raise an issue there.

@corona10
Copy link
Member

SInce the issue is related to pyperf, I will follow up the issue :)

@mpage
Copy link
Contributor

mpage commented May 10, 2024

@corona10 - Can this be closed since it's a pyperf, not CPython, issue?

@Eclips4 Eclips4 closed this as not planned Won't fix, can't repro, duplicate, stale May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-free-threading type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

5 participants