-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Copying multiple larger files can hang the copy if autoreload is on #3986
Comments
Confirming workaround: suspending autoreload through This was discovered while programming Big Honking Buttons. The factory setup script copies a set of files to the device - one set is the following WAV files:
These in particular trigger the issue and would cause writes to the device to hang. It seems BHB's library code accessing the filesystem at startup may contribute to triggering the issue more often as well. As a workaround, we are forcing the device into the REPL before copying files:
With that workaround, we can program devices consistently without hanging:
|
I found the copying is much slower if the serial connection is not open. This test is with Not connected:
Connected, in REPL or not (doesn't seem to matter on timing), and repeating the commands above, I see much smaller delays, still a few seconds for copying larger files (sometimes two at a time). I do not see the pulsing green NeoPixel as above. I am not sure this completely jives with @theacodes' technique of forcing the REPL, and having trouble before that. |
Should I try with just opening a serial connection and not sending
anything?
…On Tue, Jan 12, 2021, 11:08 PM Dan Halbert ***@***.***> wrote:
I found the copying is much slower if the serial connection is not open.
This test is with code.py just being print("Hello, world").
Not connected:
***@***.***:~/Downloads/sounds$ /bin/rm /media/halbert/CIRCUITPY/*.wav;sync
***@***.***:~/Downloads/sounds$ cp -v *.wav /media/halbert/CIRCUITPY/
'hit.wav' -> '/media/halbert/CIRCUITPY/hit.wav'
'idle.wav' -> '/media/halbert/CIRCUITPY/idle.wav'
[big delay; NeoPixel pulses green]
'off.wav' -> '/media/halbert/CIRCUITPY/off.wav'
[big delay; NeoPixel pulses green]
'on.wav' -> '/media/halbert/CIRCUITPY/on.wav'
[big delay; NeoPixel pulses green]
'swing.wav' -> '/media/halbert/CIRCUITPY/swing.wav'
[big delay; NeoPixel pulses green]
Connected, in REPL or not (doesn't seem to matter on timing), and
repeating the commands above, I see much smaller delays, still a few
seconds for copying larger files (sometimes two at a time). I do *not*
see the pulsing green NeoPixel as above.
I am not sure this completely jives with @theacodes
<https://github.com/theacodes>' technique of forcing the REPL, and having
trouble before that.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3986 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAB5I445BKEIZVGTJMQE3DLSZUME3ANCNFSM4WADE4LQ>
.
|
I thought you were doing that initially, and you added the "getting into the REPL" part. But if were not opening the connection before, yes, you could try that. I think this may have to do with not calling the tinyusb "do tasks" call often enough. It probably gets called more often when a serial connection is open, because there's traffic that generates interrupts. The fix may be to enable tick interrupts all the time if there's any kind of USB connection, including MSC and CDC. I'm going to try that soon. |
I am able to reproduce the issue with tip of main by copying all sample files here https://github.com/theacodes/Winterbloom-Big-Honking-Button/tree/master/samples . Bus analyzer show that firmware didn't response to the SCSI status phase, either the stack failed to manage the state and/or could be another buffer overflow, or maybe the tud_task() isn't called e.g cpy doesn't think it needs to call the usb task. Eventually the host have to issue a bus reset -> re-enumerate -> host tries again with on-going files which is the huge pause we see. I will do more testing with different MCUs, and scenario, will update with more details later on. |
Thanks!! I am going to instrument the calls to |
@hathach In 5.3.1, we called That was changed in #2879 during 6.0 development. Now circuitpython/supervisor/shared/usb/usb.c Lines 97 to 100 in 409b5ff
(It is not called in the interrupt handler, but it is put on a list of background routines to call (once) when they are given a chance to run.) I think the idea was that only a USB interrupt would generate a task for TinyUSB to handle, and otherwise TinyUSB would have no work to do. But I don't think that's right, is it? It appears there are other times besides an interrupt when As an experiment, I put But I am still worried that we need to call #2868 may or may not be related to this (fixed an nrf problem), but I am not sure at all. You reviewed that PR extensively. @jepler Do you have any comment on the above (especially #2879 changes)? |
Re #2868 I don't mind at all modifying how we control when |
Maybe we need to queue up tud_task after we queue a response back into tinyusb. For example, maybe we need to trigger tud_task after a write completes. |
@dhalbert thanks for detail explanation, most of the usb task job is queued by ISR, however, there is a few exceptions especially with MSC. E.g when scsi read/write callbback return 0 as busy indicator. The whole transfer is already complete, but since the application is not ready, tinyusb will need to re-schedule the event by itself without ISR (re-submit event to task queue). https://github.com/hathach/tinyusb/blob/master/src/class/msc/msc_device.c#L366 It is probably our case here if we do However, I haven't narrowed it down to exactly specific cause just yet, since I will be switching to other works requested by @tannewt (from slack) which has a tight timing. I will come back to this one later on. |
Will |
I think I can test to see if that is true. |
Our |
@tannewt @hathach I tested this on a variety of boards with 6.2.0-beta.1. On the problem boards, the writes are severely delayed or never complete. The different behaviors may help in debugging. Metro M0 Express (SAMD21): problem: finishes, but Feather nRF52840: fine |
thanks @dhalbert , I will take a break from rp2040 and switch to troubleshoot this tomorrow. There are a few of RP2040 usb issues (usbnet, webusb etc..) though those will probably take time since I will need to go through datasheet reading. Will ping you if I could find out anything. |
update: tracing down the background callback, it does solve the race condition pretty well. However, the issued is caused by run_code.py decide to reload (AUTO_RELOAD) and run background_callback_reset() while background list is still queued up with usb event.
To sum up the issue is not related to usb stack, it seems to be caused by auto-reload while background task are still queued up. We can modify background_callback_run_all() to loop over while callback_head is not NULL. However, there is still race condition where the usb (or other) ISR is triggered during |
@hathach That makes sense to me! I think we need probably need to remove VM initiated task only (not USB.) An alternative would be run all tasks while interrupts are off and then we reset before turning them back on. |
This can be a bit tricky to do, since when
This sounds like a better solution. IMHO, there are 2 ways we could do
Update: I think (1) to remove all VM background is much better, it will prevent similar issue with other ISR. Let me know which task is VM, I could make an PR to remove those only |
I make a quick hack to add usb background after the background_reset() for quick test, it does solve this problems and quickly complete the file copying without issue hathach#1 . There is also a couple of minor code clean up as well. |
I think your 2 idea is the best way to go about it for now. There isn't a single VM task. We don't have a way to know which tasks to drop because we're stopping and those we can to keep doing. Perhaps we should add a critical flag or something. |
Ok, I will make an PR using this method. Update: #4122 is submitted. |
First reported by @theacodes. Considerable discord discussion starting about 7pm ET 2021-1-12 in #circuitpython
Seems to happen fairly often (though not always) on SAMD21. Not sure if it happens on SAMD51. Not tried on other boards.
Example, copying to a Metro M0 Express running near the tip of
main
. First erase the filesystem. Then, do not enter the REPL, but copy multiple larger files, e.g. like this set:For me, one out of two times this caused the copy to hang.
@theacodes reports that a workaround is to copy one file at a time or to enter the REPL, so that auto-reload is not happening, and then copy.
This problem was not evident to @theacodes on 5.3.1.
The text was updated successfully, but these errors were encountered: