Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

⚡ Refactor thumbnail generation and scanner IO operations #426

Merged
merged 29 commits into from
Sep 8, 2024

Conversation

aaronleopold
Copy link
Collaborator

@aaronleopold aaronleopold commented Aug 30, 2024

⚠️ There are potentially breaking changes in this PR from the restructuring of library options. The migration was hand written, and while I tested it on my personal experimental instance without issue I am still weary enough to put out this disclaimer.

There is a lot of good and important context in this PR and the referenced issue #427. I won't rewrite it all, but I will summarize the actual changes below:

  • Async-ify IO operations where possible
  • Refactor the thumbnail generation job and scanner job to remove rayon usage for heavy IO operations
    • These operations are now scoped to spawned blocking threads
  • Make generating file hashes and processing metadata optional during scan
  • Use a distroless final image (67% reduced image size)
  • Add two new configuration options:
    • STUMP_MAX_SCANNER_CONCURRENCY: The maximum number of concurrent files which may be processed by a scanner (default 200)
    • STUMP_MAX_THUMBNAIL_CONCURRENCY: The maximum number of concurrent files which may be processed by a thumbnail generator (default 50)
  • Greatly improved UX around job status reporting on the UI
    • Fixed bug with how tasks and subtasks were being reported
    • Fixed bug where a job in-flight would not be added to the jobs state upon page refresh
    • Add intermediate state while job is doing setup tasks

This PR aims to (hopefully) improve the performance for both the thumbnail generation and scanner jobs. I don't think the changes are quite ready, but am making the PR to gather feedback and give myself time for proper review and testing.

The main thing to highlight is that I am trying to move away from rayon for anything IO-related, as that is actually a misuse of the library as I am learning, and towards dedicated, blocking threads. I've also added two concurrency-related configuration options in the hopes to provide better options for machines of varying power capacities.

Copy link

codecov bot commented Aug 31, 2024

Codecov Report

Attention: Patch coverage is 8.51064% with 731 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
core/src/filesystem/scanner/utils.rs 0.00% 316 Missing ⚠️
apps/server/src/routers/api/v1/library.rs 0.00% 75 Missing ⚠️
core/src/filesystem/image/thumbnail/utils.rs 0.00% 56 Missing ⚠️
core/src/filesystem/scanner/library_scan_job.rs 0.00% 41 Missing ⚠️
...e/src/filesystem/image/thumbnail/generation_job.rs 0.00% 39 Missing ⚠️
core/src/filesystem/image/thumbnail/generate.rs 0.00% 32 Missing ⚠️
core/src/filesystem/media/process.rs 27.27% 32 Missing ⚠️
apps/server/src/routers/api/v1/media.rs 0.00% 30 Missing ⚠️
core/src/filesystem/common.rs 0.00% 28 Missing ⚠️
apps/server/src/routers/api/v1/series.rs 0.00% 26 Missing ⚠️
... and 16 more
Files with missing lines Coverage Δ
apps/server/src/routers/utoipa.rs 0.00% <ø> (ø)
core/src/config/stump_config.rs 83.59% <100.00%> (+0.26%) ⬆️
core/src/db/entity/media/entity.rs 0.00% <ø> (ø)
core/src/filesystem/image/mod.rs 100.00% <ø> (+25.00%) ⬆️
core/src/filesystem/media/builder.rs 69.71% <100.00%> (-1.62%) ⬇️
core/src/filesystem/media/format/epub.rs 37.31% <100.00%> (ø)
core/src/filesystem/media/format/pdf.rs 27.31% <100.00%> (ø)
core/src/filesystem/media/format/rar.rs 61.26% <100.00%> (ø)
core/src/filesystem/media/format/zip.rs 75.53% <100.00%> (ø)
core/src/filesystem/media/mod.rs 100.00% <ø> (ø)
... and 29 more

@aaronleopold
Copy link
Collaborator Author

Note to self: the last commit broke thumbnail loading on the client

@aaronleopold
Copy link
Collaborator Author

aaronleopold commented Sep 3, 2024

So I ran a little experiment against these changes:

SMB mount library (~1543 books total) hooked up to external HDD 🐢

The initial scan:

5 hours 22 min to scan 1543 books (default concurrency of 200 files processed at a time)

Full details
  • For the first 4 hours memory didn't go above ~28-29MB during my periodic observations, but then slowly rose to ~35MB by the 5th hour and fluctuated between 33-36MB. I believe this is just the nature of a long-running process which tracks a growing internal state, which on a non-bottlenecked setup would likely be negligible. However I can't say for sure if this is the case here. I will say, no errors were thrown so the array of log to be dumped into the DB never grew, and while the task queue does "grow" I wouldn't expect the underlying data to be even close to the delta.
    • Once the scan completed, memory usage did start dropping immediately but the thumbnail generation started and.. well, it shot back up lol
    • When inspecting the opened files throughout the system related to the process, it does show the ~/.stump/Stump.log file remaining open consistently. I have to assume this is just a tailing operation, and that tracing isn't buffering the entire log in memory. Worth noting though.
  • Palpably slow media generation, but to be somewhat expected from this combination IMO
    • Side note: I do think the progress tracking on the UI is better than before by a large margin. However, the initial load time of provisioning the blocking threads on a severely bottlenecked setup like this still makes it feel like it's hanging for a second here and there. I think the indeterminate state I added helps, but still worth noting.
  • The job setup was actually very quick, quick enough that I wasn't able to time it before I had the thought to do so. This further suggests that basic traversal of the files isn't really the issue here, rather one of (or a combination of) the processing steps during media generation (hashing, metadata extraction, page counting).
  • DB writes were almost instantaneous, not surprising but supports filesystem IO being the bottleneck for slow scans

Some thoughts re: potential areas for performance improvements:

  • File hashing is currently a trait fn which each processor independently implements, using specific crates relevant to the format (e.g. zip for CBZ/ZIP) to take X pages as the sample to then hash. It might be better to have a single implementation using a smaller sample of the file, likely resulting in smaller read sizes. However, this could make the duplicate book detection less accurate, so it's a trade-off.
    • File hash is a nullable field in the DB, so it could be worth considering making this an opt-in feature for users who don't care about duplicate detection. That is practically the only use case for this field.
  • Metadata extraction is currently required, not opt-in. Besides the hashing, this should be the only read operation during the scan process. It might be worth considering a way to skip this step for users who don't care about metadata, or prefer metadata extraction to be done on-demand as a follow-up task.
    • Note that I am making assumptions about the crates used per format, e.g. I assume zip doesn't read the entire file to extract metadata used for iteration (e.g. file names), but I could be wrong.

TLDR; the scan was slow, but I think it's mostly due to the bottlenecked setup. Memory usage was relatively stable, and there are some potential areas for performance improvements around adding opt-in features for metadata extraction and file hashing.

EDIT: I started another run with the hashing and ComicInfo.xml parsing logic commented out. It now "processes" files (just page counting) in a few microseconds. So I think it is safe to say those two operations make a big difference

The thumbnail generation:

Est ~7ish hours to generate 1543 thumbnails (200x350 px, 0.75 quality, webp format, default concurrency of 50 files processed at a time) 🐢
Note: I got impatient and stopped the process after an 2 hours, so this is a rough estimate
Note also: cancel functionality "worked" but is both bugged and improved with these changes lol (it successfully cancelled faster but marked as completed)

Full details
  • Averaged 4.4 (I know, fractional doesn't make sense) thumbnails per minute, which means it would take ~5-6 hours to generate all thumbs (1543/4.4 = 350.68, 351/60 = 5.85)
  • Memory relatively spiked for this job, at least initially, jumping to ~76MB as soon as the main sub-task loop started. It varied though, and I actually didn't see it go above 90MB. The lowest I saw was 44.2MB. I mostly saw it hovering between 48-50MB, which I'll attribute to varying sizes of images pulled from the files being held in memory since it was relatively stable for batches of files (50 are allowed at once, but I didn't keep track that finely to say for certain)

Some thoughts re: memory consumption:

  • Until the person who reported [BUG?] very slow in creating library #427 has a chance to retest using the image with these changes, I can't really say for sure if the memory usage is improved. I do think at the very least it is faster now than before, but I couldn't replicate the growing memory issue with this setup.
    • For argument sake, I sampled the first 20 CBZ files in the library and found the first pages range from 2-4MB in size, so taking the average of 3MB and multiplying by 50 (concurrency limit) gives us 150MB we could hypothetically hold in memory while we generate a thumb. We resize the image in memory, so plus a little bit for each resized buffer, we then convert it, and finally dump it to disk. This doesn't quite add up to GBs of memory as reported though, but is also more than what I observed the process actually using
    • An additional note: each stream loop iteration (when a future resolves and the result is available), we handle the result. By the end of an iteration, I'd expect the returned buffer to be dropped and the memory to be freed (I'm not even using it). Perhaps a manual drop is needed? These changes aren't present in the version which the memory issue was reported

TDLR; the generation was also slow, but I do think it's improved from before. I could not replicate the memory issue reported in #427, in fact I saw relatively stable and low memory usage throughout the process

All this has definitely been interesting to investigate, though I wish I had a clearer understanding of the linked issue. I'll try and setup a NFS for a follow-up, which hypothetically should result in improved speeds

@aaronleopold
Copy link
Collaborator Author

Follow-up experiment:

NFS mount library (~1543 books total) hooked up to external HDD (still 🐢)

The initial scan:

3 hours 20 min to scan 1543 books (default concurrency of 200 files processed at a time)

Notes
  • Memory "spiked" earlier in the scan, getting to 44MB by 2 hours. I didn't really see it trend upwards, it would fluctuate down to 38-41MB and mostly hover in that range. I think this could just be correlated to the files being processed, which are unordered and differ each scan.
    • I did not notice memory drop after the scan completed, perhaps the OS kept the allocated memory as a cache intead of freeing it? I can experiment with a different allocator if needed but we are also talking about a relatively small MB so 🤷
  • Still overall slow media generation, but I'd say significantly faster than SMB (took ~2 hours less)

Some thoughts re: potential areas for performance improvements:

  • See note about allocator above, maybe something like mimalloc? Might be premature, I'm observing a small number of MBs hanging around, not GBs and not a steady increase as reported in [BUG?] very slow in creating library #427. Also should note that the thumbnail generation tasks start soon after the scan, so it might not have had time to free the memory

TLDR; the scan was still slow, memory usage about the same, any speed diff is likely NFS vs SMB. I feel this doesn't really shed any light on the issues reported in #427, as I couldn't replicate the memory issues with either SMB or NFS mounts.

I didn't try analyzing the thumbnail generation, canceling it after a few minutes. I assume it would be similar to the scan re: overall faster but similar memory usage.

@aaronleopold
Copy link
Collaborator Author

Screenshot 2024-09-04 at 5 02 20 PM

@aaronleopold
Copy link
Collaborator Author

This comment #427 (comment) is really important context. A TLDR; is some internals/configurations in the zip crate changed which caused a huge performance regression. There are still some lingering memory issues, but magnitudes less after downgrading. For reference, the NFS test took 8 minutes after downgrading (down from 3 hours 20 min lol)

@aaronleopold aaronleopold changed the title [DRAFT] ⚡ Refactor thumbnail generation and scanner IO operations ⚡ Refactor thumbnail generation and scanner IO operations Sep 6, 2024
@aaronleopold
Copy link
Collaborator Author

I'm going to do a last pass review and aim to merge this in today. I want to call out the warning in the PR description:

⚠️ There are potentially breaking changes in this PR from the restructuring of library options. The migration was hand written, and while I tested it on my personal experimental instance without issue I am still weary enough to put out this disclaimer.

I will promote the current experimental to develop and then merge this shortly after (once I give it a last review). I'm somewhat eager to get these improvements out there, so I'll likely not let these changes sit in experimental for more than a week or two. If anyone is able to help with testing I'd appreciate it!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't get the plot to work in this bash script, if anyone wants to give it a go. I created this as a quick debugging tool to try and visualize a long-running scan

@aaronleopold aaronleopold marked this pull request as ready for review September 8, 2024 06:31
@aaronleopold aaronleopold merged commit 5f9871b into experimental Sep 8, 2024
8 checks passed
@aaronleopold aaronleopold deleted the al/perf/thumbnailer branch September 8, 2024 06:32
aaronleopold added a commit that referenced this pull request Sep 14, 2024
* ⚡ Refactor thumbnail generation

* ⚡ Refactor scanner IO operations

Push up last nights changes

* debug: push to new tag for testing

* async-ify IO operations outside scan context

* debug: add tracing

* wip: rename library options and add fields

requires migration still

* wip: rename library options, fix frontend

* handwrite migration 😬

* fix ui

* NON FUNCTIONAL: wip migration

kill me

* debug: stop auto pushing

* fix migration?

* wip: support processing options and restructure

* 🥪 lunch allocator experiment

* super wip: distroless image, zip downgrade

* cleanup docker stuff

* Revert "debug: stop auto pushing"

This reverts commit bd6da98.

* remove missed feature

* fix job upsert after refresh

* cleanup comments and wip review

* Reapply "debug: stop auto pushing"

This reverts commit f43c187.

* cleanup
@aaronleopold aaronleopold mentioned this pull request Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant