⚡ Refactor thumbnail generation and scanner IO operations #426

aaronleopold · 2024-08-30T23:24:08Z

⚠️ There are potentially breaking changes in this PR from the restructuring of library options. The migration was hand written, and while I tested it on my personal experimental instance without issue I am still weary enough to put out this disclaimer.

There is a lot of good and important context in this PR and the referenced issue #427. I won't rewrite it all, but I will summarize the actual changes below:

Async-ify IO operations where possible
Refactor the thumbnail generation job and scanner job to remove rayon usage for heavy IO operations
- These operations are now scoped to spawned blocking threads
Make generating file hashes and processing metadata optional during scan
Use a distroless final image (67% reduced image size)
Add two new configuration options:
- STUMP_MAX_SCANNER_CONCURRENCY: The maximum number of concurrent files which may be processed by a scanner (default 200)
- STUMP_MAX_THUMBNAIL_CONCURRENCY: The maximum number of concurrent files which may be processed by a thumbnail generator (default 50)
Greatly improved UX around job status reporting on the UI
- Fixed bug with how tasks and subtasks were being reported
- Fixed bug where a job in-flight would not be added to the jobs state upon page refresh
- Add intermediate state while job is doing setup tasks

This PR aims to (hopefully) improve the performance for both the thumbnail generation and scanner jobs. I don't think the changes are quite ready, but am making the PR to gather feedback and give myself time for proper review and testing.

The main thing to highlight is that I am trying to move away from rayon for anything IO-related, as that is actually a misuse of the library as I am learning, and towards dedicated, blocking threads. I've also added two concurrency-related configuration options in the hopes to provide better options for machines of varying power capacities.

…nailer

Push up last nights changes

…nailer

codecov · 2024-08-31T00:04:21Z

Codecov Report

Attention: Patch coverage is 8.51064% with 731 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
core/src/filesystem/scanner/utils.rs	0.00%	316 Missing ⚠️
apps/server/src/routers/api/v1/library.rs	0.00%	75 Missing ⚠️
core/src/filesystem/image/thumbnail/utils.rs	0.00%	56 Missing ⚠️
core/src/filesystem/scanner/library_scan_job.rs	0.00%	41 Missing ⚠️
...e/src/filesystem/image/thumbnail/generation_job.rs	0.00%	39 Missing ⚠️
core/src/filesystem/image/thumbnail/generate.rs	0.00%	32 Missing ⚠️
core/src/filesystem/media/process.rs	27.27%	32 Missing ⚠️
apps/server/src/routers/api/v1/media.rs	0.00%	30 Missing ⚠️
core/src/filesystem/common.rs	0.00%	28 Missing ⚠️
apps/server/src/routers/api/v1/series.rs	0.00%	26 Missing ⚠️
... and 16 more

Files with missing lines	Coverage Δ
apps/server/src/routers/utoipa.rs	`0.00% <ø> (ø)`
core/src/config/stump_config.rs	`83.59% <100.00%> (+0.26%)`	⬆️
core/src/db/entity/media/entity.rs	`0.00% <ø> (ø)`
core/src/filesystem/image/mod.rs	`100.00% <ø> (+25.00%)`	⬆️
core/src/filesystem/media/builder.rs	`69.71% <100.00%> (-1.62%)`	⬇️
core/src/filesystem/media/format/epub.rs	`37.31% <100.00%> (ø)`
core/src/filesystem/media/format/pdf.rs	`27.31% <100.00%> (ø)`
core/src/filesystem/media/format/rar.rs	`61.26% <100.00%> (ø)`
core/src/filesystem/media/format/zip.rs	`75.53% <100.00%> (ø)`
core/src/filesystem/media/mod.rs	`100.00% <ø> (ø)`
... and 29 more

…nailer

aaronleopold · 2024-09-01T22:54:42Z

Note to self: the last commit broke thumbnail loading on the client

…nailer

aaronleopold · 2024-09-03T23:12:31Z

So I ran a little experiment against these changes:

SMB mount library (~1543 books total) hooked up to external HDD 🐢

The initial scan:

5 hours 22 min to scan 1543 books (default concurrency of 200 files processed at a time)

Full details

For the first 4 hours memory didn't go above ~28-29MB during my periodic observations, but then slowly rose to ~35MB by the 5th hour and fluctuated between 33-36MB. I believe this is just the nature of a long-running process which tracks a growing internal state, which on a non-bottlenecked setup would likely be negligible. However I can't say for sure if this is the case here. I will say, no errors were thrown so the array of log to be dumped into the DB never grew, and while the task queue does "grow" I wouldn't expect the underlying data to be even close to the delta.
- Once the scan completed, memory usage did start dropping immediately but the thumbnail generation started and.. well, it shot back up lol
- When inspecting the opened files throughout the system related to the process, it does show the ~/.stump/Stump.log file remaining open consistently. I have to assume this is just a tailing operation, and that tracing isn't buffering the entire log in memory. Worth noting though.
Palpably slow media generation, but to be somewhat expected from this combination IMO
- Side note: I do think the progress tracking on the UI is better than before by a large margin. However, the initial load time of provisioning the blocking threads on a severely bottlenecked setup like this still makes it feel like it's hanging for a second here and there. I think the indeterminate state I added helps, but still worth noting.
The job setup was actually very quick, quick enough that I wasn't able to time it before I had the thought to do so. This further suggests that basic traversal of the files isn't really the issue here, rather one of (or a combination of) the processing steps during media generation (hashing, metadata extraction, page counting).
DB writes were almost instantaneous, not surprising but supports filesystem IO being the bottleneck for slow scans

Some thoughts re: potential areas for performance improvements:

File hashing is currently a trait fn which each processor independently implements, using specific crates relevant to the format (e.g. zip for CBZ/ZIP) to take X pages as the sample to then hash. It might be better to have a single implementation using a smaller sample of the file, likely resulting in smaller read sizes. However, this could make the duplicate book detection less accurate, so it's a trade-off.
- File hash is a nullable field in the DB, so it could be worth considering making this an opt-in feature for users who don't care about duplicate detection. That is practically the only use case for this field.
Metadata extraction is currently required, not opt-in. Besides the hashing, this should be the only read operation during the scan process. It might be worth considering a way to skip this step for users who don't care about metadata, or prefer metadata extraction to be done on-demand as a follow-up task.
- Note that I am making assumptions about the crates used per format, e.g. I assume zip doesn't read the entire file to extract metadata used for iteration (e.g. file names), but I could be wrong.

TLDR; the scan was slow, but I think it's mostly due to the bottlenecked setup. Memory usage was relatively stable, and there are some potential areas for performance improvements around adding opt-in features for metadata extraction and file hashing.

EDIT: I started another run with the hashing and ComicInfo.xml parsing logic commented out. It now "processes" files (just page counting) in a few microseconds. So I think it is safe to say those two operations make a big difference

The thumbnail generation:

Est ~7ish hours to generate 1543 thumbnails (200x350 px, 0.75 quality, webp format, default concurrency of 50 files processed at a time) 🐢
Note: I got impatient and stopped the process after an 2 hours, so this is a rough estimate
Note also: cancel functionality "worked" but is both bugged and improved with these changes lol (it successfully cancelled faster but marked as completed)

Full details

Averaged 4.4 (I know, fractional doesn't make sense) thumbnails per minute, which means it would take ~5-6 hours to generate all thumbs (1543/4.4 = 350.68, 351/60 = 5.85)
Memory relatively spiked for this job, at least initially, jumping to ~76MB as soon as the main sub-task loop started. It varied though, and I actually didn't see it go above 90MB. The lowest I saw was 44.2MB. I mostly saw it hovering between 48-50MB, which I'll attribute to varying sizes of images pulled from the files being held in memory since it was relatively stable for batches of files (50 are allowed at once, but I didn't keep track that finely to say for certain)

Some thoughts re: memory consumption:

Until the person who reported [BUG?] very slow in creating library #427 has a chance to retest using the image with these changes, I can't really say for sure if the memory usage is improved. I do think at the very least it is faster now than before, but I couldn't replicate the growing memory issue with this setup.
- For argument sake, I sampled the first 20 CBZ files in the library and found the first pages range from 2-4MB in size, so taking the average of 3MB and multiplying by 50 (concurrency limit) gives us 150MB we could hypothetically hold in memory while we generate a thumb. We resize the image in memory, so plus a little bit for each resized buffer, we then convert it, and finally dump it to disk. This doesn't quite add up to GBs of memory as reported though, but is also more than what I observed the process actually using
- An additional note: each stream loop iteration (when a future resolves and the result is available), we handle the result. By the end of an iteration, I'd expect the returned buffer to be dropped and the memory to be freed (I'm not even using it). Perhaps a manual drop is needed? These changes aren't present in the version which the memory issue was reported

TDLR; the generation was also slow, but I do think it's improved from before. I could not replicate the memory issue reported in #427, in fact I saw relatively stable and low memory usage throughout the process

All this has definitely been interesting to investigate, though I wish I had a clearer understanding of the linked issue. I'll try and setup a NFS for a follow-up, which hypothetically should result in improved speeds

requires migration still

kill me

aaronleopold · 2024-09-04T23:20:58Z

Follow-up experiment:

NFS mount library (~1543 books total) hooked up to external HDD (still 🐢)

The initial scan:

3 hours 20 min to scan 1543 books (default concurrency of 200 files processed at a time)

Notes

Memory "spiked" earlier in the scan, getting to 44MB by 2 hours. I didn't really see it trend upwards, it would fluctuate down to 38-41MB and mostly hover in that range. I think this could just be correlated to the files being processed, which are unordered and differ each scan.
- I did not notice memory drop after the scan completed, perhaps the OS kept the allocated memory as a cache intead of freeing it? I can experiment with a different allocator if needed but we are also talking about a relatively small MB so 🤷
Still overall slow media generation, but I'd say significantly faster than SMB (took ~2 hours less)

Some thoughts re: potential areas for performance improvements:

See note about allocator above, maybe something like mimalloc? Might be premature, I'm observing a small number of MBs hanging around, not GBs and not a steady increase as reported in [BUG?] very slow in creating library #427. Also should note that the thumbnail generation tasks start soon after the scan, so it might not have had time to free the memory

TLDR; the scan was still slow, memory usage about the same, any speed diff is likely NFS vs SMB. I feel this doesn't really shed any light on the issues reported in #427, as I couldn't replicate the memory issues with either SMB or NFS mounts.

I didn't try analyzing the thumbnail generation, canceling it after a few minutes. I assume it would be similar to the scan re: overall faster but similar memory usage.

aaronleopold · 2024-09-05T00:02:44Z

This reverts commit bd6da98.

aaronleopold · 2024-09-06T01:03:00Z

This comment #427 (comment) is really important context. A TLDR; is some internals/configurations in the zip crate changed which caused a huge performance regression. There are still some lingering memory issues, but magnitudes less after downgrading. For reference, the NFS test took 8 minutes after downgrading (down from 3 hours 20 min lol)

…nailer

This reverts commit f43c187.

aaronleopold · 2024-09-07T21:43:10Z

I'm going to do a last pass review and aim to merge this in today. I want to call out the warning in the PR description:

⚠️ There are potentially breaking changes in this PR from the restructuring of library options. The migration was hand written, and while I tested it on my personal experimental instance without issue I am still weary enough to put out this disclaimer.

I will promote the current experimental to develop and then merge this shortly after (once I give it a last review). I'm somewhat eager to get these improvements out there, so I'll likely not let these changes sit in experimental for more than a week or two. If anyone is able to help with testing I'd appreciate it!

aaronleopold · 2024-09-07T22:00:03Z

docker/collect_stats.sh

I couldn't get the plot to work in this bash script, if anyone wants to give it a go. I created this as a quick debugging tool to try and visualize a long-running scan

…nailer

* ⚡ Refactor thumbnail generation * ⚡ Refactor scanner IO operations Push up last nights changes * debug: push to new tag for testing * async-ify IO operations outside scan context * debug: add tracing * wip: rename library options and add fields requires migration still * wip: rename library options, fix frontend * handwrite migration 😬 * fix ui * NON FUNCTIONAL: wip migration kill me * debug: stop auto pushing * fix migration? * wip: support processing options and restructure * 🥪 lunch allocator experiment * super wip: distroless image, zip downgrade * cleanup docker stuff * Revert "debug: stop auto pushing" This reverts commit bd6da98. * remove missed feature * fix job upsert after refresh * cleanup comments and wip review * Reapply "debug: stop auto pushing" This reverts commit f43c187. * cleanup

aaronleopold added 5 commits August 29, 2024 16:30

⚡ Refactor thumbnail generation

210fa97

Merge remote-tracking branch 'origin/experimental' into al/perf/thumb…

49062ca

…nailer

Merge remote-tracking branch 'origin/experimental' into al/perf/thumb…

71b1d90

…nailer

⚡ Refactor scanner IO operations

186c888

Push up last nights changes

Merge remote-tracking branch 'origin/experimental' into al/perf/thumb…

ab73a26

…nailer

aaronleopold mentioned this pull request Aug 31, 2024

[BUG?] very slow in creating library #427

Closed

aaronleopold added 3 commits August 31, 2024 12:07

debug: push to new tag for testing

4320b74

async-ify IO operations outside scan context

49ab7ce

Merge remote-tracking branch 'origin/experimental' into al/perf/thumb…

219b341

…nailer

aaronleopold added 2 commits September 1, 2024 16:25

debug: add tracing

f3b8c64

Merge remote-tracking branch 'origin/experimental' into al/perf/thumb…

0704d14

…nailer

aaronleopold added 6 commits September 3, 2024 17:52

wip: rename library options and add fields

48b8643

requires migration still

wip: rename library options, fix frontend

c4d4f2a

handwrite migration 😬

282bc30

fix ui

61b5a83

NON FUNCTIONAL: wip migration

f606595

kill me

debug: stop auto pushing

bd6da98

fix migration?

e1477c8

aaronleopold added 6 commits September 4, 2024 18:19

wip: support processing options and restructure

45a59c1

🥪 lunch allocator experiment

31c5daf

super wip: distroless image, zip downgrade

e0d398b

cleanup docker stuff

df11b53

Revert "debug: stop auto pushing"

f43c187

This reverts commit bd6da98.

remove missed feature

8a953ae

fix job upsert after refresh

3cae35c

aaronleopold changed the title ~~[DRAFT] ⚡ Refactor thumbnail generation and scanner IO operations~~ ⚡ Refactor thumbnail generation and scanner IO operations Sep 6, 2024

aaronleopold added 3 commits September 6, 2024 19:25

cleanup comments and wip review

4c4cf77

Merge remote-tracking branch 'origin/experimental' into al/perf/thumb…

3fd0ecb

…nailer

Reapply "debug: stop auto pushing"

28f26c9

This reverts commit f43c187.

aaronleopold commented Sep 7, 2024

View reviewed changes

aaronleopold added 2 commits September 7, 2024 15:02

cleanup

8ece52d

Merge remote-tracking branch 'origin/experimental' into al/perf/thumb…

4975590

…nailer

aaronleopold marked this pull request as ready for review September 8, 2024 06:31

aaronleopold merged commit 5f9871b into experimental Sep 8, 2024
8 checks passed

aaronleopold deleted the al/perf/thumbnailer branch September 8, 2024 06:32

aaronleopold mentioned this pull request Oct 11, 2024

🔖 Release v0.0.7 #473

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡ Refactor thumbnail generation and scanner IO operations #426

⚡ Refactor thumbnail generation and scanner IO operations #426

aaronleopold commented Aug 30, 2024 •

edited

Loading

codecov bot commented Aug 31, 2024 •

edited

Loading

aaronleopold commented Sep 1, 2024

aaronleopold commented Sep 3, 2024 •

edited

Loading

aaronleopold commented Sep 4, 2024

aaronleopold commented Sep 5, 2024

aaronleopold commented Sep 6, 2024

aaronleopold commented Sep 7, 2024

aaronleopold Sep 7, 2024

⚡ Refactor thumbnail generation and scanner IO operations #426

⚡ Refactor thumbnail generation and scanner IO operations #426

Conversation

aaronleopold commented Aug 30, 2024 • edited Loading

codecov bot commented Aug 31, 2024 • edited Loading

Codecov Report

aaronleopold commented Sep 1, 2024

aaronleopold commented Sep 3, 2024 • edited Loading

aaronleopold commented Sep 4, 2024

aaronleopold commented Sep 5, 2024

aaronleopold commented Sep 6, 2024

aaronleopold commented Sep 7, 2024

aaronleopold Sep 7, 2024

Choose a reason for hiding this comment

aaronleopold commented Aug 30, 2024 •

edited

Loading

codecov bot commented Aug 31, 2024 •

edited

Loading

aaronleopold commented Sep 3, 2024 •

edited

Loading