Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zarr sink Improvements #1713

Merged
merged 8 commits into from
Jan 3, 2025
Merged

Zarr sink Improvements #1713

merged 8 commits into from
Jan 3, 2025

Conversation

annehaley
Copy link
Collaborator

@annehaley annehaley commented Nov 1, 2024

Resolves #1674. Resolves #1698.

@annehaley annehaley marked this pull request as ready for review November 5, 2024 18:29
@annehaley annehaley requested a review from manthey November 5, 2024 18:30
@manthey
Copy link
Member

manthey commented Nov 7, 2024

I have a multiprocess code path that behaves poorly.

In the main process, create the sink and add a tile.
Create a subprocess. It doesn't know that they are already tiles present, so the newaxes list is non-empty. Something uses a lot of memory as a tile is added.

I think the solution is that if new_axes (https://github.com/girder/large_image/pull/1713/files#diff-c79d7356628b5b310aaeb154520b413576727848e175b84f16af2b0eaadb98deR735) is not empty, we set the updateMetadata flag.

My test case was to use the examples/algorithm_progression.py example using --multiprocessing and to hack in an sink.addTile with a 1 pixel record at the maximum sweep locations. That is,

diff --git a/examples/algorithm_progression.py b/examples/algorithm_progression.py
index 52271cfb..2a70c0dc 100755
--- a/examples/algorithm_progression.py
+++ b/examples/algorithm_progression.py
@@ -42,7 +42,7 @@ class SweepAlgorithm:
 
         self.combos = list(itertools.product(*[p['range'] for p in input_params.values()]))
 
-    def getOverallSink(self):
+    def getOverallSink(self, maxValues=None):
         msg = 'Not implemented'
         raise Exception(msg)
 
@@ -118,7 +118,10 @@ class SweepAlgorithm:
 
     def run(self):
         starttime = time.time()
-        sink = self.getOverallSink()
+        source = large_image.open(self.input_filename)
+        maxValues = {'x': source.sizeX, 'y': source.sizeY, 's': source.metadata['bandCount']}
+        maxValues.update({p['axis']: len(p['range']) for p in self.param_order.values()})
+        sink = self.getOverallSink(maxValues)
 
         print(f'Beginning {len(self.combos)} runs on {self.max_workers} workers...')
         num_done = 0
@@ -149,7 +152,7 @@ class SweepAlgorithm:
 
 
 class SweepAlgorithmMulti(SweepAlgorithm):
-    def getOverallSink(self):
+    def getOverallSink(self, maxValues=None):
         os.makedirs(os.path.splitext(self.output_filename)[0], exist_ok=True)
         algorithm_name = self.algorithm.__name__.replace('_', ' ').title()
         self.yaml_dict = {
@@ -270,10 +273,14 @@ class SweepAlgorithmMultiZarr(SweepAlgorithmMulti):
 
 
 class SweepAlgorithmZarr(SweepAlgorithm):
-    def getOverallSink(self):
+    def getOverallSink(self, maxValues=None):
         import large_image_source_zarr
 
-        return large_image_source_zarr.new()
+        sink = large_image_source_zarr.new()
+        if maxValues:
+            sink.addTile(np.zeros((1, 1, maxValues.get('s', 1))),
+                         **{k: v - 1 for k, v in maxValues.items() if k != 's'})
+        return sink
 
     def writeOverallSink(self, sink):
         sink.write(self.output_filename, lossy=self.lossy)

and then python examples/algorithm_progression.py ppc --param=hue_value,hue,0,1,100,open --param=hue_width,width,0.10,0.25,4 -w 36 --multiprocessing build/tox/externaldata/sample_Easy1.png /tmp/sweep.zarr.zip

@manthey
Copy link
Member

manthey commented Nov 7, 2024

Hmm... If we did that, then maybe we'd also need to call self._validateZarr() in the _initNew method if we didn't create the file, but then things fail if we aren't setting data before doing the multiprocessing fork.

@annehaley
Copy link
Collaborator Author

annehaley commented Dec 2, 2024

I took some time to observe the behavior of your example use case in multiple python versions and plot the output from memory-profiler to compare them. The resulting plot is shown below, with timestep (sampled every 0.1 seconds) along X and memory usage along Y. The longest blue line is the main process and all other lines are subprocesses.

For all of these python versions, the only spike in memory I experienced was in the main process after the multiprocessing stage, during the conversion step of the write function. Interestingly, 3.11 and 3.12 have this extra subprocess that hangs around and occupies a negligible but non-zero amount of memory after the multiprocessing stage (shown as a flat orange line).

image

In my experience, 3.10 performed the best and 3.11 performed the worst. Am I remembering correctly that you said you tried this with 3.11? If my 3.11 graph does not reflect what you experienced, can you send me the full list describing your environment so I can do my best to replicate? Otherwise, I'm not sure that we can do much to further improve memory usage, and this may just be a matter of python and other library versions.

EDIT: Here's the command I ran in each environment:

mprof run --multiprocess examples/algorithm_progression.py ppc --param=hue_value,hue,0,1,100,open --param=hue_width,width,0.10,0.25,4 --multiprocessing  /home/anne/data/large_image/Easy1.png /home/anne/data/large_image/generated/sweep.zarr.zip

Also, write the metadata after fundamental changes so that anything that
reopens the file after that point will see those changes.
When reopening a zarr sink, validate it
@manthey
Copy link
Member

manthey commented Jan 3, 2025

@annehaley I think the only thing left is to document that if you use multi-processes and don't pre-allocate the maximal dimension, axes values aren't fully recorded.

@manthey
Copy link
Member

manthey commented Jan 3, 2025

We are getting some transient failures in CI. It doesn't have anything directly to do with this PR. I can erratically replicate it locally on master. We'll rerun CI until it passes for this PR and a different PR will address the flaky test.

@manthey
Copy link
Member

manthey commented Jan 3, 2025

We are getting some transient failures in CI. It doesn't have anything directly to do with this PR. I can erratically replicate it locally on master. We'll rerun CI until it passes for this PR and a different PR will address the flaky test.

If #1756 is merged to master before this passes, we can just rebase off of master and it should pass.

@annehaley annehaley merged commit 58d1230 into master Jan 3, 2025
18 checks passed
@annehaley annehaley deleted the zarr-sink-improvements-1 branch January 3, 2025 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Set frame values in multiple processes Modify Zarr sink addTile to allow creation of new axes
2 participants