Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZipArchive: Apply strategy pattern depending on ZipArchiveMode #61820

Closed
wants to merge 3 commits into from

Conversation

carlossanlop
Copy link
Member

@carlossanlop carlossanlop commented Nov 18, 2021

The ZipArchiveMode parameter that is passed to the ZipArchive constructor plays an important role on how we decide to manipulate the zip stream that we pass to the ZipArchive constructor.

Currently, the ZipArchive code heavily intermixes the logic for reading/creating/updating, making it difficult to debug, diagnose, fix bugs, or even to extend the class with new features.

This PR organizes the ZipArchive code using the strategy pattern, like we did recently with FileStream, to help reduce those problems considerably.

This change would make it easier to address some zip-related issues that we might consider for .NET 7.0:

If this proposed PR is accepted, then the next step would be to also separate concerns in ZipArchiveEntry, since it also intermixes create/read/update logic, although not as heavily as ZipArchive.

@ghost
Copy link

ghost commented Nov 18, 2021

Tagging subscribers to this area: @dotnet/area-system-io-compression
See info in area-owners.md if you want to be subscribed.

Issue Details

The ZipArchiveMode parameter that is passed to the ZipArchive constructor plays an important role on how we decide to manipulate the zip stream that we pass to the ZipArchive constructor.

Currently, the ZipArchive code heavily intermixes the logic for reading/creating/updating, making it difficult to debug, diagnose, fix bugs, or even to extend the class with new features.

This PR organizes the ZipArchive code using the strategy pattern, like we did recently with FileStream, to help reduce those problems considerably.

This change would make it easier to address some zip-related issues that we might consider for .NET 7.0:

If this proposed PR is accepted, then the next step would be to also move ZipArchiveEntry to the strategy pattern, since it also intermixes create/read/update logic, although not as heavily as ZipArchive.

Author: carlossanlop
Assignees: carlossanlop
Labels:

area-System.IO.Compression

Milestone: 7.0.0

@carlossanlop
Copy link
Member Author

@adamsitnik I'm going to run the benchmarks. Do you know if there's potential for regressions due to using an abstract class? I mainly added it to avoid duplicating code, because there were many cases where two modes shared logic (read and update do one thing, but create does another one, for example). But I wouldn't mind have some duplicate code if we have to avoid abstract overrides to prevent perf regressions.

Copy link
Member

@jozkee jozkee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your ZipArchiveStrategy is currently just abstracting the Mode, opposed to FileStreamStrategy which abstracts the OS, Sync/Async, buffering, if base or derived.

I would suggest to go with a more lightweight approach for abstracting Mode and just separate the logic for each in static methods.

If there are other factors that support having the abstraction, then cool, but right now, the Mode is not enough IMO.

@@ -262,19 +262,19 @@ public long Length
/// <exception cref="ObjectDisposedException">The ZipArchive that this entry belongs to has been disposed.</exception>
public void Delete()
{
if (_archive == null)
if (_strategy == null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_strategy is not nullable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'll check if there are other similar existing conditions I can simplify.


namespace System.IO.Compression
{
internal interface IZipArchiveStrategy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why an interface?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems I don't need it. With just the abstract class is enough. I'll remove it.


namespace System.IO.Compression
{
internal class ZipArchiveCreateStrategy : ZipArchiveStrategy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: seal all the leaf classes.

Suggested change
internal class ZipArchiveCreateStrategy : ZipArchiveStrategy
internal sealed class ZipArchiveCreateStrategy : ZipArchiveStrategy

private Stream? _backingStream;
private byte[]? _archiveComment;
private Encoding? _entryNameEncoding;
internal IZipArchiveStrategy Strategy { get; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new allocation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's similar to what we have in FileStream. If we find places where we can improve perf, it can compensate this extra allocation. Unfortunately, due to the inheritance of the abstract class, I cannot switch to using structs. Do you have any alternatives?

@stephentoub
Copy link
Member

I generally agree with @jozkee here; I'm not seeing what this abstraction is really buying. With FileStream, we were contending with Windows vs Unix, buffering vs unbuffered, pipe vs file, async vs sync, etc., and the matrix of all of those is a large part of what made the implementation untenable. In contrast, here there appears to be just be a single dimension, and in each of the concrete implementations, most of the members are empty/minimal.

I'm not against an internal abstraction here if it helps more than it hurts, but I'm not seeing it at the moment. You mention this will help better address future issues; can you elaborate?

@carlossanlop
Copy link
Member Author

carlossanlop commented Nov 19, 2021

I'm not against an internal abstraction here if it helps more than it hurts, but I'm not seeing it at the moment. You mention this will help better address future issues; can you elaborate?

3 things come to mind:

  1. There's a reqeuest from the OpenXML team to bring back the ZipArchive.Flush method that we didn't port from .NET Framework. The API would unblock some problems they found in System.IO.Packaging APIs. I experimented with adding a Flush method, and realized that the behavior would have to be special-cased depending on the mode. Flush would do exactly what we currently do inside Dispose, minus the disposal of the internal streams. But since the expectation is that the user should be able to continue manipulating the archive and entries after they call Flush, then we need to put the streams back into a good state, and this is where I was finding the complications due to the read/create/updaet logic being intermixed.
  2. ZipArchive also has different behaviors when loading the archive and the entries, which depend on the mode. It would be quite helpful to have a separation of the logic depending on the mode, if we want to fix the memory-related issues (linked in the description).
  3. Finally, if we want to add zip64 support, I imagine we would have to split the code even more. I haven't dug deep into this, but my first impression is that this could be another layer of abstraction if we use the strategy pattern (I could be wrong).

I generally agree with @jozkee here; I'm not seeing what this abstraction is really buying.

A good alternative @jozkee gave me is to add specialized methods instead, and split the methods into partial files depending on the method. I like the option since I don't have strong arguments for points 2 and 3 yet. So until I can confirm we have an additional abstraction layer besides ZipArchiveMode that would require a strategy pattern, we can stick to partial files.

How does that sound?

@stephentoub
Copy link
Member

Thanks for the additional details.

My suggestion is to wait to do such a refactoring an introduction of an internal abstraction until there's actually necessity, i.e. do it in the same PR where it's being used. Otherwise, we're speculating about what the abstraction should be and whether it's helpful without actually being able to take it for a spin. It's the same in this regard to introducing new public abstractions: you want to both consume them and implement them several times to have confidence in their shape. Obviously we have more flexibility with an internal implementation detail, but it's similar.

@carlossanlop
Copy link
Member Author

do it in the same PR where it's being used.

Ok thanks. I'll continue investigating the Flush API implementation, reusing the code I have here, and when it's ready, I'll submit a new PR.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 19, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants