Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for adding new entries to large zip archives #49149

Closed
Tracked by #62658
madelson opened this issue Mar 4, 2021 · 9 comments · Fixed by #102704
Closed
Tracked by #62658

Add support for adding new entries to large zip archives #49149

madelson opened this issue Mar 4, 2021 · 9 comments · Fixed by #102704
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.IO.Compression in-pr There is an active PR which will close this issue when it is merged needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration
Milestone

Comments

@madelson
Copy link
Contributor

madelson commented Mar 4, 2021

Background and Motivation

We use System.IO.Compression.ZipArchive to manage the creation of large .zip files in a streaming fashion. When creating new archives, this works quite well (with ZipArchiveMode.Create, it writes through to the underlying stream so long as you only write to one entry at a time).

However, when we want to append to an existing archive, we have to use ZipArchiveMode.Update. According to the doc comments, with this mode the contents of the entire archive must be held in memory! This caused our system to crash due to array length restrictions when working with a particularly large file.

The zip format is designed to support efficient appending of files, so I believe it should be possible for .NET's implementation to support this use-case.

Proposed API

This could be addressed using a new ZipArchiveMode enum value, perhaps named ZipArchiveMode.Append to match FileMode.Append. This would be similar to Create but would allow for an existing file to be used.

Usage Examples

using var fileStream = File.Open("existing.zip");
var zip = new ZipArchive(fileStream, ZipArchiveMode.Append);
var newEntry = zip.CreateEntry("new");
using var writer = new StreamWriter(newEntry.Open());
// write lots of content!

Alternative Designs

Another approach would be to change the behavior of Update such that it would only bring things into memory as needed (e. g. if you change the contents of an existing entry or keep multiple entries open for writing at the same time).

This second approach would have the benefit of improving the performance of all existing programs which use Update mode to append to existing zips, which seems likely to be a common use-case for Update.

Similarly, it seems that this could also enable Update to share the streaming benefits of Read mode in many cases.

The downside would be that code could silently go from performant to non-performant if the usage limitations were violated, although that is already the case with Read when the underlying stream is not seekable and Create when writing to multiple entries at once.

Another potential downside is that today presumably Update mode does all writes at the end of the operation, potentially allowing other readers to use the zip until then. This change would alter that behavior.

Risks

With the design approach of optimizing Update for specific scenarios, the design might entail update switching from a write-through approach to an in-memory approach partway through an operation. This might add overhead to someone who is actually leveraging the ability modify existing entries.

@madelson madelson added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Mar 4, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.IO.Compression untriaged New issue has not been triaged by the area owner labels Mar 4, 2021
@carlossanlop carlossanlop added needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration and removed untriaged New issue has not been triaged by the area owner labels Mar 25, 2021
@carlossanlop carlossanlop added this to the Future milestone Mar 25, 2021
@carlossanlop
Copy link
Member

@madelson I think what you're facing is what I described in this other issue, which is a problem often reported: #35815

The suggested workaround is to change ZipArchiveMode to Create instead of Update. This will avoid having to create the intermediate MemoryStream that ends up reaching Int32.MaxValue. The reason why this works is because you only need write permission, not read permission (you're adding new items).

Your suggestions make sense, even though there's a workaround. Another thing we should consider doing is improving the documentation.

@madelson
Copy link
Contributor Author

madelson commented Oct 1, 2021

@carlossanlop are you saying that Create can be used to update existing archives? This definitely wasn't how I interpreted the documentation (like you said it probably needs an update). I had thought we saw otherwise, but perhaps that was on an older version.

EDIT: this does not work. See my comment below.

@AraHaan
Copy link
Member

AraHaan commented Feb 9, 2022

I could have sworn that the Create mode is for creating or overwriting existing archives if they already exist.

@jeffhandley jeffhandley modified the milestones: 7.0.0, 8.0.0 Jul 9, 2022
@RokeJulianLockhart
Copy link

RokeJulianLockhart commented Jul 21, 2022

@carlossanlop, why is the MemoryStream limited to a 32-bit value? Is replacement of that with a 64-bit value not the most easy method of remediation?

@madelson
Copy link
Contributor Author

madelson commented Jul 21, 2022

Is replacement of that with a 64-bit value not the most easy method of remediation?

I don't think this would have solved it for us. We were dealing with multiple huge files on a server and aside from the array length issue, (iirc) that the server memory usage was spiking. This would just have allowed it to spike even higher.

madelson added a commit to madelson/dotnet-api-docs that referenced this issue Jul 21, 2022
I think this example should leverage `ZipArchiveMode.Create` because it is only appending/writing to entries. `Create` is more efficient in that scenario since it does not have to buffer the entire archive in memory. Furthermore, it is confusing that `ZipArchiveMode.Create` supports appending (unlike `File.Create` / `FileMode.Create`), so calling this nuance out in the first code example seems prudent.

See dotnet/runtime#49149
madelson added a commit to madelson/runtime that referenced this issue Jul 21, 2022
For context, see dotnet#49149. The aim is to reduce confusion around the fact that Create mode supports append (unlike `FileMode.Create`).
@madelson
Copy link
Contributor Author

@carlossanlop I went to update the docs based on your comments but I can't reproduce the behavior you describe:

var path = "...";

using (var archive = new ZipArchive(File.Create(path), ZipArchiveMode.Create))
{
	WriteEntry(archive, "a");
	WriteEntry(archive, "b");
}
PrintEntries(path);

using (var archive = new ZipArchive(File.Open(path, FileMode.Open), ZipArchiveMode.Create))
{
	WriteEntry(archive, "c");
	WriteEntry(archive, "d");
}
PrintEntries(path);

using (var archive = new ZipArchive(File.Open(path, FileMode.Open), ZipArchiveMode.Create))
{
	WriteEntry(archive, "e");
}
PrintEntries(path);


void WriteEntry(ZipArchive archive, string name)
{
	using (var writer = new StreamWriter(archive.CreateEntry(name).Open()))
	{
		writer.Write(name);
	}
}

void PrintEntries(string path)
{
	Console.WriteLine($"Entries for {path}:");
	using var archive = new ZipArchive(File.OpenRead(path), ZipArchiveMode.Read);
	foreach (var entry in archive.Entries)
	{
		using var reader = new StreamReader(entry.Open());
		Console.WriteLine($"* {entry.FullName}: {reader.ReadToEnd()}");
	}
}

Output:

Entries for c:\dev\test.zip:
* a: a
* b: b
Entries for c:\dev\test.zip:
* c: c
* d: d
Entries for c:\dev\test.zip:
<Fails with InvalidDataException: "Number of entries expected in End Of Central Directory does not correspond to number of entries in Central Directory." at System.IO.Compression.ZipArchive.ReadCentralDirectory()>

So I don't think Create mode works for updating existing archives. It also potentially seems like a bug that it can create an invalid archive when overwriting. This is on .NET 6; possibly it has been fixed for .NET 7. Thoughts?

@AraHaan
Copy link
Member

AraHaan commented Jul 21, 2022

I doubt it was fixed, Also I thought it would overwrite it as well and I was right. 😅 It could have been that the one who thought that create could be used was wrong, or you might have to use it only with manually creating ZipArchiveEntry but open the actual archive with Update.

But I think a better option would be to consider replacing existing archive code with 7zip's code in System.IO.Compression.Native and then just have the ZipArchive* types simply P/Invoke those apis. Also note: zlib 1.2.12 was just released as well.

@madelson
Copy link
Contributor Author

madelson commented Jul 22, 2022

you might have to use it only with manually creating ZipArchiveEntry but open the actual archive with Update.

@AraHaan do you have a code snippet? Not sure I follow. Do you think this will avoid the "buffer everything in memory" issue?

@AraHaan
Copy link
Member

AraHaan commented Jul 23, 2022

I do not think it will bypass the memory buffer, but I think 7zip's native library for it could.

@ViktorHofer ViktorHofer modified the milestones: 8.0.0, Future Jul 12, 2023
@dotnet-policy-service dotnet-policy-service bot added the in-pr There is an active PR which will close this issue when it is merged label May 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.IO.Compression in-pr There is an active PR which will close this issue when it is merged needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants