Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request for file based copy on write #3020

Open
pavel-odintsov opened this issue Jan 16, 2015 · 14 comments
Open

Feature request for file based copy on write #3020

pavel-odintsov opened this issue Jan 16, 2015 · 14 comments
Labels
Type: Feature Feature request or new feature

Comments

@pavel-odintsov
Copy link

Hello, folkes!

I walking through rich features of ZFS and it perfectly fits to my container tasks.

But I haven't find one very important feature. I want ability to use one file in multiple places until it changed somewhere.

I want to create read only volume ("template" in OpenVZ terms) with bunch of binary files representing root hierarchy of Debian Wheezy. After that I want to create hundreds of volumes for customers with file hierarchy completely identical to "template". So every /usr/bin/apache2 for every customer will have identical content, inode and will be stored in filesystem only once. If customer want to change or remove /usr/bin/apache2 it "unlinks" from "template" and works like standard file. This approach will optimize system performance, prevent double-caching and provide much more space without very costly deduplication.

If you still not understand me I want something like fork behavior in Linux when child memory not allocated really until it touched.

Something like this was realized in VServer (http://linux-vserver.org/util-vserver:Vhashify) as patch for ext4. This approach used in Parallels Virtuozzo vzfs (http://download.swsoft.com/virtuozzo/virtuozzo4.0/docs/en/lin/VzLinuxUG/209.htm and http://www.montanalinux.org/openvz-kir-interview.html ) too but was very buggy and hard to maintain. But they want do this task on non-cow filesystem and it's was not a good idea.

THis feature can provide following benefits in compare with "copy this template 100 times":

  • Very big disk space save
  • Lower load on I/O because we should copy only meta data if files and do not touch files contents
  • Lower memory usage because we cache this files only once
  • Faster operation time then direct call of: zfs send debian-template@11022014 | zfs receive client-container-42

According to my understanding of ZFS internal this feature can be realized in reliable way on ZFS. And every user of VM or containers (with haystacks of identical binary files) will be very pleased by this feature!

Thank you!

@lkateley
Copy link

What you are looking for is very easily done with snapshot and clone.

I have a little video on how to.. on http://kateleyco.com/?page_id=783

On 1/16/15 9:21 AM, Pavel Odintsov wrote:

Hello, folkes!

I walking through rich features of ZFS and it perfectly fits to my
container tasks.

But I haven't find one very important feature. I want ability to use
one file in multiple places until it changed somewhere.

I want to create read only volume ("template" in OpenVZ terms) with
bunch of binary files representing root hierarchy of Debian Wheezy.
After that I want to create hundreds of volumes for customers with
file hierarchy completely identical to "template". So every
/usr/bin/apache2 for every customer will have identical content, inode
and will be stored in filesystem only once. If customer want to change
or remove /usr/bin/apache2 it "unlinks" from "template" and works like
standard file. This approach will optimize system performance, prevent
double-caching and provide much more space without very costly
deduplication.

If you still not understand me I want something like fork behavior in
Linux when child memory not allocated really until it touched.

Something like this was realized in VServer
(http://linux-vserver.org/util-vserver:Vhashify) as patch for ext4.
This approach used in Parallels Virtuozzo vzfs
(http://download.swsoft.com/virtuozzo/virtuozzo4.0/docs/en/lin/VzLinuxUG/209.htm
and http://www.montanalinux.org/openvz-kir-interview.html ) too but
was very buggy and hard to maintain. But they want do this task on
non-cow filesystem and it's was not a good idea.

THis feature can provide following benefits in compare with "copy this
template 100 times":

  • Very big disk space save
  • Lower load on I/O because we should copy only meta data if files
    and do not touch files contents
  • Lower memory usage because we cache this files only once
  • Faster operation time then direct call of: |zfs send
    debian-template@11022014 | zfs receive client-container-42|

According to my understanding of ZFS internal this feature can be
realized in reliable way on ZFS. And every user of VM or containers
(with haystacks of identical binary files) will be very pleased by
this feature!

Thank you!


Reply to this email directly or view it on GitHub
#3020.

@pavel-odintsov
Copy link
Author

Linda, thank you so much! I found 8 videos on this pages. Could you clarify what video for me?

@lkateley
Copy link

there is one just about snapshot and clone. If you snap a file or
dataset... it becomes a read only filesystem. Then you can clone it make
an identical read write version. The blocks will point back to original
blocks.

On 1/16/15 9:41 AM, Pavel Odintsov wrote:

Linda, thank you so much! I found 8 videos on this pages. Could you
clarify what video for me?


Reply to this email directly or view it on GitHub
#3020 (comment).

@lkateley
Copy link

I know that you want it just on a single file, but those can be done
through links too.. snap and clone work well on a template of files in a
directory.

On 1/16/15 9:41 AM, Pavel Odintsov wrote:

Linda, thank you so much! I found 8 videos on this pages. Could you
clarify what video for me?


Reply to this email directly or view it on GitHub
#3020 (comment).

@pavel-odintsov
Copy link
Author

Thank you again!

It's really what I want! I very appreciate your help!

Clones
A clone is a writable volume or file system whose initial contents are the same as another dataset. As with snapshots, creating a clone is nearly instantaneous, and initially consumes no additional space.

Clones can only be created from a snapshot. When a snapshot is cloned, it creates an implicit dependency between the parent and child. Even though the clone is created somewhere else in the dataset hierarchy, the original snapshot cannot be destroyed as long as a clone exists. The origin property exposes this dependency, and the destroy command lists any such dependencies, if they exist.

The clone parent-child dependency relationship can be reversed by using the promote subcommand. This causes the "origin" file system to become a clone of the specified file system, which makes it possible to destroy the file system that the clone was created from.

@pavel-odintsov
Copy link
Author

Yes, zfs clone can solve initial part of my issue. I can create containers/vm's without copying same data multiple times.

But what about ability to relink dataset to another template like it realized in VServer http://linux-vserver.org/util-vserver:Vhashify ?

I want it for following case:

  • I installed container with Debian 7.0.1
  • Customer upgraded Debian 7.0.1 to 7.0.2 manually
  • Links to original template snapshot will be broken and I want to link container to Debian 7.0.2 template again.

Is it possible?

@pavel-odintsov
Copy link
Author

I'm deeply thinking about another approach used in OpenVZ pfcache http://wiki.openvz.org/Pfcache/API

It's very interesting approach for de-duplication of binary and library files in memory.

They process /usr/bin, /usr/sbin, /bin, /sbin in every container disk and generate sha1 checksumm for it. After that they store sha1 in special xattr field (trusted.pfcache).

As next step they build lookup table for identifying uniq files. When all uniq binary files found they copied to /vz/pfcache folder with names builded from sha1 chksumms and original files replaced with links to files in this folder.

This approach provide following features:

  • Memory deduplication
  • Very low operation cost (I can't never worry about any upgrade in container)

is is possible do something like with with ZFS natively?

@lkateley
Copy link

This sounds like you can use rollback of a snapshot..

On 1/16/15 10:13 AM, Pavel Odintsov wrote:

Yes, zfs clone can solve initial part of my issue. I can create
containers/vm's without copying same data multiple times.

But what about ability to relink dataset to another template like it
realized in VServer http://linux-vserver.org/util-vserver:Vhashify ?

I want it for following case:

  • I installed container with Debian 7.0.1
  • Customer upgraded Debian 7.0.1 to 7.0.2 manually
  • Links to original template snapshot will be broken and I want to
    link container to Debian 7.0.2 template again.

Is it possible?


Reply to this email directly or view it on GitHub
#3020 (comment).

@gordan-bobic
Copy link
Contributor

Clones, snapshots and deduplication are NOT equivalent to this. Please stop suggesting it is close enough because it really, really isn't. There is a major difference, and additional inconveinences with cloning snapshots.

  1. CoW hard-link breaking would mean the inodes are the same. This is extremely important because it means the mmap() of the binary will result in only one in-memory copy, no matter how many different chroots mmap() different hard-links of it. This means a massive saving in memory for hosts running VServer, LXC or OpenVZ (or jails on FreeBSD, or any other similar feature on an OS that supports ZFS).

  2. Clones are grandfathered. Once you have cloned a file system you cannot delete it. You have to keep it as long as any clones based on it exist - even if every last block has changed. Clones are also not rebasable.

Also, unlike with memory deduplication being done by a separate subsystem, CoW hard-link breaking would make deduplication free at the point of consumption. You hard-link the files periodically, and after than the deduplication of memory is implicit and completely free.

CoW hard-link breaking is a feature that simply cannot be meaningfully approximated using the existing features.

@behlendorf behlendorf added Type: Feature Feature request or new feature Difficulty - Medium labels Jan 16, 2015
@pavel-odintsov
Copy link
Author

Thank you for detailed explanation, Gordan Bobic!

"Clones are grandfathered. Once you have cloned a file system you cannot delete it. You have to keep it as long as any clones based on it exist - even if every last block has changed. Clones are also not rebasable."

We could solve this issue with zfs promote command which detach clone from original template and made it stand-alone volume. But since we did promote we can't got any benefits....

@gordan-bobic
Copy link
Contributor

Indeed, hence my remark that a clone cannot be rebased.

Ideally we want to set a ZFS option, e.g.:

zfs set cowhardlink=1 pool/fs

which will subsequently cause any open() of any file on that for writing (but not for reading) to get copied if refcount for inode > 1.

In pseudo code, something along the lines of the following:

if (refcount > 1 && (mode == O_WRONLY || mode == O_RDWR))
{
// copy entire file and return file handle to the new copied file
}

@pavel-odintsov
Copy link
Author

Yep, rebase is absolutely impossible for zfs/clone :(

@pavel-odintsov
Copy link
Author

Issue #405 and thread https://groups.google.com/a/zfsonlinux.org/forum/#!topic/zfs-discuss/mvGB7QEpt3w will be useful for everyone interested in this feature.

@gordan-bobic
Copy link
Contributor

While there may be some functionality overlap, these are really different
features, both functionally and semantically. What torn5 was asking for is
actually much more similar to using zvols and clones of zvols - it's just
that he wanted to use files rather than zvols for his own reasons, valid or
otherwise.
The CoW hard-link breaking is much more like FL-COW that I mentioned on
that thread, only it needs to be implemented at a level below what the
guest chroot can control for security reasons.

On Tue, Jan 20, 2015 at 11:15 AM, Pavel Odintsov notifications@github.com
wrote:

Issue #405 #405 and thread
https://groups.google.com/a/zfsonlinux.org/forum/#!topic/zfs-discuss/mvGB7QEpt3w
will be useful for everyone interested in this feature.


Reply to this email directly or view it on GitHub
#3020 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature
Projects
None yet
Development

No branches or pull requests

4 participants