Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Files cache: use ctime instead of mtime #911

Closed
enkore opened this issue Apr 15, 2016 · 32 comments
Closed

Files cache: use ctime instead of mtime #911

enkore opened this issue Apr 15, 2016 · 32 comments
Assignees
Labels
Milestone

Comments

@enkore
Copy link
Contributor

enkore commented Apr 15, 2016

ctime can't be set via userspace, while mtime is easily manipulated.

For most files (e.g. not extracted from tarballs, or installed by a package manager) these two are the same, which limits the impact of the change on existing file caches.

  • Works with sshfs? -> (YES) (see below)

@verygreen noted that in the Windows world ctime usually means "creation" not "change" time

  • Works with windows+cygwin? -> YES
  • Works with native windows? -> NO
@ThomasWaldmann
Copy link
Member

as we always backup file metadata, we are only interested in file contents changes (not also in file metadata changes). thus mtime fits better.

@verygreen
Copy link
Contributor

Well, here lies the risk.
Imagine I have a file, run borg backup, it stores the fiel data, remembers the mtime, size and stuff.

Now I modify the file (without changing the size), and then use utimes to return the mtime back to what it was - now next run of borg would miss the changes.

@ThomasWaldmann
Copy link
Member

then you have shot yourself in the foot. :-P

@verygreen
Copy link
Contributor

What if it was not me but:

  1. package manager updated a file (pretty common, though changed content is probably a bit less common).
    1.1 untar or cp -a or something similar updated a file
  2. evil hackers updated a file (in conjunction with people that use borg diff to detect intrusions potentially, there was a ticket about this).

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Apr 30, 2016

well, borg is not meant to be a intrusion detection system.

as far as a backup program is concerned, it would just not backup the hacked binary.

even if we used ctime, an attacker could still modify the binary, then modify the borg cache and update the ctime there, so there is no safety against this.

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Dec 2, 2016

Besides "hackability" (detecting which is not our goal), mtime seems a better fit as we need the files cache to detect content changes (to re-chunk, if needed, but only if needed).

So, isn't ctime the wrong timestamp anyway for this?


ctime

ctime is the inode or file change time. The ctime gets updated when the file attributes are changed, like changing the owner, changing the permission or moving the file to an other filesystem but will also be updated when you modify a file.

mtime

mtime is the file modify time. The mtime gets updated when you modify a file. Whenever you update content of a file or save a file the mtime gets updated.

Most of the times ctime and mtime will be the same, unless only the file attributes are updated. In that case only the ctime gets updated.


@ThomasWaldmann
Copy link
Member

we do not want to detect file attribute changes like owner - borg stores them always anyway.
we ONLY want to detect if file content changes (and then we need to re-chunk the file).

so, close?

@verygreen
Copy link
Contributor

As long as you ignore deliberate abusers that update a few bytes in the file (leaving hte size intact) and then reset the mtime to what it was (mtime is changeable by user, ctime is not), it's ok to close.

@d3zd3z
Copy link

d3zd3z commented Sep 10, 2017

I have had the debian package manager, on Ubuntu, replace files, setting the mtime back to what it was before. Since the file was the same size, rsync did not back it up (neither would have Borg). The other backup software I used that correctly used the ctime backed it up just fine.

I'm not sure I see why this is a debate. Using ctime will always back up a file when it changes. Using mtime will usually back up a file that has changed. Why would we want to only do the correct thing most of the time. The cost, as mentioned above is possibly having to reread a file that had its metadata changed.

Could we not somehow make this an option? For those that want the mostly-correct behavior, it could use mtime, and those of us that would actually like everything backed up, to use the ctime?

@ThomasWaldmann
Copy link
Member

@d3zd3z so, that sounds like a bit stupid behaviour of the package manager in the first place. did it also not change the inode number? because borg considers (mtime, size, inode) by default.

@d3zd3z
Copy link

d3zd3z commented Sep 10, 2017

At least dpkg will overwrite files if already present, so the inode number doesn't change. The only thing that was different on the file was the ctime. There are also revision control systems that set the mtime on files. At least those would be expected to put a time on the file that matched its contents, so if it ended up with the same mtime, it would also have the same contents.

But there are lots of potential pieces of software out there, and it is hard to know what kinds of crazy things people could do. Even the 'touch' command can easily put an mtime back.

@d3zd3z
Copy link

d3zd3z commented Sep 10, 2017

Someone asked for when ctime/mtime are updated:

  • mtime is updated when there is a change to the content of the file. It can also be set by the utime() .
  • ctime is updated when any change is made to the file, whether it be metadata, or the contents of the file.

@d3zd3z
Copy link

d3zd3z commented Sep 10, 2017

Additional information:

  • Per POSIX, ctime will be updated when a change of any kind is made to a file. This will apply to most native filesystems on POSIX or POSIX-like systems.
  • The ctime cannot be set by the user. The only way it can be faked is to change the system clock before modifying a file.
  • The mtime is updated when the data of the file changes, it can also be set to an arbitrary value with the utime() or utimes() call.
  • The only time ctime wouldn't be updated is when a filesystem that doesn't support it is being backed up. For example, backing up a FAT filesystem.
  • Reading a file will modify the atime (but not the ctime).
  • Setting the atime with utime() or utimes() will cause the ctime to be set to the current time.

@d3zd3z
Copy link

d3zd3z commented Sep 10, 2017

Just verified, OSX also returns mtime in the ctime field when mounting a filesystem that doesn't support a ctime. This means if we use 'ctime' on one of these filesystems, we would get the same behavior we have now (barring consistent inodes).

@ThomasWaldmann ThomasWaldmann added this to the 1.1.0 milestone Sep 10, 2017
@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Sep 10, 2017

added this to 1.1.0 milestone to keep it on the radar. we shouldn't do such fundamental changes at patch releases. it's already rather late considering we are at rc3 already, so we will only be able to do that if we can do it rather safely and quickly. and it will delay 1.1.0 as we will need at least another rc just for that.

@ThomasWaldmann
Copy link
Member

one downside of using ctime is that chown/chmod -R ... bigfiles/ would chunk/hash all the bigfiles again.

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Sep 10, 2017

I did some experiment under win10+cygwin (on ntfs):

  • mtime updates if contents of file changes.
  • ctime updates then also. additionally ctime updates when mode/owner/group/nlinks is changed.

Also did experiment under win10 native (on ntfs):

  • mtime updates if contents of file changes.
  • ctime does not update if contents of file changes (so ctime is a "creation" timestamp here).
  • changing owner or setting "hidden" attribute neither changes ctime nor mtime.

@jdchristensen
Copy link
Contributor

I prefer the current behaviour. Using mtime+size+inode is a fairly standard heuristic that works almost all of the time, whereas using ctime is going to cause a lot of unnecessary chunking in many cases. I'm also concerned because some programs that scan filesystems (including some backup programs) explicitly reset the atime after reading up a file, and this causes the ctime to update, which would cause borg to rechunk all files.

Is it really the case that dpkg updates a file but preserves all of length, mtime and inode? I don't recall ever seeing that behaviour, and find it hard to believe. Which package (and version) and which file?

@ThomasWaldmann
Copy link
Member

@jdchristensen are you sure these backup programs really reset the atime (and not choose the NOATIME open mode available for root, that does not touch the atime when reading the file)? borg uses the NOATIME open mode, if possible.

@ThomasWaldmann ThomasWaldmann self-assigned this Sep 10, 2017
@jdchristensen
Copy link
Contributor

I'm not sure how common it is for atime to be reset, but I've heard of it. E.g. tar has as an option:

--atime-preserve
       preserve access times on dumped files, either by restoring the
       times after reading (METHOD='replace'; default) or by not setting
       the times in the first place (METHOD='system')

I also recall hearing about other indexing programs doing this, maybe before other approaches were available.

In any case, even ignoring such programs, ctime will get changed for lots of reasons that don't involve changing the content, while mtime is precisely intended to indicate the last time the content was changed, so I think it's the best thing to look at. But I could understand someone wanting ctime as an option.

@rugk
Copy link
Contributor

rugk commented Sep 10, 2017

In any case, you should leave an option for using the opposite of what you might implement in the future, i.e. make it optional. Only I as a user know, whether I mess around with my mtime or atime, or – if I have a "usual" system – can rely on these times. So the cases in which I need one thing or the other, may be different. There is no "one-thing-fits-it-all" solution here.


Also the hacking argument (i.e. an attacker wants to prevent some files from getting backed-up without niticing), is, IMHO, still valid.First the attacker may not have root, so cannot modify the binary. Secondly you e.g. offer another feature (the --exclude-caches`), which describes the following in it's "security consideration":

"Blind" use of cache directory tags in automatic system backups could potentially increase the damage that intruders or malware could cause to a system. A user or system administrator might be substantially less likely to notice the malicious insertion of a CACHDIR.TAG into an important directory than the outright deletion of that directory, for example, causing the contents of that directory to be omitted from regular backups. To mitigate this risk, backup software should at least inform the user which directories are being omitted due to the presence of cache directory tags.

An attacker could use mtime/atime for the same. Just reset it, that is way subtler than deleting the file.

@rugk
Copy link
Contributor

rugk commented Sep 10, 2017

ctime will get changed for lots of reasons that don't involve changing the content,

Borg could always just do another check after this is detected, using the file hash, so no false-positives happen.

ThomasWaldmann added a commit to ThomasWaldmann/borg that referenced this issue Sep 11, 2017
…gbackup#911

using ctime is the more safe option for a backup tool (see borgbackup#911),
but --use-mtime can be given if using mtime is good enough or if
there are any issues with ctime on the platform / filesystem.
ThomasWaldmann added a commit to ThomasWaldmann/borg that referenced this issue Sep 11, 2017
…gbackup#911

using ctime is the more safe option for a backup tool (see borgbackup#911),
but --use-mtime can be given if using mtime is good enough or if
there are any issues with ctime on the platform / filesystem.
ThomasWaldmann added a commit to ThomasWaldmann/borg that referenced this issue Sep 11, 2017
…gbackup#911

using ctime is the more safe option for a backup tool (see borgbackup#911),
but --use-mtime can be given if using mtime is good enough or if
there are any issues with ctime on the platform / filesystem.
ThomasWaldmann added a commit to ThomasWaldmann/borg that referenced this issue Sep 11, 2017
…gbackup#911

using ctime is the more safe option for a backup tool (see borgbackup#911),
but --use-mtime can be given if using mtime is good enough or if
there are any issues with ctime on the platform / filesystem.
@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Sep 11, 2017

Summary of twitter feedback:

  • wanting to keep mtime for performance reasons, seeing cases for ctime as "special"
  • suggestion to use max(mtime, ctime) as default to address the case when ctime is not there / is implemented as "creation time", offer mtime or ctime as option.

@edgewood
Copy link
Contributor

What do other backup programs do? Not to say that Borg can't do something different, but might speak to user expectations/what the default should be.

Research I was able to do on my phone:

Duplicity: mtime
https://bazaar.launchpad.net/~duplicity-team/duplicity/0.8-series/view/head:/duplicity/path.py

obnam: mtime
https://github.com/obnam-mirror/obnam/blob/b1967189ea3e054305977cb586825bf222a05863/obnamlib/plugins/backup_plugin.py

@ThomasWaldmann
Copy link
Member

restic: mtime

@jdchristensen
Copy link
Contributor

unison is a directory synchronization program that I've been using on various systems for over 15 years to synchronize all of my files. It uses mtime.

rsync uses mtime by default.

Gnu tar uses mtime when updating an archive, but uses ctime when comparing against a snapshot file.

@ThomasWaldmann
Copy link
Member

Note:

  • a lot of tools using mtime does not necessarily mean that this is the best(tm) way.
  • not having seen issues with using mtime does not necessarily mean there are none, maybe one just has not noticed them.

@ThomasWaldmann
Copy link
Member

sshfs: has a clientside cache for misc stuff, including stat() results. So it might give incoherent results over short timespans.

But shouldn't be a problem because the default timeout is relatively short and (after timeout) mtime/ctime gets propagated from server to client.

@ThomasWaldmann
Copy link
Member

From IRC:

21:21  Artefact2$ not sure, not a fs expert by any means. i know some programs like jpegoptim or pngcrush will change contents of a file while maintaining mtime
21:21  Artefact2$ but filesize usually changes

inode number changes also.

@ThomasWaldmann
Copy link
Member

everybody please have a look at PR #3024 (and comment either there or here) - I'ld like to fix that for borg 1.1.0.

@ThomasWaldmann
Copy link
Member

vfat / linux:

  • atime: there seems to be an atime simulation that works as long as the fs is mounted, but atime gets lost on umount.
  • ctime: there seems to be also a ctime simulation: if one does a chmod, it has no effect, but ctime updates (mtime does not update), ctime is preserved over remounts.
  • mtime behaves as expected (granularity 2s)
  • ctime/mtime update in parallel if contents are updated (same value, as expected)

@ThomasWaldmann
Copy link
Member

ntfs / linux (FUSE):

  • atime: behaves as expected, survives remount
  • ctime: updates on content-only changes, does not update on chmod/chown (chmod/chown has no effect), updates on creating hardlinks (increase link count)
  • mtime: behaves as expected (granularity 100ns)

ThomasWaldmann added a commit that referenced this issue Sep 30, 2017
implement files cache mode control, fixes #911
ThomasWaldmann added a commit to ThomasWaldmann/borg that referenced this issue Sep 30, 2017
You can now control the files cache mode using this option:

--files-cache={ctime,mtime,size,inode,rechunk,disabled}*

(only some combinations are supported)

Previously, only these modes were supported:
- mtime,size,inode (default of borg < 1.1.0rc4)
- mtime,size (by using --ignore-inode)
- disabled (by using --no-files-cache)

Now, you additionally get:
- ctime alternatively to mtime (more safe), e.g.:
  ctime,size,inode (this is the new default of borg >= 1.1.0rc4)
- rechunk (consider all files as changed, rechunk them)

Deprecated:
- --ignore-inodes (use modes without "inode")
- --no-files-cache (use "disabled" mode)

The tests needed some changes:
- previously, we use os.utime() to set a files mtime (atime) to specific
  values, but that does not work for ctime.
- now use time.sleep() to create the "latest file" that usually does
  not end up in the files cache (see FAQ)

(cherry picked from commit 5e2de8b)
ThomasWaldmann added a commit that referenced this issue Oct 1, 2017
implement files cache mode control, fixes #911
@ghost ghost mentioned this issue Aug 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants