this fixes the issue that on some filesystems, you cannot recursively
remove a directory when you hold a lock on a file inside (e.g. nfs/cifs)
it is not really backwards compatible (so during an upgrade, there
could be two daemons have the lock), but since the locking was
broken before (see previous patch) it should not really matter
(also it seems very unlikely that someone will trigger this)
Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
'lock_manifest' returns a Result<File, Error> so we always got the result,
even when we did not get the lock, but we acted like we had.
bubble the locking error up
Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
WalkDir does not follow symlinks by default anyway, and this behaviour
is not documented anywhere. e.g., if a sysadmin mounts 'extra storage'
for some backup group or type (not knowing that only metadata is stored
in those directories), GC will ignore all the indices contained within
and happily garbage collect their chunks..
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
for safety reason, GC finds and marks all index files below the
datastore base path. as a result of regular operations, only index files
within the expected scheme of <TYPE>/<ID>/<TIMESTAMP> should exist.
add a small check + warning if the index list contains index files out
side of this expected scheme, so that an admin with shell access can
investigate.
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
we have messages starting the phases anyway, and limit the number of
progress updates so that context remains available at all times.
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Simplify the phase 2 code by treating .bad files just like regular
chunks, with the exception of stat logging.
To facilitate, we need to touch .bad files in phase 1. We only do this
under the condition that 1) the original chunk is missing (as before),
and 2) the original chunk is still referenced somewhere (since the code
lives in the error handler for a failed chunk touch, it only gets called
for chunks we expect to be there, i.e. ones that are referenced).
Untouched they will then be cleaned up after 24 hours (or after the last
longer-running task finishes).
Reason 2) is also a fix for .bad files not being cleaned up at all if
the original is no longer referenced anywhere (e.g. a user deleting all
snapshots after seeing some corrupt chunks appear).
cond_touch_path is introduced to touch arbitrary paths in the chunk
store with the same logic as touching chunks.
Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
try do reduce some unecessary lines, make match arms more precise so
one can faster see what's actually happening.
Also, avoid
> return Err(format_err!(...))
stuff, just use bail!()
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
in most generic places. this is accompanied by a change in
RpcEnvironment to purposefully break existing call sites.
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Make it more clear that removed files are chunks (not indexes or
something like that, user cannot know that we do not touch them here)
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
fixes commit b4fb262335, which copied
over the "Removed bad files:" block, but only adapted the log text,
not the actual variable.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
and load it again when opening it
this way we can persist the status of the last garbage collect across
daemon reloads and reboots
Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
To cater to the paranoid, a new datastore-wide setting "verify-new" is
introduced. When set, a verify job will be spawned right after a new
backup is added to the store (only verifying the added snapshot).
Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
Force consumers to use the lookup_datastore method instead of
potentially opening a datastore twice, and pass the config we have
already loaded into open_with_path, removing the need for open(1).
Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
Avoid races when updating manifest data by flocking a lock file.
update_manifest is used to ensure updates always happen with the lock
held.
Snapshot deletion also acquires the lock, so it cannot interfere with an
outstanding manifest write.
Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
Removing a snapshot has some more safety checks which we don't want to
ignore when removing an entire group (i.e. locking the manifest and
notifying GC).
Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
There's no point in having that as a seperate method, just parse the
thing into a struct and write it back out correctly.
Also makes further changes to the method simpler.
Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
...to avoid it being forgotten or pruned while in use.
Update lock error message for deletions to be consistent.
Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
To untangle the server code from the actual backup
implementation.
It would be ideal if the whole backup/ dir could become its
own crate with minimal dependencies, certainly without
depending on the actual api server. That would then also be
used more easily to create forensic tools for all the data
file types we have in the backup repositories.
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
- remove chrono dependency
- depend on proxmox 0.3.8
- remove epoch_now, epoch_now_u64 and epoch_now_f64
- remove tm_editor (moved to proxmox crate)
- use new helpers from proxmox 0.3.8
* epoch_i64 and epoch_f64
* parse_rfc3339
* epoch_to_rfc3339_utc
* strftime_local
- BackupDir changes:
* store epoch and rfc3339 string instead of DateTime
* backup_time_to_string now return a Result
* remove unnecessary TryFrom<(BackupGroup, i64)> for BackupDir
- DynamicIndexHeader: change ctime to i64
- FixedIndexHeader: change ctime to i64
The iterator of get_chunk_iterator is extended with a third parameter
indicating whether the current file is a chunk (false) or a .bad file
(true).
Count their sizes to the total of removed bytes, since it also frees
disk space.
.bad files are only deleted if the corresponding chunk exists, i.e. has
been rewritten. Otherwise we might delete data only marked bad because
of transient errors.
While at it, also clean up and use nix::unistd::unlinkat instead of
unsafe libc calls.
Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
An flock on the snapshot dir itself is used in addition to the group dir
lock. The lock is used to avoid races with forget and prune, while
having more granularity than the group lock (i.e. the group lock is
necessary to prevent more than one backup per group, but the snapshot
lock still allows backups unrelated to the currently running to be
forgotten/pruned).
Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
Attempt to lock the backup directory to be deleted, if it works keep the
lock until the deletion is complete. This way we ensure that no other
locking operation (e.g. using a snapshot as base for another backup) can
happen concurrently.
Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
instead of bailing and stopping the entire GC process, warn about the
missing chunks and continue.
this results in "TASK WARNINGS: X" as the status.
Signed-off-by: Oguz Bektas <o.bektas@proxmox.com>
Used chunks are marked in phase1 of the garbage collection process by
using the atime property. Each used chunk gets touched so that the atime
gets updated (if older than 24h, see relatime).
Should there ever be a situation in which the phase1 in the GC run needs
a very long time to finish, it could happen that the grace period
calculated in phase2 is not long enough and thus the marking of the
chunks (atime) becomes invalid. This would result in the removal of
needed chunks.
Even though the likelyhood of this happening is very low, using the
timestamp from right before phase1 is started, to calculate the grace
period in phase2 should avoid this situation.
Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>