Discussion:
[linux-elitists] Fun with Git repository copying
Don Marti
2013-04-13 14:45:39 UTC
Permalink
What happens when you're doing a copy of a Git
repository that's in the process of being pushed to
or garbage collected?

http://joeyh.name/blog/entry/difficulties_in_backing_up_live_git_repositories/

http://marc.info/?l=git&m=136422341014631&w=2

Sometimes, bad things.

Here's a hypothetical game.

Let's say that programmer A has the job of
implementing POSIX cp(1), but has decided to do it
in a way that will pass the "cp" test suite but order
the file copying to maximize the chances of breaking
copies of Git repositories that are being changed
during the copy. (For example, "evil cp" might see
if there are any subdirectories directories named
"objects", copy their contents first, then pause,
then copy the rest.)

Programmer B has decided to extend Git to defend
against "evil cp" so that the copy is usable, even if
"evil cp" and a large push and repack happened at
the same time.

A has full access to the Git source code and mailing
list. B is aware of the existence of "evil cp"
but not the details of what it does.

Who wins?
--
Don Marti +1-510-332-1587 (mobile)
http://zgp.org/~dmarti/ Alameda, California, USA
***@zgp.org
Greg KH
2013-04-13 15:07:32 UTC
Permalink
Post by Don Marti
What happens when you're doing a copy of a Git
repository that's in the process of being pushed to
or garbage collected?
http://joeyh.name/blog/entry/difficulties_in_backing_up_live_git_repositories/
http://marc.info/?l=git&m=136422341014631&w=2
Sometimes, bad things.
Sometimes? It's more common than you might think, which is why the
kernel.org admin has created grokmirror to handle mirroring of git
repos, which have the same problem of backing up / copying them on a
live system:
https://plus.google.com/u/0/114752601290767897172/posts/fVvEoMe1H6q
Post by Don Marti
Here's a hypothetical game.
Let's say that programmer A has the job of
implementing POSIX cp(1), but has decided to do it
in a way that will pass the "cp" test suite but order
the file copying to maximize the chances of breaking
copies of Git repositories that are being changed
during the copy. (For example, "evil cp" might see
if there are any subdirectories directories named
"objects", copy their contents first, then pause,
then copy the rest.)
Programmer B has decided to extend Git to defend
against "evil cp" so that the copy is usable, even if
"evil cp" and a large push and repack happened at
the same time.
A has full access to the Git source code and mailing
list. B is aware of the existence of "evil cp"
but not the details of what it does.
Who wins?
B because git doesn't use 'cp' but rather the syscalls directly, so the
user of the git repo itself will be just fine, who knows about the user
of the copied repo, an "evil" cp could just not copy all of the files.

Again, don't just use rsync or cp on a live git repo, you wouldn't do
that on a database, would you?

greg k-h
Don Marti
2013-04-13 17:26:11 UTC
Permalink
Post by Greg KH
Again, don't just use rsync or cp on a live git repo, you wouldn't do
that on a database, would you?
No, for a database I'd shut down the server, then
copy, then start up the server again (unless it was
critical to minimize downtime, in which case I'd put
the database files on a separate filesystem and do
a snapshot, then copy that.)

For git, though, there's no server process to shut
down (unless I want to bring down sshd). What's the
best way to make git not modify a repository while
I'm copying it or backing it up?
--
Don Marti +1-510-332-1587 (mobile)
http://zgp.org/~dmarti/ Alameda, California, USA
***@zgp.org
Greg KH
2013-04-14 03:04:12 UTC
Permalink
Post by Don Marti
Post by Greg KH
Again, don't just use rsync or cp on a live git repo, you wouldn't do
that on a database, would you?
No, for a database I'd shut down the server, then
copy, then start up the server again (unless it was
critical to minimize downtime, in which case I'd put
the database files on a separate filesystem and do
a snapshot, then copy that.)
For git, though, there's no server process to shut
down (unless I want to bring down sshd). What's the
best way to make git not modify a repository while
I'm copying it or backing it up?
You already said it, stop all processes that could access the repo (i.e.
sshd), back it up / snapshot it, and start it up. Just like any other
database.

Or use a snapshot-like filesystem (trigger a btrfs snapshot on every
commit finishing), or have a git trigger that pushes the data somewhere
else at the same time (or start a pull from somewhere else), which is
git.kernel.org is moving toward.

greg k-h
Øyvind A. Holm
2013-04-14 03:21:31 UTC
Permalink
Post by Greg KH
Post by Greg KH
Again, don't just use rsync or cp on a live git repo, you wouldn't
do that on a database, would you?
No, for a database I'd shut down the server, then copy, then start
up the server again (unless it was critical to minimize downtime, in
which case I'd put the database files on a separate filesystem and
do a snapshot, then copy that.)
For git, though, there's no server process to shut down (unless I
want to bring down sshd). What's the best way to make git not
modify a repository while I'm copying it or backing it up?
You already said it, stop all processes that could access the repo
(i.e. sshd), back it up / snapshot it, and start it up. Just like any
other database.
Why use rsync at all? We already have git fetch. Create a bare
repository on another machine, set up a remote in that bare repo that
points to the source repo (the one that should be backed up) and run
"git fetch --all --prune". In addition to that, you could recreate all
the branches locally (in the backup repo) using something like this
script:

https://github.com/sunny256/utils/blob/master/git-allbr

Are there any advantages to using rsync instead of just fetching a
backup? By using this method, silent corruption of the main repo (like
the KDE event some weeks ago) will be caught if fetch.fsckObjects is set
to true. And there's no need to shut down anything. Also, if the source
repo is repacked with git-gc, no additional bandwith is used to download
all the repackaged objects.

Regards,
Øyvind

Loading...