Going big in numbers
Doing your regular backups is one thing, but doing it right is sometimes a quite different story. Especially when you happen to stumble upon some extreme situations - like e.g. a reasonably small (in terms of actual disk size) Subversion repository that has a rather high number of committed revisions.
The problem
Some time ago our backup software started to returned strange looking information about one of our SVN repositories - it reported that the process was taking a very long time to complete. A full weekly backup that started on Monday night, hasn't finished till Wednesday morning. That wasn't normal...
Although original repo was quite small (500MiB), after being dumped, compressed and encrypted, it weighted over 11GiB! Given the final size, the reason why it took so long for a full backup to complete (remember - dump, compress and finally encrypt) was quite obvious. But why was the dump so enormous? Was there anything wrong in our setup? Other repositories were backup up as normal, and all of them could be restored (we did a full diff, just to make sure).
A bit of a background
After some investigation it turned out that the repo was quite small in size (about 500MiB) but it had a reasonably high number of committed revisions (way over 25k).
It wasn't anything strange as we keep our whole configuration in Subversion, which is then distributed by Puppet among all of the servers. As a safe guard, all servers report and commit back all the changes made to their configuration.
They do it on regular intervals, run asynchronously from cron, every 4 to 6 minutes. If anything gets changed, machine sends back the current state of its configuration. Those changes are typically small, but given the number of servers, revisions build up quite fast.
What can be done about it?
The "problem" was in the way we did our backups. Initially, all our Subversion repositories were dumped using svnadmin dump and then piped through a set of other tools like bzip2, gpg etc. Finally, they were distributed among different destinations.
svnadmin dump uses a binary safe and portable text format to store the backup dumps. Unfortunately, if you have loads of revisions, and those commits are quite small in size, the overhead becomes very high - high enough to create such abnormal situations.
Even if the commit includes only a single line inside some file, it's still described by a number of attributes, that get dumped during the backup process. Multiply that by a factor of 25k and you get the point.
Solution
The solution turned to be rather easy.
Instead of doing svnadmin dump we make a hotbackup copy of the repository, tar and then feed through the original pipeline (bzip2, gpg and so on).
There is one catch, though. This requires extra space of roughly twice the size of the largest repository being backed up. But given the overall results, it was worth it in our case, since the size of the repository backup dropped dramatically - from 11GiB to something around 100MiB.
Current state
As a side note, currently our repository holds slightly above 700MiB of data and about 32k revisions. The backup still weights roughly 100MiB and takes only minutes to complete.
So what?
Always see through your toolbox before you start complaining or making panic movements :)

Leave a comment