commit 0b00fb3536c5308b2e0de5000f16abc7a5c9bb06
parent e783e1f6547d2ea69e8ac0f857506f4dd0a4ab28
Author: pyratebeard <root@pyratebeard.net>
Date: Thu, 13 Oct 2022 13:15:04 +0100
smoke_me_a_kipper
Diffstat:
1 file changed, 24 insertions(+), 0 deletions(-)
diff --git a/entry/smoke_me_a_kipper.md b/entry/smoke_me_a_kipper.md
@@ -2,4 +2,28 @@
Earlier this year I wrote about my [backup setup](20220414-speak_of_the_dedup.html) and this last week I had to put it to the test.
+My PC is a tower that I have on a small stand next to my desk. In the past I had kept the case (an Antec 1200) on my desk but it is rather large and dominates the space a bit too much, I don't have a very big desk. The other day my 1 year old toddled into the study and started pushing the power button on my PC. This power cycled the machine a few times in quick succession. At the time I wasn't aware of this. The next morning I booted up my PC but noticed it was very sluggish. It crashed trying to open my browser. After it happened again I started digging through the logs and noticed some filesystem corruption.
+
+As I described in my "speak_of_the_dedup" post I have a 3 disk RAID array as my $HOME. Because of the size I only nightly backup important documents, etc. A full backup is done periodically to an external drive I keep in my bug out bag. Unfortunately I had not done a full back in a while, but I knew my nightly backups were good so nothing too important was lost.
+
+I had used xfs on my $HOME, so I unmounted the device and started an `xfs_repair`. The repair tool very quickly got to Phase 3, showing the output
+```
+Phase 3 - for each AG...
+ - scan and clear agi unlinked lists
+ - 09:50:01: scanning agi unlinked lists - 0 of 32 allocation groups done
+```
+
+The last line was repeated every 15 minutes, for over 36 hours, never changing from 0 allocation groups done. I don't think it was doing anything. Eventually I stopped it and ran the repair in check mode. This caused a segmentation fault at Phase 3. I tried again but got the same segfault.
+
+After a few days of digging around and trying different things I decided the effort wasn't worth it. Reluctantly I accepted my losses and started the recovery.
+
+Once the RAID array was reformatted I began the data copy from my external drive. This put me back to when it was last backed up. Then I could `rclone` my nightly backups from the last time it ran (before the corruption) and bring that data up to date.
+
+This got me to a relatively good position. Okay I had lost some random downloads, and a little bit of code that hadn't been pushed to my git server, but nothing serious. It is a little disappointing though, my backup setup is not good enough.
+
+The reason I don't do a full nightly backup to the cloud is because `rclone` takes so long to copy the data. I decided to look into this, to see if it could be sped up. Reading the man page shows that `rclone` has an option to only transfer files younger than a specified age, `--max-age=`. Using `dedup` means I don't have to transfer everything each time `rclone` runs, only the most recent archive. Testing this brought my nightly backup time down to TK.
+
+I decided I needed more regular backups of my $HOME, so I needed some more storage. I purchased another external drive which now sits permanently plugged into my PC. I was going to use `dedup` again but decided it would be better to use an alternative tool so I am not relying on only one tool. I opted for `rsnapshot`. The first backup did take a long time, but now each evening I can run `rsnapshot` to backup my $HOME to the external drive and `rclone` that latest archive to the cloud storage.
+
+Another full backup will still be done to the drive in my bug out bag, I just have to be better at doing it more regularly. At least now if I need to restore I will be able to recover all of $HOME and not only the important things.