Skip to main content

Requiem for data loss

I was about to go full epic-narrator mode, but I'm barely keeping it together. Subjectivity will creep in, can't help it at the moment. I _can_ still try to explain it in a succinct, bullet points, matter-of-fact way, but I'm choosing against it. So, it'll be a bit long winding story, unlike most of my posts.

What I lost can't be downloaded, borrowed or recreated. There were one of each of those things, and now there aren't. I don't expect anyone else to 1:1 grasp how I feel about losing 4 years worth of hobby projects, things I brought to life to be no longer there - so, if you're like "yeah, not like anyone died... chill out & quit bickering about it" I don't really fault you for that. I don't have a pet; I wouldn't quite understand how losing a pet may feel either - compared to someone who does - at least not to the same extent.

So, please excuse me while I write its obituary, cause I want this to stay as vivid of an example as possible, for me to revisit; cause on the surface it may seem easier to blame it on this-n-that or call it an accident, but at the end of the day, it wasn't anyone else's fault but mine.

The Pledge

I have one Windows PC. The primary purpose of it is running games & multimedia tools that aren't optimal for running on my Linux machines.

A couple of days back, I tried playing PUBG & it was lagging like crazy. Everything was happening in slow-mo. So, after few tries, some googling around if any recent PUBG patch broke the game, and trying out some other games & seeing major performance drops, I was sure it's something to do with my system and not with the game(s).

Was trying to calibrate rank for the new season... GGWP
So naturally, I thought my PC is probably affected by some sort of malware, or rogue background services, or someone is mining on my RTX 3080 without giving me a cut. I ran some benchmarks, overclocked, underclocked, ran some differential diagnostics to narrow down the specific bottleneck in resource usage and all. Cleaned and reinstalled GPU drivers with DDU, used a previously known good stable version etc. These things isolated the problem gradually & they pointed towards a bad GPU performance.

I suspected some sort of distributed mining operation running in the background.

In between all these things, I didn't do one very basic check. Which we'll come back to later. Anyway, after all that, I decided it was time to install fresh Windows on the other drive, cause trying to hunt down malware on Windows is usually an effort in futility.

This system had two m.2 NVMe SSDs. One of them the older Evo 960 as boot drive, and a newer Evo 970 as media drive, that I upgraded in 2019. This system wasn't supposed to hold any crucial data, except for only the projects I'm currently working on. So, it was never meant to be a tough choice to format it, and that is so by design. No matter whether there's a thief, murderer or a ghost hiding in my house, my go-to strategy is to set it on fire & let it sort itself out.

However, I underestimated the human fallibility, especially that of my own. The projects were meant to be published as they finish - but being hobby projects, not having any pressing deadlines, they kept piling up in half-done state. Ending up becoming valuable data that I can't just safely lose anymore.

Oh, I know... I should take regular differential backups! How about a DIY NAS as media cache where daily backups are taken with rsnapshot running on a Raspberry Pi? Yeah, that's a good project... let's do that... and 2 years later, here I am, and there it goes! It was supposed to avoid data loss, yet it became a victim of data loss. Great!

Anyway, I digress.

I removed everything unimportant on the old boot drive, leaving the windows installation intact, in case I want to boot from it again. And put about 50GB of stuff that's practically all the projects combined.

I thought of just uploading them on a cloud bucket, then thought, "well what if these files are somehow infected?" as an afterthought... it doesn't fucking matter whether they're infected, when they're being put on an object storage backend like S3/GCS. If they're later tested to be infected, I can just delete that storage bucket. What a severe lack of judgement & clarity!

So, after all the files were sent to Evo 960, I put on a bootable USB, rebooted the system. I took the 960 out, leaving the 970 in, to become the new boot drive (as it's much faster too).

The Turn

Things went pretty smoothly, considering it's a Windows installation that still takes ~2hrs of manual labor to turn the system somewhat usable, and wipes user's home directory, all customizations on each install.

It was only after making it just usable enough to run some benchmarks, and try some games to validate if anything changed or not. And would you look at that! GPU-Z says the GPU is running at PCIe 1x lane (instead of 16x, which is how it should be).

I verified if that was the crux of the issue, or a recent additive problem. Turned out, all the GPU benchmarks are same as it was before (i.e. similar bad results before fresh install) & I realized that it was indeed the main issue. I completely overlooked that very basic piece of information, that it was running on 1 lane, and it was pretty much always in front of me the whole time, if I just looked for it.

Looking up for common causes for sudden PCIe lane count change, I found a truck load of misdirections & misguidance in various threads and forum posts - in short, it can be common cold, or cancer, or broken capacitor, or torn out edge connector's gold fingers. As it always is with symptom based diagnosis with anecdotal solutions.

After spending a few hours doing a little this and that from software & BIOS side, I decided to open, clean & resit the GPU (the RAM treatment), and that actually bumped it to 8x lane.

Weird, but at least we're heading the right way. If nothing... I can still cut the loss & be okay with this, as 8x Gen3 PCIe should be fairly acceptable for the time being, until I get time to find out the real issue. But hey, why not do this again... clean it a bit more and hope it works?

And it did... 16x BABY!

Sigh of relief, as THAT was the issue... 4AM, and about time I got it sorted out. And it's not even the first time I have been in a situation like this, but I guess I never learned to look for the easy solutions first!

Having enough experience simultaneously roleplaying both the user and the technician, one might assume, I'd learn something by now... huh!

Now, that made me realize the whole business of backing up & formatting had been - ironically & in fact - the actual effort in futility. Just because I was beating around the wrong bushes & was too trigger-happy.

So, let's just put that good old SSD back in, and we should be able to test it from the old windows environment itself, right? WRONG! The BIOS takes forever to load, and it doesn't show up as a boot device!

Heh, must be something with UEFI settings, cause the old one had legacy bootloader (MBR), so let's reset & reconfigure the BIOS & try again. Still nothing.

Ok, let's try one NVMe module at a time, in the same slot, each tried for both UEFI & Legacy mode & see what happens. 970 does fine, 960 does fuck all.

Alright, let's reinstall Windows (yes, all over again) in forced Legacy mode (reinitialize drive with MBR, removing GPT). Remove all the keys from BIOS, and check if it detects SSDs which are both MBR drives now. Nope!

Alright, let's see if there's some issue with the slot compatibility... how about we put it into some USB and PCIe adapters... and also try them on some other systems...


That's when I realized that from a consumer's perspective, it's gone. A lot of "science" can be done on it, but for all practical intents and purposes, I'm acting in denial for the last two hours.

I sat down. Cursed myself, cause I realized there's no one to blame but myself. I had multiple opportunities & forever amount of time to not be in that situation, yet I took wrong turns in each one of those crossroads. Felt like crying & punching whatever is in front of me. For 5 painfully long seconds, I was livid.

Realizing I'm in grief, I short circuited the bargaining, depression & acceptance phases into a weird medley of "let's document this the best we can, list down what we lost, note down the takeaways, store the module safely in case it can be recovered by someone specializing in V-NAND recovery", and tweeted about it.

The Prestige

Nope... there's no happy ending. At least not yet, and probabilistically not ever. But I'll keep this section for future edits just in case. Miracles are only but Murphy's law with inverted value propositions.

Postmortem

My best guess is, the module went belly up as soon as I took it out of its slot after taking back up. Cause the system was rebooting into bios, so there's a good chance there may have been some handshake going on during the POST and/or some power pin made contact with data pin & fried the interface IC.


The interface IC is what looks like a custom ARM subsystem specially designed for PCIe interfacing & controlling the data caching & transfer between the DRAM & NAND chips. If that's all that happened, then it's possible the data is safe in the 2 NAND modules. If I was in US/EU, right now I'd be on-transit to someone's den - someone with a flash module reader hardware or precision soldering/reflowing equipment. Chances would still be thin, but there would be some merit in trying. But in India, it's practically a desert when it comes to these parts of technology.

To add some 3rd party perspective into what are the considerations for data recoveries, in case you're reading this with that intent, here's a quick video of Linus explaining it, as fast as possible for it to be explained:



Reflection

Here's the list of stuff that I can recall being lost:

  • Multimedia Projects
    • Raw footages, audio, voiceover, along with DaVinci Resolve project files & final exports of quite a few work-in-progress & recently completed but not uploaded projects (some date as far back as 2017, for e.g. the Core i7 7700k build video).
    • Many digital concept artworks in various stages of completion. "Metamorphosis" being the one I would miss the most. These are all unpublished.
    • Some photomanipulations of my own photography samples. Also all these are unpublished, except for some of the bland ones.
  • 3D Printing (FDM)
    • Thankfully a lot of them are on my GitHub as a kitchen sink, or in their own repos. But many others were WIP, and those weren't yet pushed.
    • Quite a few of modules/scripts/cutouts I made that can be easily reused in various projects, without needing to be redesigned each time.
    • Cura slicer configs, for ABS & PETG with with 0.4mm & 0.8mm nozzles (absolutely no reason why I hadn't pushed it). It can potentially be recreated, but takes a massive amount of time for validating the settings tweaks with multiple test prints.
  • Electronics & PCB Designs
    • Have been holding onto these for quite a while cause ordering from China is problematic & getting them made in India would mean I will have to do SMT soldering myself. So, couldn't finalize those designs before getting the prototypes at hand. Bad excuse for not pushing to GitHub, but that's what it is.
    • Only one survived, but it has only the initial planning part, and not the final version of the project. But I can recreate it, so it's something!
    • A ton of datasheets, and a goldmine of reference designs & application notes on MCUs & power circuit designs. Especially for a GaNFET based pocket sized 1kW+ phase shifted full-bridge SMPS & for some carrier board designs like for Jetson Nano modules, ESP32 & Raspberry Pi Pico with functional breakouts (as hinted here).
  • Settings, Configs & Checkpoints
    • Saves files for any single player games I've played in last 4 years
    • Cura slicer would also fall into this category, but in general all the software specific customizations (KiCad custom libs & footprints, FreeCAD customizations etc.)
    • 5 of functionally isolated Firefox profiles that I heavily relied on & pretty much all the other application configs in %APPDATA% that I thought I'll gradually migrate over time.
    • The entire NVIDIA Omniverse setup, along with all its tooling ecosystem & the tests I was running on them (basic stuff, nothing to show for yet, but that ground work is generally the unfun part that I wouldn't wanna redo).

As for avoiding such an incident in future, I need to take similar steps that I did for computing devices themselves. Once upon a time if my primary workstation acted up, I'd need to wait until Monday to use someone else's system to even create a bootable USB to attempt recovery process (circa 2011, Mysore); that day is behind now. Now, it will take me 5min and a coffee to switch to another system, and turn it exactly into the environment I'd want it to be, along with redundancy & adaptability of peripherals too.

I may need to rethink about it when I have a better hold of my mental facilities & am capable of more objectivity, but for now, these are my key takeaways:

  1. Don't try to optimize backups, or at least don't use that as an excuse for not taking backups. It's ok if they're messy. It's better for something to be hard to find than it being lost.
  2. The storage migration in 2019, moving everything from spinning & optical media to SSDs was a good move. But that doesn't make the stored data invulnerable. SSDs are still pieces of technology that can & will eventually fail. Build resilient failovers - the more the merrier.
  3. Understand the frailty of holding anything dear as an inherent & self-imposed vulnerability. Instead of keeping them to self, and trying to reach for a self-defined obstacle named "perfection", only for their personal values just to myself - I should share them ASAP to build a collective value. It'll either stand the test & come out as worthy idea (so that I no longer have to protect it solo) or get thrown out in the bin (as it rightfully deserves to be); but in either case, it won't pile up as a potential heartache.
  4. There _are_ better and worse ideas. Nothing against my LED lights, I love them very much, but they aren't more important, effective & meaningful than data protection projects.
  5. The target resiliency to attain would be, even for me to fail to destroy the data, with full opportunity & intent (e.g. losing control in a feat of rage). Also to ensure that I can't be controlled or influenced or extorted to do it.
  6. Occam's razor.
  7. Wu Wei.

Comments