2020-06-23

Et tu, Microsoft

It's a beautiful Saturday afternoon.

Everything is going as peachy as could be, with the satisfaction of having released a new version of your software, just a couple days ago, that wasn't short lived due to the all too common subsequent realisation that you managed to introduce a massive "oops", such as including completely wrong drivers for a specific architecture (courtesy of Rufus 3.10) or having your ext formatting feature break when a partition is larger than 4 GB (courtesy of Rufus 3.8)... Sometimes I have to wonder if Rufus isn't suffering from the same curse as the original Star Trek movie releases (albeit inverted in our case).

Thus, basking in the contentment of a job well done, you fire up your trusty Windows 10, which you upgraded to the 2004 release just a couple weeks ago (along with that Storage Space array you use), and go on your merry way, doing inconsequential Windows stuff, such as deciding to rename one folder.

And that's when all hell breaks lose...

Suddenly, your file explorer freezes, every disk access becomes unresponsive and you are seeing your most important disk (the one that contains, among other things, all the ISOs you accumulated for testing with Rufus, and which you made sure to set up with redundancy using Storage Spaces, along with an ReFS file system where file integrity had been enabled) undergoing constant access, with no application in sight seemingly performing those...

Oh and rebooting (provided you are patient enough to wait the 10 minutes it takes to actually reboot) doesn't help in the slightest. If anything, it appears to make the situation worse as Windows now takes forever to boot, with the constant disk access issue of your Storage Space drive still in full swing.

Yet the Storage Spaces control panel reports that all of the underlying HDDs are fine, a short SMART test on those also reports no issue and even a desperate attempt to try to identify what specific drive might be the source of the trouble, by trying each combination of 3 our 4 HDDs, yields nothing. If nothing else, it would confirm the idea that Microsoft did a relatively solid job with Storage Spaces, at least in terms of hardware gotchas, considering that every other parity solution I know of, such as the often decried Intel RAID, would scream bloody murder if you removed another drive before it got through the super time consuming rebuilding of the whole array (which is the precise reason I swore off using Intel RAID and moved to Storage Spaces).

An ReFS issue then? If that's the case, talk of a misnomer for something that's supposed to be resilient...

Indeed, the Event Viewer shows a flurry of ReFS errors, ultimately culminating in this ominous message, that gets repeated many times as the system attempts to access the drive, as you end up finding that your drive has been "remounted" as RAW:
Volume D: is formatted as ReFS but ReFS is unable to mount it;
ReFS encountered status The volume repair was not successful...

Someone at Microsoft may want to look up the definition of resiliency...


Ugh, that's the second ReFS drive I lose in about a month (earlier was an SSD that hosted all my VMs, and that Windows mysteriously overwrote as a Microsoft Reserved Partition)! If that's indicative of a trend, I think that Microsoft might want to weather-test their data oriented solutions a little better. Things used to be rock-stable, but I can't say I've been impressed by Windows 10's prowess on the matter lately...

And yes, I do have some backups of course (well, I didn't for those VMs, but that was data I could afford to lose) but they are spread all over the place on account that I am not made of money, dammit!

See, the whole point of entrusting my data to a 10 TB parity array made of 4x4 TB HDDs was that I could reuse drives that I (more or less) had lying around, and you'd better believe those were the cheapest 4 TB drives I'd been able to lay my hands on. In other words, Seagate, since HDD manufacturers have long decided, or, should I say, colluded, that they should stop trying to compete on price, as further evidenced by the fact that I still paid less for an 8 TB HDD, two frigging years ago, than the cheapest price I could find for the exact same model today.

"Storage is getting cheaper", my ass!

Oh and since we're talking about Seagate and reliability, I will also state that, in about 20 years of using almost exclusively Seagate drives, on account that they are constantly on the cheaper side (though Seagate and other manufacturers may want to explain why on earth it is cheaper to buy a USB HDD enclosure, with cable, PSU and SATA ↔ USB converter, than the same bare model of HDD), I have yet to experience a single drive failure for any Seagates I use in my active RAID arrays.

So when people say Seagate is too unreliable, I beg to anecdotally differ since, for the price, Seagate's more than reliable enough. I mean, between paying exactly 0 € for 10 TB with parity vs. between 500 to 700 € (current price, at best) for a parity or mirrored NAS array, there's really no contest. I don't mind that a lot of people appear to have semi-bottomless pockets, and can't see themselves go with less than a mirroring solution with brand new NAS drives. But that's no reason to look down on people who do use parity along with cheap non NAS drives, because price is far from being an inconsequential factor when it comes to the preservation of their data...

And it's even more true here as the issue at hand has nothing to do with using cheap hardware and that everyone knows that a parity or mirroring solution is worth nothing if you don't also combine it with offline backups, which means even more disks, preferably of large capacity, and therefore even more budget to provision...

All this to say that there's a good reason why I don't have a single 8 or 10 TB HDD lying around, with all my backups for the array that went offline, and why, as much as I wish otherwise, there are going to be gaps in the data I restore... So yeah, count me less than thrilled with a data loss that wasn't incurred by a hardware failure or my own carelessness (the only ever two valid causes for losing data).

Alas, with the Windows 10 2004 feature update, it appears that the good folks at Microsoft decided that there just weren't enough ways in which people could kill their data. So they created a brand new one.

Enters KB4570719.

The worst part of it is that I've seen reports indicating that this, as well as other corollary issues, was pointed out to Microsoft by Windows Insiders as far back as September 2019. So why on earth was something that should instantly have been flagged as a super critical data loss issue, included in the May 2020 update?

Oh and of course, at the time of this post, i.e. about one month after the data-destructive Windows update was released, there's still no solution in sight... though, from what I have found, non extensible parity Storage Spaces may be okay to use, as long as these were created using PowerShell commands to make them non dynamically extensible, rather than through the UI which forces extensible.


If this post seems like a rant, it's because it mostly is, considering that I am less than thrilled at having had to waste one week trying to salvage what I could of my data. But since we need to conclude this little story, let me impart the following two truths upon you:

1. EVERYTHING, and I do mean EVERYTHING is actively trying to murder your data.
Do not trust the hardware. Do not trust yourself. And especially, do not trust the Operating System not to lounge a sharp blade straight through your data's toga, during the Ides of June.

2. (Since this is something I am all too commonly facing with Rufus' user reports) It doesn't matter how large and well established a software company is compared to an Independent Software Developer; the OS can still very much be the one and only reason why third party software appears to be failing, and you should always be careful never to consider the OS above suspicion. There is no more truth to "surely a Microsoft (or an Apple or a Google for that matter) would not to ship an OS that contains glaring bugs" today as there has been in the past, or as there will be in the future.
The OS can and does fail spectacularly at times (and I have plenty more examples besides this one, that I could provide). So don't fail to account for that possibility.

2 comments:

  1. Though NTFS doesn't have integrity checking, I end up using it over ReFS because I know it's a mess but I know from personal experience that it works! I'm sorry to hear about the data loss though, it's a real mess to have unexpected failure!

    ReplyDelete
  2. That is why I use a self made home server with BtrFS Raid-1 on two drives running Debian stable...

    ReplyDelete