Software quality rant

jwatte's picture

Last week-end, I lost the boot disk for my Linux server. Mostly, that server just serves as a file server for MP3 files, photos and ripped DVDs these days, but not being able to listen to music or watch movies does cramp your style a little bit.

I actually run most everything in redundant mode (RAID1 mirroring), so it should have been a simple thing to just remove the bad disk, keep the good mirror, and restart the system. Unfortunately, however, a md "soft RAID" device isn't actually recognized by the BIOS, and trying to re-install GRUB to boot from the second (non-lost) drive met with all kinds of mysterious and poorly documented failures. In the end, I put a small partition onto the bad disk (before the bad blocks) and re-ran GRUB in manual mode, and got the system up and running again.

Here's the first rant about software quality: The Ubuntu install disk, and GRUB, could not give me any kind of useful diagnostics when they didn't work. They just said "hey, it failed!" That's not... very useful. After going into the syslog and troubleshooting for a long while, I started piecing together what the problem might be, and after much trial-and-error, I came up with my own solution. Someone who hasn't run Linux for the last 13 years probably wouldn't have been as lucky.

This week-end, I'm about to replace the six disks of small capacity (left through legacy) with a pair of 1.5 TB disks (Samsung eco disks for $99 each!) Samsung has a disk utility on their web site ("estools") which can burn to a bootable CD ROM, and be used to test the drives before putting them into production. That's generally a good thing! I wouldn't want to put in a drive, and then have to take it out again a day later.

I booted up the DOS based utility on my desktop machine (NVIDIA i650 chipset based), and started scanning one of the disks. That will take about three hours, so I shut down the Linux box and put in the second disk. Booting the DOS system, the estool quickly crashes with an access violation exception. Apparently, a module called int_ahci has trouble. (This board is based on Intel G33 AFAICR)
Searching on the web, I can only find one other user who has seen that problem == and that user didn't care; all he wanted to do was turn on SMART for the hard disk. Searching on Samsung's web site, there is no FAQ or other information about this problem. However, what's more troubling is that there's no contact information for support with their support tool. All they have is a form for RMA-ing a disk, which is not at all what I want to do.

So: There's a diagnostics tool that has not been tested with an Intel chipset that's two or three years old by now (and fairly common). There's no way to get support for what to do when this tool doesn't actually work. What's worse is that, by not having a conduit back into the company, Samsung will not learn about this bug in the tool, so it will remain there forever, even if they wanted to fix problems like these. Given that their hard drive partitioning tool ends support at Windows 2000, with only implicit support for Windows XP, and no support for Vista or Windows 7, it's unclear how much they'd care, anyway.

For the past few years, I've been of the opinion that it's impossible to actually get computers to do what you want them to do, even though, by all indications, it should be totally possible. The reason for this is that computer development, and software development, is a race to the bottom in quality. Shipping something that appears to work for 80% of customers is more important than actually doing what it says on the tin. And, once you've shipped, it's more important to play catch-up with the competition, that already announced some new vapor-ware, than actually making the remaining 20% work. I am getting really, really tired of this.

And, before someone says I should try Windows 7 / Fedora / Ubuntu / NetBSD / MacOS X / FreeDOS / Solaris / Minix: they are all crap. Computer hardware is too diverse to be well supported, documentation is too scarce (and jelaously guarded, in most cases!), testing is too expensive, and there'll be something new and shiny to work on in six months anyway. Mac OS has actually been the worst to me (even though Apple allegedly has the "benefit" of controlling their own hardware/software); Linux has been better, but not great (it's actually feeling like it's going downhill since I switched from build-your-own-distro to Ubuntu), and the OS with the least problems so far has been Windows 7. Say what you want about Microsoft problems from the past, but they do have a lot of people employed doing both user interface and quality testing, which is more than you can say for most other computer products.

Too bad they don't do the same thing for their hardware: my X-box has been sent back three times (once for DVD problems, twice for 3RROD). At least they pay for replacement and shipping both ways...

But the web is the new platform for software, right? Web apps like GMail will kill ancients monoliths like Office, right?
So, I went to my Gmail account through my browser bookmark. It opened the "loading jwatte@gmail.com" prompt. Then I heard the page reload click. And then I heard it again. And again. It kept trying to reload the page, never making any progress. In the end, I had to click the "load basic HTML" link at the bottom to get it to load my mail inbox. Because, you know, web pages are really hard to write, and harder to test for unimportant niche browsers like IE8, and email isn't important enough that customers should care when it doesn't work anyway. Right?

Btw: My netbook (MSI Wind 123) came with a Bluetooth logo on the box, and a Bluetooth LED built-in, and a Bluetooth utility on the hard disk. But no actual Bluetooth module installed. Checking support, that's how it's supposed to be. Yay crappiness!