In a second SSD snafu in as many years, Dell and HPE have revealed that the two vendors have shipped enterprise drives with a critical firmware bug, one will eventually cause data loss. The bug, seemingly related to an internal runtime counter in the SSDs, causes them to fail once they reach 40,000 hours runtime, losing all data in the process. As a result, both companies have needed to issue firmware updates for their respective drives, as customers who have been running them 24/7 (or nearly as much) are starting to trigger the bug.

Ultimately, both issues, while announced/documented separately, seem to stem from the same basic flaw. HPE and Dell both used the same upstream supplier (believed to be SanDisk) for SSD controllers and firmware for certain, now-legacy, SSDs that the two computer makers sold. And with the oldest of these drives having reached 40,000 hours runtime (4 years, 206 days, and 16 hours), this has led to the discovery of the firmware bug and the need to quickly patch it. To that end, both companies have begun rolling out firmware

As reported by Blocks & Files, the actual firmware bug seems to be a relatively simple off-by-one error that none the less has a significant repercussion to it.

The fault fixed by the Dell EMC firmware concerns an Assert function which had a bad check to validate the value of a circular buffer’s index value. Instead of checking the maximum value as N, it checked for N-1. The fix corrects the assert check to use the maximum value as N.

Overall, Dell EMC shipped a number of the faulty SAS-12Gbps enterprise drives over the years, ranging in capacity from 200 GB to 1.6 TB. All of which will require the new D417 firmware update  to avoid an untimely death at 40,000 hours.

Meanwhile, HPE shipped 800 GB and 1.6 TB drives using the faulty firmware. These drives were, in turn, were used in numerous server and storage products, including HPE ProLiant, Synergy, Apollo 4200, Synergy Storage Modules, D3000 Storage Enclosure, and StoreEasy 1000 Storage, and require HPE's firmware update to secure their stability.

As for the supplier of the faulty SSDs, while HPE declined to name its vendor, Dell EMC did reveal that the affected drives were made by SanDisk (now a part of Western Digital). Furthermore, based on an image of HPE’s MO1600JVYPR SSDs published by Blocks & Files, it would appear that HPE’s drives were also made by SanDisk. To that end, it is highly likely that the affected Dell EMC and HPE SSDs are essentially the same drives from the same maker.

Overall, this is the second time in less than a year that a major SSD runtime bug has been revealed. Late last year HPE ran into a similar issue at 32,768 hours with a different series of drives. So as SSDs are now reliable enough to be put into service for several years, we're going to start seeing the long-term impact of such a long service life.

Related Reading:

Sources: Blocks & Files, ZDNet

POST A COMMENT

49 Comments

View All Comments

  • LMF5000 - Sunday, March 29, 2020 - link

    In the semiconductor industry, some products have their time accelerated by elevated temperature and humidity. For hard disks and SSDs, no idea. Reply
  • leexgx - Wednesday, July 8, 2020 - link

    They can run vaule checks on the code in a simulation to test vaule boundaries to make sure the output is valid

    And Intel or who ever makes the ssd can make a firmware that allows changes to smart numbers directly so you can just set it to 50k hours for example and the ssd won't boot up with the N-1 bug (should of just been N in this case so it was basically coding error)
    Reply
  • Gigaplex - Monday, March 30, 2020 - link

    Because they can't wait for a 40,000 hour test to complete before shipping. Reply
  • eastcoast_pete - Sunday, March 29, 2020 - link

    All that brings up an interesting question: how is SSD firmware bug-tested? Obviously, this is a bug, but one that doesn't show up for quite a while, so the drives work just fine until that hit that age. Would like to know a bit more on how SSD controller software is tested. Maybe a little backgrounder is in order? Reply
  • dwbogardus - Monday, March 30, 2020 - link

    Usually in ASIC design or in this case, SSD controller design, pre-silicon validation is done by running simulations that make a point of checking all the boundary conditions, like buffer overruns, FIFO underflows, and various limits, some of which are never expected to be reached in normal operation. Normal simulations would take way too long to run in order to hit those boundary conditions, so special test hooks often permit the validations engineers to preset values close to the limits, and then do a few increments to reach the condition. Then they can verify correct operation for the condition. It can be tedious to check every instance, and perhaps some were missed. Whether the validation simulations are being run to check the controller, or the validation test simulations are being run to check the firmware, the principles are the same: check all the boundary conditions by presetting registers close to the limit, then increment to and through the limit, and verify expected behavior. That way you don't have to wait for years of "wall clock" time to reach the limits you need to validate. Reply
  • FunBunny2 - Monday, March 30, 2020 - link

    in addition to the other, long, reply is the simple answer: the coders and analysts simply didn't confirm the design spec. kind of like those airplane crashes on "Air Disasters" where the crew skipped steps on a pre-flight checklist. or, of course, the analysts wrote the spec without checking with design requirements. in any case, this sort of error would be nearly impossible to find in production QA. Reply
  • leexgx - Wednesday, July 8, 2020 - link

    Well they obviously did checks 3-4 years later and found these bugs before they became a problem in real world (not as bad as the 0mb bug on the old sandforce ssds witch had a random chance at powerup to nuke the ssd and respond with 0mb space, some sort of bug with the unique way sandforce has 2 levels of virtual LBA NAND mapping then second level compression Mapping witch would result in the whole drive becoming 0mb in very rare but specific cases) Reply
  • Sivar - Monday, March 30, 2020 - link

    Linux network drivers, Oracle database, other SSD firmware -- how many times does this need to happen before developers stop making the same mistake?
    It isn't even a tricky fix. Use a larger integer! Count something larger (e.g. days instead of hours, packets instead of bytes)! Add a second integer that counts overruns of the first! Use a double or arbitrary precision value!
    Reply

Log in

Don't have an account? Sign up now