LJ article: intel not to fix hardware flaws in future CPU’s

There was an interesting line in the August 2018 linux journal article “ What’s New in Kernel Development”, by Zack Brown:
“Linus Torvalds has interpreted some of Intel statements to mean that Intel does not intend to fix some of the hardware flaws in future generations of CPUs, which would force kernel developers, and developers of other operating systems, to work around those flaws for the for seeable future. “

This likely would’ve come from intel’s evaluation that to fix these flaws at the hardware level would cost too much money. So, they will update their EULA’s to kick the issue down the supply chain (EULA’s let manufactures get away with too much, but that’s a different topic).

Is intel’s decision a little like a pacemaker manufacturer saying “…yeah we know that sometimes it won’t put out a voltage pulse but we’re gonna leave that issue to be fixed in the software by the hospital before surgery”? Is this a business model we want to allow?

I think that statement paints an inaccurate picture regarding “future generations” and “forseeable future”. It was clear that Intel would not be able to immediately fix all of its designs. Skylake was just out of the door and everything we are seeing now is basically refinements of Skylake or even older generations. It takes years to get a new CPU generation production-ready, and changing low-level internals like branch prediction, cache control etc. basically throws you back to start because you have to do all the validation and optimization again. Also some of the additional flaws we know about now were discovered after the initial Spectre/Meltdown publications.

As an employee of a large enterprise customer I can tell you that Intel is working hard to fix all known flaws. Cascade Lake, due to ship in Q4’18/Q1’19, will have the first wave of hardware mitigations against side channel attacks.

Obviously this isn’t ideal, but there have been hardware bugs worked around in software since time immemorial, right? The original XBoxes booted up at radically different speeds because Microsoft bought all the RAM they could get their hands on, some of which didn’t even work, and then tested it at bootup and disabled broken chips, just so they could get them built and released as fast as possible. Pentium chips had the FOOF bug, and OSes worked around it. The next chip will not have the flaw, and software works around it in the current chip; it’s not a great way for things to operate, but the software has to do it anyway since existing released chips can’t be fixed and can’t be made to disappear.

Yes, you don’t want to know how much broken hardware is out there and how many quirks people have to build into their drivers. The string “quirk” appears about 11,000 times in the Linux kernel sources.

I think the point is that

  1. Usually you try to fix the next design and Intel is (wrongfully) accused as just having said “Nope”, which would have been a major departure from how the world worked before

  2. Usually broken hardware requires a small fix somewhere. The F00F bug patch was ~140 lines and a non-trivial part of that is formatting an printing strings. But Spectre/Meltdown led to things like the introduction of full-blown Kernel Page Table Isolation, which weighs in at several hundred or even thousands of lines of code.

Intel couldn’t keep these issues around even if they wanted to. Their public image is bad enough as it is, and their partners are having a hard time explaining enterprise customers why they are supposed to buy CPUs which will not perform as expected due to necessary security mitigations. Especially while AMD is busy selling more EPYCs every day.

And the Power Cell CPUs that went into IBM POWER servers or PlayStation 3s depending on how many SPE cores were marked as defective through QA.

…that’s standard practice, it’s called “binning”.

I know, that was my point [used to work for a fabless semi inc]

It might be worth adding that in the meantime it might be better to just disable all the software patches for Spectre because they just waste resources for nothing:

This is especially relevant since the current patches really mess up performance. I recently installed Ubuntu 18.04 Server (Kernel 4.15) on a five year old Sandy Bridge Xeon dual-socket machine with a total of 16 cores. With the patches enabled, the kernel couldn’t even sustain the ~300 MByte/s for the software RAID5 array. With the Spectre patches disabled this was no longer a problem.

Roger, on it!

This laptop’s “About” window currently says TM and © 1983-2017. I think I didn’t get those patches to begin with…

But does that actually make me vulnerable? My browser is up-to-date and still supports my OS version. My disk is unencrypted. Will I actually get hacked by surfing the web, or is an untargeted person like myself still safe?