Tuesday, May 22, 2007

Nuclear Plant DataStorm

In August of last year Unit 3 of the Browns Ferry nuclear power plant in Alabama experienced a manual emergency shutdown (SCRAM). The Reactor got shut down there was no Radiation released and all the shutdown systems worked as they were supposed to.
so, what happened?

Apparently a ethernet connected PLC device generated what the investigative committee termed a "data storm":

The U.S. House of Representative's Committee on Homeland Security called this week for the Nuclear Regulatory Commission (NRC) to further investigate the cause of excessive network traffic that shut down an Alabama nuclear plant.

During the incident, which happened last August at Unit 3 of the Browns Ferry nuclear power plant, operators manually shut down the reactor after two water recirculation pumps failed. The recirculation pumps control the flow of water through the reactor, and thus the power output of boiling-water reactors (BWRs) like Browns Ferry Unit 3. An investigation into the failure found that the controllers for the pumps locked up following a spike in data traffic -- referred to as a "data storm" in the NRC notice -- on the power plant's internal control system network. The deluge of data was apparently caused by a separate malfunctioning control device, known as a programmable logic controller (PLC).
In other words PLC controller that had nothing directly to do with the valve controllers that caused the shutdown. The errant controller, was babbling or (in Government speak) causing a datastorm which in effect disabled the motorized valve controllers. What we don't know is the real cause of the babbeling PLC. Could it have been the PLC itself or caused by an external Denial of Service attack?
"Conversations between the Homeland Security Committee staff and the NRC representatives suggest that it is possible that this incident could have come from outside the plant," Committee Chairman Bennie G. Thompson (D-Miss.) and Subcommittee Chairman James R. Langevin (D-RI) stated in the letter. "Unless and until the cause of the excessive network load can be explained, there is no way for either the licensee (power company) or the NRC to know that this was not an external distributed denial-of-service attack."
The article continues to describe a couple of instances where Virrii and worms have contributed to major power shortages.

There are lessons to be learned from this one incident.

While IT administration and security issues often times can be perceived by us in the front lines a nuisance issues. Network security is critical when control issues are involved.

I find it interesting that no one has been able to nail down whether the PLC controller that brought the network down has a real harware issue or not. One time events are tough!!! It does sound like the problem though was most likely the PLC itself or some internal communications within the plant.

No digital contagion has been fingered in the latest incident, said Terry Johnson, spokesman for the Tennessee Valley Authority, the public power company that runs the Browns Ferry power plant.

"The integrated control system (ICS) network is not connected to the network outside the plant, but it is connected to a very large number of controllers and devices in the plant," Johnson said. "You can end up with a lot of information, and it appears to be more than it could handle."

The device responsible for flooding the network with data appears to be a programmable logic controller (PLC) connected to the plant's Ethernet network, according to an NRC information notice on the incident (PDF). The PLC controlled Unit 3's condensate demineralizer -- essentially a water softener for nuclear plants. The flood of data spewed out by the malfunctioning controller caused the variable frequency drive (VFD) controllers for the recirculation pumps to hang.

Such failures are common among PLC and supervisory control and data acquisition (SCADA) systems, because the manufacturers do not test the devices' handling of bad data, said Dale Peterson, CEO of industrial system security firm DigitalBond.

"What is happening in this marketplace is that vendors will build their own (network) stacks to make it cheaper," Peterson said. "And it works, but when (the device) gets anything that it didn't expect, it will gag."

In many cases, a simple vulnerability scan will even cause the devices to crash, Peterson said. During tests in an electrical substation, Nessus running in safe scan mode crashed devices, he said. In some cases, sending out broadcast data on the network will crash several of connected devices, he added.

"If you were to test any control systems that have any more than three or four (different) network-connected devices, they could be knocked over very easily," Peterson said.

Labels: ,

Friday, May 18, 2007

ECC comming to a PC near you?

Error Correction Code is nothing new to us that have been maintaining classic DCS systems. The process invisible to us in operation is all about single bit memory error correction. Common in vintage processors from the 70's and 80's, Not only could errors get corrected but they also got logged so that repairs could be made before a hard error would occurs. Events like this is what makes a Preventive Maintenance program pay for itself.

Some of the earlier PC's did have parity checking, but the problem is; "What do you do once you discover an error?" You have a choice of ignoring the error and living with the consequences or simply halt the program. Real error corection has never been a popular alternative except in servers.

We have all seen on our PC's the window that tells us the application is going to be shut down and do we want to notify Microsoft? Microsoft is now claiming that single bit memory errors are a major cause of these crashes.
Desktop and notebook computers may need to adopt error-correcting code (ECC) memory to combat rising system crashes from single-bit memory errors, according to a confidential white paper written by Microsoft Corp. The software giant raised the issue in a panel discussion on memory at the Windows Hardware Engineering Conference here although it admits its data on system failures is still inconclusive.

For about four years Microsoft has been collecting data through its Online Crash Analysis (OCA) tool that reports system crashes to a Microsoft Web site. About 18 months ago it began sharing OCA data and the white paper with systems and chip makers. According to one source, the report said single-bit error rates in DRAM are now among the top ten causes of systems failures.

Microsoft admits the data is still inconclusive because OCA does not provide enough detail about the types of systems that crash and the memory they use. As it tries to improve the tool, Microsoft is asking OEMs to help provide more data and to consider ECC memory in desktops and notebooks.