Monday, Jan. 29, 1990

Ghost in The Machine

By Philip Elmer-DeWitt

The first sign that something had gone haywire in AT&T's long-distance telephone network came at 2:25 p.m. last Monday, when the giant map of the U.S. in the company's operations center in New Jersey began to light up like a football scoreboard. For reasons still being investigated, a computer in New York City had come to believe it was overloaded with calls, and it started to reject them. Alerted to New York's troubles, dozens of backup computers across the U.S. automatically switched in to take up the slack -- only to exhibit the same bizarre symptoms. People trying to place long-distance calls all over the world suddenly began to hear busy signals and recorded messages blandly informing them that "all circuits" were busy.

Thus began the worst computer breakdown in the history of the U.S. telephone system. The incident was also a vivid reminder of how susceptible America, and the world, has become to computer failures -- natural and man-made. In 20 years of intensive automation, everything from supermarkets to stock exchanges has been computerized. Last week businesses and consumers were forced to face up to a downside of technology that becomes apparent only when the new systems fail. Said Steven Idelman, chairman of Omaha-based Idelman Telemarketing: "When things go wrong in a computer environment, they go wrong in a big way."

Things stayed wrong at AT&T for nine hours last week. Of the 148 million long-distance and 800-number calls placed with the company that day, only 50% got through. Hotels lost bookings. Cars went unrented. The number of calls to the American Airlines reservation system fell two-thirds. Idelman had to send 800 phone workers home for the day; he estimates he lost about $75,000 in sales. All told, the breakdown cost AT&T some $60 million to $75 million in lost revenues. Said AT&T Chairman Robert Allen: "It was the worst nightmare I've had in 32 years in the business."

Phone-company technicians traced the problem to a single "failure of logic" in the computer programs that route calls through the AT&T network. Like many programming bugs, it stemmed from an improvement on the original system. By carrying information about who is calling whom on a separate channel, or band, from the call signal itself, AT&T has been able to reduce the time between dialing and ringing from as much as 20 seconds to as little as four seconds. But the refinement inadvertently made the system more prone to breakdowns. Last week's glitch spread rapidly among the 114 computers in AT&T's network in part because they all contained the same programming error.

The collapse of its network came at a time of increased vulnerability for , AT&T. Although Ma Bell still carries 70% of the U.S.'s long-distance traffic (down from 90% five years ago), it has been fighting a rearguard action to keep its customers from defecting to its feisty competitors, MCI and US Sprint. The glitch simultaneously deflated AT&T's multimillion-dollar "reliability" advertising campaign and handed its competitors a once-in- a-career sales pitch. "An important message to everyone whose telephone is the lifeline of their business," began a print ad rushed out by US Sprint after the breakdown. "Always have two lifelines."

AT&T operators made matters worse on Monday by refusing to give stranded customers instructions for calling via MCI or Sprint -- a standing order that was reversed 3 1/2 hours after the breakdown began, too late to do East Coast businesses any good. To help make amends, AT&T announced late last week that it had asked the Federal Communications Commission for permission to offer long-distance discounts to all callers on Valentine's Day. But the phone company's aura of infallibility will not be so easily repaired.

That an operation as heavily computerized as AT&T's could have maintained such a reputation is a near miracle. To experts who track technological mishaps, the past decade reads like an unending parade of computer disasters, ranging from the humiliating bugs that delayed one space-shuttle launch after another to the Belgian stock-exchange computers that collapsed under the rush of sell orders during last October's minicrash. Computerized elevator doors have shut unexpectedly. Factory robots have started without warning, killing workers. A misprogrammed medical X-ray machine delivered fatal doses of radiation to at least three cancer patients.

The vulnerability of all computer systems was underscored last week by separate court proceedings in California and New York. In San Jose three Silicon Valley workers were indicted for a range of computer crimes, including, for perhaps the first time, taking classified military information from Government computers. The next day a Cornell University graduate student made the first public explanation of how the rogue program he released into a research network in November 1988 managed to cripple some 6,000 university and military computers. "It was a mistake," Robert Morris said at his federal trial in Syracuse. "I'm sorry."

But a trespassing hacker is just one of the problems that can bring a computer system to its knees. Technicians were installing extra disk drives in an underground computer in Tulsa last May when they triggered a collapse of American Airlines' SABRE reservation system. Last September a Parisian computer creatively misread magnetic labels on 41,000 traffic-violation files and began charging delinquent motorists with crimes ranging from murder and drug trafficking to prostitution. A fire in a Tokyo utility tunnel several years ago wiped out circuits connecting Mitsubishi Bank's mainframe computers with branch offices, shutting down automated-teller machines across the country for five days.

Massive system failures dramatize the trade-off that occurs whenever a high- tech system replaces a low-tech one. Because most electronic systems are thoroughly interconnected, their failures tend to be all-or-nothing affairs. They do not, as computer scientists put it, degrade gracefully; they crash. Moreover, what is gained in speed and productivity is often lost in control, reliability and -- for lack of a better word -- transparency. When a system of gears and levers stops working, its operators can roll up their sleeves, raise the hood and go to work. When a microchip goes bad, its circuits are unlikely to respond to on-the-spot ministrations.

The risk for businesses is not so much that their systems will someday break down -- that is almost a given -- but that lingering computer anxiety in the buying public will make it harder for firms to recoup their investments in high-tech equipment and services. Banks and brokerage houses live in fear that one or two well-publicized computer failures will alienate their customer base, triggering mass defections to their competitors.

There are ways to make the technology more reliable. Fault-tolerant computers like those built by Stratus, Tandem and, for that matter, AT&T reduce runaway system errors by a kind of "paranoid democracy," where modules working in parallel constantly evaluate whether their electronic co- workers are "sane" or "crazy." Unfortunately, as last week's breakdown showed, it is possible for all the modules to go crazy at once. Software, always the skittish part of any system, can also be made more dependable by imposing the kind of discipline on programmers that engineering standards impose on, say, bridge designers. A program like AT&T's faulty switching system, however, which can contain a million lines of code, is more complex than any bridge. "Standards have not been developed," says Donn Parker, a senior management consultant at SRI International. "Software is not predictable."

But automation is certain to become ever more pervasive. If U.S. firms do not develop the most advanced systems, Japanese or South Korean or European companies are sure to do so. "American industry faces an extremely competitive situation," says Tandem President James Treybig. "AT&T is fighting to be in the forefront of technology, and there is some cost to staying in front." If Treybig is right, temporary setbacks are just the price of progress. But incidents like last week's are sure to influence the priorities of technology shoppers: reliability will be just as important as clever ads and fancy features.

With reporting by Thomas McCarroll/New York and Paul A. Witteman/San Francisco