IBM is building the largest data array in the world

IBM recently made public its intentions of developing what will be upon its completion the world’s largest data array, consisting of 200,000 conventional hard disk drives intertwined and working together, adding to 120 petabytes of available storage space. The contract for this massive data array, 10 times bigger than any other data center in the world at present date, has been ordered by an “unnamed client”, whose intentions has yet to be disclaimed. IBM claims that the huge storage space will be used for complex computations, like those used to model weather and climate.

To put things into perspective 120 petabytes, or 120 million gygabites would account for 24 billion typical five-megabyte MP3 files or 60 downloads of the entire internet, which currently spans across 150 billion web pages. And while 120 petabytes might sound outrageous by any sane standard today, in just a short time, at the rate technology is advancing, it might become fairly common to encounter a data center similarly sized in the future.

“This 120 petabyte system is on the lunatic fringe now, but in a few years it may be that all cloud computing systems are like it,” Hillsberg says. Just keeping track of the names, types, and other attributes of the files stored in the system will consume around two petabytes of its capacity.

I know some of you tech enthusiasts out there are already grinding your teeth a bit to this fairly dubious numbers. I know I have – 120 petabytes/200.000 equals 600 GB. Does this mean IBM is using only 600 GB hard drives? I’m willing to bet they’re not that cheap, it’s would be extremely counter-productive in the first place. Firstly, it’s worth pointing out that we’re not talking about your usual commercial hard drives. Most likely, the hard-drives used will be of the sort of 15K RPM Fibre Channel disks, at the very least – which beats the heck out of your SATA drive currently powering your computer storage. These kind of hard-drives are currently not that voluminous in storage as SATA ones, so this might be an explanation. There’s also the issue of redundancy which is encountered in data centers, which decreases the amount of available real storage spaces and increases as a data center is larger. So the hard-drives used could actually be somewhere between 1.5 and 3 TB, all running on cutting edge data transfer speed.

Steve Conway, a vice president of research with the analyst firm IDC who specializes in high-performance computing (HPC), says IBM’s repository is significantly bigger than previous storage systems. “A 120-petabye storage array would easily be the largest I’ve encountered,” he says.

To house these massively numbered hard-drives IBM located them horizontaly on drawers, like in any other data center, but made these spaces even wider, in order to accommodate more disks within smaller confines. Engineers also implemented a new data backup mechanism, whereby information from dying disks is slowly reproduced on a replacement drive, allowing the system to continue running without any slowdown. Also, a system called GPFS, meanwhile, spreads stored files over multiple disks, allowing the machine to read or write different parts of a given file at once, while indexing its entire collection at breakneck speeds.

Last month a team from IBM used GPFS to index 10 billion files in 43 minutes, effortlessly breaking the previous record of one billion files scanned in three hours. Now, that’s something!

China breaks quantum entanglement record at 18 qubits

IBM makes significant breakthrough towards scalable quantum computers

D-Wave claims it wants to release a 1,000 qubit quantum computer in 2014

Organic topological insulator demonstrated for first time

Fast access to huge storage is of crucial necessity for supercomputers, who need humongous amounts of bytes to compute the various complicate model they’re assigned to, be it weather simulations or the decoding of the human genome. Of course, they can be used, and most likely are already in place, to store identities and human biometric data too. I’ll take this opportunity to remind you of a frightful fact we published a while ago – every six hours the NSA collects data the size of the Library of Congress.

As quantum computing takes ground and eventually the first quantum computer will be developed, these kind of data centers will become highly more common.

UPDATE: The facility has indeed opened in 2012.

MIT Technology Review