by Robin Cheng
Ten years ago, my family had a computer with a 4GB hard drive. It was enough for us, because all we did was some document processing and image editing. Occasionally I played some games, but the largest game was no more than 50 MB in size.
The numbers we see today are many times larger. What used to be the capacity of a hard drive is now the typical capacity of a flash drive. However, the demand for data storage space has also increased. I’m currently using a 128GB solid state drive and it remains mostly filled. In need of more space, I purchased a 1TB1 external hard drive, and I have already used over half of it. This is not a concern, though, because when I use up the space, it’s about time to upgrade my laptop with a larger drive.
Amount of Information vs. Available Storage
One of the fortunate aspects of our digital world is that there seems to be abundant storage space available to us. For local storage, a hard drive with 1TB capacity can be bought for as low as $60. For online storage, free email services that provide several gigabytes of inbox space are widely available. Gmail provides continuously increasing inbox capacity, which encourages users never to delete any emails. However, is our perception that data storage space is abundant actually correct? Even if it is now, will we still have sufficient storage capacity ten years from now?
A study by Gantz and Reinsel from the International Data Corporation (IDC), dedicated to market research and analysis, showed that over the last three years, the total amount of digital information stored on the Earth has increased from 0.25 zettabyte to 1.2 zettabytes. In other words, it has increased from 275 billion gigabytes to 13 trillion gigabytes. This represented a growth factor of 4.8 over three years. Gantz and Reinsel predicted that by 2020, the total amount of information in the “Digital Universe” will have increased to 35 zettabytes, almost 30 times what we have now.
As Carl Howe reported, a prediction by IDC in 2007 estimated that the total amount of stored information would be about 1 zettabyte, which is verified today to be true, and the amount was underestimated. Thus it is likely that IDC’s prediction would turn out to be true again in 2010. If that were the case, we would not be satisfied with a 2TB hard drive or an 8GB email inbox, due to the inflation of information size. Therefore we will need extra storage space. However, Gantz and Reinsel’s study also showed that storage availability will not grow as fast as the total amount of information, which means that ten years later, some of the information we have will have to be discarded.
Gantz and Reinsel discovered that there will be an increasingly large gap between the amount of available storage and the amount of digital information. They predicted that by 2010, this gap will be even larger than the amount of available storage. This means that half the information we have will not be stored. Certainly, we do not wish this to happen.
Though this prediction may sound incredible, it seems observable to me that we’re generating a lot of information in our daily lives. Being a freshman, I currently have about 1600 emails in my MIT Inbox. Of these emails, more than half are delivered by my residence’s mailing list2. My residence, MacGregor, has 326 undergraduate residents, multiplied by about 800 emails per person – that’s over 260,000 emails in total. The average size of these emails in my inbox is 50KB each2, and if we multiply that by 260,000, that is about 13GB in size! It has been two months since the beginning of the semester, and over four years, the total size would grow to 312GB, which is about 1GB per person. In this trend, just our emails would fill our MIT account quota by the end of our undergraduate years!
Causes of Information Growth
One of the major causes for information growth is the easy access to storage. As listed on the Seagate website, a 2TB external hard drive sells for only $170 but can store a horrendous amount of information, such as 1000 movies, 400 large-scaled video games, or 200,000 high quality MP3 songs. If used for backup, it can make about 10 sets of complete hard drive backup, based on a half-filled 320GB hard drive, which is typical in the current market. Online storage is also affordable. Most online services such as email and online backup provide a few gigabytes of free storage space, and the price for extra storage can be as low as 15 cents per gigabyte, as advertised by JungleDisk.
The easy access to storage encourages the creation and storage of new information and thus creates a high demand on storage devices. This leads to more production of storage media and therefore decreased prices per unit of storage, which further encourages more data to be produced and stored. The study by Gantz and Reinsel agreed on this prediction. It said that the total money spent on information processing is about $4 billion globally, and that this number will not grow significantly over the next decade, implying that the rapid increase in the amount of information implies a decrease in the cost of data storage media.
There is little pressure on data storage efficiency because storage space seems to be abundant. For example, when a user receives an advertisement email, he or she may be lazy and leave it in his or her inbox, with no worries about the storage space it occupies, since the size of the email is so small compared with the capacity of the inbox. When a user’s desktop is filled with assorted files, he or she may be too lazy to clean them out but instead put them all in a new folder called “Cleanup”, since the total size of these files is not significant compared with the free space available on his or her hard drive. However, it is very unlikely that the user looks at the advertisement email again, and nine out of ten files in the Cleanup folder are probably useless. This results in a waste of storage space. On a larger scale, we may say that the vast availability of storage space invites the storage of more unnecessary data.
The easy accessibility of storage space also encourages data redundancy. A user may choose to store redundant information for convenience. For example, an MIT student may store a copy of a problem set in his or her laptop, Athena account, and iPhone, so that he or she can gain easy access to the problem set anytime, anywhere.
In fact, duplication of the same data is also another major cause of information growth. Gantz and Reinsel estimated that 75% of all data are duplicates of existing data, and only 25% are original. Duplication of data is done through many ways. One primary source of data duplication is backup. Enterprises and some individual users periodically backup their data so that they incur minimal loss upon a natural disaster, a hacker’s invasion, a device failure or a mistake during normal operation. Over time, many backups may be made, which results in more and more duplication of existing data.
In addition to normal backups, data redundancy, which is the duplication of data in real time, is a commonly used method to prevent storage device failure. One such set of techniques is Redundant Array of Independent Disks (RAID), whose most basic configuration, RAID 1, involves mirroring identical data across two hard drives of the same capacity, making the two drives function like one. The advantage of RAID 1 is that in the case that one of the drives is broken, no data is lost, and a new drive may be replaced to fully restore the RAID 1 configuration. This technique naturally duplicates data twice.
Email is another source of data duplication. Whenever an email is sent, one copy is left at the sender’s “Sent Mail” folder, and another copy is stored at the recipient’s inbox. Thus every email is typically duplicated at least twice, not to mention many emails being forwarded. Also, advertisement emails can be duplicated across all their recipients, which can be hundreds or millions of copies.
Problems We May Face
The rapid growth of information poses several problems. One such problem is the speed at which data can be retrieved. Suppose we have a book of 500 pages. Finding a relevant page of the book involves scanning through the table of contents, turning to the corresponding page, and reading through the section to find the page. This may take less than a minute. However, suppose now that we have a 15000-page book. Its table of contents may take 100 pages itself, and each section may be 30 times longer. Finding a relevant page in this case could take significantly more time. This is similar to having more data. As we store more and more data, the time taken to find a relevant piece of data will take longer. Therefore, in order to maintain the same searching speed, more efficient search algorithms must be developed, or else search engines such as Google will load several times slower.
Information growth also leads to increased difficulty in data management. Suppose a teacher has a pile of 50 graded exam papers on her desk. It may take her only a few minutes to arrange the papers from the highest score to the lowest. However, if she had 1500 papers in several piles on her desk, it may take several hours because of the great number of exam papers and the limited space on the desk. Similarly, as we have more and more data, it becomes increasingly difficult to organize them, but organization of data is essential for their easy retrieval later on. More information also means longer time to identify and clean up unnecessary information, making data more easily accumulated over time.
As enterprises acquire more information, the cost of storage and management of information becomes a burden. For this reason, many enterprises are shifting to cloud storage and services, which means storing their data at a third party’s storage space. Such a third party is devoted to providing large amounts of storage, managing the stored data, and protecting the data at a professional level of security. However, relying on a third party introduces privacy concerns. Cloud services are not transparent, as it is difficult for enterprises that use these services to monitor exactly who has access to their stored data, which would be easy if data are stored within their own servers.
Gantz and Reinsel predicted that the amount of data that needs protection is increasing. For example, by 2020, the amount of “Lockdown data” (information that requires the highest level of security) will increase by a factor of 100. The growth of sensitive information puts privacy at a higher level of importance. They said that a social security number that is entered on a website can pass through the cloud a million times a year. A loophole or failure in cloud storage can lead to a significant impact on security, in which a large amount of sensitive data can be leaked.
Controlling Information Growth
To prevent data from going out of control, it is essential to slow down the growth of information. While we cannot prevent the generation of new information, we can change the way it is stored, and reduce the number of times it is duplicated. As we mentioned above, the ease of access to storage devices encourages data storage. Thus, users should not be given more storage space than they really need. For example, institutes and enterprises should design an adequate but not redundant storage quota for their users, so that their users do not have the incentive to take advantage of their abundant quota and store unnecessary data. Email service providers, such as Gmail, should control the growth of inbox quota, so that their customers are encouraged to clean up unnecessary emails.
Residential Internet Services Providers (ISP) should limit their customers’ uploading bandwidth to a level much lower than their downloading bandwidth, or charge additional fees for a higher uploading bandwidth. This is to control the amount of new data added to the Internet through residential networks. To encourage enterprises and ISPs to limit storage quota and bandwidth, the government should set a minimum price on storage devices. This way, people will shift from buying more storage to storing their information more efficiently.
Efficient information storage will significantly reduce the amount of data stored. For example, compressing data that is not frequently used can reduce the size of the data. Enterprises should establish guidelines on conserving data storage, such as periodically cleaning up unnecessary documents. We should also establish citizens’ awareness of information growth, such as teaching them in school that they should delete unnecessary emails. People should develop a habit of discarding useless data, which is both beneficial to themselves by keeping data organized, and beneficial to the community for reducing data storage.
Software developers should improve their software to store data more efficiently and avoid storing redundant information. For example, when storing a large database that is rarely accessed, compression should be used to reduce the database size.
There is also a lot we can do to reduce data duplication. We can start by changing our emailing habits. We should not excessively forward emails, especially “chain emails”. A chain email uses curses (claiming something bad will happen if the email is not forwarded) or blessings (claiming something good will happen if the email is forwarded) to encourage the recipient to forward it to all of his or her contacts, which further prompts these contacts to forward the email to their contacts, and so on. This causes a potential exponential growth: If each person has 30 contacts of which 10 continue to forward, at each stage where all recipients forward the email, the number of emails increases by a factor of 10. In five such stages, one initial email will have already resulted in 100,000 emails. A chain email is often very long, typically 1 megabyte in size, and 100,000 of such emails would add about 100 gigabytes to the digital world. Certainly, this is completely unnecessary, so we should not forward these emails or feel frightened that the curses in the email might come true.
Emails that are sent for good purposes can also be unnecessarily duplicated. My residence’s mailing list is an example. Many of the emails sent through the MacGregor mailing list are discussions on a specific topic that only a small portion of residents care about. One solution to data duplication due to mailing lists is to establish an online forum where people can have discussions and where news can be announced. An infrequent periodic update can be sent by email, and emergencies can also be notified by email. This strategy preserves the original functionality of mailing lists, and significantly reduces the amount of duplication.
Backup strategies can be improved to conserve storage space. Although backing up is naturally duplicating data to prevent data loss, strategically using differential backups can reduce duplication. A differential backup is a backup that stores only information that was added, removed, or changed since the last complete or differential backup. It relies on the previous backup to reconstruct data, so using differential backup is not as reliable as using complete backup. However, an enterprise can perform a complete backup every month, for example, and a differential backup every day.
Besides manually reducing duplication, we can also develop innovative file systems that can detect and automatically discard duplicated files. For example, such a file system can be used on an email server so that if the server finds that several inboxes received the same email, that email will only be stored once, and each inbox will point to the same stored email. When such a file system is used on an MIT student’s computer, if the student downloads a course syllabus multiple times, it will only be stored once on the computer. By developing and deploying such file systems, we can effectively reduce duplication without changing much of the users’ behavior.
We have seen the trend of information growth and the potential threats it implies, but we have also seen that we have the power to change this rapid exponential growth. Often times, we’re generating and storing much more information than we need, and we are consuming information faster than we can absorb. We do not want to overload our lives with massive amounts of information. Individuals and enterprises should be aware of this issue and take actions before our digital world explodes.
Gantz, John and David Reinsel. “The Digital Universe Decade – Are You Ready?” IDC – IVIEW (2010): 16.
Howe, Carl. “Is the Information Explosion a Crisis?” 14 March 2007. Seeking Alpha. <http://seekingalpha.com/article/29486-is-the-information-explosion-a-crisis>.
Jungle Disk. “Jungle Disk Desktop Edition.” n.d. Jungle Disk. 2 November 2010 <https://www.jungledisk.com/personal/desktop/pricing/default.aspx>.
Seagate. “Seagate(R) Expansion(TM) External Drives.” n.d. Seagate. 2 November 2010 <http://www.seagate.com/www/en-us/products/external/expansion/expansion_desktop/>.
1) 1TB is about 1000 GB.
2) To count the number of emails in each category, Mozilla Thunderbird was used. All emails in the inbox were first downloaded, and then divided into several folders based on their recipients. To calculate the average size of the mails sent to the MacGregor mailing list, the size of the corresponding email database is taken and divided by the total number of emails in this category.
Robin Cheng was born in China, and later moved to Canada for four years of study, thus considering himself a mix of Chinese and Canadian. He is a member of the class of 2014, majoring computer science. His primary interest lies in software engineering, though he’s also very passionate in music, psychology, and literature. Robin advocates a style of writing that uses simple English but expresses a clear idea.
He was inspired to write this piece when he had to frequently clean up his new solid state drive. He realizes that much of today’s software utilizes disk space inefficiently, and storage capacities, file sizes, as well as Internet speed are rapidly growing.