Petabytes of data spilling on the floor.
A million seconds is 12 days.
A billion seconds is 31 years.
A trillion seconds is 31,688 years.
Last year IDC released the results of a study that found the world generated 161 exabytes of digital data the year before. How much data is that? A lot. Its 161,000 petabytes. Its 161 million terabytes. Its 161 billion gigabytes. All still really big numbers.
For a more useful perspective, consider the US Library of Congress:
With millions of books on its shelves, it is earth's largest library.
Yet the Library's printed works collection is estimated to store the equivalent of only about 10 terabytes. To even get into the petabytes of data we have to include the contents of all US research libraries, and then the total is only 2 petabytes.
While the sheer size of the numbers is entertaining, there are two serious points here. First, in 2006 we created more digital data than we could store. The same IDC study estimates that the world had only 181 exabytes of storage available in 2006. In that storage budget we had to store all the previously stored data, in addition to some fraction of the newly generated data. 2006 was the first time we exceeded our storage budget in the entire known history of humankind.
Second, this storage gap is expected to grow rapidly: IDC estimates we'll have a total of 601 exabytes of storage available worldwide by 2010, but in that year alone we will create 988 exabytes of new data.
Without sounding unnecessarily abstract, our society is in the midst of a profound change. There are important debates going on about what this means, and how much of what we create we'll be able to leave to future generations. These questions will take a long time to sort out.
In the meantime, businesses and other organizations that thrive on data are faced with new challenges they need to respond to today. Until now they've had the luxury of assembling data over time, analyzing it offline, conducting long-term studies, revisiting old data, and so on. The trick was getting all of the data into one place; once that happened; it could be analyzed and re-analyzed as many times as one had a reason to do so.
This is no longer the case. As streams of interactive data -- data about everything from online browsing behavior to health statistics to the financial markets -- multiply exponentially, data-centric organizations must change the way they think about data. Data streams now need to be processed during collection, either to support making tactical decisions in near real-time based on the stream, or to reduce the data to its vital characteristics for in-depth analysis offline.
Some are viewing this change in information management as a catalyst for creation of new branch of computer science and engineering. Stream processing (as it is called) has sparked a great deal of interest and research among academics and businesses alike, and the architectures they come up with will determine how society continues to evolve while it soaks in this data soup.
Pervasive's DataRush is right on the front line of this change. While not (currently) useful for the ultra-low latency required by algorithmic trading systems, DataRush offers very high levels of throughput, allowing for far greater use of actual data, rather than sampled or derived data.
As co-sponsors of the recent Gartner Event Processing Summit, we recently spoke with a number of the top software providers in the event processing marketplace, and they all recognize the need for higher throughput in addition to the low latency they all compete on.
What new insights and information could you get from the ever-growing wave of data you deal with?




