July 2008 Archives

2008-07-03T08_16_38

Taming Vast Amounts of Data and Vast Data Files

The files and documents on your hard disk drive are getting larger, making finding information when you need it harder. Use the Google Desktop Enterprise edition to index all the files on your machine and make finding data when you need it easier.

Typical Personal Computers (PCs) today have hard disk drives with capacities measured in gigabytes. Not so long ago, as Bill Gates' vision of a PC on every desk began to be realized in the early nineteen nineties, hard disk drives were just a few tens of megabytes in size. There have been several orders of magnitude of increase in the amount of storage available on the typical computer in the last decade and a half.

Soon PCs will come with terabytes of hard disk drive capacity. And guess what, most users will still be constantly running out of disk space! Why do we need so much more storage than we did ten years ago? Now we store videos, photographs, and audio files on our computers. These files are large. Additionally, there is currently a shift toward high definition video going on which increases file sizes again. So we will need those terabyte drives. In fact, most people can use all the storage that they can obtain.

However, writers of 'ordinary' programs like word processors and email clients seem to have forgotten how to be efficient with disk space. These programs do not have to deal with video or audio and so should not need to have particularly large data files. Generally speaking there is no need for an email program to store all its information in one large file. Packing everything into a single file is like putting all your eggs in one basket. Everything will be fine until the day that something (probably a buggy email client) corrupts the file and possibly loses all of your email data.

Take for example Lotus Notes. I used Lotus Notes every day at work for a few years. I was always amazed by the size of file that it created on the hard disk drive on a laptop. I would generally have around 10 gigabytes in one or two large data files, dedicated to providing Notes with the ability to show email messages when I was not connected to the internet. What was most impressive was the fact that exported to a text file, this same amount of information would have been at least ten times smaller in size. It was never clear to me how Lotus Notes was able to achieve this impressive level of 'anti' compression. Lotus Notes is not alone, many programs these days take the opportunity to create large 'database' files and proceed to bloat these with strange and esoteric information. Possibly it reflects a course which computer scientists are receiving on exit from their colleges, but from a user's point of view it seems to provide very little value.

Not only do large files provide seemingly little value they also cause problems. Large files take a long time to search. Large files chew up large amounts of disk space and take longer to backup. Most importantly large files a fragile. If something goes wrong in one large file, a small area of corruption for example, the whole file can become unreadable. So you may lose all your data in one fell swoop. From a user's point of view this is very undesirable.

The fragility problem can only be resolved by programmers writing their programs to break up the data storage into multiple files. Smart programmers do this instinctively and hopefully as applications like Lotus Notes continue to evolve this will be programmed into the versions of the future.

In order to handle the problem of finding information in large data files, particularly Lotus Notes email files, I eventually found that Google Desktop Enterprise provided a workable solution. This comes complete with the ability to index Lotus Notes email messages. Google Desktop Enterprise creates files which index keywords in all your files and then provides a fast, familiar, Google like search capability for your own machine.

This essentially solves the searching problem. Google's search technology is so efficient that you can rapidly find messages by searching on short key word strings. If you need to look at the search results in Lotus Notes, Microsoft Word, or any other application, Google Desktop will fire up the appropriate application with the appropriate document loaded. It is pretty neat and saves a lot of time searching through email and so on manually.

However, sadly, Google Desktop Enterprise performs this magic by creating huge index files on your hard disk drive. So there is a price to pay in storage. Again the index files can be tens of gigabytes in size, so the storage price is substantial. Additionally, the indexing process takes many hours. You will probably want to leave it running overnight, particularly if you are a Lotus Notes user. Despite these reservations you will find that the resulting Google convenience is worth the effort.

To download and install Google Desktop Enterprise, just search for Google Desktop Enterprise, using your favorite search engine. You will find that it is a free download, and that it is straightforward to get it up and running. Just be aware of the disk space cost that you will need to pay.

So, you can invest yet more disk space to fight back in the struggle with vast amounts of data and vast data files. The real respite in this struggle is going to come from programmers changing their habit of creating vast data files. This is not something that will happen overnight. However, the rise of Linux, where generally software and data storage are more efficient, may provide a drive to push programmers on all platforms in this direction.


Posted by ZFS | Permanent link