Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

Strategies for pruning data in the cloud

David Taber | Nov. 10, 2011
In terrestrial systems, you don't think about disk space. In the clouds, you have to, if you don't it will cost you.

• Before you go any further, do a complete backup of all your cloud's data onto either disk or optical media. I cannot say it any more clearly: this is NOT optional.

• For tables that can be freely pruned, look for the "signal to noise ratio." Is there some time horizon beyond which the information doesnt matter at all? For example, in a marketing automation or web monitoring cloud, do we really care about anonymous visitors who havent returned in 6 months? Is it OK to remove all Leads with a score of less than zero? Make sure you get buy-in from all the affected user groups first, but signal- to-noise based pruning can get rid of millions of records in a hurry.

• Some tables have decent signal-to-noise ratios, but the amount of detail stored just isn't worth it over time. For example, many marketing automation and e-mail blasting systems use the activity table to record important e-mail and Web interactions. These activity tables can represent half of the system's storage. But how much will it matter a year from now whether a person watched video A today versus video B yesterday? Use this litmus test: if a particular detail will not actually change anyone's decision or behavior, it's not "information" any longer. For these situations, we recommend a compression approach: keep the information, but remove most of the details after 6 months or so. The histories are typically stored as custom tables, represented by tallies, token strings, or even bitmaps with tiny storage requirements. This strategy will require some careful thinking, user input, and custom code development, but can provide continuous pruning based on information value.

• Some tables (particularly leads and contacts) can collect duplicates in a hurry, particularly if your firm has process problems in lead collection and handling. If your cloud system has deduping tools (from the main vendor or third parties), buy a good one and really learn it. The best tools have fuzzy-logic algorithms that let you find and merge duplicates without moving the data out of the cloud. The merging process preserves as much of the data as possible, but if you have a lot of data collisions (e.g., two different mobile phone numbers for the same person), you may need to create shadow fields and pre-populate them with divergent data prior to the merge. For a number of reasons, data merging must be done in phases: it takes a lot of CPU time, as well as your think-time, to get rid of 100,000 dupes. Do not rush it, as there is no undo for a merge.

Most of the above is a one-time fix, rather than a process change. If you aren't willing to invest in enhancing your data management processes, you may need to revisit these issues on a quarterly basis. Pretty much forever.


Previous Page  1  2  3  Next Page 

Sign up for MIS Asia eNewsletters.