Over the past few years organizations have awakened to the fact that there is knowledge hidden in Big Data, and vendors are feverishly working to develop technologies such as Hadoop Map/Reduce, Dryad, Spark and HBase to efficiently turn this data into information capital. That push will benefit from the emergence of another technology Software Defined Networking (SDN).
Much of what constitutes Big Data is actually unstructured data. While structured data fits neatly into traditional database schemas, unstructured data is much harder to wrangle. Take, for example, video storage. While the video file type, file size, and the source IP address are all structured data, the video content itself, which doesn't fit in fixed length fields, is all unstructured. Much of the value obtained from Big Data analytics now comes from the ability to search and query unstructured data — for example, the ability to pick out an individual from a video clip with thousands of faces using facial recognition algorithms.
The technologies aimed at the problem achieve the speed and efficiency required by parallelizing the analytic computations on the Big Data across clusters of hundreds of thousands of servers connected via high-speed Ethernet networks. Hence, the process of mining intelligence from Big Data fundamentally involves three steps: 1) Split the data into multiple server nodes; 2) Analyze each data block in parallel; 3) Merge the results.
These operations are repeated through successive stages until the entire dataset has been analyzed.
Owing to the Split-Merge nature of these parallel computations, Big Data Analytics can place a significant burden on the underlying network. Even with the fastest servers in the world, data processing speeds the biggest bottleneck for Big Data can only be as fast as the network's capability to transfer data between servers in both the Split and Merge phases. For example, a study on Facebook traces show this data transfer between successive stages accounted for 33% of the total running time, and for many jobs the communication phase took up over 50% of the running time.
By addressing this network bottleneck we can significantly speed up Big Data analytics which has two-fold implications: 1) Better cluster utilization reduces TCO for the cloud provider that manages the infrastructure; and 2) faster job completion times and results in real-time analytics for the customer that rents the infrastructure.
What we need is an intelligent network that, through each stage of the computation, adaptively scales to suit the bandwidth requirements of the data transfer in the Split & Merge phases, thereby not only improving speed-up but also improving utilization.
The role of SDN
SDN has huge potential to build the intelligent adaptive network for Big Data analytics. Due to the separation of the control and data plane, SDN provides a well-defined programmatic interface for software intelligence to program networks that are highly customizable, scalable and agile, to meet the requirements of Big Data on-demand.
Sign up for MIS Asia eNewsletters.