Historically the main method for managing these ’fire hose’ like streams of information has been to discard most of it. Significant effort has always been placed into sampling strategies which ensure the “baby is not thrown out with the bath water” and then increasing the level of detail captured when required. With the cost per gigabyte so low however, this data is now kept and mined for business advantage.
The 3 V’s volume, velocity and variety are used to characterize Big Data problems and this blog is about how they affect me daily.
Years ago I could tell my boss that we had 16 storage arrays, 1,600 spindles, 650 volumes and 800Tb of data in our global storage estate almost instantly. If he wanted to know what the I/O performance trends where, I could generate a report in 30 minutes which sampled a pre-selected, but highly limited range of just 12 critical volumes, about 2% of the total storage. To break down the storage consumption by business application and report against the capacity requirements forecast for the year would take two days. However the constant changes throughout the estate between the quarterly reviews resulted in reporting errors due to missed volumes.
Those days my storage metric database was over 150Gb in size, hosted on a 4 way server with 8Gb of ram and it could just about manage to keep up with the data streams. Today I have over 7,000 storage arrays in the UK, which I could be asked questions about and many more in the rest of Europe. Unless I was able to generate a point in time snapshot of the information in all the different management systems used, I could not tell you accurately how many spindles we supported. This is because the number would change from when I started calculating the total to when I finished many hours later, due to changes in the estate. If the boss needed to know, the problem could be solved by getting the wider team involved. We could all agree which part of the estate each of us would report on and then send our answers to a central point for totalling and then reporting back.
This is what happens in Big Data when the Hadoop platform, with MapReduce is employed for parallel processing. The data is captured into the distributed file system HDFS, making it available to the multiple nodes in the Hadoop cluster. MapReduce enables these servers work together in parallel by each taking a chunk of the problem to solve “map stage”, processing it and then returning the partial results for totalling, in the “Reduce stage”. This increased power is used to run highly complex queries on all the data available, eliminating the reporting errors. Clearly this is more like batch processing of old and brings us onto the next problem.
Black Hawk Down presents us with a classic example of the ‘velocity’ issue. In the movie, a convoy of soldiers is trying to navigate to the helicopter crash site, through the city of Mogadishu. The organisation’s intelligence team back in HQ have access to maps of the city and live information streams which they must combine in order to generate a safe route through the city. Simple instructions are relayed through the chain of command to the lead vehicle driver, who has to turn left or right. The problem however, is the situation on the group is constantly changing due to the opposition gathering in armed mods and creating road blocks. The time required by the back office processes to generate the instructions is greater than the time which the opportunity exists and the instructions are valid. Lt. Col. Danny McKnight “You have to tell me to turn before we have past the turn!” The proposed solution from the back office team is to drive slower but this is ruled out as ‘unacceptable’ by the people in the firing line, who are literally getting shot at!
Back in my data centre I must identify potential issues before they become a problem. An alert from the monitoring system stating a threshold has been past is how we operated yesterday. This is because the alert must be configured at the point where it leaves enough time for the administration team to respond. This surplus capacity on top of the actual requirement is a buffer of extra system resources which are insurance against the unforeseen. The bigger the buffer, the more time the available to act but this also increases the infrastructure costs, thus directly impacting the bottom line. Intelligence data on what is actually happening within the environment, coupled with delivery of simple instructions during the window of opportunity enables profits to be maximised and increases business agility.
By either “streaming data” or to use a different term “complex event processing”, the input to the system can be reduced to a manageable amount. This analysis of the input before reaching the back end systems enables data to be identified which warrants an immediate response from the application. IBM’s InfoSphere Streams are an established proprietary product which competes with Twitter’s Storm and Yahoo’s S4 open source frameworks. Through establishing a feedback loop, the outputs can be utilities to further drive business advantage.
Lastly the variety of log files and the diversity of detail in them is a constant nightmare to the systems administrator. It is not uncommon when troubleshooting a problem to compare the logs from a storage array, network switch and host to discover what is happening. A skilled engineer has no difficultly comparing apples with pairs, discovering that they are both fruits and processing accordingly. Data extrapolation is at the heart of Big Data problem solving, where computers are employed to preform repeatable tasks quickly. Such as reading every log file in a distributed, n-tier platform operated concurrently by 1,000s, then analysing the unstructured data, discovering patterns to extract the ordered meaning. However it is not that simple because of the real issue with Big Data variety is the quality of the data. Before it can be processed it must be cleaned up and entity resolution preformed to establish exactly what a name refers to. Is this Exchange2010 volume for customer X on filer Y or customer A on filer B?
What I must deliver is a solution which can take all the different inputs and tell me who the individual is, running CrapApp1 on their PC, generating the problem IO load on storage system Y, which is causing the core business application to operate slowly for the whole business. There is nothing more powerful than catching “nobody did that” red-handed, to stop it happening again.