Filtering Big Data
As we have mentioned in a previous article, 70 to 80 percent of a data scientist’s time is spent on gathering the data, and only 20 to 30 percent of the time on actual in depth analysis. Therefore, the biggest challenge is not the analysis, the challenge lays in extraction, transformation, and loading the data, a process known as ETL, which precedes the analysis.
· E: The data is extracted from its source
· T: data is transformed using a variety of aggregations, functions, and combinations to make it usable.
· L: data is loaded into an environment in which analysis will take place.
Before filtering down the data stream into the bits we are most interested in, we need to carry out a thorough investigation of the data stream as a whole to explore different pieces of information, and dismiss irrelevant ones.
A crucial part before applying analytics is filtering the data. Filters can be applied on the front end of the data stream to remove unwanted portions when it first arrives. Other filters can be applied along the way as the data get processed. The amount of data being filtered out is dependent on the data source and the business problem at hand.
Mixing Data
A simple method to increase the quality of insight provided by the data exponentially is to simply mix big data with other traditional data available at the corporate. It allows addressing problems that are interconnected and interdependent between different data sources. For example: if a corporate has a traditional relational database of its employees and a big data stream of customers data, mixing the two sources into an enterprise data warehouse (EDW), would allow us to analyze customer and employee data together, since they are no longer separate, and find some hidden connections between the two. Therefore, an independent big data strategy is not desired, instead it’s preferable to develop a cohesive strategy that includes all types of data in a cohesive manner.
The Evolving Nature of Big Data
We have stated earlier in What is Big Data some definitions of the topic, however, today’s big is not tomorrow’s big, because “big” is relative to the available resources and technology. Moreover, big data is not the same across different industries. What is considered big data in a large e-commerce company is much larger than big data of a school or a small organization. This doesn’t mean that big data is going away, it simply means that it’s definition is evolving, and it’s here to stay!