Hyper Dormant
5 min readOct 11, 2019

What is Different about Big Data?

To distinguish big data from from traditional data sources, we will introduce some of its features. Big data doesn’t necessarily satisfy all of the following features, but most of them. These features are:

  1. Big data is often generated in an automated manner:
Credit: ebay

Traditional data sources involve a person in its generation, think of bank transactions, invoice payments, telephone records, etc.

On the other hand, automated generation of data implies that data is generated by machines without human intervention, consider the data gathered by IOT devices, smart watches, weblogs tracking user behavior online, and other real-time data generated by sensors monitoring machinery. For example: the large hadron collider at CERN generates 40 terabytes of data every second during experiments.

2) Big data typically stems from a new source of data:

Credit: Waterfalls near Portland

If we look at bank transactions done online for instance, they are not fundamentally different from transactions executed traditionally. However, the transaction has took place in a new channel.

3) Many big data sources are messy, ugly and unfriendly:

Credit: 11 FACTS ABOUT TSUNAMIS

Think of tweets, we can’t force the users to follow strict rules on grammar, the number of prohibited characters, or the structure of the paragraph tweeted. Most traditional data sources on the other hand are designed to be clean and easily managed on purpose, in order to store it and manipulate it with ease.

Traditionally, data sources were constrained and tightly defined, such that only data of value is included, but with the diminishing cost of storage space, and the undefined nature of big data sources, we usually store all of the data received and worry about what’s of value on the data later, this allows us to capture all the subtleties of the data, and that nothing will be missed, however, it makes the process of data analysis more painful.

Technologies don’t come risk free, here are some of the risks triggered by big data:

  1. Storing data costs money, as the data accumulates, money is being spent on its storage and maintenance. Therefore, there has to be an implementation of a smart strategy when dealing with big data. It’s not necessary to capture every bit of data being streamed, we can instead sample some of the data and look and apply exploratory analysis to determine what sources are relevant to the organization and how each source can be used. One is then ready to tackle data sources of interest on a large scale.
Credit

2) The biggest risk that accompanies big data is probably privacy. Different individual and corporations may exploit users’ big data without their consent. Refer to the Cambridge Analytica case for an intimidating insight into the privacy risks of big data.

Credit

The structure of big data

Generally, there are three types of data: structured, unstructured, and semi structured data. It is often said that traditional data is structured and big data is unstructured, to investigate the validity of this statement, let us explore types of data structure.

Strucutered Data: Most traditional data sources are structured, it means that data come in a predefined format that is well defined into a particular schema, data features are known and expected ahead of time.

Water Tanks

Unstructured Data: it is data that we have almost no control over, such as text, audio, and video data. An image for example is comprised of individual pixels, there are no constraints on the content of these pixels, the dimensionality of the images, or their size. They come in all flavors.

The Great Lakes

Semi-Structured Data: most data come in this format, this type of data follows a logical format that can be understood, but it’s not user friendly. It is usually intermixed with lots of noise, and it usually involves putting extra time and effort in specifying a set of rules to determine how each piece of information is read. This type of data is often referred to as multi-structured data. An example of semi-structured data is weblogs.

Dam

Exploring Big Data

To explore big data, it is collected and introduced to the analytics team. A rule of thumb is that 70 to 80 percent of the time is spent on gathering the data and the other 20 to 30 percent are spent on actual analysis with it. This is acceptable since identifying the pieces of big data containing most of the value requires time and effort. It should be noted that most of the data has no value, some data has long term strategic value, and some will be useful for immediate or tactical use. A key part in the process of data exploration is identifying these individual pieces.

In most cases, raw data is kept for a period of time, enabling the extraction of additional data that is missed when it was first processed. Examples of this are methods used by websites to track user’s behavior, such as tag-based and log-based methodologies. In these methodologies data isn’t thrown away up-front, it is exploited and kept as long as its cost effective to do so, depending on both the size of the data feed and how much storage is available.

Next, we will discuss how filtering big data takes place.

Hyper Dormant
Hyper Dormant

Written by Hyper Dormant

Computer Engineer/Data Scientist, who’s looking forward to sharing my continuous learning journey with the world.

No responses yet