
Data is growing at an exponential pace.
The 15 GB of free storage that you get from Google drive (which never seems enough) is more than 400 Thousand times the total data storage capacity of the Apollo Guidance Computer that landed the first people on the moon in 1969, [1]. The average smartphone today can hold several times more data than that.
In the 5 seconds that it took to load this page about 40,000 new tweets were made. In the 9 seconds that you took to read the the previous paragraph around 10,000 new pictures were uploaded to Instagram and in the past 20 or so seconds, since you started this adventure, approximately 1,437,298 Gigabytes of new data has been added to the internet [2].
The corporations of today, generate and collect copious amounts of diverse data. From conventional systems such as Enterprise Resource Planning (ERP), Customer Relationship Management (CRM) and Accounting Information Systems (AIS) to new sources such as Social Media, the Internet of Things (IoT) and Knowledge Management Systems (KMS). All of this data can be used to gather a lot of insight about the corporation’s reputation, health and business trajectory etc. in real time.
As you can imagine, this kind of data, known as Big Data, can be staggeringly massive. The 4 Vs of Big Data, as they are often called, illustrate the enormity of the term. These are: Volume, Velocity, Variety and Veracity [3]
With all this data, Data Analytics has emerged as a science in its own right. It requires practitioners that have advanced skills in both Computer Science as well as Statistics. They can handle, organise and work with gargantuan quantities of data that would overwhelm an untrained person. They can develop algorithms to automate and speed up its flow, processing, storage, retrieval and visualisation etc with Artificial Intelligence and Machine Learning.
These specialists are called Data Scientists. It quite naturally follows the data that Data scientist roles have grown over 650% since 2012, and that hundreds of companies are hiring for those roles [4].
One of the most exciting (for us) emerging applications for Data Science is in the field of enterprise skill-development. With the kind of Big Data that enterprises have today, they can benchmark their operations, processes and employee potential against that of the markets they compete in and also streamline these with their plans for growth and/or diversification.
In this Article we will explore 10 common Data-Science terms:
- Warehousing: Data Warehouses are especially designed relational databases that are custom built for the purposes of data-analytics and visualisation etc. A simple way to see the distinction between data warehouses and conventional databases are the terms OLAP and OLTP respectively. OLAP stands for On-Line Analytical Processing and are reporting systems that reside above the layer of OLTP, which stands for On-Line Transaction Processing, and are systems designed for efficiently recording and organising transaction data. For example an organisation’s LMS database may hold detailed information about the current year’s workforce learning resources, learner information and progress whereas its data warehouse may be much larger and process the data from the past 5 years of its operations. In this example, the streamlined database will likely be operational 24 x 7 x 365 while the data warehouse will be used only occasionally for strategic planning purposes.
- Aggregations: As the term implies Data Aggregation is the process of the collection, sifting, summarising and structuring of data for analysis. Enterprise Big Data can be from disparate sources and be stored in a myriad of structures and formats. Statistical analysis is one of the key tools in any effort to make-sense-of and draw insights from such large and diverse types pf data, aggregation makes Big Data amenable for statistical analysis. An example purpose of aggregation is to derive empirical data on historical customer service interactions for Training Needs Analysis.
- Clustering: Also called Cluster Analysis, are a variety of methods used for grouping data objects such that objects in the same cluster or group are more similar to each other than they are to like objects outside the group or in other groups. There are many types of clustering algorithms that cluster the data in a lot of different ways. Clustering can be used to find density, sub-space, distribution or even connectivity, amongst other similarities in data. This is often one of the first transformations of aggregated data and sets the stage for what can be inferred from the data.
- Multidimensional Scaling: Multidimensional scaling is a visual representation of distances or dissimilarities between sets of data objects. This can be used to reduce the complexity of highly dimensional data. The term scaling here comes from psychometrics, where abstract concepts (“objects”) are assigned numbers according to a rule. An example of this is the quantification of customer satisfaction on a scale from 1 to 5 where “1” is “Highly Satisfied” and “5” is “Very Dissatisfied”.
- Principal Components Analysis: Principal component analysis is a method of extracting important variables (in form of components) from a large set of variables available in a data set. Simply put this is another technique for taking highly complex data and extracting a simpler to understand/visualise dimensionality for it. For example a single ERP transaction could have tens of dimensions such as date, time, location, item id, quantity, price, seller, buyer etc. It would be very difficult (almost impossible) to visualise, predict or infer from all the dimensions at the same time.
- Regression: Regression Analysis employs various statistical techniques to estimate the relationships between various variables. Specifically, regression can be used to figure out how a dependent variable (criterion) changes when one independent variable (predictor) is varied and the other independent variables are held fixed. An example of regression analysis can be to measure the incidence of expression of customer dissatisfaction by varying the frequency of customer service training (refresher courses) to front-line hospitality staff.
- Correlations: Correlation is a technique for investigating the relationship between two quantitative (that can be counted), continuous (a variable that has an infinite number of possible values) variables, for example, customer waiting time and sales conversion ratio. This analysis seeks to get a measure of the strength of the association between the two variables.
- Probability: In any introductory article or discussion about Data Science, Probability is a must mention. Concepts of Probability Theory often form the backbone of various inferential and predictive analytics. Quite simply put Probability is the likelihood of getting a specific outcome in a series of similar experiments.
- Predictive Analytics: Predictive analytics is a term that covers a range of statistical methods for organising, structuring and understanding data with a view to examine current and historical events and information in order to make predictions about future (or unknowable events).
- Pattern Recognition: When dealing with Big Data, it is often not possible (or practical) to sift, organise of structure the data manually. Machine Learning is used to automate, standardise and speed up the process. Pattern Recognition is used to get AI algorithms to recognise patterns and regularities in data by creating training or sample data sets for the system to follow. An example of pattern recognition is classification, which attempts to assign each input value to one of a given set of classes (for example, determine whether a given task performance was “accurate” or “inaccurate”).
About the Author:
Siddharth is the Creative Director for Playware Studios a Singapore Serious Games Developer. He develops games for Military, Healthcare, Airlines, Corporate and Government training and Mainstream education. He has taught game design in various college programs at NTU, SIM, NUS and IAL in Singapore and is the author and proponent of the Case Method 2.0 GBL pedagogy.
Links
[1] Internet data live Link.
[2] How the Apollo Guidance Computer compares against a modern smartphone Link.
[3] The 4 Vs of Big Data as an Info-graphic Link
[4] Fastest growing jobs today are in Data Science and Machine Learning Link
Comments open