BLOG

Data Mining, Machine Learning, and the Role of Data Scientists

Very Engineering Team

5 min read

Big data by itself is meaningless. Unmined, unprocessed and lacking context, it just sits there. Even today, many organizations are data rich but information poor. Like any raw material, it needs to be processed to be at all useful.

For data – especially big data – to be valuable, it must be actionable. As former HP CEO Carly Fiorina put it, “The goal is to turn data into information, and information into insight.”

Data mining is how you do that, and almost any type of data can be mined if you have the right tools.

Data Mining 101

Data mining focuses on extracting meaningful information – or, if you prefer, knowledge – from vast sets of data. One of the foundational data-mining books, Data Mining. Concepts and Techniques, calls data mining a “knowledge discovery process” that involves finding patterns in massive amounts of data – “big data.”

But the name can be misleading, according to the book’s authors.

Even the term data mining does not really present all the major components in the picture. To refer to the mining of gold from rocks or sand, we say gold mining instead of rock or sand mining. Analogously, data mining should have been more appropriately named “knowledge mining from data,” which is unfortunately somewhat long. However, the shorter term, knowledge mining, may not reflect the emphasis on mining from large amounts of data. Nevertheless, mining is a vivid term characterizing the process that finds a small set of precious nuggets from a great deal of raw material. Thus, such a misnomer carrying both “data” and “mining” became a popular choice.

Data mining has been called both a field and a technique; in either case, it is truly interdisciplinary. It draws on various techniques, tools, and disciplines, including statistics, data cleaning, pattern recognition, database theory, database technology, artificial intelligence, and statistics – and, importantly, machine learning.

Before we dig any deeper, it may be helpful to explain some of the terms we’re using.

Data mining vs. machine learning: Machine learning is one technique that can be used for data mining, but it’s not the only one. As we’ve discussed before, machine learning is one example of artificial intelligence. It involves giving computers access to a trove of data and letting them learn for themselves. Machine learning applications automatically learn and improve without being explicitly programmed. In other words, they evaluate data in real-time and change their behavior accordingly.

Data mining vs. machine learning vs. deep learning: Just as machine learning is one approach to data mining, deep learning is one approach to machine learning. A deep-learning model, which is a subset of neural networks, analyzes data with a logic structure similar to the way you and I think. What distinguishes neural networks from other types of machine learning is that they make use of an architecture based on the neurons in the human brain. “Deep learning” is pretty much just a short way of saying “machine learning using deep neural networks.”

So to bring it full circle, data mining can use deep learning algorithms – along with other approaches – to extrapolate meaningful information from data sets.

If you still aren’t completely clear, that’s ok. We’ve discovered that even people doing this work call it by different names. As long as you grasp the relationship between data mining and machine learning, you’re on the right path.

All of this fits under the umbrella of data science, and that may be the fuzziest term of all to define.

Data Science vs. Data Mining vs. Machine Learning

“A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician.” — Josh Wills of Slack

Data science is a field of study that encompasses everything we’ve been talking about so far, including data mining, machine learning, deep learning, statistics, and much more. Data science focuses on the science of data, while data mining focuses on the process of discovering new patterns in big data sets. And as we’ve already established, deep learning is a type of machine learning.

But, perhaps more than any of the other terms we’ve discussed, “data science” has proven difficult to define. Perhaps that’s why there are so many definitions floating around. We especially like the one Cassie Kozyrkov, chief decision intelligence engineer at Google, came up with: “Data science is the discipline of making data useful,” she wrote in a Hacker Noon blog post. (Of course, she goes into much more detail, but that tweetable phrase captures the essence of her post.)

Over at CIO, Thor Olavsrud came up with a somewhat similar, albeit longer, definition:

Data science is a method for gleaning insights from structured and unstructured data using approaches ranging from statistical analysis to machine learning. For most organizations, data science is employed to transform data into a value that might come in the form of improved revenue, reduced costs, business agility, improved customer experience, the development of new products, and the like.

One thing is for sure: It’s hot. A 2012 Harvard Business Review article called data scientist “the sexiest job of the 21st century.” Then, in 2018, Glassdoor named it the best job in America — just as it did in 2016 and 2017. And, based on the work of Microsoft’s Jim Gray, data science has been referred to as the fourth paradigm of science. The other three are empirical observation, theoretical approaches, and computational science.

A Team Sport

We should add one more thing about data science: It’s a team sport. If you’ve worked with us before or follow our blog, you know we fully embrace a DevOps approach in everything we do. So, of course, we see data science as a team sport. But this isn’t just our opinion:

“The biggest value a data science team can have is when they are embedded with business teams. Almost by definition, a novelty-seeking person, someone who really innovates, is going to find value or leakage of value that is not what people otherwise expected,” Ted Dunning, chief application architect at MapR Technologies, told CIO. He recommends embedding data scientists in DevOps teams. So do we.

Much More to Learn

We’ve barely touched on the basics of these issues. If you want to learn more, the web is full of resources. And if you are looking for a deeper dive, consider a machine learning and data-mining book. Unsurprisingly, there plenty to choose from. For instance, TopTalkedBooks provides a list based on recommendations from Hacker News, Reddit, and Stack Overflow. If money isn’t an object, Springer’s Encyclopedia of Machine Learning and Data Mining is available for $749.

And, of course, our data science team is always available to help. If you’re ready to mine your data sets for insights that can transform your company, consider working with us. We’re applying data science to software product development. It’s a new frontier in an industry where new frontiers are rare.