It’s all around you. Everywhere. Powering your camera. Tracking your smartphone. Helping you navigate your PC. But have you stopped to considmer what it actually is? What is data science?
For quite some time now, we’ve all been deluged by data. Everything we do, from posting on social media to texting to saving a document, generates huge amounts of data. Even something as seemingly simple as a Google search!
Although data immersion is nothing new, you may have noticed that the phenomenon is accelerating. What used to be tiny streams of data has turned to barrages of structured, semi-structured, and unstructured data that is streaming from almost every activity in both the digital and physical worlds. Welcome to the world of big data!
You might question the purpose of this data collection. How is all of this data useful?
Just a decade ago, no one was in a position to make much use of the data generated. However, that has changed. We now have two sets of power users who are transforming data: data engineers, who constantly find innovative ways to capture, collate, and condense massive volumes of data, and data scientists, who analyze this data and derive valuable insights from it to suggest actions that could make a huge difference to an organization.
Data science produces insights that are valuable to people working in every industry. It helps you understand and improve your business, investments, planning, and even your personal health, lifestyle, and social life. Now data scientists are plying their trade in the sexiest job of the 21st century.
What Is Data Science?
So, back to that question: what is data science? The term gets thrown around a lot, but it’s rarely decoded. Basically, data science is the process of extracting value from data—and it usually requires an understanding of scientific methods and processes. As with other forms of experiments, data science requires you to make observations, ask questions, form hypotheses, create tests, analyze results, and come up with practical recommendations.
DJ Patil, former chief data scientist of the United States, first defined data science as “the ability to extract knowledge and insights from large and complex data sets.”
Hal Varian, chief economist at Google and UC Berkeley professor of information sciences, business, and economics, talked about data science in terms of being able “to understand [data], to process it, to extract value from it, to visualize it, to communicate it.”*
Investopedia breaks down data science as: “a field of big data geared toward providing meaningful information based on large amounts of complex data. Data science, or data-driven science, combines different fields of work in statistics and computation in order to interpret data for the purpose of decision making.”
There are many subfields of data science, like data mining, statistics, machine learning, analytics, and programming, which appeal to different types of audiences, but all work together to allow you and your organization to make more informed decisions through fact-based evidence. We can all hypothesize that a certain cohort of users is likely to renew their subscription to a video streaming service or that hamburgers will outperform hot dogs on the menu, but data science allows us to get far more granular and accurate in our analyses.
Data Science’s Origins
Data has always existed, but for a long time humans struggled to capture it through writing and oral history. The first computer was invented in 1936, and with the creation of the modern internet in 1990, it became possible to collect data on a massive scale and use statistical and mathematical models to interpret it.
Soon, companies and organizations began realizing that they could use data to solve important problems. Many people played a role in popularizing the term data scientist, but it is mostly commonly credited to Patil and Jeff Hammerbacher, the Cloudera co-founder who led Facebook’s data team. They were among the first to call themselves “data scientists.”
With new and faster computer and mobile technology came the production of huge data sets (big data), which were challenging to manage, but provided many useful sources of information on users, customers, and transactions. To process big data, companies started inventing cloud storage capabilities and analytical tools.
In 2010, Mike Loukides wrote, “What is data science? The future belongs to the companies and people that turn data into products.” Since then, more companies and universities have started corporate departments and academic programs around the study of data.
Data science jobs grew 15,000 percent between 2011 and 2012, as companies saw how data science could increase revenue, cut costs, increase marketing effectiveness, create impact metrics, and expedite go-to-market strategies. Today, 2.5 quintillion bytes of data are created each day. IBM noted in 2013 that 90 percent of all data in the world was produced in the previous two years.
Data science is leading to new fields of prediction and data visualization through artificial intelligence, machine learning, and deep learning. Data science is not only about buzzwords, but it really is changing the way we work and opening up new frontiers.
What Is a Data Scientist?
Within data science, there are many possible roles depending on your strengths, interests, and level of experience. For example, data analysts, as the name suggests, extract and analyze data sets, find actionable insights to research questions, and turn data into reports, goals, and dashboards.
A data scientist is usually someone with more programming experience than a data analyst who cannot only pull data, but also develop models and algorithms to solve problems, test products, and lead the company in new directions through advanced data processing. Many people move from data analyst to data scientist positions. (Check out this resource to explore a day in the life of a data scientist.)
A business analyst is usually a business student who has experience with software like SAP, SQL, and Tableau, and can use data and quantitative analyses to make more informed, data-driven business decisions. Business analysts can identify process improvements and behavioral trends that change business and profitability outcomes. Business intelligence is the larger field that, according to Gartner, “includes the applications, infrastructure and tools, and best practices that enable access to and analysis of information to improve and optimize decisions and performance.”
A data engineer comes from an engineering background with programming experience in languages like Java, Python, and Scala. Their job is to design and prepare the data infrastructure to collect and analyze data within an organization.
Machine learning is a branch of artificial intelligence where algorithms use data inputs to autonomously predict future outcomes. A machine learning engineer uses machine learning to create robust and scalable models for data science. These engineers can also program computers and robots that can execute commands by “learning” from patterns in data.
Data analyst, data scientist, business analyst, and data engineer are just a few of the job descriptions available within data science, but who knows what new options may emerge in the next 10 to 15 years? People starting a career in data science can easily migrate from one type of data science to the other as they continue to build new skills.
Today, you don’t need to get into a full-time course and channel years toward getting a degree in statistics, computer science, or data science, but you do need to update your skill set. The internet has made learning far easier than it was before—there are resources and online courses everywhere. By gaining data science skills, you’ll understand how to ask the right questions of your data such that you can make smart business decisions and hit business goals.
Who Can Use Data Science?
The terms data science and data engineering are commonly interchanged. Both the fields, though interdependent, are very distinct domains of expertise.
Data science involves deriving actionable and invaluable insights from raw data sets using computational methods. On the other hand, data engineering revolves around handling data-processing hold-ups and data-handling problems for applications that use huge volumes, varieties, and velocities of data.
Both disciplines do have some points in common—they involve working with three varieties of data:
Structured data: Data that’s stored, processed, and manipulated in a traditional relational database management system and that is explicitly labeled in some way that is human-readable.
Unstructured data: Data that’s commonly generated from human activities and that doesn’t fit into a structured database format.
Semi-structured data: Data that doesn’t fit into a structured database system, but is nonetheless structured by tags that are useful for creating a form of order and hierarchy in the data.
It is a common misconception that only big organizations that have massive funding implement data science methodologies to improve and optimize their business. That is far from true.
The following are just a few titles under which experts of every discipline are using data science: ad-tech data scientist, banking digital analyst, clinical data scientist, geo-engineer data scientist, geospatial analytics data scientist, retail personalization data scientist, and clinical informatics analyst in pharmacometrics.
Businesses can employ tools like Google Analytics (web analytics), MixPanel (heat maps), UserTesting (user feedback), KissMetrics (customer segmentation), Flurry (mobile app management), and Optimizely (product development), depending on their needs and customers.
What Skills Do Data Scientists Need?
Data science is very interdisciplinary and you must understand how the different puzzle pieces fit together. To practice data science, you need mathematics, statistics, and programming knowledge, and ideally an area of subject-matter expertise. Without domain expertise, you might be able to call yourself a mathematician, a statistician, or a programmer, but not a data scientist.
Collecting, querying, and consuming data
Data engineers capture and collate huge volumes of structured, unstructured, and semi-structured data that exceeds the processing capacity of conventional database systems. Again, data engineering tasks are separate from the work that’s performed in data science, which focuses more on analysis, prediction, and visualization. Despite this distinction, when a data scientist collects, queries, and consumes data during the analysis process, he or she performs work that’s very similar to that of a data engineer.
Although valuable insights can be generated from a single data source, oftentimes the combination of several relevant sources delivers better contextual information. A data scientist can work off of several data sets that are stored in one database, or even in several different data warehouses. Other times, source data is stored and processed on a cloud-based platform built by software and data engineers.
No matter how the data is combined or where it’s stored, if you’re doing data science, you almost always have to query the data and work with it by a process known as data mining. Most of the time, you use SQL to query data.
The data that you access from various sources doesn’t come in an easily packaged form, ready for analysis—quite the contrary. The raw data not only may vary substantially in format, but you may also need to transform it to make all the data sources cohesive and amenable to analysis. Transformation may require changing data types and the order in which data appears, and even creating new data entries based on the information provided by existing entries.
You may need to use big data solutions like Hadoop to deal with massive amounts of data that a single computer can’t handle. Thankfully, many of these tools are free and open source, and they are specifically designed so you can do advanced analytics with big data.
Data science relies heavily on a practitioner’s math and statistics skills. These skills are also valuable in data science to carry out predictive forecasting, decision modeling, and hypothesis testing.
Though most of the concepts and formulas used in statistics are derived from the vast knowledge base of mathematics, it is treated as a separate and independent branch of math that has many applications.
Mathematics uses deterministic numerical methods and deductive reasoning to form a quantitative description of the world, while statistics is a form of science that’s derived from mathematics, but that focuses on using a stochastic approach—an approach based on probabilities—and inductive reasoning.
Data scientists use mathematical methods to build decision models, to generate approximations, and to make predictions about the future.
In data science, statistical methods are useful for getting a better understanding of your data’s significance, for validating hypotheses, for simulating scenarios, and for making predictive forecasts of future events. Advanced statistical skills are somewhat rare, even among quantitative analysts, engineers, and scientists. If you want to go places in data science, though, take some time to get up to speed in a few basic statistical methods, like linear regression, Bayes Theorem and probability, inferential statistics, ordinary least squares regression, Monte Carlo simulations, and time-series analysis.
The good news is that you don’t have to know everything—it’s not like you need to go out and get a master’s degree in statistics to do data science. You need to know just a few fundamental concepts and approaches from statistics to solve problems.
A data scientist may need to know several programming languages in order to achieve specific goals. For example, you may need SQL knowledge to extract data from relational databases. Programming languages such as Python and R are important for writing scripts for data manipulation, analysis, and visualization.
The immense data sets that data scientists rely on often require multiple levels of redundant processing to transform into useful processed data. Manually performing these tasks is time-consuming and error prone, so programming presents the best method for achieving the goal of a coherent, usable data source.
Given the number of programming languages that most data scientists use, it may not be possible to use just one programming language.
You may have to choose other languages to fill out your toolkit. The languages you choose depend on a number of criteria. Here are the things you should consider:
How you intend to use data science in your code (you have a number of tasks to consider, such as data analysis, classification, and regression)
Although coding is a requirement for data science, it really doesn’t have to be this big scary thing people make it out to be. Your coding can be as fancy and complex as you want it to be, but you can also take a rather simple approach. Most people can learn enough coding to practice high-level data science with tutorials such as Codecademy.
Data scientists are required to have strong subject-matter expertise in the area in which they’re working. Data scientists generate deep insights and then use their domain-specific expertise to understand exactly what those insights mean.
Assume you just landed a data science job with MegaTelCo, a large U.S. telecommunications firm. They are having a major problem with customer retention in their wireless business. In the mid-Atlantic region, 20 percent of cell phone customers leave when their contracts expire, and it is getting increasingly difficult to acquire new customers. Since the cell phone market is now saturated, previously huge growth in the wireless market has tapered off.
Communications companies are now engaged in battles to attract each other’s customers while retaining their own. Customers switching from one company to another is called churn, and it is expensive all around; one company must spend on incentives to attract a customer while another company loses revenue when the customer departs.
You have been called in to help understand the problem and to devise a solution. Attracting new customers is much more expensive than retaining existing ones, so a good deal of marketing budget is allocated to prevent churn. Marketing has already designed a special retention offer. Your task is to devise a precise, step-by-step plan for how the data science team should use MegaTelCo’s vast data resources to decide which customers should be offered the special retention deal prior to the expiration of their contracts.
Think carefully about what data you might use and how they would be used. Specifically, how should MegaTelCo choose a set of customers to receive their offer in order to best reduce churn for a particular incentive budget? Answering this question is much more complicated than it may seem initially.
This means that when it comes to solving a data science problem, in many ways it is far more important to have industry knowledge than a Ph.D. in statistics. In this case, you need to have a good understanding of the telecom industry to solve this data science problem.
Communicating data insights
Communication is one of the most underrated skills a data scientist can have. While some of your colleagues can get away with being siloed in their technical bubbles, data scientists must be able to communicate with other teams and effectively translate their work for maximum impact.
Most of the time, people need to see something for themselves in order to understand. Data scientists must be creative and pragmatic in their means and methods of communication. The field of data visualization turns raw data into graphs, infographics, dashboards, and art, so that it is easier to notice trends and highlight correlations between variables.
The Value of Data Science
Data science is hugely valuable across many career fields, so even if you are not performing data science directly, you should have a grasp of potential use cases. Here’s a preview of how data science is revolutionizing industries in different ways:
Travel: Route optimization, dynamic pricing, personalized travel recommendations, bots, customer route segmentation
Telecom: Customer churn optimization, network management, analytics on subscriber traffic, lifetime value prediction
Retail: Market basket analysis, inventory management, store and product distribution, fraud detection
Financial Services: Robot advisory services, risk prediction, credit card evaluation, compliance monitoring
Healthcare: Wearable devices, disease detection, drug discovery, operating room efficiency, precision medicine
Food and Beverages: Quality and timeliness of delivery, customer service and satisfaction monitoring, product placement, advertising
Here are a few examples of what this looks like in practice. Dr. Pepper Snapple Group used a new data-driven platform called MyDPS to show sales staff which stores to visit and when to offer specials. Sales effectiveness increased by 50 percent as a result of these sales funnel recommendations. Cargill’s animal nutrition unit created an app for tracking shrimp farm data. This app uses environmental data to predict biomass in the ponds and provides recommendations on how to improve the health and feeding of harvested shrimp.
Data science can not only improve business performance, but it can also save lives. For example, Chicago University’s Data Science for Social Good and the North Shore University Health System modeled electronic medical records (EMRs), finding that childhood obesity by age five is a predictor of health issues later in life. Based on this information, it is possible to identify at-risk population segments and change diet and exercise before complications become life-threatening.
The United Nations Refugee Agency is also starting to collect socioeconomic and health data on the 22.5 million refugees who have escaped persecution from politically unstable countries, making this data publicly available to aid agencies. Although many people are critical of data science for privacy concerns related to identity theft, user profiling, information leaks, and cybersecurity threats, when used wisely, it can be a huge force for good.
The Future of Data Science
The future belongs to the companies that figure out how to collect and decipher data successfully. Google, Amazon, Facebook, Netflix, and LinkedIn have all tapped their data networks and made that the core of their success. They were the vanguard, but now even small businesses are following their path. Whether it’s mining social media data, recommending products based on a user’s purchase history, or studying the URLs that people pass to others, the next generation of successful businesses will be built around data.
Future trends in data science include self-service analytics, crowdsourced data (think of Waze, which uses drivers to identify traffic problems), open-source tools, data security regulations, and more organizational structure around data science career growth (approximately, 90 percent of large organizations are predicted to have a chief data officer by the end of 2019).
Although in the future data science will be increasingly automated, the need for human judgment to interpret the data will not go away any time soon and the 150,000+ job vacancies in data science indicate that these skills will pay huge dividends in our increasingly data-driven world.