Building a world-class data-science team

Let’s begin with the origins and end with practical tools.

September 2, 2020

Data science isn’t about special people in special places. It’s about teams.

We’ve all witnessed the wave of innovations that has washed over business models of late. These innovations didn’t surface as the ideas of individuals. The architecture of businesses, business interactions, data collections, and the use of information is so complex that a single individual in a mid- or large-size company wouldn’t have the knowledge to understand all elements required to make the idea a practical reality.

Also, it’s long been proven that heterogeneity enhances group brainstorming. More diverse groups produce better ideas. This concept is especially important when we’re designing data-science teams.

A part of the whole

You’ve probably been told you need to hire one of two individuals. The first is an astute data developer with a grounded understanding of Python, SQL and data storage, PostgreSQL, Unix and Linux command-line knowledge (mainly to run and schedule cron jobs); Python data libraries (Pandas, Scrapy, Keras, Matplotlib, TensorFlow, Bokeh, Scikit-learn, etc.); Flask, Bottle, and Django to host the analysis of the database as a RESTful API, AWS, or Azure-hosting framework; and, of course, AngularJS for presentation results and DS.js to create data visualizations.

If, for some reason, you botch the hiring of the astute data developer, you only have one other alternative—to hire a data academic. This is a theorist who pontificates about changing the world with data but whose experience rarely ventures outside the educational setting and has few practical applications. The data academic understands core statistics, categorical data analysis, applying statistics with R (multiple linear regressions, qualitative predictors, linear discriminant analysis, resampling methods like k-fold cross-validation, hyperplanes, hierarchical clustering), sequential data models (Markov models, hidden Markov models, linear dynamical systems), Bayesian model averaging, and machine-learning probabilistic theory. You hope some of this learning is connected to causality.

Are these two roles important for a data-science team? Of course. If you, by chance, hire both these roles, do you have a data-science team? No, you do not.

Let’s begin with the origins of data science and, from there, we’ll lead into the critical capabilities required to build a world-class data-science team.

From there to here

The foundation of data science originated with five key areas:

Computer science: the study of computation and information
Data technology: data generated by humans and machines
Visualizations: graphical representation of information and data
Statistics: methodologies to gather, review, analyze, and draw conclusions from data
Mathematics: the science of the logic of shape, quantity, and arrangement

Computer science evolved from Turing machines to cybernetics and information theory by the 1900s. Tree-based methods and graph algorithms surfaced in the 1960s. By the 1970s, computer programming and text or string searches popped up. Data mining, data classification, and similar methods pushed us into the early 2000s.

Data technology began before the 1800s with binary logic and Boolean algebra with punch cards. IBM introduced the first computers in the 1940s as DBMS matured. Removable disks with relational DBMS followed into the 1960s. By the mid-1970s desktops, SQL, and objective-oriented programming was the norm. In early 2001, statistical modeling started to emerge, balancing the stochastic data model by using algorithmic models and treating data mechanisms as unknowns.

Visualizations arose prior to the 1800s with cartography and astronomical mapping of charts. Line and bar charts came out in the 1800s, and statistical graphics were depicted by the mid-1800s. The box plot was created in the 1970s, and word or tag clouds started to form in 1992.

Statistics entered the 1800s with theories of correlation, probability, and Bayes Theorem. In the 1900s the concept of regression, times series and least-squares made the rounds. The 1900s introduced the foundation of modern statistics with the hypothesis and design of experiments. By the mid-1960s, we had Bayesian methods, stochastic methods, and more complex time-series methods such as survival analysis and grouping time-series data. Through the 1980s, more developments occurred in Markov simulation and computational statistics, allowing us to better understand the interface between statistics and computer science. By the late 1990s, decision science, pattern recognition, and machine learning were starting to take shape.

Mathematics entered the 1800s with calculus and logarithms. Next, Newton-Raphson introduced optimization methods. By the 1930s, the military had started to adopt theories for manufacturing and communications. The 1960s were booming with networks, automation, scheduling, and assignment problems, which have only matured in recent years.

Understanding the origins of data science helps demystify it and allows you to develop a concrete capability in your company.

Data-science capabilities

Finding success with data science comes down to four factors: people, data, tools, and security.

The most important elements of your data-science team are the people and the capabilities they enable. Next, to get insights—even with the best people—we ultimately need data and access to data. Usually, data is siloed across teams, departments, and systems, making gaining access difficult. Assuming we have the people and access to the data, next, we need tools. Performing analytics necessitates computational and data-storage resources. Fortunately, today we have many open-source options that are more than adequate. Lastly, data security and privacy protection are crucial as data becomes more centralized. With this convenience comes access—which, in the wrong hands, creates risk.

With this understanding of the origins of data science, it’s fascinating to see the mix of conventional capabilities aligned with the less traditional data-science skills that are required for success. Let’s cover examples of data-science capabilities and complementary data-science team skills that are found within world-class data-science teams.

Data-science capabilities

Statistics: R, SPSS, Excel, Minitab
Mathematics: MATLAB, Theano, Octave, NumPy,
Pattern recognition: Scala, Google Cloud, Twilio, narrative science, cogito, PoluAI, smartling
Data mining: SAS, Talend, Teradata, board, Dundas, orange, KNIME, Alteryx
Machine learning: BigML, NET, XGBoost, DL4J, H2O
Artificial intelligence: CNTK, Numenta, OSARO, Affectiva, Cognitive Scale
Neural networks: Keras, Caffe2, Chainer, PyTorch,
Data visualizations: js, Tableau, ggplot2, Matplotlib, Bokeh, Spotfire
Hypothesis testing: Datameer, incorta, switchboard, Starburst, Pentaho
Data modeling: QlikView, RapidMiner, Vertabelo, Lucidchart, Erwin, HeidiSQL
Big-data solution engineering: Jupyter, Apache Spark, R Shiny, Databricks
Exploratory data analytics: pandas, Hadoop, Hive, Pig, Flume
Modeling and prescriptive analytics: SigOpt, TensorFlow

Data-science team skills

Stakeholder management: business-relationships management, project management
Storytelling ability: executive presence, presentation skills
Business communications: clear and timely communication, governance
Consulting: need analysis, solutions aligned to goals
Problem-solving: lean six-sigma, agile
Topical analytics techniques: statistics, root-cause analysis, statistical-process control, value-stream mapping, flows
Domain expertise: knowledge of the data, who’s using it and for what purpose
Business analysis: experience evaluating and modeling business cases

The ultimate success of a data-science team depends on how well expectations are managed. When expectations are met, the data-science team will be viewed as impactful. Inversely, a weak perception of delivery is a significant reason why data-science teams eventually get disbanded—they focus on what’s cool, not what’s most impactful for the business.

The hidden art of storytelling

It’s idealistic to believe data-science teams can find value in data from day 1, but, eventually, they’ll connect data to new insights. However, often that data is layered across hundreds or thousands of sources, and the team might be months or years away from collecting it all. Most data-science teams begin with a simple set of questions. These questions are challenging but tangible to answer. This approach also limits the data set required to be integrated into an initial proof-of-concept. Sample questions might include some of the following:

Which applications in our portfolio have the most significant security risk?
Why is the Durham, NC location the most profitable?
What type of patient visit will be the costliest next quarter?
Is antibody A or antibody B more likely to achieve FDA approval?
Which drone should we bring in first for preventive maintenance?

Building a world-class data-science capability isn’t about individuals; it’s about assembling your team. It’s crucial to ensure that essential data-science capabilities and data-science skills are part of your team design. To tap into the power of data science, we require teams to not only extract insights from data but also tell a compelling story. Quite often, we’re left with a lot of data, confusing insights, and no story. Make sure that the team you build can tell a story.