[Opinion] Our Exciting Journey as Data Scientists Onwards to Higher Levels of Abstraction


Learn from our challenges and triumphs as our talented engineering team offers insights for discussion and sharing.

[Opinion] Our Exciting Journey as Data Scientists Onwards to Higher Levels of Abstraction



Over the last few years, I’ve been consistently excited by the ever-increasing pace of new developments in the cloud computing space. By leveraging these tools, engineers and data scientists can now tackle bigger problems by outsourcing large parts of the work to cloud vendors. As a data scientist, I’ve been particularly focused on the advancements in Machine Learning as a Service (MLaaS), which aspires to abstract away many of the common challenges of robustly applying statistical modeling. I believe that increasing levels of success in this objective will hail the coming of a truly transformative period in data science.

In some ways it’s worrisome that several of the unique skills that I’ve developed over the last 15 years are being automated away and even commodified. On the other hand, it’s exhilarating to know that as we data scientists master these tools, we’ll be able to solve even larger problems by outsourcing much of our work to cloud vendors. In this post, I’d like to share some of my thoughts on how we can all embrace this opportunity and become next-generation, elite data scientists.

A history lesson on compilers

First, a brief history lesson to demonstrate that there is ample precedent of such paradigm shifts in tech. I imagine that few readers here have any experience writing assembly code, which consists of low-level instructions for the CPU. I myself have very limited experience here too because I’ve always been able to solve such problems using a higher-level language and a compiler to generate optimized machine code.

But it wasn’t always this way. A generation of software engineers had to be trained on assembly code and then make a judgment call about learning new languages such as C as the early compilers came on the scene. The early “higher-level” languages and compilers were quite simple and limited in their capabilities. It was totally reasonable for engineers to forgo using an early language and compiler unless they had a special problem that was well suited to the early tools. Instead, they could still leverage their expertise in assembly coding to get better results than the nascent compilers.

Following extensive research and advancements in compilers technology, the tools became harder and harder to ignore. Engineers that mastered these new tools and applied them well could simply solve problems quicker. Further, they were able to tackle bigger problems than before. For example, projects such as the original Unix operating system were only made tractable with the development of robust C compilers.

I’ve heard tales about a small number of engineers that eschewed these tools. They took pride in the unique skills they had developed over the years and couldn’t believe that a compiler — a simple computer program — could develop better assembly code than they could. Some even looked down on the programmers that adopted these new “easier” tools. After all, how could one call themselves a proficient programmer if they couldn’t manage their own stack and registers? To these assembly-zealots, a high-level language like C looked like a crutch for weak engineers.

But overtime, strong engineers found themselves embracing higher-level languages and compilers because the tools empowered them to better develop software by operating at a higher level of abstraction. Engineers who truly loved the art of assembly code joined compiler teams and rather than writing assembly code manually, they improved how compilers generated assembly code. A plethora of optimizers were developed which allowed compilers to generate machine code with efficiency on par with hand-tuned assembly code. As compilers advanced, they soon could best assembly programmers on all but an exceedingly small number of problems.

The final nail in the assembly programmer coffin was the evolution of CPU instruction sets to better accommodate compilers at the expense of humans. While there are still some engineers who primarily write assembly code, they are certainly a rarity in the modern world of software engineering. As I understand it, a non-trivial number of engineers found themselves unemployable as they failed to learn and adapt the new technologies quickly enough.

The journey to higher levels of abstraction never really stops; even when the compilers can’t keep up. For example, architects engineer through technical design docs written in simple English. Seeing as computers cannot yet work at this level of abstraction, architects must rely on organic compilers to transcode their design into a programming language, on which a computational compiler can operate.

While many of these organic compilers greatly enjoy their work, one can see the gap between English and programming languages decreasing as more powerful tools come on the scene. Those engineers that leverage cloud computing components extensively may feel that we’re getting quite close to making organic compilers unnecessary. Such engineers can already tie together powerful cloud services using minimal actual code to create substantial engineering systems.

Disruption in data science

Those of us who have been working with data and statistics since the early 2000s have already seen some smaller scale disruption in our field. I can still remember using C++ and Perl as an undergrad researcher to develop Monte Carlo simulations, apply statistical modeling, and perform data analysis. Thankfully, I was introduced to Python 2.3 just months into my ordeal. I recall a sense of joy and excitement as I learned and applied this more powerful tool and thereby got things done more quickly and easily.

Python itself has also undergone substantial evolution over the years that I’ve used it. I can still remember my indecision in deciding what numerical array library to use until the community standardized on numpy. I have memories of analyzing data without the aid of Jupyter and pandas, but instead having to write small scripts that used low-level numpy operations and an early version of matplotlib. And of course, statistical modeling was much more tedious without such libraries as scikit learn and statsmodels. I even see Python itself possibly being disrupted and replaced as it faces increasing competition from the more elegant language Julia.

But all of the advancements I’ve seen over my time as an academic researcher and data scientist pale in comparison to what is happening in contemporary times within cloud computing. Now, we data scientists have the opportunity to leverage a broad array of cloud technologies for everything from data lakes, to ETL, to machine learning. The last of these is the most exciting in my opinion, because we can now leverage sophisticated, specialized systems to handle assorted machine learning tasks from general predictive modeling to computer vision. For an overview of MLaaS technologies, see this insightful article.

Further, I think we should remember that we’re just at the beginning of the MLaaS age. What we’re seeing now is comparable to the early compilers; new tools that are initially only of use for a subset of problems. But with all of the excitement around the potential of MLaaS, we can expect substantial R&D investments from such powerhouse companies as Amazon, Microsoft, and Google. Hence, we should expect more and more machine learning problems to be better addressed with MLaaS solutions as the technology advances. I imagine such companies won’t stop until they’ve provided superior solutions to almost every machine learning problem. There’s simply too much money at stake for them to ignore these opportunities.

Our evolution as data scientists

So what does this all mean for us data scientists? Will we become obsolete as computers replace us? Possibly. But in my opinion, we’re just like assembly programmers at the dawn of the compiler age. Some of us will be early adopters because we have problems specifically well suited to MLaaS. Others may wait until it becomes obvious that you can’t be an effective data scientists without leveraging these tools. And I fear a small number of unfortunate folks may truly become unemployable as they fail to learn and adopt these exciting new technologies in sufficient time.

What will a data science role look like as we outsource our machine learning work to MLaaS? In many ways our roles will be similar. We’ll still need strong proficiency in the fundamentals of statistics and strong systematic thinking so that we can reason through the complex structure of novel data problems. We’ll still be figuring out how to quantitatively understand data sets through data analysis, although we’ll be using increasingly more powerful cloud tools to this end such as BigQuery and DataFlow. Lastly, we’ll still need our creative technical thinking to map business needs on to problems that can be solved by machine learning and other data science methods.

With the technical tasks getting easier, it’ll become all the more important to master what I believe to be the chief data science skill: exceptionally strong technical communication. We’ll need to hone our empathetic listening capabilities so that we can well understand the needs of our customers and stakeholders. We’ll need to practice strong systematic thinking to organize the complex information associated with a data science project so that we can communicate this informational in amazing documents, presentations, and ad hoc conversations. I believe systematic thinking and technical communication will increasingly become the X-factor in differentiating elite data scientists from the rest of the crowd.

And lastly, we’ll all be looking for bigger problems to solve using our newly learned, more powerful tools. To that end, I believe we’ll each need to develop deep expertise in at least one specific problem domain such as marketing or healthcare. Not only will we need to understand the data, but we’ll also need to understand how this data drives our business. We’ll need to pick up additional business knowledge and context so we can find even more problems to work on since each problem will be easier to solve. I recommend “The Ten-Day MBA 4th Ed.: A Step-by-Step Guide to Mastering the Skills Taught In America’s Top Business Schools”.

Concluding Remarks

I hope you’ve found this opinion piece thought provoking and I’d love to hear what other people think. Do you believe MLaaS is a truly transformative technology? Are there other technologies that we data scientists should have on our radar? In what ways do you believe the data science role will evolve with the dawn of these exciting new technologies? Feel free to comment in the section below.