The 5 ways in which automation will change the big data landscape

Posted on 21-mrt-2019 15:42:27

Many companies are moving their data and applications to the cloud. This is quite unsurprising news, right? But there are 5 other trends we see appearing that you'll better keep an eye on as a data scientist. Have you already integrated them in your daily business activities?

Introduction: Automation on all levels

The prediction is that by 2020, the EU will face a data skills gap from 769,000 unfilled positions. This is means that the few big data experts in the lucky companies are experiencing a heavy workload. As a data scientists you have to be up to date with the best suited technologies, deliver POC's at a high pace and have business feeling to create the right analysis and solutions. That's a lot of skills for one person.

data scientists job

We’ve already seen the number of big data technologies growing astronomically in 2016. In the next years, budgets for big data technologies will rapidly increase as it becomes more widely accepted amongst businesses. Most companies have already identified that they need to improve this area of business, which will, in turn, lead to more data scientists being needed to handle the masses of extra data they have to access.

A traditional way to cope with shortages is by automating the tasks where these people are losing time. We’ve seen the rise of micro-services where repetitive IT tasks are being automated, with or without the help of machine learning. So for the business, this is a great evolution; highly demanded data scientists search for solutions to automate certain parts of their job.

So, on which levels will data scientists search for automation solutions?

1. Infrastructure

Big data deployments are pretty intensive these days. Automating deployment tasks will finally make it possible for data science teams to get rid of operational tasks and start earlier with the business intelligence.

In open-source communities, this kind of automation is already the standard. Open-source automation platforms automate the deployment, setup and configuration processes by creating digital work environments for data scientists for both rapid experimentation as production systems. Instead of repeatedly spending several days or weeks on one setup, these automation platforms completes the installation, configuration, integration, monitoring and fault-recovery autonomously in minutes instead of weeks.

2. Coding

In 2015, researchers at MIT created a program that automatically fixed software bugs by replacing faulty lines of code with working lines from other programs. Brockschmidt says that future versions could make it very easy to build routine programs that scrape information from websites, or automatically categorise Facebook photos, for example, without human coders having to lift a finger.

Coding is by nature a pretty creative task. So simply automating this via micro-services is completely out of the question. But that's where artificial intelligence comes in. According to Wikipedia, Machine learning is the subfield of computer science that gives "computers the ability to learn without being explicitly programmed”. So a machine learning system could be able to gain the ability to write its own code. Created by researchers at Microsoft and the University of Cambridge, a system, called DeepCoder, solved basic challenges of the kind set by programming competitions. This kind of approach could make it much easier for people to build simple programs without knowing how to write code.

“All of a sudden people could be so much more productive,” says Armando Solar-Lezama at the Massachusetts Institute of Technology. “They could build systems that it [would be] impossible to build before.”

Ultimately, the approach could allow non-coders to simply describe an idea for a program and let the system build it, says Marc Brockschmidt, one of DeepCoder’s creators at Microsoft Research in Cambridge, UK.

While expert data scientists have historically been forced to write custom code, automating this task empowers them to shift focus from routine coding and data wrangling to tasks that add real value: understanding the business problem, the deployment context of their models and explaining their results.

Interesting source: https://www.wired.com/2016/05/the-end-of-code/

3. Data conversion

Before data scientists can use the data for business intelligence, they spend up to 70% of their time to cleaning and structuring data. It’s pretty obvious that the world is seeking possibilities to make this process quicker and more effective since this is not the most fascinating part of the job. Also here, machine learning and automation will be used to make this process faster and dummy proof.

As an example:
This will also be a big step towards enabling us to represent prediction problems in a abstractibe standard way which makes it sharable with other analysts.

4. Integration

With the advent of integrated Big Data as a Service and the cloud, new solutions are on the rise to deliver big data teams complete packages. We now see companies such as Cazena, Bluedata and Tengu offering solutions where the full-stack is available to the end-user. Deploying full stacks in a world with hundreds of technologies is only possible when a huge level of automation is put in place.

With IaaS, PaaS and SaaS as a stack, users still have to write their own application or business solution and provide it as a service. These solutions (Healthcare, Insurance, Finance, Banking, IoT, IIoT, etc.) take time and resources to build. Very well integrated stack-solutions should make it possible that clusters can be deployed automatically on the service of choice. By using automated stacks, the customer plugs in the data, The service performs all the scalable activities and cleaning/preparing the data for applying the algorithm. Whether it is an anomaly detection algorithm, recommendation engine or something similar, the solution is ready for the customer for that particular domain or vertical.

5. Data Analysis

With new algorithms, data scientists could accomplish in days what has traditionally taken months. In 2015, MIT researchers presented a system that automated a crucial step in big-data analysis: the selection of a “feature set,” or aspects of the data that are useful for making predictions. The researchers entered the system in several data science contests, where it outperformed most of the human competitors and took only hours instead of months to perform its analyses.

With the recent advent of GPUs (graphics processing units), computers are becoming much better in pattern recognition. This is one of the main use cases for machine learning as computer vision systems and embedded systems are able to compute large datasets in parallel while learning the nuances. Taking into account the serious shortage of data scientists, machine learning will help companies to catch up. In the future, many tasks of data scientists, like predictive analytics, will be made easier, where for instance marketing managers will also be able to predict without even needing a data scientist.


This automation trend on 5 levels is leveraged by the open-source community.

Companies with limited knowledge of the big data landscape (we don't blame you for that, it's a pretty complex field) are tempted to buy large integrated solutions. In the medium or long term, this results in SMB’s and educational institutions stuck in the middle with a vendor, unable to use a technology they might need or switch from without costs. That's why they start looking on creating solutions (stacks) consisting of mainly open-source components. We at Tengu firmly believe that open-source full-stack solutions are able to compete with the expensive full-stack solutions companies usually buy from the larger vendors. Since we practice what we preach, we kindly invite you to experience those benefits yourself by starting your free Tengu trial.

Is your company already future-proof?

Topics: Big Data

Sarah Facq

Written by Sarah Facq

Sarah has been our Growth Marketer during 2018.

Join our 100+ subscribers!

Stay informed about topic related to DataOps, data management, Tengu, interesting data events.

Subscribe