Building a modern analytics stack

Building a modern analytics stack

towards-data-science

This post was originally published by Abizer Jafferjee at Towards Data Science

Many modern analytics architectures prefer the ELT approach because it increases flexibility in the pipeline. This is becoming even easier with storage solutions like Snowflake data warehouse that allow semi-structured data to be stored and queried. But, it’s not a matter of choosing ETL vs. ELT as a combination of both approaches may be right depending on a company’s needs.

Data Warehousing

The next piece of a data analytics stack is the data storage platform. The most popular strategy is to place data from all your sources into a common repository where the data can be transformed and combined for various use cases. The most popular solution for data storage are data warehouses — they house raw and transformed data in databases that are easy for different teams within a company to access. Traditionally, data marts have been a popular solution for curating data for specific domains like human resource and finance into their own databases and servers, but at the expense of being siloed. The revolution in cloud data warehousing with platforms like Snowflake, AWS Redshift and Google BigQuery are completely shattering this pattern. Snowflake, for example, has risen in popularity because of its architecture that separates storage and compute resources. As the cost of storage has reduced dramatically, Snowflake’s split architecture has made it possible for companies to cheaply store massive amounts of raw data from all their sources and only use compute resources on transforming data for analytics use cases. Read this article on how you can set up your Snowflake architecture.

Another approach has been to store raw data without any specific purposes at the moment into data lakes. Data lakes are not relational SQL-based platforms like data warehouses and are completely the opposite in concept to data marts. Data lakes are broad stores of general data that allow any kind of data, whether structured or unstructured, to be stored without any organisation. While they are difficult to navigate, they hold the benefit of making it easy to start new analytics use cases and data science explorations. Some examples of data lakes are AWS S3 and Azure Blobs. However, platforms like Snowflake also combine the advantage of data lakes by using cloud storage like S3 as their storage solution, making Snowflake just as cheap for storage. In addition, with their ability to store semi-structured data and automatically optimise data for storage and querying among other features, solutions like Snowflake can serve as a replacement for data lakes in many analytics stack.

Data analytics and machine learning

Analytics is at the top of the hierarchy in the analytics stack. For every analytics use case, teams will want to map out the target metrics and KPIs that are relevant. Then they can choose to either model and store data within the data warehouse to serve the use case or model data once it’s in the analytics tool they have chosen. The choice of which analytics tool to use comes down to what kind of activity is being performed and who the user is. These users could be business teams, product and engineering teams, or data teams.

Business intelligence is the most common analytics use case for most companies. At a fundamental level, BI provide users with an easy way to analyse historical, current and predictive views of business operations. To choose a BI tool we must first narrow down our use case. Companies are now understanding that providing BI dashboards to line-of-business users and executives has great benefits. The dashboards give end-users self-service access to insights that can impact the bottom line. They also provide ad hoc analysis with features like data filters, and the ability to group or isolate data to find interesting trends. Example of BI platforms like Chartio and Microsoft PowerBI are also really easy to deploy for business teams without continuous IT involvement. Once set up business users easily connect BI platforms to a selection of modelled data within the data warehouse that serve their needs. In addition, many companies are also looking for ways to integrate analytics tools into their existing applications and overall business process. Embedded analytics tools provide these capabilities by allowing developers to embed visualisations into applications. Sisense is an example of a platform that helps developers build custom analytics into any kind of app using APIs.

Advanced analytics still tend to be deployed less frequently but they can be some of the highest value activities that set companies apart. Data science is one of those activities where more complex statistical techniques and modelling are applied on structured and unstructured data in huge volumes to generate predictive or prescriptive insights. Data science involve a lot of exploratory work so data scientists usually use querying tools for initial data exploration, then build custom programs that connect to data warehouses to extract data or integrate with platforms like RapidMiner that provide integrated environments for mining and predictive analytics work. Machine learning is an extension to the data science work where modelled data are fed into services like AWS SageMaker or Data Robot to train, evaluate and deploy models. These models are then integrated within a company’s existing products for customer facing features like a recommendation engine, used with existing analytics tools for augmented analytics like churn prediction, or as part of intelligent automation applications like predictive maintenance of server loads. Due to the huge scope of use cases that fall under advanced analytics and machine learning, it is difficult to narrow down a few tools. Unlike BI, advanced analytics can have very complex architectures of their own, but the data processing and data warehousing components of the analytics stack remain the same.

Moving from depending on siloed applications for basic analytics to building your own stack can be a major task. We’ve laid out a guide for how you should think about the components in your stack. If your company is just starting on this journey then it’s important to know that there are no one size fits all tools. And, tools that work for your use cases today may need to be changed as your data grows. Hence, your analytics stack will continuously evolve. Regardless of the stage you’re at, think carefully about the tools that fit well with your needs today but are scalable or interchangeable in the future.

Spread the word

This post was originally published by Abizer Jafferjee at Towards Data Science

Related posts