Mastering Big Data Analysis with Tableau: Advanced Techniques and Best Practices
Unlocking the Power of Tableau for Big Data Analytics: Part 2 – Advanced Strategies
Taking Your Big Data Analytics to the Next Level: Advanced Tableau Techniques and Tips
I. Understanding the Hadoop Ecosystem
The core technology for data storage is the Hadoop Distributed File System (HDFS). As discussed in the first part of this series Hadoop redundantly stores blocks (pieces of data) across servers. HDFS stores and manages data, while MapReduce processes and computes data in HDFS. However, there are abstractions built on top of HDFS that allow technology and analytics professionals to query data using SQL. These SQL layers introduce simplicity to the Hadoop ecosystem by removing the need to use programmatic methods of accessing and analyzing data and using query interfaces instead.
II. Hive
Hive provides a convenient SQL abstraction from the underlying data stored in HDFS. This allows users to query and analyze information in files. Furthermore, data split into multiple files can be accessed on one table instead of having to read one file at a time through HDFS. Hive, like HDFS, utilizes the MapReduce processing layer to access the underlying data. The issue with this structure is that this process creates latency issues whenever Tableau attempts to load data when dashboards are refreshed and updated. This makes it difficult to leverage the seamless real time interaction Tableau offers to users through dashboard drill downs.
III. Impala
Impala is the best performing solution because it circumvents MapReduce processing. Fast execution of SQL queries in Impala facilitates quick reads and refreshes in Tableau. There are also several ways to improve workbook performance for dashboards pointing to Impala data. On the Tableau side, some of the best optimizations include filtering and hiding unused fields, limit string related calculations in Tableau unless it’s necessary and avoiding joins of various data sources. Joins should generally be used as a process for refining the underlying data source. Performance optimizations that can be implemented with Impala include storing data in Parquet files as discussed earlier with Hive. Furthermore, partitioning data based on columns that have an ideal level of granularity allows tables to be split into similarly sized files. In addition, partitioning by small integer fields such as TINYINT instead of dates or strings make data reads much faster for Impala and consequently improves latency downstream with Tableau.
IV. Hue
The Hadoop User Experience (Hue) is an open-source web application that facilitates querying of data in Hive and Impala. It also provides users with the ability to examine table schemas and create new datasets. It also allows users to navigate HDFS interactively without relying on a command line interface. Hue is available in the Cloudera Data Platform (CDP), AWS, Google Cloud Platform, and Microsoft Azure.
V. Using Hive & Impala with Tableau
When using dashboards that are pre-defined and cannot be customized by viewers, Hive can be a reliable data source for Tableau developers. However, there are important limitations to using Hive on Tableau. With the batch-oriented nature of Hive, it does not process small queries efficiently and will inevitably lead to slow processing of Tableau sheets and dashboards. This problem compounds when presenting visualizations that rely on calculated fields.
Optimizing Data Connections: Best Practices for Connecting Tableau to Big Data Sources
Implementing Advanced Calculations: Techniques for Analyzing Big Data with Tableau
Creating Advanced Visualizations: Tips and Tricks for Visualizing Big Data in Tableau
Performance Optimization: Strategies for Improving Tableau Performance with Big Data
Leveraging Extracts: Maximizing Performance with Tableau Data Extracts in Big Data Analysis
Advanced Calculation Techniques: Implementing Complex Calculations for Big Data Analysis in Tableau
Visual Best Practices: Designing Effective Visualizations for Big Data Analysis in Tableau
Performance Monitoring: Tools and Techniques for Monitoring Tableau Performance in Big Data Analysis
On the other hand, Impala by-passes the slow MapReduce architecture of Hive and supports rapid processing of interactive and ad-hoc queries. As a result, Impala is the ideal data source for creating dynamic, low-latency Tableau visualizations.