Big Data Analytics for Apache Impala

Big Data Analytics and Visualization for Apache Impala
Share on LinkedIn Tweet about this on Twitter Share on Facebook

What is Apache Impala?

Impala, the SQL analytic engine shipped with Cloudera Enterprise, is a fully integrated, state-of-the-art analytic database architected specifically to leverage the flexibility and scalability of Apache Hadoop, which may contain many types of information and content including click stream, web and call center logs, and ID scans. Although most closely associated with Cloudera, Impala also ships with other Hadoop distributions including MapR, Oracle, and Amazon.

Why Apache Impala for Big Data Analytics?

The Impala platform brings scalable parallel database technology to Hadoop, enabling users to issue low-latency SQL queries to big data stored in HDFS and Apache HBase without requiring data movement or transformation.

With Impala came the Parquet columnar data storage format, which stores data more efficiently than row-based formats in HDFS. Although writing Parquet files means you need to determine the schema (tables, columns) in advance and write the data in a specific way, the upside is much faster analysis.

Impala enables analysts and data scientists to perform real-time, interactive analytics on data stored in Hadoop via SQL or business intelligence tools.

Logi Composer and Apache Impala

Logi Composer was one of the first certified Impala big data analytics and visualization software tools, and the results of this collaboration have been dramatic. While legacy BI tools use JDBC or ODBC to query Impala as if it were a relational database, Logi Composer connects to Impala via native APIs and understands the Parquet partitioning scheme.

It uses this information to break up single logical queries into multiple micro-queries. Micro-queries submitted to Impala return at different points in time. Logi Composer displays a preliminary visualization as soon as the first micro-query returns and then sharpens the visualization as additional micro-queries complete. The result: much faster response time, analysis, and insights.

Originally published January 15, 2020; updated on March 19th, 2021