Apache Spark Analytics & Visualization

What is Apache Spark?
Share on LinkedIn Tweet about this on Twitter Share on Facebook
Apache Spark is an open source, in-memory application framework for distributed data processing and iterative analysis of massive data volumes. Spark is used by many organizations to process and analyze big data sets and runs virtually anywhere, making it ideal for big data analytics.

Apache Spark offers programmers an API focused on the resilient distributed data set (RDD) structure (also known as Spark DataFrames), which is a read-only set of data items distributed over a cluster of machines maintained to deliver fault tolerance — a.k.a. “graceful degradation.”

RDDs facilitate the implementation of iterative algorithms such as those for machine learning, which visit their data set multiple times in a loop. They also support exploratory data analysis, with repeated database-style queries of data. Applications built in this way improve real-time performance by reducing latency.

You can run Spark as a standalone application or on Hadoop, Mesos, or in the cloud, while accessing diverse data sources including the Hadoop distributed file system (HDFS), Cassandra, HBase, and S3.

How Does Spark Streaming Work?

Logi Composer leverages Apache Spark data stream processing as a complementary processing layer within the Logi Composer server. Remember that as much as possible Logi Composer pushes query processing to original sources so that aggregation, filtering, and calculations are performed close to where data is stored. But as aggregated, filtered result sets are retrieved from their original sources, Logi Composer caches this data as Spark DataFrames.

When users submit new requests for data, Logi Composer retrieves the data from the Spark Streaming result set cache whenever possible. Logi Composer also uses cached result sets if the user sorts or crosstabs the data, or performs some kind of interaction that can be achieved without going back to the original source.

Spark Core

Spark Core is the foundation of Apache Spark. It handles dispatching, scheduling, and I/O functions, which are exposed through the API.

Spark SQL

Spark SQL is a component on top of Spark Core that introduces a data abstraction layer called DataFrames. DataFrames provide support for structured and semi-structured data.

For example, when a user defines a new calculation based on metrics that already exist in a visualization, Logi Composer executes the calculation using Spark and the DataFrames that contain the cached data. Spark also powers Logi Composer Fusion. As with single-source queries, Logi Composer pushes as much query processing as possible to the original sources. But when Logi Composer needs to fuse multiple sources, it retrieves an aggregated result set from each source and Spark performs the join between the two. Finally, Logi Composer leverages Spark to accelerate slow data sources via the SparkIt feature. Some sources, like flat files and S3 buckets, do not provide query capabilities. SparkIt provides a way to load big data sets from these sources into Spark, where they become fully interactive, queryable data sets.

By implementing these capabilities with Apache Spark, Logi Composer taps into the broad open-source community that supports and enhances its scale-out, in-memory technology. In addition, for maximum scalability, Logi Composer provides the option to deploy the Spark caching, big data analytics, and fusion layer in an external Spark cluster.

Why Spark for Big Data Analytics?

Many organizations have adopted Spark for big data processing and big data analytics. Companies like Comcast, Bloomberg, and Capital One. Why? Because Spark is great for big data analytics and large-scale data science use cases. For example, Comcast has used Spark, Spark MLlib, and machine learning to detect the issues behind anomalies in its 30 million cable boxes — boxes that generate more one billion data points every day. Comcast runs Apache Spark on a 400-node cluster with nearly a TB of RAM and eight PBs of storage.

Bloomberg uses Spark for its low-latency, cloud-based analytics platform, which delivers financial information to its clients. The company uses the Spark DataFrame concept for its Spark applications. As more and more enterprises look for data analysis capabilities, Spark has become a virtual single toolbox for data scientists.

Spark is gaining immense popularity because it:

  • Features an advanced directed acyclic graph (DAG) execution engine that supports acyclic data flow and in-memory computing
  • Holds data in memory — making it up to 100-times faster for certain applications
  • Supports multi-stage primitives, which makes it faster than Hadoop with MapReduce
  • Offers a convenient, unified programming model for developers, supporting SQL, stream processing, machine learning, and graph analytics
  • Allows user programs to load big data into a Spark cluster’s memory and query it repeatedly, making it well suited to machine learning algorithms
  • Offers a scalable machine learning library via Spark MLlib
  • Provides support for graphs and graph-parallel computation with its GraphX API

In addition, unlike Hadoop, Apache Spark is compatible with several resource managers such as YARN or Mesos. And it’s also easy to use, offering APIs in programming languages like Scala, Java, and Python, in addition to Spark SQL Spark does not include its own system for organizing files, which is one reason many big data projects run Spark on top of a Hadoop cluster using a distribution like Hortonworks or Cloudera. Most companies find that Apache Spark and Hadoop are necessary for a robust analytics ecosystem.

Logi Composer and Apache Spark

Logi Composer integrates with and leverages Apache Spark for big data analytics in multiple ways. If you use Spark to manage data directly, Logi Composer can access and visualize Spark data via Spark SQL. Logi Composer connects to Spark and makes DataFrames available for fast visual analytics on big data, leveraging your existing Spark cluster by pushing Spark SQL queries to the source.

Because Spark is a powerful unified environment for structured queries, machine learning, and graph analytics, you can combine these frameworks and visualize results in Logi Composer. For example, build your machine learning models in Spark, such as a customer value score, and add that to your Spark DataFrame where Logi Composer can use that new field like any other attribute in Logi Composer.

Logi Composer can also integrate with Spark Streaming so that users can interact with live streams of real-time data. Learn More about Streaming Analytics <link to streaming analytics>

In addition, Logi Composer makes special use of Spark as an embedded technology. Working with any data source, Logi Composer leverages Spark for result set caching, data blending, and additional calculations on top of what is available from the source.

SparkIt: Visualize Data from Flat Files, S3, JSON and More

Logi Composer is designed to push query processing to the source as much as possible.

Common analytic processing includes:

  • Selection or filtering — displaying only a subset of data that meets a condition, e.g. customers from North America
  • Aggregation — calculating a value across many data elements, e.g. counting users or summing revenue across many customers

But some sources do not support even these basic types of data analytics. If you are working with raw data in a file system or in Amazon Web Services S3, the file system will not support any analytic processing.

Logi Composer includes the ability to read these raw files into Spark, where they become fast, interactive, and queryable. When establishing a connection, users can choose to “SparkIt” and preload data into a Spark DataFrame for interactive use through Logi Composer. This capability is available for common file formats such as CSV files, tab-delimited files, JSON and XML files.

SparkIt can also be used for sources other than raw files. Even relational data from Oracle, SQL Server, MySQL, or data from any “slow” source can be loaded into Logi Composer’s Spark layer to convert it to a fast, queryable, interactive source.

Originally published January 14, 2020; updated on March 19th, 2021