With each passing year, vertical industries seem to be discovering an increasing number of areas where Graphics Processing Unit (GPU) powered applications are delivering outstanding business value. These applications can span the gamut – from image processing in Retail to text analysis in Insurance to self-driving cars in Transportation to portfolio backtesting in Financial Services. The Apache Spark project added support for GPU based chipsets almost two years ago. Over the last 15 months, this integration has matured and the day is not too far off when we will see an increasing number of Spark-fied applications delivering immense business value on a GPU based architecture. This blogpost discusses the convergence of these complementary technologies.
Why the GPU will revolutionize Industry…
Over the last three years, I have discussed a host of Big Data and Machine learning usecases and architectures across Banking, Manufacturing, and Insurance. There is no doubt that every Fortune 500 company now has a data strategy in place as a way of guiding business insights.
However, the vast majority of conventional data lakes or data grids are based on traditional multicore CPU technology. Even with significant enhancements to storage, memory and networking subsystems, nearly every enterprise user complains about the lack of performance in analytics based on data in an enterprise data lake or warehouse.
GPU technology, pioneered by years of research & development by Nvidia, AMD etc, has witnessed a decade of industrial growth. GPUs are inherently able to process data in parallel because of the thousands of cores contained in them. GPU cards are typically plugged into server class machines and they then augment the main CPU. They also offer faster I/O capabilities which increase the amount of data that can be passed from them to the main CPU & back.
As the below graphic shows, the performance differential in GPUs vs multicore CPU based systems keeps increasing every year due to the 1.5X increase in throughput.
While GPUs have mostly been used in graphics-heavy applications over the past few decades, newer usecases such as bitcoin mining have become popular over the last year. GPUs rendered images on the gaming console using their parallel processing capabilities. Using thousands of arithmetic processors and associated memory storage, they are loaded with source data from the main memory and provide results which are then stored in their memory. However, the tradeoff with GPU is that applications and algorithms need to be modified to work on the GPU. However, for a given workload & hardware configuration, it is safe to say that GPU will easily outperform a CPU.
.
This ability of GPU accelerators confers them a great advantage in performing compute heavy tasks for industrial applications. What is more interesting is the vision of a mixed or hybrid datacenter which leverages both GPU and CPU based on the usecases. What particularly suits GPU technology are big data processing, mining, AI and Image/Video based analytics. Cloud computing provides a further fillip to this deployment model.
Nvidia’s GPU language is known as CUDA and offers an open language standard called OpenCL (Open Computing Language). The latter is supported by AMD and Intel as well. Programs written using the OpenCL framework are portable across a range of CPUs, GPUs, FPGA’s and other kinds of hardware accelerators. The APIs provide a standard interface for parallel computations covering both task and data-based parallelism.
Apache Spark…
Apache Spark is the most transformative open source project in the Big Data landscape. Spark radically altered the applicability of batch-oriented Hadoop by providing a range of capabilities around processing streaming data, real-time data and complex analytics. At its core, Spark is a framework for processing massive amounts of data. However, when contrasted with MapReduce, it’s in-memory processing capabilities enable massive gains in speed while leveraging a cluster of servers. It’s sophisticated yet simple to use APIs enable developers to control CPU, memory and storage resources down to a granular level.
Virtually every Hadoop cluster uses Spark for high-speed analytics that leverage distributed computation. It is also decoupled from the underlying storage – HDFS/Amazon S3/Azure Block Storage etc while providing support for multiple programming languages – Java, Python, and Scala. Spark has also spawned an ecosystem of projects that provide libraries for a range of computation types – SQL on Spark, Machine Learning, Graph Analytics and Streaming Analytics. These types of workloads especially the applications that use mathematical functions & employ graph based traversal benefit greatly from a hardware acceleration approach.
Spark with GPUs…
While Apache Spark project does not support GPU deployments out of the open source community. Databricks [1] has done substantial work in integrating support for GPU with Spark clusters.
These include the following capabilities [1] –
Cloud-Based Cluster setup: Preconfigured GPU hardware libraries such as CUDA and cuDNN for cloud-based deployments such as Amazon AWS and Azure.
Preconfigured Spark for GPU: Spark clusters are preconfigured to prevent contention on GPU. This reduces context switching thus ensuring higher throughput by reducing Spark parallelism.
Cluster management & Security: All of the above capabilities can be launched as isolated and containerized entities.
A GPU based hybrid architecture can replace dozens of CPU servers while serving up a spectrum of Spark-based data usecases ranging from the simple to the complicated-
- Business Reporting
- OLAP Reporting
- Machine Learning
- Deep Learning
In 2018, we will delve into industry applications that can benefit from bringing these paradigms together i.e run natively on GPUs while supporting advanced analytics on a Spark based architecture. These usecases can benefit from an order of magnitude speedup due to the automated configuration of GPU machines, and their seamless ability to run Spark clusters.
3 comments
Vamsi,
Good kickoff post to what seems to be a nice potential series of posts going forward.
Something that you might want to cover for your readers a bit more are the kinds of jobs that are NOT a good fit for Spark on GPUs. Alot of the focus in the market has been on the kinds of things that you can do, but there also seems to be confusion that everything is better on GPUs, which I believe is not the case. Something to consider….
Great point Rob. I tried to make this distinction (somewhat weakly) at the end when I list out the various kinds of enterprise data workloads. However, I will be sure to clarify this as the series picks steam. Thank you for your comment and your readership.
will you please provide the proper implementation of deep learning on spark using tensorflow and supported on gpu