Introduction
Even before using Kubernetes for orchestration, the advantages to packaging Big Data applications and their dependent libraries using containers are manifold. A huge bugbear in advanced Big Data & Data Science development has been the packaging of ensemble models, along with dependencies & other libraries. I have covered these issues at length here [1]. With the increased adoption of containers and the immutable image model, interactions between Data Scientists, Data developers, and Infrastructure teams will become much more smoother resulting in four benefits. Firstly, Kubernetes constructs can be used to build a Big Data CI/CD pipeline that is enterprise hardened both from within and from outside the pipeline. Second, the container image formats supports seamless transition of code and changes to the code make its way through from dev to prod. Dev environments with all necessary security settings can be setup in an automated manner. Third, once the image is built, security scans and plugins for monitoring environments can be setup. Finally, deployers have a certified deployment template based on the underlying IaaS (Heat, CloudFormations, etc), images are ready for use for downstream testing, validation & deployment. Tools such as Ansible and Terraform provide a lot of value here from both a dev and ops standpoint.
Architecture
Spark provides a utility (docker-image-tool.sh) to quickly build and deploy kubernetes bound workloads as docker images.[2] These images can then be made available in any Docker registry and then pushed/pulled into runtime environments. Key properties such as Java versions, memory/CPU settings, dependent JARs can be injected into the container using k8’s ConfigMaps. Security & other confidential information can be injected using Secrets. ConfigMaps and Secrets are recommended for centralized configuration as simple environment variables can be limiting & can also be overridden by the Deployment object. If the application’s dependencies are hosted on remote locations such as HDFS or HTTP servers, they may be referred to by their appropriate remote URIs. Also, application dependencies can be pre-mounted into custom-built Docker images. Those dependencies can be added to the classpath by referencing them with local:// URIs and/or setting the SPARK_EXTRA_CLASSPATH environment variable in your Dockerfiles. The local:// scheme is also required when referring to dependencies in custom-built Docker images in spark-submit. Note that using application dependencies from the submission client’s local file system is currently not yet supported.
Helm, the Kubernetes package manager can be used to publish these images individually or as part of a composite application in a catalog. Helm also enables these catalog items to be versioned thus permitting easy rollback and roll forward enabling some degree of model governance. Using Helm integration, Platform 9 Managed Kubernetes (PMK) can provision applications and all underlying infrastructure (compute, storage and network) on any IaaS or Bare Metal provider. At the time of writing, vendors such as Platform9 provide a rich catalog of runtimes ranging from Spark, Kafka, Tomcat, NGINX, et al that enable end to end buildout and management of Spark workloads.
With Spark 2.3, several important features have been introduced to run Spark workloads natively in a Kubernetes 1.7 or above cluster [2]. Kubernetes now can schedule spark workloads onto a set of servers whether private cloud, public cloud or virtualized, bare metal, etc. Spark-based applications can be run in a true multi-tenant fashion – leveraging namespaces & quotas across the above landing zones. Spark can now also take advantage of the RBAC model in kubernetes to run in a true multi-tenant fashion.
Using a managed service such as Managed Kubernetes, upgrades to Kubernetes, Spark, their associated libraries and integration with logging & monitoring solutions becomes a breeze. Kubernetes presents several enterprise storage and networking options which fit in well with the local storage model.
Spark also uses the Service primitive to provide service discovery and load balancing which allows Spark based applications to scale easily and seamlessly without operator involvement. The kubernetes master detects any unhealthy service and forwards requests to healthy ones. This is especially helpful as memory leaks are all very common in data-intensive applications. Further, Platform 9 managed service not only provisions the k8s based application onto any target cloud but also provides monitoring using Prometheus and can replicate, migrate the application as dictated by business needs. This enables granular management of Spark services. Platform9 also enables the scale-up at the IaaS layer as well especially if additional capacity is needed to run more Spark executors. Thus Spark applications are managed as Kubernetes Jobs which manages their lifecycle. Jobs are similar to ReplicationController as they create multiple pods and ensure they run seamlessly. However, the Job construct also assumes the workload is done in the event of all pods terminating.
Many enterprises that deploy Hadoop installations complain about the high cost of infrastructure when running Hadoop in the public cloud as compared to the private cloud. The gist of it is that Hadoop spins up VMs in the public cloud, and despite using spot instances, it is extremely expensive for many enterprises. An approach to save on costs is to reduce the memory/resource allocation to the worker nodes. However, this then means that certain kinds of data processing workloads end up failing, thus resulting in errors which cause a lot of break/fix effort on part of engineers. Transitioning and unifying these workloads under Kubernetes can help ensure a higher level of stability.
A high-level schematic of Spark/Kubernetes integration is shown below. As mentioned above, from Spark 2.3 onwards, kubernetes can schedule Spark jobs.
To make this real world, let us consider an example in the banking industry.
The typical flow of data in a bank follows a familiar path –
- Data is captured in large quantities as a result of business processes (customer onboarding, retail bank transactions, payment transactions et al). These feeds are captured using a combination of techniques – mostly ESB (Enterprise Service Bus) and Message Brokers.
- The basic integration workflow is that a spark-submit job is triggered by any of the mechanisms shown on the left – a Kafka hosted message queue or an API call or an admin triggering it. The trigger event is data arriving into their BORT systems.
- Spark submit would also have the right args to run the spark streaming jobs in a kubernetes cluster. The spark app which does their T or T+1 processing is directed to the API server which then uses the Scheduler to schedule it. The scheduler then creates both the Driver and Executor pods and then processes the job. Data locality is ideally configured by the spark framework itself. Spark knows how to find the HDFS namenode. The namenode will then map the spark executor containers to the HDFS datanodes.
- Once all of the relevant data has been normalized, transformed and then processed it is then copied over into business reporting systems where it is used to perform a range of functions – typically for reporting for use cases such as Customer Analytics, Risk Reporting, Business Reporting, Operational improvements, etc.
- The overall flow of data is BORT Systems -> Spark Core -> Scheduler -> Spark Cluster running on kubernetes managed pods. Once the job is finished and the executor pods are reclaimed for cleanup & all resources are freed up by the scheduler.
How to design for Rack Locality
Another key optimization is ensuring rack locality – HDFS/Spark both have ‘locality’ preference built in for their executors. There are many levels, there are two main ones which we can exploit: NODE and RACK.
This can be tackled in the following way:
– model relevant IaaS Regions as a Hadoop ‘Rack’ by configuring Hadoop nodes.
– enable Spark k8s resource-manager scheduler to advertise ‘Rack’ information about each k8s node.
– carry out config changes depending on the distribution of Spark & HDFS
In future versions, a Spark operator is being planned for Kubernetes, which provides a higher level of integration. As opposed to just deploying a standalone Spark cluster on a set of Kubernetes managed pods, the operator will provide significant benefits. For instance, the lifecycle of Spark based applications is managed using Kubernetes constructs such as Deployments, StatefulSets and DaemonSets. Spark-based workloads can also be monitored using Kubernetes projects such as Prometheus.
References
[1] Data Science in the Cloud
Data Science in the Cloud A.k.a. Models as a Service (MaaS)..
[2] Apache Spark integration with Kubernetes
https://spark.apache.org/docs/latest/running-on-kubernetes.html
[2] Spark Operator for Kubernetes
https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/design.md
2 comments
I’m a data engineer and I am working on spark applications and data analysis
Hi, you have really done great work, you ‘ve carefully selected great resource on big data for Environment. I benefited much from it. It relates to our blog on:
big data for Environment