Running Spark Jobs on a Remote Databricks Cluster using Databricks Connect
Databricks Connect is a client library for Databricks Runtime. It allows you to write jobs using Spark APIs and run them remotely on a Databricks cluster instead of in the local Spark session.
This allows developers to develop locally in an IDE they prefer and run the workload remotely on a Databricks Cluster which has more processing power than the local spark session.
Supported Databricks Runtimes
-------------------------------------------------------------------
8.1 ML, 8.1
7.3 LTS ML, 7.3 LTS
6.4 ML, 6.4
5.5 LTS 5.5 ML
I would walk you through step by step how you could achieve this with Python and Scala APIs for Spark. In this example, I would be using the below environments and tools.
Azure Databricks Service
VisualStudio Code,IntelliJ Idea
Python 3.8, JDK 1.8, Scala 2.12.13
Miniconda installed on a PC. (Miniconda)
Hadoop setup on Windows with winutils fix
PySpark Edition
First, I would be creating a virtual environment using Conda prompt
conda create --name envdbconnect python=3.8
conda activate envdbconnect
The compatible Python versions are as below.
Databricks Runtime Version Python Version
---------------------------- -- -------------------
8.1 ML, 8.1 3.8
7.3 LTS ML,7.3 LTS 3.7
6.4 ML, 6.4 3.7
5.5 LTS ML 3.6
5.5 LTS 3.5
Setting up the Client
Uninstall locally installed PySpark if it’s already there, since it will conflict with databricks-connect
package installation.
pip uninstall pyspark
Install Databricks Connect client (Available Client versions)
pip install -U "databricks-connect==8.1.*"
Once the installation is completed, we need to configure the client.
Below items need to be noted down before the configuration.
Workspace URL
Personal Access Token
Navigate to the User Settings pane and generate a new token and note it down.
Cluster-ID
Navigate into the cluster pane and to the cluster configuration, which you will be using for the workloads, and take a note of the highlighted text from the URL
Eg: 0331–151109-issue277
Organization Id
This is an azure specific parameter that can be identified by the ?o= tag
Port
default port 15001
Once you have collected the information execute the below command.
databricks-connect configure
It will prompt you for values to be entered.
E.g:
As the next step open up VS Code and install python extension from Extensions. Then hit CTRL+Shift+P and select the Conda environment you set up earlier.
Create a new python file and run the below code. I have added a sample code where I read a parquet file mounted to Databricks cluster from ADLS Gen2.
Databricks Connect does not allow you to mount/unmount files to the cluster. So you need to mount the files to the cluster using a notebook as below and continue the development on the IDE.
from pyspark.sql import SparkSessionfrom pyspark.dbutils import DBUtilsspark = SparkSession.builder.getOrCreate()dbutils = DBUtils(spark)dbutils.fs.ls("/mnt/ny")df = spark.read.parquet('/mnt/ny/NYC_taxi_2009-2016.parquet')df.show()
The output.
Note
If you are using Databricks Connect on Windows, you may experience the below error due to incorrect Hadoop configuration.
Failed to locate the winutils binary in the Hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
The same exercise can be done using an IDE like IntelliJ Idea. I would show you how it is done with Scala.
Scala Edition
First, Open up the Conda prompt and execute below
1.conda activate envdbconnect2.databricks-connect get-jar-dir
You would get the path of the jar files related to the client.
c:\users\ceky\.conda\envs\envdbconnect\lib\site-packages\pyspark\jars
Then create a Scala project using SBT on IntelliJ Idea. Add below the line on the build.sbt file. Then resolve the dependencies.
unmanagedBase := new java.io.File("c:\\users\\ceky\\.conda\\envs\\envdbconnect\\lib\\site-packages\\pyspark\\jars")mainClass := Some("com.example.dbf")
Create a scala object as below in the project then build and execute.
import org.apache.spark.sql.SparkSession
object dbf {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.master("local")
.getOrCreate()
val cu1 = spark.read.parquet("/mnt/ny/NYC_taxi_2009-2016.parquet")
cu1.show()
}
}
Hope this article will help you as a quick start guide on how to use databricks-connect with your favorite IDE.
References