Running Spark Jobs on a Remote Databricks Cluster using Databricks Connect

4 min readApr 23, 2021

Databricks Connect is a client library for Databricks Runtime. It allows you to write jobs using Spark APIs and run them remotely on a Databricks cluster instead of in the local Spark session.

This allows developers to develop locally in an IDE they prefer and run the workload remotely on a Databricks Cluster which has more processing power than the local spark session.

Supported Databricks Runtimes
 ------------------------------------------------------------------- 
  8.1 ML, 8.1                 
  7.3 LTS ML, 7.3 LTS         
  6.4 ML, 6.4                 
  5.5 LTS 5.5 ML

I would walk you through step by step how you could achieve this with Python and Scala APIs for Spark. In this example, I would be using the below environments and tools.

Azure Databricks Service

VisualStudio Code,IntelliJ Idea

Python 3.8, JDK 1.8, Scala 2.12.13

Miniconda installed on a PC. (Miniconda)

Hadoop setup on Windows with winutils fix

PySpark Edition

First, I would be creating a virtual environment using Conda prompt

conda create --name envdbconnect python=3.8
conda activate envdbconnect

The compatible Python versions are as below.

Databricks Runtime Version                   Python Version                                         
 ----------------------------      --      ------------------- 
  8.1 ML, 8.1                                      3.8              
  7.3 LTS ML,7.3 LTS                               3.7                          
  6.4 ML, 6.4                                      3.7              
  5.5 LTS ML                                       3.6 
  5.5 LTS                                          3.5

Setting up the Client

Uninstall locally installed PySpark if it’s already there, since it will conflict with databricks-connect package installation.

pip uninstall pyspark

Install Databricks Connect client (Available Client versions)

pip install -U "databricks-connect==8.1.*"

Once the installation is completed, we need to configure the client.

Below items need to be noted down before the configuration.

Workspace URL

Eg : https://adb-1234503031867986.6.azuredatabricks.net

Personal Access Token

Navigate to the User Settings pane and generate a new token and note it down.

Cluster-ID

Navigate into the cluster pane and to the cluster configuration, which you will be using for the workloads, and take a note of the highlighted text from the URL

Eg: 0331–151109-issue277

Organization Id

This is an azure specific parameter that can be identified by the ?o= tag

Port

default port 15001

Once you have collected the information execute the below command.

databricks-connect configure

It will prompt you for values to be entered.

E.g:

As the next step open up VS Code and install python extension from Extensions. Then hit CTRL+Shift+P and select the Conda environment you set up earlier.

Create a new python file and run the below code. I have added a sample code where I read a parquet file mounted to Databricks cluster from ADLS Gen2.

Databricks Connect does not allow you to mount/unmount files to the cluster. So you need to mount the files to the cluster using a notebook as below and continue the development on the IDE.

from pyspark.sql import SparkSessionfrom pyspark.dbutils import DBUtilsspark = SparkSession.builder.getOrCreate()dbutils = DBUtils(spark)dbutils.fs.ls("/mnt/ny")df = spark.read.parquet('/mnt/ny/NYC_taxi_2009-2016.parquet')df.show()

The output.

Limitations of Databricks Connect

Mounting ADLS files to Databricks

Note

If you are using Databricks Connect on Windows, you may experience the below error due to incorrect Hadoop configuration.

Failed to locate the winutils binary in the Hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.

How to resolve winutils issue

The same exercise can be done using an IDE like IntelliJ Idea. I would show you how it is done with Scala.

Scala Edition

First, Open up the Conda prompt and execute below

1.conda activate envdbconnect2.databricks-connect get-jar-dir

You would get the path of the jar files related to the client.

c:\users\ceky\.conda\envs\envdbconnect\lib\site-packages\pyspark\jars

Then create a Scala project using SBT on IntelliJ Idea. Add below the line on the build.sbt file. Then resolve the dependencies.

unmanagedBase := new java.io.File("c:\\users\\ceky\\.conda\\envs\\envdbconnect\\lib\\site-packages\\pyspark\\jars")mainClass := Some("com.example.dbf")

Create a scala object as below in the project then build and execute.

import org.apache.spark.sql.SparkSession

object dbf {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .master("local")
      .getOrCreate()

    val cu1 = spark.read.parquet("/mnt/ny/NYC_taxi_2009-2016.parquet")
    cu1.show()

  }
}

Hope this article will help you as a quick start guide on how to use databricks-connect with your favorite IDE.

References

Microsoft Documentation

Winutils dependencies

Running Spark Jobs on a Remote Databricks Cluster using Databricks Connect

PySpark Edition

Note

Scala Edition

Written by Charith Ekanayake