Databricks with R

Edgar Ruiz @ Posit

Spark Connect

‘Thin client, with the full power of Apache Spark’

Spark Connect

Introduced a decoupled client-server architecture
It enables remote connectivity to Spark clusters.
Allows R users to interact with a cluster from their preferred environment, laptop or otherwise.
Databricks Connect, is based on Spark Connect architecture. Available in DBR version 13+

flowchart LR
  lp[User's machine]
  sp[Spark]
  
  lp <-. Network .-> sp
  
  style lp  fill:#fff,stroke:#666
  style sp  fill:#f4c430,stroke:#666

Underlying technologies

Uses of a remote procedure call framework, named gRPC.
Uses Torch for the ML capabilities (Spark 3.5+).
PySpark offers the best integration with Spark Connect.

flowchart LR
  subgraph lp[User's machine]
    ps[PySpark]
    g1[gRPC]
  end
  sp[Spark]
  
  g1 <-. Network .-> sp
  ps --> g1
  
  style ps  fill:#eff,stroke:#666
  style lp  fill:#fff,stroke:#666
  style sp  fill:#f4c430,stroke:#666
  style g1  fill:#447099,stroke:#666,color:#fff

Integrating R, via sparklyr

Integrates with PySpark, via reticulate
Extends the functionality, and user experience:
- dplyr back-end
- DBI back-end
- RStudio’s Connections pane integration.

flowchart LR
  subgraph lp[User's machine]
    sr[sparklyr]
    rt[reticulate]
    ps[PySpark]
    g1[gRPC]
  end
  sp[Spark]
  
  sr --> rt
  rt --> ps
  g1 <-. Network .-> sp
  ps --> g1
  
  style sr  fill:#d0efb1,stroke:#666
  style rt  fill:#d0efb1,stroke:#666
  style ps  fill:#eff,stroke:#666
  style lp  fill:#fff,stroke:#666
  style sp  fill:#f4c430,stroke:#666
  style g1  fill:#447099,stroke:#666,color:#fff

-———————- No need to install Java in my machine!!! 🎉 ———————-

Getting started is easy

The new functionality is available in an sparklyr extension called pysparklyr. It contains a function that creates and prepares your Python environment.

install.packages("sparklyr")
remotes::install_github("mlverse/pysparklyr")

pysparklyr::install_pyspark()

#> Using Python: /Users/edgar/.pyenv/versions/3.10.13/bin/python3.10
#> Creating virtual environment 'r-sparklyr' ... 
#> + /Users/edgar/.pyenv/versions/3.10.13/bin/python3.10 -m venv /Users/edgar/.virtualenvs/r-sparklyr
#> Done!
#> Installing packages: pip, wheel, setuptools
#> + /Users/edgar/.virtualenvs/r-sparklyr/bin/python -m pip install --upgrade --no-user pip wheel setuptools
#> Collecting pip
#>   Using cached pip-23.2.1-py3-none-any.whl (2.1 MB)
...
#> Successfully installed MarkupSafe-2.1.3 PyArrow-13.0.0 cachetools-5.3.1 certifi-2023.7.22 charset-normalizer-3.2.0 databricks-connect-13.3.0 databricks-sdk-0.8.0 delta-spark-2.4.0 filelock-3.12.4 google-api-core-2.11.1 google-api-python-client-2.99.0 google-auth-2.23.0 google-auth-httplib2-0.1.1 googleapis-common-protos-1.60.0 grpcio-1.58.0 grpcio_status-1.58.0 httplib2-0.22.0 idna-3.4 importlib-metadata-6.8.0 jinja2-3.1.2 mpmath-1.3.0 networkx-3.1 numpy-1.25.2 pandas-2.1.0 protobuf-4.24.3 py4j-0.10.9.7 pyasn1-0.5.0 pyasn1-modules-0.3.0 pyparsing-3.1.1 pyspark-3.4.1 python-dateutil-2.8.2 pytz-2023.3.post1 requests-2.31.0 rsa-4.9 six-1.16.0 sympy-1.12 torch-2.0.1 torcheval-0.0.7 typing-extensions-4.7.1 tzdata-2023.3 uritemplate-4.1.1 urllib3-1.26.16 zipp-3.16.2

What does it install?

Installs the following Python libraries:

pyspark
pandas
PyArrow
grpcio
grpcio_status
delta-spark

google-api-python-client
databricks-connect (for using with Databricks)
torch (Spark 3.5+)
torcheval (Spark 3.5+)

Needs Python 3.9 or higher. install_pyspark() will check for you.

Run Spark Connect locally

Install Spark 3.4+ (one time)

spark_install("3.4")

Start the Spark Connect service using:

pysparklyr::spark_connect_service_start()

#> Starting Spark Connect locally ...
#>   starting org.apache.spark.sql.connect.service.SparkConnectServer, logging to
#>   /Users/edgar/spark/spark-3.4.0-bin-hadoop3/logs/spark-edgar-org.apache.spark.sql.connect.service.SparkConnectServer-1-Edgars-work-laptop.local.out

To stop the service use:

pysparklyr::spark_connect_service_stop()

#> Stopping Spark Connect
#>   - Shutdown command sent

Databricks Connect

‘Use from any application, running anywhere’

Databricks Connect “v2”

Based on Spark Connect
Available on DBR 13 onward
Databricks Connect becomes a thin client that is simple and easy to use

sparklyr 1.8.3, along with pysparklyr, supports Databricks Connect:

library(sparklyr)

sc <- spark_connect(
  master     = "", # Your org's address
  cluster_id = "", # Your cluster's ID
  token      = "", # Your personal token
  method     = "databricks_connect"
 )

Secure your connection

sparklyr uses environment variables, if provided:
- DATABRICKS_HOST - Your org’s address
- DATABRICKS_TOKEN - Your personal token

Only the method, and cluster ID will be needed:

sc <- spark_connect(
  cluster_id = "{your cluster id}", 
  method = "databricks_connect"
 )

Explore data in the Unity Catalog

Now inside RStudio…

Connections pane!

Matches the structure in UC

And of course…

preview the top 1K rows

Accessing the Catalog data

library(dplyr)
library(dbplyr)

trips <- tbl(sc, in_catalog("samples", "nyctaxi", "trips"))

Accessing the Catalog data

library(dplyr)
library(dbplyr)

trips <- tbl(sc, in_catalog("samples", "nyctaxi", "trips"))

trips
#> # Source: spark<`samples`.`nyctaxi`.`trips`> [?? x 6]
#>    tpep_pickup_datetime tpep_dropoff_datetime trip_distance fare_amount
#>    <dttm>               <dttm>                        <dbl>       <dbl>
#>  1 2016-02-14 10:52:13  2016-02-14 11:16:04            4.94        19  
#>  2 2016-02-04 12:44:19  2016-02-04 12:46:00            0.28         3.5
#>  3 2016-02-17 11:13:57  2016-02-17 11:17:55            0.7          5  
#>  4 2016-02-18 04:36:07  2016-02-18 04:41:45            0.8          6  
#>  5 2016-02-22 08:14:41  2016-02-22 08:31:52            4.51        17  
#>  6 2016-02-05 00:45:02  2016-02-05 00:50:26            1.8          7  
#>  7 2016-02-15 09:03:28  2016-02-15 09:18:45            2.58        12  
#>  8 2016-02-25 13:09:26  2016-02-25 13:24:50            1.4         11  
#>  9 2016-02-13 10:28:18  2016-02-13 10:36:36            1.21         7.5
#> 10 2016-02-13 18:03:48  2016-02-13 18:10:24            0.6          6  
#> # ℹ more rows
#> # ℹ 2 more variables: pickup_zip <int>, dropoff_zip <int>

:::

Use dplyr to interact with the data

trips %>%
  group_by(pickup_zip) %>%
  summarise(
    count = n(),
    avg_distance = mean(trip_distance, na.rm = TRUE)
  )

#> # Source: spark<?> [?? x 3]
#>    pickup_zip count avg_distance
#>         <int> <dbl>        <dbl>
#>  1      10032    15         4.49
#>  2      10013   273         2.98
#>  3      10022   519         2.00
#>  4      10162   414         2.19
#>  5      10018  1012         2.60
#>  6      11106    39         2.03
#>  7      10011  1129         2.29
#>  8      11103    16         2.75
#>  9      11237    15         3.31
#> 10      11422   429        15.5 
#> # ℹ more rows

Use dplyr to interact with the data

Translated, and sent as SQL to the cluster. To view query use: show_query()

trips %>%
  group_by(pickup_zip) %>%
  summarise(
    count = n(),
    avg_distance = mean(trip_distance, na.rm = TRUE)
  ) %>% 
  show_query()

#> <SQL>
#> SELECT `pickup_zip`, COUNT(*) AS `count`, AVG(`trip_distance`) AS `avg_distance`
#> FROM `samples`.`nyctaxi`.`trips`
#> GROUP BY `pickup_zip`

And, coming soon…

Databricks cluster management

Inside Posit Workbench! 🎉🎉

Start/stop clusters

Expand to view details

No need to leave the IDE

Makes it easy to connect

library(sparklyr)

sc <- spark_connect(
  cluster_id = "",
  method = "databricks_connect"
)

Copy the cluster ID from the pane

library(sparklyr)

sc <- spark_connect(
  cluster_id = "",
  method = "databricks_connect"
)

Paste to your connection code

library(sparklyr)

sc <- spark_connect(
  cluster_id = "0608-170338-jwkec0wi",
  method = "databricks_connect"
)

Connecting again is easy!

RStudio remembers code

Right back to the Unity Catalog!

Additional info

What is supported, and some advanced functionality

Spark Connect & Databricks Connect

Supported

Most of the dplyr, and DBI, APIs
invoke() command
Connections Pane navigation
PAT for Databricks Connect
Most read and write commands

Not supported

ML functions
SDF functions
tidyr

Access the entire API

Easy to interact with the underlying Python components

sc <- spark_connect("sc://localhost", method = "spark_connect")


# Point to the `reticulate` Python object
session <- sc$python_obj


# Copies the data from R into Spark and returns a PySpark dataframe
tbl_mtcars <- session$createDataFrame(mtcars)


# Object doesn't print the data, prints the PySpark dataframe's metadata
tbl_mtcars
#> DataFrame[mpg: double, cyl: double, disp: double, hp: double, drat: double, wt: double, qsec: double, vs: double, am: double, gear: double, carb: double]


# Access things such a the PySpark functions directly
tbl_mtcars$corr("wt", "mpg")
#> [1] -0.8676594

Go further using reticulate

Spark Connect 3.5+ supports GLM. It can be accessed via reticulate today:

library(reticulate)


# Load the Python libraries
pyspark <- import("pyspark")
connect_classification <- import("pyspark.ml.connect.classification")


# Build an array column that contains the feature variables
feature_fields <- pyspark$sql$functions$array("mpg", "wt")
tbl_features <- tbl_mtcars$withColumn("features", feature_fields)


# Instanciate a new model, and designate the label column
log_reg <- connect_classification$LogisticRegression()
log_reg_label <- log_reg$setLabelCol("am")

Go further using reticulate

Spark Connect 3.5+ supports GLM. It can be accessed via reticulate today:

# Fit the model 
model <- log_reg_label$fit(tbl_features)

model
#> LogisticRegression_144975e223ce

# Create the predictions
preds <- model$transform(tbl_features)

# Import data into R 
r_preds <- preds$toPandas()

head(r_preds, 3)
#>    mpg cyl disp  hp drat    wt  qsec vs am gear carb      features prediction          probability
#> 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4   21.00, 2.62          1 0.4157636, 0.5842364
#> 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 21.000, 2.875          1 0.4890891, 0.5109109
#> 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1   22.80, 2.32          1 0.2807752, 0.7192248

In closing

A quick review, and link sharing

Today, we covered

Spark Connect enables remote connectivity to Spark clusters
Allows R users to interact with a cluster from their preferred environment
Databricks Connect , is based on Spark Connect architecture. Available in DBR version 13+
Posit Workbench will support remote cluster management
sparklyr supports Spark & Databricks Connect, via the pysparklyr extension
The integration was developed using Python, and PySpark
Access the entire PySpark API through reticulate

Databricks with R

Spark Connect

Spark Connect

Underlying technologies

Integrating R, via sparklyr

Getting started is easy

What does it install?

Run Spark Connect locally

Databricks Connect

Databricks Connect “v2”

Secure your connection

Explore data in the Unity Catalog

Now inside RStudio…

Connections pane!

Matches the structure in UC

And of course…

preview the top 1K rows

Accessing the Catalog data

Accessing the Catalog data

Use dplyr to interact with the data

Use dplyr to interact with the data

And, coming soon…

Databricks cluster management

Inside Posit Workbench! 🎉🎉

Start/stop clusters

Expand to view details

No need to leave the IDE

Makes it easy to connect

Copy the cluster ID from the pane

Paste to your connection code

Connecting again is easy!

RStudio remembers code

Right back to the Unity Catalog!

Additional info

Spark Connect & Databricks Connect

Supported

Not supported

Access the entire API

Go further using reticulate

Go further using reticulate

In closing

Today, we covered

Links