Spark Node

Load data from Apache Spark into Visual Notebooks.

Prerequisites

Follow the steps below to add credentials for Spark using Databricks.

Sign into Databricks
Hover over left-hand navigation menu, select the Datascience and Engineering workspace, then select Compute
Select a running cluster
Scroll to the bottom of the cluster configuration, expand Advanced Options, and click the JDBC/ODBC tab
Copy the values for Server Hostname, Port, and HTTP Path into a notepad
Click your username in the top right-hand corner and select User Settings
Select the Access Tokens tab and click Generate new Token
Copy the token into your notepad
In Visual Notebooks, drag a Spark node onto the canvas
Select the gear icon beside the Credential field
Select the plus sign in the upper right corner
Paste the contents for Server Hostname:Port into the ServiceEnpoint input, HTTPath, and your token as the Password

Configuration

Field	Description
Name Optional	A user-specified node name displayed in the canvas
Credential Required	The information needed to access Spark data Select a saved credential from the dropdown menu. Select the gear icon to add a new credential or delete existing credentials.
Schema Required	The name of the desired Spark schema Select the schema from the auto-populated dropdown menu.
Select Table or Define Query Required	The data to upload Select the table you want to upload from the auto-populated dropdown menu or enter a SQL query that returns the desired data.
Filter by Value Optional	Configure filters to be applied to data Use the dropdown fields to filter results. Filter options include is null, is not null, is equal, is not equal, begins with, ends with, in between, is less than, is less than or equal to, is greater than, and is greater than or equal to. Filters can be applied on any column datatype. Add additional filters to create "And" conditional logic treatment.
Select column to partition with Optional	Column to use when partitioning the data Enter the name of a column in the table to use when partitioning the data. If a column is specified, Visual Notebooks partitions the file using the given column and creates multiple "parts" to speed up performance.
Number of partitions Default: 100	Number of partitions to make when uploading the file Enter an integer. The data will be partitioned into this specified number of parts.

Node Inputs/Outputs

Input	None
Output	Visual Notebooks returns a table, called a dataframe, that contains all uploaded data. Columns are labeled and include a symbol that specifies the data type of that column.

Example dataframe output

Figure 1: Example dataframe output

Examples

First, upload data from Spark with the default settings.

Select the Spark schema and table that contains the desired data.
Select Run to create a dataframe in Visual Notebooks.

Example dataframe created from a Spark table

Figure 2: Example dataframe created from a Spark table

Now add a filter to limit the amount of data returned.

Select a column to filter by.
Select a condition and define a value.

Figure 3 shows data from the "iris" table that has an "Id" value less than or equal to 10.

Example dataframe filtered by value

Figure 3: Example dataframe filtered by value

Upload a subset of the data using a SQL query.

Select a Spark schema.
Write a query that returns the desired data.
Select Run to create a dataframe in Visual Notebooks.

Figure 4 shows a SQL query that returns the "Id" and "Species" columns for the first ten rows of the "iris" table.

Example dataframe created from a SQL query

Figure 4: Example dataframe created from a SQL query

Copy link to this sectionPrerequisites

Copy link to this sectionConfiguration

Copy link to this sectionNode Inputs/Outputs

Copy link to this sectionExamples

Prerequisites

Configuration

Node Inputs/Outputs

Examples