C3 AI Documentation Home

Spark Node

Load data from Apache Spark into Visual Notebooks.

Prerequisites

Follow the steps below to add credentials for Spark using Databricks.

  1. Sign into Databricks
  2. Hover over left-hand navigation menu, select the Datascience and Engineering workspace, then select Compute Databricks workspace navigation
  3. Select a running cluster
  4. Scroll to the bottom of the cluster configuration, expand Advanced Options, and click the JDBC/ODBC tab Cluster configuration advanced options
  5. Copy the values for Server Hostname, Port, and HTTP Path into a notepad JDBC connection details
  6. Click your username in the top right-hand corner and select User Settings User settings menu
  7. Select the Access Tokens tab and click Generate new Token Generate access token
  8. Copy the token into your notepad
  9. In Visual Notebooks, drag a Spark node onto the canvas
  10. Select the gear icon beside the Credential field Spark node credentials
  11. Select the plus sign in the upper right corner Add new credential
  12. Paste the contents for Server Hostname:Port into the ServiceEnpoint input, HTTPath, and your token as the Password Configure Spark credentials

Configuration

FieldDescription
Name OptionalA user-specified node name displayed in the canvas
Credential RequiredThe information needed to access Spark data Select a saved credential from the dropdown menu. Select the gear icon to add a new credential or delete existing credentials.
Schema RequiredThe name of the desired Spark schema Select the schema from the auto-populated dropdown menu.
Select Table or Define Query RequiredThe data to upload Select the table you want to upload from the auto-populated dropdown menu or enter a SQL query that returns the desired data.
Filter by Value OptionalConfigure filters to be applied to data Use the dropdown fields to filter results. Filter options include is null, is not null, is equal, is not equal, begins with, ends with, in between, is less than, is less than or equal to, is greater than, and is greater than or equal to. Filters can be applied on any column datatype. Add additional filters to create "And" conditional logic treatment.
Select column to partition with OptionalColumn to use when partitioning the data Enter the name of a column in the table to use when partitioning the data. If a column is specified, Visual Notebooks partitions the file using the given column and creates multiple "parts" to speed up performance.
Number of partitions Default: 100Number of partitions to make when uploading the file Enter an integer. The data will be partitioned into this specified number of parts.

Node Inputs/Outputs

InputNone
OutputVisual Notebooks returns a table, called a dataframe, that contains all uploaded data. Columns are labeled and include a symbol that specifies the data type of that column.

Example dataframe output

Figure 1: Example dataframe output

Examples

First, upload data from Spark with the default settings.

  1. Select the Spark schema and table that contains the desired data.
  2. Select Run to create a dataframe in Visual Notebooks.

Example dataframe created from a Spark table

Figure 2: Example dataframe created from a Spark table

Now add a filter to limit the amount of data returned.

  1. Select a column to filter by.
  2. Select a condition and define a value.

Figure 3 shows data from the "iris" table that has an "Id" value less than or equal to 10.

Example dataframe filtered by value

Figure 3: Example dataframe filtered by value

Upload a subset of the data using a SQL query.

  1. Select a Spark schema.
  2. Write a query that returns the desired data.
  3. Select Run to create a dataframe in Visual Notebooks.

Figure 4 shows a SQL query that returns the "Id" and "Species" columns for the first ten rows of the "iris" table.

Example dataframe created from a SQL query

Figure 4: Example dataframe created from a SQL query

Was this page helpful?