Jupyter notebook management
After initiating a Jupyter Notebook following the creation of a Notebook instance, the Jupyter dashboard will be presented in the following manner:
Note
If you are working in your Jupyter notebook for a long time, you might see a warning message indicating that the connection to the server is lost. Wherobots use cookie sessions to authenticate users, and to keep accounts secure those sessions expire after an hour.
You can refresh your page and the browser will redirect to the login page; after you re-login will the browser redirect you back to the previous page.
Executing Jupyter Notebook¶
There are two types of kernels available for your Jupyter Notebook: the Python kernel (ipykernel)
and the Scala kernel (Scala)
. These kernels can be created using the Launcher
, which can be accessed through File -> New Launcher
.
Spark Web UI¶
A Spark web UI enables real-time monitoring, performance analysis, and resource optimization for efficient data processing, aiding in identifying bottlenecks and improving overall application efficiency. You can access it through the toolbar and clicking on Sedona Spark
and selecting the correct port number.
To obtain the port number, execute the provided code snippet:
spark_ui_port = sedona.sparkContext.uiWebUrl.split(":")[-1]
Click here for more information on Spark Web UI.
Executing All Cells¶
Open the desired Jupyter notebook to start executing on the Notebook instance that you created. Follow the steps below to execute all the code cells in the Jupyter notebook.
- Locate the toolbar on the top-left of the notebook and select
Run
as shown in the screenshot. -
Click
Run All Cells
to execute each cell in the notebook.
Note
When you first execute a WherobotsDB code cell, you will see a warning:
<TIMESTAMP> WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
This behavior is normal as it takes somewhere between 1-5 minutes depending on the number of Executors provided, to start the Executors.
The spark configuration and adding maven libraries differs slightly between Scala notebook and Python notebook.
Python Notebook Sedona Spark Configuration¶
For example, if you want to access data, you will need to create a SedonaContext with appropriate configurations. The first configuration that you should do is accessing data from a public S3 bucket, it looks like this:
.config("spark.hadoop.fs.s3a.bucket.<YOUR_BUCKET_NAME>.aws.credentials.provider","org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
If your bucket is not public then you may add your access key and secret key to access the private S3 bucket using the configuration shown below:
.config("spark.hadoop.fs.s3a.bucket.<YOUR_BUCKET_NAME>.access.key", "ACCESS-KEY")
.config("spark.hadoop.fs.s3a.bucket.<YOUR_BUCKET_NAME>.secret.key", "SECRET-KEY")
.config("spark.hadoop.fs.s3a.bucket.<YOUR_BUCKET_NAME>.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
WherobotsDB can also dynamically fetch any number of additional dependencies directly from Maven by defining the maven coordinates for the package. Maven coordinates have 3 required fields groupId:artiactId:version
.
.config("spark.jars.packages", "<MAVEN_COORDINATE>,<MAVEN_COORDINATE>")
Compiling all the configuration for accessing a public S3 bucket with a dependency, looks like this.
from sedona.spark import *
config = SedonaContext.builder() \
.config("spark.hadoop.fs.s3a.bucket.<YOUR_BUCKET_NAME>.aws.credentials.provider","org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider") \
.config("spark.jars.packages", "<MAVEN_COORDINATE>") \
.getOrCreate()
sedona = SedonaContext.create(config)
Scala Notebook Sedona Spark Configuration¶
Defining configurations for Scala Jupyter notebook is a bit different than you are used to in Python Jupyter notebook. You have to define the configurations before you create a SedonaContext. To configure Scala Jupyter notebook to access a public S3 bucket follow the instruction below:
launcher.conf.set("spark.hadoop.fs.s3a.bucket.<YOUR_BUCKET_NAME>.aws.credentials.provider","org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
To access an S3 bucket is not public, you will have to define the access key and secret key for the respective S3 bucket.
launcher.conf.set("spark.hadoop.fs.s3a.bucket.<YOUR_BUCKET_NAME>.access.key", "ACCESS-KEY")
launcher.conf.set("spark.hadoop.fs.s3a.bucket.<YOUR_BUCKET_NAME>.secret.key", "SECRET-KEY")
launcher.conf.set("spark.hadoop.fs.s3a.bucket.<YOUR_BUCKET_NAME>.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
You can dynamically add any number of Maven dependencies by defining the Maven coordinates.
launcher.packages = ["<MAVEN_COORDINATE>", "<MAVEN_COORDINATE>"]
Compiling all the Scala Jupyter configurations to access a public S3 bucket, while adding a dependency. You need to add these configurations at the very beginning of the Scala notebooks.
%%init_spark
launcher.conf.set("spark.hadoop.fs.s3a.bucket.<YOUR_BUCKET_NAME>.aws.credentials.provider","org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
launcher.packages = ["<MAVEN_COORDINATE>"]
As you have defined the configuration for Scala Jupyter notebook, creation of Sedona object is straight forward.
val config = SedonaContext.builder().getOrCreate()
val sedona = SedonaContext.create(config)
Note
You have the option to bypass configuring Spark settings in the Jupyter Notebook by specifying them directly in the advanced configuration's Spark configuration field.
Python Zip Library File Support¶
In Jupyter Notebook, there is a support for users to import their own customised Python modules.
-
On users' local computer or any users' environments, users create a directory called
zipmoduletest
-
In the
zipmoduletest
directory, create a file namedhellosedona.py
with the following contents:
def hello(input):
return 'hello ' + str(input);
- In the same directory, add an empty file with the name
__init__.py
, if you list this directory,ls zipmoduletest
,they should now look like the following.
__init__.py hellosedona.py
- Use the zip command to place the two module files into a file called
zipmoduletest.zip
, or you can use any compress tool to zip these two files.
zip -r9 ../zipmoduletest.zip *
-
Upload the zip file into Wherobots Managed Storage, refer to file structure docs
-
Copy the S3 path of zip file, and use below code to import zip file
sedona.sparkContext.addPyFile('s3://<Your-Bucket>/path-to-file/zipmoduletest.zip')
from zipmoduletest.hellosedona import hello
hello_str = hello("Sedona")
The output will be:
hello Sedona
- In Job submission, if you want to import customised Python module, you also can include above code in your Python code.
Python Notebook As a Job¶
Once the Jupyter notebook is ready to export the notebook as an executable Python file. Exporting your Python Jupyter notebook allows you to create jobs. Please follow steps to create a python executable.
- Locate the toolbar on the top-left of the notebook and select
File
as shown in the screenshot. -
Hover over the
Save and Export Notebook As...
. -
Select
Executable Script
then a file with the name of the Jupyter notebook will be saved to your downloads folder.
Once you have the Python executable file, refer to job submission docs to create a job.
Scala Notebook As a Job¶
As you have made changes to the Jupyter Scala notebook and want to export it as an executable Scala file. The steps are same as creating a Python executable script from Python notebook.
- Locate the toolbar on the top-left of the notebook and select
File
as shown in the screenshot. -
Hover over the
Save and Export Notebook As...
. -
Select
Executable Script
then a file with the name of the Jupyter Scala notebook will be saved to your downloads folder.
As you have a scala file, you can import it to Scala executable file to sedona-maven-exmaple/src/main/scala/com/wherobots/sedona/
file path, this will make sure that you have the scala file in the jar package for job submission.
Note
The Scala executable file that you create won't have a main class, for that you can wrap your code after the imports with this object <class-name-you-want> extends App { all-of-code }
, this will ensure that the job is executing as expectation.
Executing a .scala
file is not possible within the Jupyter environment as it functions solely as a text editor. To execute code, please utilize the Jupyter Scala notebook.
Alternatively, you can also copy the code from Jupyter Scala notebook and paste it in the already existing file in path mentioned above.
Creating Jar File¶
- Navigate to
File
on the toolbar. -
Click on
New Launcher
-
Open
terminal
, make sure you are in thesedona-maven-example
directory. -
Execute
mvn clean package
in the terminal at the specified location. -
Locate
target
folder. -
Right-click on the
sedonadb-example-0.0.1.jar
. -
Select
Download
to download the jar in your downloads folder.
Note
You may add any dependency to the pom.xml
located at notebook-example/scala/sedona-maven-example
.
Once you have the jar file, refer to job submission docs to create a job.