make pyspark working inside jupyterhub

  • Last Update :
  • Techknowledgy :

On my server jupyter kernels are located at:

/usr/local / share / jupyter / kernels /

You can create a new kernel by making a new directory:

mkdir / usr / local / share / jupyter / kernels / pyspark

Then create the kernel.json file - I paste my as a reference:

{
   "display_name": "pySpark (Spark 1.6.0)",
   "language": "python",
   "argv": [
      "/usr/local/bin/python2.7",
      "-m",
      "ipykernel",
      "-f",
      "{connection_file}"
   ],
   "env": {
      "PYSPARK_PYTHON": "/usr/local/bin/python2.7",
      "SPARK_HOME": "/usr/lib/spark",
      "PYTHONPATH": "/usr/lib/spark/python/lib/py4j-0.9-src.zip:/usr/lib/spark/python/",
      "PYTHONSTARTUP": "/usr/lib/spark/python/pyspark/shell.py",
      "PYSPARK_SUBMIT_ARGS": "--master yarn-client pyspark-shell"
   }
}

You could start jupyter as usual, and add the following to the top of your code:

import sys
sys.path.insert(0, '<path>/spark/python/')
   sys.path.insert(0, '<path>/spark/python/lib/py4j-0.8.2.1-src.zip')
      import pyspark
      conf = pyspark.SparkConf().set<conf settings>
         sc = pyspark.SparkContext(conf=conf)

Suggestion : 2

PySpark isn’t installed like a normal Python library, rather it’s packaged separately and needs to be added to the PYTHONPATH to be importable. This can be done by configuring jupyterhub_config.py to find the required libraries and set PYTHONPATH in the user’s notebook environment. You’ll also want to set PYSPARK_PYTHON to the same Python path that the notebook environment is running in.,If you’re using an archived notebook environment <archived-environments>, you may instead want to bundle a spark config directory in the archive, and set the SPARK_CONF_DIR to the extracted path. This allows you to specify the path to the same archive in the config, so your users don’t have to themselves. This might look like:,Given configuration like above, users may not need to enter any parameters when creating a SparkContext - the default values may already be sufficiently set:,There are additional Jupyter and Spark integrations that may be useful for your installation. Please refer to their documentation for more information:

import os
import glob
# Find pyspark modules to add to PYTHONPATH, so they can be used as regular
# libraries
pyspark = '/usr/lib/spark/python/'
py4j = glob.glob(os.path.join(pyspark, 'lib', 'py4j-*.zip'))[0]
pythonpath = ':'.join([pyspark, py4j])

# Set PYTHONPATH and PYSPARK_PYTHON in the user 's notebook environment
c.YarnSpawner.environment = {
   'PYTHONPATH': pythonpath,
   'PYSPARK_PYTHON': '/opt/jupyterhub/miniconda/bin/python',
}
# A custom spark-defaults.conf
# Stored at `<ENV>/etc/spark/spark-defaults.conf`, where `<ENV>` is the top
      # directory of the unarchived Conda/virtual environment.

      # Common configuration
      spark.master yarn
      spark.submit.deployMode client
      spark.yarn.queue myqueue

      # If the spark jars are already on every node, avoid serializing them
      spark.yarn.jars local:/usr/lib/spark/jars/*

      # Path to the archived Python environment
      spark.yarn.dist.archives hdfs:///jupyterhub/example.tar.gz#environment

      # Pyspark configuration
      spark.pyspark.python ./environment/bin/python
      spark.pyspark.driver.python ./environment/bin/python
# Add PySpark to PYTHONPATH, same as above
#...

   # Set PYTHONPATH and SPARK_CONF_DIR in the user 's notebook environment
c.YarnSpawner.environment = {
   'PYTHONPATH': pythonpath,
   'SPARK_CONF_DIR': './environment/etc/spark'
}
import pyspark

# Create a spark context from the defaults set in configuration
sc = pyspark.SparkContext()
import pyspark

conf = pyspark.SparkConf()

# Override a few
default parameters
conf.set('spark.executor.memory', '512m')
conf.set('spark.executor.instances', 1)

# Create a spark context with the overrides
sc = pyspark.SparkContext(conf = conf)
def some_function(x):
   # Libraries are imported and available from the same environment as the
# notebook
import sklearn
import pandas as pd
import numpy as np

# Use the libraries to do work
return...

   rdd = sc.parallelize(range(1000)).map(some_function).take(10)

Suggestion : 3

Published March 19, 2019

Java is used by many other software. So, it is quite possible that a required version (in our case version 7 or later) is already available on your computer. To check if Java is available and find its version, open a Command Prompt and type the following command.

java - version

If Java is installed and configured to work from a Command Prompt, running the above command should print the information about the Java version to the console. For example, I got the following output on my laptop.

java version "1.8.0_92"
Java(TM) SE Runtime Environment(build 1.8 .0_92 - b14)
Java HotSpot(TM) 64 - Bit Server VM(build 25.92 - b14, mixed mode)

Instead if you get a message like

'java'
is not recognized as an internal or external command, operable program or batch file.

If Python is installed and configured to work from a Command Prompt, running the above command should print the information about the Python version to the console. For example, I got the following output on my laptop.

Python 3.6 .5::Anaconda, Inc.

1. Click on Windows and search “Anacoda Prompt”. Open Anaconda prompt and type “python -m pip install findspark”. This package is necessary to run spark from Jupyter notebook.
2. Now, from the same Anaconda Prompt, type “jupyter notebook” and hit enter. This would open a jupyter notebook from your browser. From Jupyter notebookàNewàSelect Python3, as shown below.
 

3. Upon selecting Python3, a new notebook would open which we can use to run spark and use pyspark. In the notebook, please run the below code to verify if Spark is successfully installed. Once this is done you can use our very own Jupyter notebook to run Spark using PySpark.
4. Now let us test the if our installation was successful using Test1 and Test 2 as below.
Test1

import findspark
findspark.init()

Suggestion : 4

1. Download & Install Anaconda Distribution,Step 1. Download & Install Anaconda Distribution,Steps to Install PySpark in Anaconda & Jupyter notebook,PySpark – Install on Windows

1._
# Install OpenJDK 11
conda install openjdk

To install PySpark on Anaconda I will use the conda command. conda is the package manager that the Anaconda distribution is built upon. It is a package manager that is both cross-platform and language agnostic.

# Install PySpark using Conda
conda install pyspark

In order to run PySpark in Jupyter notebook first, you need to find the PySpark Install, I will be using findspark package to do so. Since this is a third-party package we need to install it before using it.

conda install - c conda - forge findspark

If you get pyspark error in jupyter then then run the following commands in the notebook cell to find the PySpark .

import findspark
findspark.init()
findspark.find()

post install, write the below program and run it by pressing F5 or by selecting a run button from the menu.

# Import PySpark
from pyspark.sql
import SparkSession

#Create SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

# Data
data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]

# Columns
columns = ["language", "users_count"]

# Create DataFrame
df = spark.createDataFrame(data).toDF( * columns)

# Print DataFrame
df.show()