add jar to pyspark when using notebook

  • Last Update :
  • Techknowledgy :

I start a python3 notebook in jupyterhub and overwrite the PYSPARK_SUBMIT_ARGS flag as shown below. The Kafka consumer library was downloaded from the maven repository and put in my home directory /home/jovyan:

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = 
  '--jars /home/jovyan/spark-streaming-kafka-assembly_2.10-1.6.1.jar pyspark-shell'

import pyspark
from pyspark.streaming.kafka import KafkaUtils
from pyspark.streaming import StreamingContext

sc = pyspark.SparkContext()
ssc = StreamingContext(sc,1)

broker = "<my_broker_ip>"
directKafkaStream = KafkaUtils.createDirectStream(ssc, ["test1"],
                        {"metadata.broker.list": broker})
directKafkaStream.pprint()
ssc.start()

Indeed, there is a way to link it dynamically via the SparkConf object when you create the SparkSession, as explained in this answer:

spark = SparkSession\
   .builder\
   .appName("My App")\
   .config("spark.jars", "/path/to/jar.jar,/path/to/another/jar.jar")\
   .getOrCreate()

You can run your jupyter notebook with the pyspark command by setting the relevant environment variables:

export PYSPARK_DRIVER_PYTHON = jupyter
export IPYTHON = 1
export PYSPARK_DRIVER_PYTHON_OPTS = "notebook --port=XXX --ip=YYY"

Suggestion : 2

I have the same problem. I am running a VirtualBox Hadoop+Spark cluster. While spark-notebook connects fine to master on the cluster, I can’t add custom jars to the classpath.,and “spark.jars” in customSparkConf in notebook metadata.,but such a line also appears when I run spark-shell below and seems to not be sufficient to ensure that the driver gets <my jar> added to its classpath.,JARs specified in spark.jars don’t appear to be on the driver’s classpath for me (version 0.6.2, Spark 1.6.0).

Here is my notebook configuration, reduced to a minimal test case:

{
  "name": "test",
  …
  "customDeps": [],
  "customImports": null,
  "customArgs": null,
  "customSparkConf": {
    "spark.yarn.jar": "…",
    "spark.jars": <my jar>,
    "spark.master": "yarn-client"
  },
  "kernelspec": {
    "name": "spark",
    "display_name": "Scala [2.10.4] Spark [1.6.0] Hadoop [2.6.0]   {Parquet ✓}"
  }
}

There is a line in the JS console (driver log) implying the JAR was added to the /jars folder on the driver:

Server log> [Thu May 05 2016 16:22:01 GMT-0400 (EDT)] [org.apache.spark.SparkContext] Added JAR <my jar> at http://172.29.46.14:44693/jars/<my jar> with timestamp 1462479721924

When I run an equivalent thing in my shell, everything works fine:

$ spark-shell --master yarn --conf spark.jars=<my jar>
   …
   scala> import org.hammerlab.guacamole
   import org.hammerlab.guacamole

Suggestion : 3

This package can be added to Spark using the --packages command line option. For example, to include it when starting the spark shell:,Alternatively to have the latest development you can download this repo and build the jar, and add it when launching the spark shell (but won’t be added in the classpath),Using --packages ensures that this library and its dependencies will be added to the classpath (make sure you use the latest version). In Python, you would do the same,See here for more options for pyspark. To build the JAR, just run sbt ++{SBT_VERSION} package from the root of the package (see run_*.sh scripts). Here is an example in the spark-shell:

# Scala 2.11
$SPARK_HOME / bin / spark - shell--packages com.github.astrolabsoftware: spark - fits_2 .11: 1.0 .0

# Scala 2.12
$SPARK_HOME / bin / spark - shell--packages com.github.astrolabsoftware: spark - fits_2 .12: 1.0 .0
# Scala 2.11
$SPARK_HOME / bin / pyspark--packages com.github.astrolabsoftware: spark - fits_2 .11: 1.0 .0

# Scala 2.12
$SPARK_HOME / bin / pyspark--packages com.github.astrolabsoftware: spark - fits_2 .12: 1.0 .0
$SPARK_HOME/bin/spark-shell --jars /path/to/jar/<spark-fits.jar>
$SPARK_HOME/bin/pyspark --jars /path/to/jar/<spark-fits.jar>
export PYSPARK_DRIVER_PYTHON_OPTS="path/to/ipython"
$SPARK_HOME/bin/pyspark --jars /path/to/jar/<spark-fits.jar>
cd /path/to/notebooks
export PYSPARK_DRIVER_PYTHON_OPTS="path/to/jupyter-notebook"
$SPARK_HOME/bin/pyspark --jars /path/to/jar/<spark-fits.jar>

Suggestion : 4

This example shows how to discover the location of JAR files installed with Spark 2, and add them to the Spark 2 configuration.

This example shows how to discover the location of JAR files installed with Spark 2, and add them to the Spark 2 configuration.

# # Using Avro data
#
# This example shows how to use a JAR file on the local filesystem on
# Spark on Yarn.

from __future__
import print_function
import os, sys
import os.path
from functools
import reduce
from pyspark.sql
import SparkSession
from pyspark.files
import SparkFiles

# Add the data file to HDFS
for consumption by the Spark executors.!hdfs dfs - put resources / users.avro / tmp

# Find the example JARs provided by the Spark parcel.This parcel
# is available on both the driver, which runs in Cloudera Machine Learning, and the
# executors, which run on Yarn.
exampleDir = os.path.join(os.environ["SPARK_HOME"], "examples/jars")
exampleJars = [os.path.join(exampleDir, x) for x in os.listdir(exampleDir)]

# Add the Spark JARs to the Spark configuration to make them available
for use.
spark = SparkSession\
   .builder\
   .config("spark.jars", ",".join(exampleJars))\
   .appName("AvroKeyInputFormat")\
   .getOrCreate()
sc = spark.sparkContext

# Read the schema.
schema = open("resources/user.avsc").read()
conf = {
   "avro.schema.input.key": schema
}

avro_rdd = sc.newAPIHadoopFile(
   "/tmp/users.avro", # This is an HDFS path!
   "org.apache.avro.mapreduce.AvroKeyInputFormat",
   "org.apache.avro.mapred.AvroKey",
   "org.apache.hadoop.io.NullWritable",
   keyConverter = "org.apache.spark.examples.pythonconverters.AvroWrapperToJavaConverter",
   conf = conf)
output = avro_rdd.map(lambda x: x[0]).collect()
for k in output:
   print(k)
spark.stop()