Friday, April 15, 2016

How to resolve spark-cassandra-connector Guava version conflicts in Yarn cluster mode

When you use spark-cassandra-connector, you will encounter this problem “Guava version conflicts” when you submit your job using Yarn cluster mode. spark-cassandra-connector usually use the latest Guava version 16.0.1, which has some new methods could not be found in the old version of Guava, e.g., 11.0.2. It is A BIG Headache when you try to resolve this problem.
Here is how you can resolve this without building something specially.
I think everyone might already have the idea: Put guava-16.0.1.jar before guava-11.0.2.jar in the classpath. But how can we achieve this when you run as YARN cluster mode?
You Hadoop cluster might already have the Guava jar. If you use CDH, try this
find -L /opt/cloudera/parcels/CDH -name "guava*.jar". If you like use that jar, you can resolve this problem by adding
spark-submit
  --master yarn-cluster
  --conf spark.driver.extraClassPath=<path of guava-16.0.1.jar>
  --conf spark.executor.extraClassPath=<path of guava-16.0.1.jar>
  ...
extraClassPath allow you prepend the jars in the class path.
If you could not find the version of Guava in you cluster, you can just include the jar by yourself
spark-submit
  --master yarn-cluster
  --conf spark.driver.extraClassPath=./guava-16.0.1.jar
  --conf spark.executor.extraClassPath=./guava-16.0.1.jar
  --jars <path of guava-16.0.1.jar>
  ...
In --jars, you actually tell spark how to find the jar, so you need to provide the full path of the jar. When spark starts in Yarn cluster mode, the jar will be shipped to the container in NodeManager, in that everything will the current directory where the executor starts, you only need to tell it is the current working directory in extraClassPath.
If you use CDH, all hadoop jars are automatically added when you run a Yarn application. Take a look of launch_container.sh when you job is running, you will see something like below.
export CLASSPATH="$PWD:$PWD/__spark__.jar:$HADOOP_CLIENT_CONF_DIR:$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/*:$HADOOP_COMMON_HOME/lib/*:$HADOOP_HDFS_HOME/*:$HADOOP_HDFS_HOME/lib/*:$HADOOP_YARN_HOME/*:$HADOOP_YARN_HOME/lib/*:$HADOOP_MAPRED_HOME/*:$HADOOP_MAPRED_HOME/lib/*:$MR2_CLASSPATH"
Here is how you can find the launch_container.sh
  • Find the hose where one of the executors is running
  • Run this command find -L /yarn -path "*<app_id>*" -name "launch*"
There is a Yarn configuration yarn.application.classpath. If you like, you can prepend an entry for Guava.
  <property>
    <name>yarn.application.classpath</name>
    <value>$HADOOP_CLIENT_CONF_DIR,$HADOOP_CONF_DIR,$HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,$HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,$HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*</value>
  </property>
You know spark-submit have some messy convention:
  • --jars is separated by comma “,
  • extraClassPath is separated by column “:
  • --driver-class-path is separated by comma “,
I was in a hurry to write this blog. I might miss something or I assume you know a lot. If something is not clear, let me know and I will fix it.