This is the documentation for Cloudera Enterprise 5.8.x. Documentation for other versions is available at Cloudera Documentation.

Building and Running a Crunch Application with Spark

Developing and Running a Spark WordCount Application provides a tutorial on writing, compiling, and running a Spark application. Using the tutorial as a starting point, do the following to build and run a Crunch application with Spark:
  1. Along with the other dependencies shown in the tutorial, add the appropriate version of the crunch-core and crunch-spark dependencies to the Maven project.
    <dependency>
        <groupId>org.apache.crunch</groupId>
        <artifactId>crunch-core</artifactId>
        <version>${crunch.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.crunch</groupId>
        <artifactId>crunch-spark</artifactId>
        <version>${crunch.version}</version>
    </dependency>
  2. Use SparkPipeline where you would have used MRPipeline in the declaration of your Crunch pipeline. SparkPipeline takes either a String that contains the connection string for the Spark master (local for local mode, yarn for YARN) or a JavaSparkContext instance.
  3. As you would for a Spark application, use spark-submit start the pipeline with your Crunch application app-jar-with-dependencies.jar file.
For an example, see Crunch demo. After building the example, run with the following command:
spark-submit --class com.example.WordCount crunch-demo-1.0-SNAPSHOT-jar-with-dependencies.jar \
hdfs://namenode_host:8020/user/hdfs/input hdfs://namenode_host:8020/user/hdfs/output
Page generated July 8, 2016.