This is the documentation for Cloudera Enterprise 5.8.x. Documentation for other versions is available at Cloudera Documentation.

Snappy Compression

Snappy is a compression/decompression library. It optimizes for very high-speed compression and decompression, and moderate compression instead of maximum compression or compatibility with other compression libraries.

    Snappy is supported for all CDH components. How you specify compression depends on the component.

    Continue reading:

    Using Snappy with HBase

    If you install Hadoop and HBase from RPM or Debian packages, Snappy requires no HBase configuration.

    Using Snappy with Hive

    To enable Snappy compression for Hive output when creating SequenceFile outputs, use the following settings:

    SET hive.exec.compress.output=true;
    SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
    SET mapred.output.compression.type=BLOCK;

    Using Snappy with MapReduce

    Enabling MapReduce intermediate compression can make jobs run faster without requiring application changes. Only the temporary intermediate files created by Hadoop for the shuffle phase are compressed; the final output may or may not be compressed. Snappy is ideal in this case because it compresses and decompresses very quickly compared to other compression algorithms, such as Gzip. For information about choosing a compression format, see Choosing and Configuring Data Compression.

    To enable Snappy for MapReduce intermediate compression for the whole cluster, set the following properties in mapred-site.xml:

    • MRv1
      <property>
        <name>mapred.compress.map.output</name>
        <value>true</value>
      </property>
      <property>
        <name>mapred.map.output.compression.codec</name>
        <value>org.apache.hadoop.io.compress.SnappyCodec</value>
      </property>
    • YARN
      <property>
        <name>mapreduce.map.output.compress</name>
        <value>true</value>
      </property>
      <property>
        <name>mapred.map.output.compress.codec</name>
        <value>org.apache.hadoop.io.compress.SnappyCodec</value>
      </property>

    You can also set these properties on a per-job basis.

    Use the properties in the following table to compress the final output of a MapReduce job. These are usually set on a per-job basis.

    MRv1 Property YARN Property Description
    mapred.output.compress
    mapreduce.output.
    fileoutputformat.
    compress
    Whether to compress the final job outputs (true or false).
    mapred.output.
    compression.codec
    mapreduce.output.
    fileoutputformat.
    compress.codec
    If the final job outputs are to be compressed, the codec to use. Set to org.apache.hadoop.io.compress.SnappyCodec for Snappy compression.
    mapred.output.
    compression.type
    mapreduce.output.
    fileoutputformat.
    compress.type
    For SequenceFile outputs, e type of compression to use (NONE, RECORD, or BLOCK). Cloudera recommends BLOCK.
      Note: The MRv1 property names are also supported (but deprecated) in YARN. You do not need to update them in this release.

    Using Snappy with Pig

    Set the same properties for Pig as for MapReduce.

    Using Snappy with Spark SQL

    To enable Snappy compression for Spark SQL when writing tables, specify the snappy codec in the spark.sql.parquet.compression.codec configuration:
    sqlContext.setConf("spark.sql.parquet.compression.codec","snappy") 

    Using Snappy Compression with Sqoop 1 and Sqoop 2 Imports

    • Sqoop 1 - On the command line, use the following option to enable Snappy compression:
      --compression-codec org.apache.hadoop.io.compress.SnappyCodec

      Cloudera recommends using the --as-sequencefile option with this compression option.

    • Sqoop 2 - When you create a job (sqoop:000> create job), choose 7 (SNAPPY) as the compression format.
    Page generated July 8, 2016.