This is the documentation for Cloudera Enterprise 5.8.x. Documentation for other versions is available at Cloudera Documentation.

Accessing Parquet Files From Spark SQL Applications

Spark SQL supports loading and saving DataFrames from and to a variety of data sources and has native support for Parquet. For information about Parquet, see Parquet Files.

To read Parquet files in Spark SQL, use the SQLContext.read.parquet("path") method.

To write Parquet files in Spark SQL, use the DataFrame.write.parquet("path") method.

To set the compression type, configure the spark.sql.parquet.compression.codec property:
sqlContext.setConf("spark.sql.parquet.compression.codec","codec") 
The supported codec values are: uncompressed, gzip, lzo, and snappy. The default is gzip.

For an example of writing Parquet files to Amazon S3, see Reading and Writing Data Sources From and To Amazon S3.

Page generated July 8, 2016.