This is the documentation for Cloudera Enterprise 5.8.x. Documentation for other versions is available at Cloudera Documentation.

Using MapReduce Batch Indexing with Cloudera Search

The following sections include examples that illustrate using MapReduce to index tweets. These examples require that you:

Complete the process of Preparing to Index Data with Cloudera Search.
Install the MapReduce tools for Cloudera Search as described in Installing MapReduce Tools for use with Cloudera Search.

Batch Indexing into Online Solr Servers Using GoLive

Warning: Batch indexing into offline Solr shards is not supported in environments in which batch indexing into online Solr servers using GoLive occurs.

MapReduceIndexerTool is a MapReduce batch job driver that creates a set of Solr index shards from a set of input files and writes the indexes into HDFS in a flexible, scalable, and fault-tolerant manner. Using GoLive, MapReduceIndexerTool also supports merging the output shards into a set of live customer-facing Solr servers, typically a SolrCloud. The following sample steps demonstrate these capabilities.

Delete all existing documents in Solr.

$ solrctl collection --deletedocs collection1

Run the MapReduce job using GoLive. Replace $NNHOST and $ZKHOST in the command with your NameNode and ZooKeeper hostnames and port numbers, as required. You do not need to specify --solr-home-dir because the job accesses it from ZooKeeper.

Parcel-based Installation:

$ hadoop --config /etc/hadoop/conf.cloudera.mapreduce1 jar \
/opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar \
org.apache.solr.hadoop.MapReduceIndexerTool -D \
'mapred.child.java.opts=-Xmx500m' --log4j \
/opt/cloudera/parcels/CDH/share/doc/search*/examples/solr-nrt/log4j.properties --morphline-file \
/opt/cloudera/parcels/CDH/share/doc/search*/examples/solr-nrt/test-morphlines/tutorialReadAvroContainer.conf \
--output-dir hdfs://$NNHOST:8020/user/$USER/outdir --verbose --go-live \
--zk-host $ZKHOST:2181/solr --collection collection1 \
hdfs://$NNHOST:8020/user/$USER/indir

Package-based Installation:

$ hadoop --config /etc/hadoop/conf.cloudera.mapreduce1 jar \
/usr/lib/solr/contrib/mr/search-mr-*-job.jar \
org.apache.solr.hadoop.MapReduceIndexerTool -D \
'mapred.child.java.opts=-Xmx500m' --log4j \
/usr/share/doc/search*/examples/solr-nrt/log4j.properties --morphline-file \
/usr/share/doc/search*/examples/solr-nrt/test-morphlines/tutorialReadAvroContainer.conf \
--output-dir hdfs://$NNHOST:8020/user/$USER/outdir --verbose --go-live \
--zk-host $ZKHOST:2181/solr --collection collection1 \
hdfs://$NNHOST:8020/user/$USER/indir

This command requires a morphline file, which must include a SOLR_LOCATOR. Any CLI parameters for --zkhost and --collection override the parameters of the solrLocator. The snippet that includes the SOLR_LOCATOR might appear as follows:

SOLR_LOCATOR : {
  # Name of solr collection
  collection : collection

  # ZooKeeper ensemble
  zkHost : "$ZK_HOST"
}

morphlines : [
  {
    id : morphline1
    importCommands : ["org.kitesdk.**", "org.apache.solr.**"]
    commands : [
      { generateUUID { field : id } }

      { # Remove record fields that are unknown to Solr schema.xml.
        # Recall that Solr throws an exception on any attempt to load a document that
        # contains a field that isn't specified in schema.xml.
        sanitizeUnknownSolrFields {
          solrLocator : ${SOLR_LOCATOR} # Location from which to fetch Solr schema
        }
      }

      { logDebug { format : "output record: {}", args : ["@{}"] } }

      {
        loadSolr {
          solrLocator : ${SOLR_LOCATOR}
        }
      }
    ]
  }
]

Check the job tracker status at http://localhost:50030/jobtracker.jsp.
When the job is complete, run some Solr queries. For example, for myserver.example.com, use: http://myserver.example.com:8983/solr/collection1/select?q=*%3A*&wt=json&indent=true
For help on how to run a Hadoop MapReduce job, use the following command:
- Parcel-based Installation:
```
$ hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar \
org.apache.solr.hadoop.MapReduceIndexerTool --help
```
- Package-based Installation:
```
$ hadoop jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar \
org.apache.solr.hadoop.MapReduceIndexerTool --help
```
- For development purposes, use the MapReduceIndexerTool --dry-run option to run in local mode and print documents to stdout, instead of loading them to Solr. Using this option causes the morphline to run in the client process without submitting a job to MapReduce. Running in the client process provides faster turnaround during early trial and debug sessions.
- To print diagnostic information, such as the content of records as they pass through the morphline commands, enable TRACE log level diagnostics by adding the following entry to your log4j.properties file:
```
log4j.logger.org.kitesdk.morphline=TRACE
```
  The log4j.properties file can be passed using the MapReduceIndexerTool --log4j command-line option.

Batch Indexing into Offline Solr Shards

Running the MapReduce job without GoLive causes the job to create a set of Solr index shards from a set of input files and write the indexes to HDFS. You can then explicitly point each Solr server to one of the HDFS output shard directories.

Batch indexing into offline Solr shards is mainly intended for offline use-cases by experts. Cases requiring read-only indexes for searching can be handled using batch indexing without the --go-live option. By not using GoLive, you can avoid copying datasets between segments, thereby reducing resource demands.

Delete all existing documents in Solr.

$ solrctl collection --deletedocs collection1
$ sudo -u hdfs hadoop fs -rm -r -skipTrash /user/$USER/outdir

Run the Hadoop MapReduce job, replacing $NNHOST in the command with your NameNode hostname and port number, as required.

Parcel-based Installation:

$ hadoop --config /etc/hadoop/conf.cloudera.mapreduce1 jar \
/opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar \
org.apache.solr.hadoop.MapReduceIndexerTool -D \
'mapred.child.java.opts=-Xmx500m' --log4j \
/opt/cloudera/parcels/CDH/share/doc/search*/examples/solr-nrt/log4j.properties --morphline-file \
/opt/cloudera/parcels/CDH/share/doc/search*/examples/solr-nrt/test-morphlines/tutorialReadAvroContainer.conf \
--output-dir hdfs://$NNHOST:8020/user/$USER/outdir --verbose --solr-home-dir \
$HOME/collection1 --shards 2 hdfs://$NNHOST:8020/user/$USER/indir

Package-based Installation:

$ hadoop --config /etc/hadoop/conf.cloudera.mapreduce1 jar \
/usr/lib/solr/contrib/mr/search-mr-*-job.jar \
org.apache.solr.hadoop.MapReduceIndexerTool -D \
'mapred.child.java.opts=-Xmx500m' --log4j \
/usr/share/doc/search*/examples/solr-nrt/log4j.properties --morphline-file \
/usr/share/doc/search*/examples/solr-nrt/test-morphlines/tutorialReadAvroContainer.conf \
--output-dir hdfs://$NNHOST:8020/user/$USER/outdir --verbose --solr-home-dir \
$HOME/collection1 --shards 2 hdfs://$NNHOST:8020/user/$USER/indir

Check the job tracker status. For example, for the localhost, use http://localhost:50030/jobtracker.jsp.
After the job is completed, check the generated index files. Individual shards are written to the results directory with names of the form part-00000, part-00001, part-00002. This example has two shards.
```
$ hadoop fs -ls /user/$USER/outdir/results
$ hadoop fs -ls /user/$USER/outdir/results/part-00000/data/index
```
Stop Solr on each host of the cluster.
```
$ sudo service solr-server stop
```
List the hostname folders used as part of the path to each index in the SolrCloud cluster.
```
$ hadoop fs -ls /solr/collection1
```

Move index shards into place.

Remove outdated files:

$ sudo -u solr hadoop fs -rm -r -skipTrash \
/solr/collection1/$HOSTNAME1/data/index
$ sudo -u solr hadoop fs -rm -r -skipTrash \
/solr/collection1/$HOSTNAME2/data/tlog

Ensure correct ownership of required directories:

$ sudo -u hdfs hadoop fs -chown -R solr /user/$USER/outdir/results

Move the two index shards into place (the two servers you set up in Preparing to Index Data with Cloudera Search):

$ sudo -u solr hadoop fs -mv /user/$USER/outdir/results/part-00000/data/index \
/solr/collection1/$HOSTNAME1/data/
$ sudo -u solr hadoop fs -mv /user/$USER/outdir/results/part-00001/data/index \
/solr/collection1/$HOSTNAME2/data/

Start Solr on each host of the cluster:
```
$ sudo service solr-server start
```
Run some Solr queries. For example, for myserver.example.com, use: http://myserver.example.com:8983/solr/collection1/select?q=*%3A*&wt=json&indent=true

Page generated July 8, 2016.

Categories: Developers | MapReduce | Search | All Categories