This is the documentation for Cloudera Enterprise 5.8.x. Documentation for other versions is available at Cloudera Documentation.

Using MapReduce Batch Indexing with Cloudera Search

The following sections include examples that illustrate using MapReduce to index tweets. These examples require that you:

Batch Indexing into Online Solr Servers Using GoLive

  Warning: Batch indexing into offline Solr shards is not supported in environments in which batch indexing into online Solr servers using GoLive occurs.

MapReduceIndexerTool is a MapReduce batch job driver that creates a set of Solr index shards from a set of input files and writes the indexes into HDFS in a flexible, scalable, and fault-tolerant manner. Using GoLive, MapReduceIndexerTool also supports merging the output shards into a set of live customer-facing Solr servers, typically a SolrCloud. The following sample steps demonstrate these capabilities.

  1. Delete all existing documents in Solr.
    $ solrctl collection --deletedocs collection1
  2. Run the MapReduce job using GoLive. Replace $NNHOST and $ZKHOST in the command with your NameNode and ZooKeeper hostnames and port numbers, as required. You do not need to specify --solr-home-dir because the job accesses it from ZooKeeper.
    • Parcel-based Installation:
      $ hadoop --config /etc/hadoop/conf.cloudera.mapreduce1 jar \
      /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar \
      org.apache.solr.hadoop.MapReduceIndexerTool -D \
      'mapred.child.java.opts=-Xmx500m' --log4j \
      /opt/cloudera/parcels/CDH/share/doc/search*/examples/solr-nrt/log4j.properties --morphline-file \
      /opt/cloudera/parcels/CDH/share/doc/search*/examples/solr-nrt/test-morphlines/tutorialReadAvroContainer.conf \
      --output-dir hdfs://$NNHOST:8020/user/$USER/outdir --verbose --go-live \
      --zk-host $ZKHOST:2181/solr --collection collection1 \
      hdfs://$NNHOST:8020/user/$USER/indir
    • Package-based Installation:
      $ hadoop --config /etc/hadoop/conf.cloudera.mapreduce1 jar \
      /usr/lib/solr/contrib/mr/search-mr-*-job.jar \
      org.apache.solr.hadoop.MapReduceIndexerTool -D \
      'mapred.child.java.opts=-Xmx500m' --log4j \
      /usr/share/doc/search*/examples/solr-nrt/log4j.properties --morphline-file \
      /usr/share/doc/search*/examples/solr-nrt/test-morphlines/tutorialReadAvroContainer.conf \
      --output-dir hdfs://$NNHOST:8020/user/$USER/outdir --verbose --go-live \
      --zk-host $ZKHOST:2181/solr --collection collection1 \
      hdfs://$NNHOST:8020/user/$USER/indir
    This command requires a morphline file, which must include a SOLR_LOCATOR. Any CLI parameters for --zkhost and --collection override the parameters of the solrLocator. The snippet that includes the SOLR_LOCATOR might appear as follows:
    SOLR_LOCATOR : {
      # Name of solr collection
      collection : collection
    
      # ZooKeeper ensemble
      zkHost : "$ZK_HOST"
    }
    
    morphlines : [
      {
        id : morphline1
        importCommands : ["org.kitesdk.**", "org.apache.solr.**"]
        commands : [
          { generateUUID { field : id } }
    
          { # Remove record fields that are unknown to Solr schema.xml.
            # Recall that Solr throws an exception on any attempt to load a document that
            # contains a field that isn't specified in schema.xml.
            sanitizeUnknownSolrFields {
              solrLocator : ${SOLR_LOCATOR} # Location from which to fetch Solr schema
            }
          }
    
          { logDebug { format : "output record: {}", args : ["@{}"] } }
    
          {
            loadSolr {
              solrLocator : ${SOLR_LOCATOR}
            }
          }
        ]
      }
    ]
  3. Check the job tracker status at http://localhost:50030/jobtracker.jsp.
  4. When the job is complete, run some Solr queries. For example, for myserver.example.com, use: http://myserver.example.com:8983/solr/collection1/select?q=*%3A*&wt=json&indent=true
    For help on how to run a Hadoop MapReduce job, use the following command:
    • Parcel-based Installation:
      $ hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar \
      org.apache.solr.hadoop.MapReduceIndexerTool --help
    • Package-based Installation:
      $ hadoop jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar \
      org.apache.solr.hadoop.MapReduceIndexerTool --help
    • For development purposes, use the MapReduceIndexerTool --dry-run option to run in local mode and print documents to stdout, instead of loading them to Solr. Using this option causes the morphline to run in the client process without submitting a job to MapReduce. Running in the client process provides faster turnaround during early trial and debug sessions.
    • To print diagnostic information, such as the content of records as they pass through the morphline commands, enable TRACE log level diagnostics by adding the following entry to your log4j.properties file:
      log4j.logger.org.kitesdk.morphline=TRACE
      The log4j.properties file can be passed using the MapReduceIndexerTool --log4j command-line option.

Batch Indexing into Offline Solr Shards

Running the MapReduce job without GoLive causes the job to create a set of Solr index shards from a set of input files and write the indexes to HDFS. You can then explicitly point each Solr server to one of the HDFS output shard directories.

Batch indexing into offline Solr shards is mainly intended for offline use-cases by experts. Cases requiring read-only indexes for searching can be handled using batch indexing without the --go-live option. By not using GoLive, you can avoid copying datasets between segments, thereby reducing resource demands.

  1. Delete all existing documents in Solr.
    $ solrctl collection --deletedocs collection1
    $ sudo -u hdfs hadoop fs -rm -r -skipTrash /user/$USER/outdir
  2. Run the Hadoop MapReduce job, replacing $NNHOST in the command with your NameNode hostname and port number, as required.
    • Parcel-based Installation:
      $ hadoop --config /etc/hadoop/conf.cloudera.mapreduce1 jar \
      /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar \
      org.apache.solr.hadoop.MapReduceIndexerTool -D \
      'mapred.child.java.opts=-Xmx500m' --log4j \
      /opt/cloudera/parcels/CDH/share/doc/search*/examples/solr-nrt/log4j.properties --morphline-file \
      /opt/cloudera/parcels/CDH/share/doc/search*/examples/solr-nrt/test-morphlines/tutorialReadAvroContainer.conf \
      --output-dir hdfs://$NNHOST:8020/user/$USER/outdir --verbose --solr-home-dir \
      $HOME/collection1 --shards 2 hdfs://$NNHOST:8020/user/$USER/indir
    • Package-based Installation:
      $ hadoop --config /etc/hadoop/conf.cloudera.mapreduce1 jar \
      /usr/lib/solr/contrib/mr/search-mr-*-job.jar \
      org.apache.solr.hadoop.MapReduceIndexerTool -D \
      'mapred.child.java.opts=-Xmx500m' --log4j \
      /usr/share/doc/search*/examples/solr-nrt/log4j.properties --morphline-file \
      /usr/share/doc/search*/examples/solr-nrt/test-morphlines/tutorialReadAvroContainer.conf \
      --output-dir hdfs://$NNHOST:8020/user/$USER/outdir --verbose --solr-home-dir \
      $HOME/collection1 --shards 2 hdfs://$NNHOST:8020/user/$USER/indir
  3. Check the job tracker status. For example, for the localhost, use http://localhost:50030/jobtracker.jsp.
  4. After the job is completed, check the generated index files. Individual shards are written to the results directory with names of the form part-00000, part-00001, part-00002. This example has two shards.
    $ hadoop fs -ls /user/$USER/outdir/results
    $ hadoop fs -ls /user/$USER/outdir/results/part-00000/data/index
  5. Stop Solr on each host of the cluster.
    $ sudo service solr-server stop
  6. List the hostname folders used as part of the path to each index in the SolrCloud cluster.
    $ hadoop fs -ls /solr/collection1
  7. Move index shards into place.
    1. Remove outdated files:
      $ sudo -u solr hadoop fs -rm -r -skipTrash \
      /solr/collection1/$HOSTNAME1/data/index
      $ sudo -u solr hadoop fs -rm -r -skipTrash \
      /solr/collection1/$HOSTNAME2/data/tlog
    2. Ensure correct ownership of required directories:
      $ sudo -u hdfs hadoop fs -chown -R solr /user/$USER/outdir/results
    3. Move the two index shards into place (the two servers you set up in Preparing to Index Data with Cloudera Search):
      $ sudo -u solr hadoop fs -mv /user/$USER/outdir/results/part-00000/data/index \
      /solr/collection1/$HOSTNAME1/data/
      $ sudo -u solr hadoop fs -mv /user/$USER/outdir/results/part-00001/data/index \
      /solr/collection1/$HOSTNAME2/data/
  8. Start Solr on each host of the cluster:
    $ sudo service solr-server start
  9. Run some Solr queries. For example, for myserver.example.com, use: http://myserver.example.com:8983/solr/collection1/select?q=*%3A*&wt=json&indent=true
Page generated July 8, 2016.