Using MapReduce Batch Indexing with Cloudera Search
The following sections include examples that illustrate using MapReduce to index tweets. These examples require that you:
- Complete the process of Preparing to Index Data with Cloudera Search.
- Install the MapReduce tools for Cloudera Search as described in Installing MapReduce Tools for use with Cloudera Search.
Batch Indexing into Online Solr Servers Using GoLive
MapReduceIndexerTool is a MapReduce batch job driver that creates a set of Solr index shards from a set of input files and writes the indexes into HDFS in a flexible, scalable, and fault-tolerant manner. Using GoLive, MapReduceIndexerTool also supports merging the output shards into a set of live customer-facing Solr servers, typically a SolrCloud. The following sample steps demonstrate these capabilities.
- Delete all existing documents in Solr.
$ solrctl collection --deletedocs collection1
- Run the MapReduce job using GoLive. Replace $NNHOST and $ZKHOST in the command with your NameNode and ZooKeeper hostnames
and port numbers, as required. You do not need to specify --solr-home-dir because the job accesses it from ZooKeeper.
- Parcel-based Installation:
$ hadoop --config /etc/hadoop/conf.cloudera.mapreduce1 jar \ /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar \ org.apache.solr.hadoop.MapReduceIndexerTool -D \ 'mapred.child.java.opts=-Xmx500m' --log4j \ /opt/cloudera/parcels/CDH/share/doc/search*/examples/solr-nrt/log4j.properties --morphline-file \ /opt/cloudera/parcels/CDH/share/doc/search*/examples/solr-nrt/test-morphlines/tutorialReadAvroContainer.conf \ --output-dir hdfs://$NNHOST:8020/user/$USER/outdir --verbose --go-live \ --zk-host $ZKHOST:2181/solr --collection collection1 \ hdfs://$NNHOST:8020/user/$USER/indir
- Package-based Installation:
$ hadoop --config /etc/hadoop/conf.cloudera.mapreduce1 jar \ /usr/lib/solr/contrib/mr/search-mr-*-job.jar \ org.apache.solr.hadoop.MapReduceIndexerTool -D \ 'mapred.child.java.opts=-Xmx500m' --log4j \ /usr/share/doc/search*/examples/solr-nrt/log4j.properties --morphline-file \ /usr/share/doc/search*/examples/solr-nrt/test-morphlines/tutorialReadAvroContainer.conf \ --output-dir hdfs://$NNHOST:8020/user/$USER/outdir --verbose --go-live \ --zk-host $ZKHOST:2181/solr --collection collection1 \ hdfs://$NNHOST:8020/user/$USER/indir
SOLR_LOCATOR : { # Name of solr collection collection : collection # ZooKeeper ensemble zkHost : "$ZK_HOST" } morphlines : [ { id : morphline1 importCommands : ["org.kitesdk.**", "org.apache.solr.**"] commands : [ { generateUUID { field : id } } { # Remove record fields that are unknown to Solr schema.xml. # Recall that Solr throws an exception on any attempt to load a document that # contains a field that isn't specified in schema.xml. sanitizeUnknownSolrFields { solrLocator : ${SOLR_LOCATOR} # Location from which to fetch Solr schema } } { logDebug { format : "output record: {}", args : ["@{}"] } } { loadSolr { solrLocator : ${SOLR_LOCATOR} } } ] } ]
- Parcel-based Installation:
- Check the job tracker status at http://localhost:50030/jobtracker.jsp.
- When the job is complete, run some Solr queries. For example, for myserver.example.com, use: http://myserver.example.com:8983/solr/collection1/select?q=*%3A*&wt=json&indent=true
For help on how to run a Hadoop MapReduce job, use the following command:
- Parcel-based Installation:
$ hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar \ org.apache.solr.hadoop.MapReduceIndexerTool --help
- Package-based Installation:
$ hadoop jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar \ org.apache.solr.hadoop.MapReduceIndexerTool --help
- For development purposes, use the MapReduceIndexerTool --dry-run option to run in local mode and print documents to stdout, instead of loading them to Solr. Using this option causes the morphline to run in the client process without submitting a job to MapReduce. Running in the client process provides faster turnaround during early trial and debug sessions.
- To print diagnostic information, such as the content of records as they pass through the morphline commands, enable TRACE log level diagnostics by adding the following entry to your
log4j.properties file:
log4j.logger.org.kitesdk.morphline=TRACE
The log4j.properties file can be passed using the MapReduceIndexerTool --log4j command-line option.
- Parcel-based Installation:
Batch Indexing into Offline Solr Shards
Running the MapReduce job without GoLive causes the job to create a set of Solr index shards from a set of input files and write the indexes to HDFS. You can then explicitly point each Solr server to one of the HDFS output shard directories.
Batch indexing into offline Solr shards is mainly intended for offline use-cases by experts. Cases requiring read-only indexes for searching can be handled using batch indexing without the --go-live option. By not using GoLive, you can avoid copying datasets between segments, thereby reducing resource demands.
- Delete all existing documents in Solr.
$ solrctl collection --deletedocs collection1 $ sudo -u hdfs hadoop fs -rm -r -skipTrash /user/$USER/outdir
- Run the Hadoop MapReduce job, replacing $NNHOST in the command with your NameNode hostname and port number, as required.
- Parcel-based Installation:
$ hadoop --config /etc/hadoop/conf.cloudera.mapreduce1 jar \ /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar \ org.apache.solr.hadoop.MapReduceIndexerTool -D \ 'mapred.child.java.opts=-Xmx500m' --log4j \ /opt/cloudera/parcels/CDH/share/doc/search*/examples/solr-nrt/log4j.properties --morphline-file \ /opt/cloudera/parcels/CDH/share/doc/search*/examples/solr-nrt/test-morphlines/tutorialReadAvroContainer.conf \ --output-dir hdfs://$NNHOST:8020/user/$USER/outdir --verbose --solr-home-dir \ $HOME/collection1 --shards 2 hdfs://$NNHOST:8020/user/$USER/indir
- Package-based Installation:
$ hadoop --config /etc/hadoop/conf.cloudera.mapreduce1 jar \ /usr/lib/solr/contrib/mr/search-mr-*-job.jar \ org.apache.solr.hadoop.MapReduceIndexerTool -D \ 'mapred.child.java.opts=-Xmx500m' --log4j \ /usr/share/doc/search*/examples/solr-nrt/log4j.properties --morphline-file \ /usr/share/doc/search*/examples/solr-nrt/test-morphlines/tutorialReadAvroContainer.conf \ --output-dir hdfs://$NNHOST:8020/user/$USER/outdir --verbose --solr-home-dir \ $HOME/collection1 --shards 2 hdfs://$NNHOST:8020/user/$USER/indir
- Parcel-based Installation:
- Check the job tracker status. For example, for the localhost, use http://localhost:50030/jobtracker.jsp.
- After the job is completed, check the generated index files. Individual shards are written to the results directory with names of the form part-00000,
part-00001, part-00002. This example has two shards.
$ hadoop fs -ls /user/$USER/outdir/results $ hadoop fs -ls /user/$USER/outdir/results/part-00000/data/index
- Stop Solr on each host of the cluster.
$ sudo service solr-server stop
- List the hostname folders used as part of the path to each index in the SolrCloud cluster.
$ hadoop fs -ls /solr/collection1
- Move index shards into place.
- Remove outdated files:
$ sudo -u solr hadoop fs -rm -r -skipTrash \ /solr/collection1/$HOSTNAME1/data/index $ sudo -u solr hadoop fs -rm -r -skipTrash \ /solr/collection1/$HOSTNAME2/data/tlog
- Ensure correct ownership of required directories:
$ sudo -u hdfs hadoop fs -chown -R solr /user/$USER/outdir/results
- Move the two index shards into place (the two servers you set up in Preparing to Index Data with Cloudera
Search):
$ sudo -u solr hadoop fs -mv /user/$USER/outdir/results/part-00000/data/index \ /solr/collection1/$HOSTNAME1/data/ $ sudo -u solr hadoop fs -mv /user/$USER/outdir/results/part-00001/data/index \ /solr/collection1/$HOSTNAME2/data/
- Remove outdated files:
- Start Solr on each host of the cluster:
$ sudo service solr-server start
- Run some Solr queries. For example, for myserver.example.com, use: http://myserver.example.com:8983/solr/collection1/select?q=*%3A*&wt=json&indent=true
<< Preparing to Index Data with Cloudera Search | ©2016 Cloudera, Inc. All rights reserved | Near Real Time (NRT) Indexing Using Flume and the Solr Sink >> |
Terms and Conditions Privacy Policy |