This is the documentation for Cloudera Enterprise 5.8.x. Documentation for other versions is available at Cloudera Documentation.

Load and Index Data in Search

Run the script found in a subdirectory of the following locations. The path for the script often includes the product version, such as Cloudera Manager 5.8.x, so path details vary. To address this issue, use wildcards.

Packages: /usr/share/doc. If Search for CDH 5.8.0 is installed to the default location using packages, the Quick Start script is found in /usr/share/doc/search-*/quickstart.
Parcels: /opt/cloudera/parcels/CDH/share/doc. If Search for CDH 5.8.0 is installed to the default location using parcels, the Quick Start script is found in /opt/cloudera/parcels/CDH/share/doc/search-*/quickstart.

The script uses several defaults that you might want to modify:

Table 1. Script Parameters and Defaults
Parameter	Default	Notes
`NAMENODE_CONNECT`	`hostname`:8020	For use on an HDFS HA cluster. If you use `NAMENODE_CONNECT`, do not use `NAMENODE_HOST` or `NAMENODE_PORT`.
`NAMENODE_HOST`	`hostname`	If you use `NAMENODE_HOST` and `NAMENODE_PORT`, do not use `NAMENODE_CONNECT`.
`NAMENODE_PORT`	`8020`	If you use `NAMENODE_HOST` and `NAMENODE_PORT`, do not use `NAMENODE_CONNECT`.
`ZOOKEEPER_ENSEMBLE`	`hostname`:2181/solr	Zookeeper ensemble to point to. For example: zk1,zk2,zk3:2181/solr If you use `ZOOKEEPER_ENSEMBLE`, do not use `ZOOKEEPER_HOST` or `ZOOKEEPER_PORT`, `ZOOKEEPER_ROOT`.
`ZOOKEEPER_HOST`	`hostname`
`ZOOKEEPER_PORT`	`2181`
`ZOOKEEPER_ROOT`	`/solr`
`HDFS_USER`	`${HDFS_USER:="${USER}"}`
`SOLR_HOME`	`/opt/cloudera/parcels/SOLR/lib/solr`

By default, the script is configured to run on the NameNode host, which is also running ZooKeeper. Override these defaults with custom values when you start quickstart.sh. For example, to use an alternate NameNode and HDFS user ID, you could start the script as follows:

$ NAMENODE_HOST=nnhost HDFS_USER=jsmith ./quickstart.sh

The first time the script runs, it downloads required files such as the Enron data and configuration files. If you run the script again, it uses the Enron information already downloaded, as opposed to downloading this information again. On such subsequent runs, the existing data is used to re-create the enron-email-collection SolrCloud collection.

Note: Downloading the data from its server, expanding the data, and uploading the data can be time consuming. Although your connection and CPU speed determine the time these processes require, fifteen minutes is typical and longer is not uncommon.

The script also generates a Solr configuration and creates a collection in SolrCloud. The following sections describes what the script does and how you can complete these steps manually, if desired. The script completes the following tasks:

Set variables such as hostnames and directories.
Create a directory to which to copy the Enron data and then copy that data to this location. This data is about 422 MB and in some tests took about five minutes to download and two minutes to untar.
Create directories for the current user in HDFS, change ownership of that directory to the current user, create a directory for the Enron data, and load the Enron data to that directory. In some tests, it took about a minute to copy approximately 3 GB of untarred data.
Use solrctl to create a template of the instance directory.
Use solrctl to create a new Solr collection for the Enron mail collection.
Create a directory to which the MapReduceBatchIndexer can write results. Ensure that the directory is empty.
Use the MapReduceIndexerTool to index the Enron data and push the result live to enron-mail-collection. In some tests, it took about seven minutes to complete this task.

Page generated July 8, 2016.

Categories: QuickStart | Search | All Categories