Tuning the Solr Server
Solr performance tuning is a complex task. The following sections provide more details.
Tuning to Complete During Setup
Some tuning is best completed during the setup of you system or may require some re-indexing.
Configuring Lucene Version Requirements
<luceneMatchVersion>4.4</luceneMatchVersion>
Designing the Schema
- Use the tdate type for dates. Do this instead of representing dates as strings.
- Consider using the text type that applies to your language, instead of using String. For example, you might use text_en. Text types support returning results for subsets of an entry. For example, querying on "john" would find "John Smith", whereas with the string type, only exact matches are returned.
- For IDs, use the string type.
General Tuning
The following tuning categories can be completed at any time. It is less important to implement these changes before beginning to use your system.
General Tips
- Enabling multi-threaded faceting can provide better performance for field faceting. When multi-threaded faceting is enabled,
field faceting tasks are completed in a parallel with a thread working on every field faceting task simultaneously. Performance improvements do not occur in all cases, but improvements are likely
when all of the following are true:
- The system uses highly concurrent hardware.
- Faceting operations apply to large data sets over multiple fields.
- There is not an unusually high number of queries occurring simultaneously on the system. Systems that are lightly loaded or that are mainly engaged with ingestion and indexing may be helped by multi-threaded faceting; for example, a system ingesting articles and being queried by a researcher. Systems heavily loaded by user queries are less likely to be helped by multi-threaded faceting; for example, an e-commerce site with heavy user-traffic.
Note: Multi-threaded faceting only applies to field faceting and not to query faceting.- Field faceting identifies the number of unique entries for a field. For example, multi-threaded faceting could be used to simultaneously facet for the number of unique entries for the fields, "color" and "size". In such a case, there would be two threads, and each thread would work on faceting one of the two fields.
- Query faceting identifies the number of unique entries that match a query for a field. For example, query faceting could be used to find the number of unique entries in the "size" field are between 1 and 5. Multi-threaded faceting does not apply to these operations.
To enable multi-threaded faceting, add facet-threads to queries. For example, to use up to 1000 threads, you might use a query as follows:http://localhost:8983/solr/collection1/select?q=*:*&facet=true&fl=id&facet.field=f0_ws&facet.threads=1000
If facet-threads is omitted or set to 0, faceting is single-threaded. If facet-threads is set to a negative value, such as -1, multi-threaded faceting will use as many threads as there are fields to facet up to the maximum number of threads possible on the system. - If your environment does not require Near Real Time (NRT), turn off soft auto-commit in solrconfig.xml.
- In most cases, do not change the default batch size setting of 1000. If you are working with especially large documents, you may consider decreasing the batch size.
- To help identify any garbage collector (GC) issues, enable GC logging in production. The overhead is low and the JVM supports
GC log rolling as of 1.6.0_34.
- The minimum recommended GC logging flags are: -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCDetails.
- To rotate the GC logs: -Xloggc: -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles= -XX:GCLogFileSize=.
Solr and HDFS - the Block Cache
Cloudera Search enables Solr to store indexes in an HDFS filesystem. To maintain performance, an HDFS block cache has been implemented using Least Recently Used (LRU) semantics. This enables Solr to cache HDFS index files on read and write, storing the portions of the file in JVM "direct memory" (meaning off heap) by default or optionally in the JVM heap. Direct memory is preferred as it is not affected by garbage collection.
Batch jobs typically do not use the cache, while Solr servers (when serving queries or indexing documents) should. When running indexing using MapReduce, the MR jobs themselves do not use the block cache. Block caching is turned off by default and should be left disabled.
Configuration
The following parameters control caching. They can be configured at the Solr process level by setting the respective system property or by editing the solrconfig.xml directly.
Parameter | Default | Description |
---|---|---|
solr.hdfs.blockcache.enabled | true | Enable the block cache. |
solr.hdfs.blockcache.read.enabled | true | Enable the read cache. |
solr.hdfs.blockcache.write.enabled | false | Enable the write cache. |
solr.hdfs.blockcache.direct.memory.allocation | true | Enable direct memory allocation. If this is false, heap is used. |
solr.hdfs.blockcache.slab.count | 1 | Number of memory slabs to allocate. Each slab is 128 MB in size. |
solr.hdfs.blockcache.global | true | If enabled, a single HDFS block cache is used for all SolrCores on a host. If blockcache.global is disabled, each SolrCore on a host creates its own private HDFS block cache. Enabling this parameter simplifies managing HDFS block cache memory. |
Increasing the direct memory cache size may make it necessary to increase the maximum direct memory size allowed by the JVM. Each Solr slab allocates the slab's memory, which is 128 MB by default, as well as allocating some additional direct memory overhead. Therefore, ensure that the MaxDirectMemorySize is set comfortably above the value expected for slabs alone. The amount of additional memory required varies according to multiple factors, but for most cases, setting MaxDirectMemorySize to at least 20-30% more than the total memory configured for slabs is sufficient. Setting the MaxDirectMemorySize to the number of slabs multiplied by the slab size does not provide enough memory.
To set MaxDirectMemorySize using Cloudera Manager
- Go to the Solr service.
- Click the Configuration tab.
- In the Search box, type Java Direct Memory Size of Solr Server in Bytes.
- Set the new direct memory value.
- Restart Solr servers after editing the parameter.
Solr HDFS optimizes caching when performing NRT indexing using Lucene's NRTCachingDirectory.
Lucene caches a newly created segment if both of the following conditions are true:
- The segment is the result of a flush or a merge and the estimated size of the merged segment is <= solr.hdfs.nrtcachingdirectory.maxmergesizemb.
- The total cached bytes is <= solr.hdfs.nrtcachingdirectory.maxcachedmb.
The following parameters control NRT caching behavior:
Parameter | Default | Description |
---|---|---|
solr.hdfs.nrtcachingdirectory.enable | true | Whether to enable the NRTCachingDirectory. |
solr.hdfs.nrtcachingdirectory.maxcachedmb | 192 | Size of the cache in megabytes. |
solr.hdfs.nrtcachingdirectory.maxmergesizemb | 16 | Maximum segment size to cache. |
Here is an example of solrconfig.xml with defaults:
<directoryFactory name="DirectoryFactory"> <bool name="solr.hdfs.blockcache.enabled">${solr.hdfs.blockcache.enabled:true}</bool> <int name="solr.hdfs.blockcache.slab.count">${solr.hdfs.blockcache.slab.count:1}</int> <bool name="solr.hdfs.blockcache.direct.memory.allocation">${solr.hdfs.blockcache.direct.memory.allocation:true}</bool> <int name="solr.hdfs.blockcache.blocksperbank">${solr.hdfs.blockcache.blocksperbank:16384}</int> <bool name="solr.hdfs.blockcache.read.enabled">${solr.hdfs.blockcache.read.enabled:true}</bool> <bool name="solr.hdfs.blockcache.write.enabled">${solr.hdfs.blockcache.write.enabled:true}</bool> <bool name="solr.hdfs.nrtcachingdirectory.enable">${solr.hdfs.nrtcachingdirectory.enable:true}</bool> <int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">${solr.hdfs.nrtcachingdirectory.maxmergesizemb:16}</int> <int name="solr.hdfs.nrtcachingdirectory.maxcachedmb">${solr.hdfs.nrtcachingdirectory.maxcachedmb:192}</int> </directoryFactory>
The following example illustrates passing Java options by editing the /etc/default/solr or /opt/cloudera/parcels/CDH-*/etc/default/solr configuration file:
CATALINA_OPTS="-Xmx10g -XX:MaxDirectMemorySize=20g -XX:+UseLargePages -Dsolr.hdfs.blockcache.slab.count=100"
For better performance, Cloudera recommends setting the Linux swap space on all Solr server hosts as shown below:
# minimize swappiness sudo sysctl vm.swappiness=1 sudo bash -c 'echo "vm.swappiness=1">> /etc/sysctl.conf' # disable swap space until next reboot: sudo /sbin/swapoff -a
Threads
Configure the Tomcat server to have more threads per Solr instance. Note that this is only effective if your hardware is sufficiently powerful to accommodate the increased threads. 10,000 threads is a reasonable number to try in many cases.
<Connector port="${solr.port}" protocol="HTTP/1.1" connectionTimeout="20000" redirectPort="8443" />Becomes this:
<Connector port="${solr.port}" protocol="HTTP/1.1" maxThreads="10000" connectionTimeout="20000" redirectPort="8443" />
Garbage Collection
Choose different garbage collection options for best performance in different environments. Some garbage collection options typically chosen include:
- Concurrent low pause collector: Use this collector in most cases. This collector attempts to minimize "Stop the World" events. Avoiding these events can reduce connection timeouts, such as with ZooKeeper, and may improve user experience. This collector is enabled using -XX:+UseConcMarkSweepGC.
- Throughput collector: Consider this collector if raw throughput is more important than user experience. This collector typically uses more "Stop the World" events so this may negatively affect user experience and connection timeouts such as ZooKeeper heartbeats. This collector is enabled using -XX:+UseParallelGC. If UseParallelGC "Stop the World" events create problems, such as ZooKeeper timeouts, consider using the UseParNewGC collector as an alternative collector with similar throughput benefits.
You can also affect garbage collection behavior by increasing the Eden space to accommodate new objects. With additional Eden space, garbage collection does not need to run as frequently on new objects.
Replication
You can adjust the degree to which different data is replicated.
Replication Settings
To adjust the Solr replication factor for index files stored in HDFS
- Configure the solr.hdfs.confdir system property to refer to the Solr HDFS configuration files. Typically the value is: -Dsolr.hdfs.confdir=/etc/solrhdfs/
- In a Cloudera Manager deployment, set this value using the advanced Solr setting box advanced configuration snippet.
-
In a deployment not managed by Cloudera Manager, set the solr confdir system property by adding the following to the command you used to invoke solr: -Dsolr.hdfs.confdir=/etc/solrhdfs
- Set the DFS replication value in the HDFS configuration file at the location you specified in the previous step. For example, to set the replication value to 2, you would change the
dfs.replication setting as follows:
<property> <name>dfs.replication<name> <value>2<value> <property>
- Restart the Solr service.
Optionally, you can also configure the transaction log replication factor. Cloudera recommends leaving the value unchanged at 3 or, barring that, leaving it greater than 1. For more information on changing this setting, see Transaction Log Replication.
Replicas
If you have sufficient additional hardware, add more replicas for a linear boost of query throughput. Note that adding replicas may slow write performance on the first replica, but otherwise this should have minimal negative consequences.
Transaction Log Replication
Beginning with CDH 5.4.1, Search for CDH supports configurable transaction log replication levels for replication logs stored in HDFS.
<updateHandler class="solr.DirectUpdateHandler2"> <!-- Enables a transaction log, used for real-time get, durability, and and solr cloud replica recovery. The log can grow as big as uncommitted changes to the index, so use of a hard autoCommit is recommended (see below). "dir" - the target directory for transaction logs, defaults to the solr data directory. --> <updateLog> <str name="dir">${solr.ulog.dir:}</str> <int name="tlogDfsReplication">3</int> </updateLog>
- Reduce the chance of data loss, especially when the system is otherwise configured to have single replicas of shards. For example, having single replicas of shards is reasonable when autoAddReplicas is enabled, but without additional transaction log replicas, the risk of data loss during a host failure would increase.
- Facilitate rolling upgrade of HDFS while Search is running. If you have multiple copies of the log, when a host with the transaction log becomes unavailable during the rolling upgrade process, another copy of the log can continue to collect transactions.
- Facilitate HDFS write lease recovery.
Initial testing shows no significant performance regression for common use cases.
Shards
In some cases, oversharding can help improve performance including intake speed. If your environment includes massively parallel hardware and you want to use these available resources, consider oversharding. You might increase the number of replicas per host from 1 to 2 or 3. Making such changes creates complex interactions, so you should continue to monitor your system's performance to ensure that the benefits of oversharding do not outweigh the costs.
Commits
Changing commit values may improve performance in some situation. These changes result in tradeoffs and may not be beneficial in all cases.
- For hard commit values, the default value of 60000 (60 seconds) is typically effective, though changing this value to 120 seconds may improve performance in some cases. Note that setting this value to higher values, such as 600 seconds may result in undesirable performance tradeoffs.
- Consider increasing the auto-commit value from 15000 (15 seconds) to 120000 (120 seconds).
- Enable soft commits and set the value to the largest value that meets your requirements. The default value of 1000 (1 second) is too aggressive for some environments.
Other Resources
- General information on Solr caching is available on the SolrCaching page on the Solr Wiki.
- Information on issues that influence performance is available on the SolrPerformanceFactors page on the Solr Wiki.
- Resource Management describes how to use Cloudera Manager to manage resources, for example with Linux cgroups.
- For information on improving querying performance, see How to make searching faster.
- For information on improving indexing performance, see How to make indexing faster.
<< Tuning Hive on Spark | ©2016 Cloudera, Inc. All rights reserved | Tuning Spark Applications >> |
Terms and Conditions Privacy Policy |