Reindexing in AEM6

Use Case 3 - Reindexing

Depending on the scenarios, in some cases reindexing needs to be performed. Currently the reindexing is done by setting the reindex flag to true in the index definition node via CRXDE or  via the Index Manager user interface. After the flag is set, reindexing is done asynchronously.
Some points to note around reindexing:
  • Reindexing is lot slower on DocumentNodeStore setups compared to SegmentNodeStore setups where all content is local;
  • With the current design, while reindexing happens the async indexer is blocked and all other async indexes become stale and do not get update for the duration of indexing. Because of this, if the system is in use, users may not see up to date results;
  • Reindexing involves traversal of the whole repository which can put a high load on the AEM setup and thus impact end user experience;
  • For a DocumentNodeStore installation where reindexing might take a considerable amount of time, if the connection to the Mongo database fails in the middle of the operation, indexing would have to be restarted from scratch;
  • In some cases reindexing can take long time because of text extraction. This is mainly specific for setups having lots of PDF files, where the time spent on text extraction can impact indexing time.
To meet these objectives the oak-run index tooling supports different modes for reindexing which can be used as required. The oak-run index command provides following benefits:
  • out-of-band reindexing - oak-run reindexing can be done separately from a running AEM setup and thus, it minimizes the impact on the AEM instance that is in use;
  • out-of-lane reindexing - The reindexing takes place without impacting indexing indexing operations. This means that the async indexer can continue to index other indexes;
  • Simplified reindex for DocumentNodeStore installations - For DocumentNodeStore installations, reindexing can be done with a single command which ensures that reindexing is done in the most optimal way;
  • Supports updating index definitions and introducing new index definitions

Reindex - DocumentNodeStore

For DocumentNodeStore installations reindexing can be done via a single oak-run command:
1
java -jar oak-run*.jar index --reindex --index-paths=/oak:index/lucene --read-write --fds-path=/path/to/datastore mongodb://server:port/aem
This provides following benefits
  • Minimal impact on running AEM instances. Most of the reads can be done from secondary servers and running AEM caches are not adversaly impacted due to all the traversal required for reindexing;
  • Users can also provide a JSON of a new or updated index via the --index-definitions-file option.

Reindex - SegmentNodeStore

For SegmentNodeStore installations reindexing can be done in one of the following ways:

Online Reindex - SegmentNodeStore

Follow the established way where reindexing is done via setting reindex flag. 

Online Reindex - SegmentNodeStore - The AEM Instance is Running

For SegmentNodeStore installations only one process can access segment files in read-write mode. Due to this some operations in oak-run indexing require additional manual steps being taken.
This would involve the following:
  1. Connect the oak-run to the same repository used by AEM in read only mode and perform indexing. An example on how to achieve this:
    1
    java -jar oak-run-1.7.6.jar index --fds-path=/Users/dhasler/dev/cq/quickstart/target/crx-quickstart/repository/datastore/ --checkpoint 26b7da38-a699-45b2-82fb-73aa2f9af0e2 --reindex --index-paths=/oak:index/lucene /Users/dhasler/dev/cq/quickstart/target/crx-quickstart/repository/segmentstore/
  2. Fnally, import the created index files via the IndexerMBean#importIndex operation from the path where oak-run saved the indexing files after running the above command.
In this scenario you do not have to stop the AEM server or provision any new instance. However, as indexing involves traversal of the whole repository it would increase the I/O load on the installation, negatively impacting runtime performance.

Online Reindex - SegmentNodeStore - The AEM Instance is Shut Down

For SegmentNodeStore installations reindexing can be done via a single oak-run command. However, the AEM instance needs to be shut down.
You can trigger reindexing with the following command:
1
java -jar oak-run*.jar index --reindex --index-paths=/oak:index/lucene --read-write --fds-path=/path/to/datastore  /path/to/segmentstore/
The difference between this approach and the one explained above is that checkpoint creation and index import are done automatically. The downside is that AEM needs to be down during the process.

Out of Band Reindex - SegmentNodeStore

In this use case, you can perform reindexing on a cloned setup to minimize impact on the running AEM instance:
  1. Create checkpoint via a JMX operation. You can do this by going to the JMX Console and search for CheckpointManager. Then, click on the createCheckpoint(long p1) operation using a high value for expiration in seconds (for example, 2592000).
  2. Copy the crx-quickstart folder to a new machine 
  3. Perform reindex via oak-run index command
  4. Copy the generated index files to AEM server
  5. Import the index files via JMX.
Under this use case, it is assumed that the Data Store is accessible on another instance which may not be possible if FileDataStore is placed on a cloud based storage solution like EBS.  This excludes the scenario where FileDataStore is also cloned. If the index definition does not perform fulltext indexing, then access to DataStore is not required.

Reference URL:


No comments:

Post a Comment