AEM Indexing Issues & How to Fix


Symptoms of indexing issues
The following are signs of an issue with AEM/Oak indexing:
  • Search results are outdated by more than 10 minutes
  • There are missing search results
  • Errors are returned either in the UI or logs during search via site UI, Query Builder search, or JCR query execution
Diagnosing an indexing issue
  • To see if asynchronous indexing is slow or failing, do the following:
1. Open these URLs on your AEM instance to view stats about the Async indexer
http://aemhost:port/system/console/jmx/org.apache.jackrabbit.oak%3Aname%3Dasync%2Ctype%3DIndexStats
http://aemhost:port/system/console/jmx/org.apache.jackrabbit.oak%3Aname%3Dfulltext-async%2Ctype%3DIndexStats  - This URL only applies to AEM6.2 and later
2. On each of those pages, check these fields:
FailingSince - This indicates when indexing first started failing.
LastError - This is the stack trace showing what is causing indexing to fail.  If this is empty then indexing isn't failing.
LastErrorTime - This indicates the last time indexing threw the error.
LastIndexedTime - If the date and time of this field is over 5 minutes old then indexing is running too slow.
What causes issues with indexing
  • Improper maintenance or failure to perform maintenance such as Revision Garbage Collection, Workflow Purge, Audit Purge, Version Purge, etc.
  • Corrupt or missing segments in Tar storage
  • Revision Corruption in a clustered environment (DocumentNodeStore - Mongo or Database)
  • An issue with the cluster topology in a clustered environment
How to analyze what is causing indexing issues
  • See this article for analyzing and fixing indexing issues

AEM_Replication Issues

Symptoms of Replication Issues
  • Publish requests are queueing up in the replication agent queue
  • Published contents are not showing up on the publish server
  • Impact on system performance
What causes Replication issues:
  • Replication agent is misconfigured and cannot connect to the publish agent
  • There is an error at the time of replication causing the replication queue to get stuck
  • The system is slow and replications are getting processed slowly
  • The replication is happening as part of a custom workflow and the problem is with workflow processing.
How to analyze Replication issues:
1. Check the replication queue status:
        Active: when items are being processed.
        Idle: when the queue is empty.
        Blocked: when items are in the queue, but cannot be processed; for example, when the agent points to a host that is down or non-existent.
2. Review the replication configurations if your server is cloned or the agent has been configured recently. For details, see here
     
3. Review the replication agent logs at http://host:port/etc/replication/agents.author/AgentName.log.html#end.  If you can’t identify any items collect this log and present to AEM support.
4. Review the server error.log from AEMinstall/crx-quickstart/logs; If you can’t identify any items collect this log and present to AEM support.
5. If the replication queue is in “idle” state and none of the above applies, in this case the problem is most likely caused by the workflows. If the workflows are not being processed then the replication item never gets to the replication queue. To monitor the status of your workflows, you can check the workflow dashboard to check the number of running workflow instances. You can read about administering workflows here.
6. Replications slows down when the system is under high load or experience other performance issues.
Solution to Common Replication issues:
1) Review the Replication queue issues
2) If the problem is due to the workflows not running efficiently, you may review the concurrent workflow processing tips
3) For issues related to the overall AEM slow performance and replication you may review AEM Performance Issues  

AEM_TarMK Corruption Issues


Symptoms of TarMK Corruption
  • Instance is inoperable after offline compaction.
  • Instance stuck in Startup in progress state.
  • Log files or compaction command output report SegmentNotFoundException.
What causes corruption issues
  • The segment is removed by manual intervention (e.g. rm -rf ).   
  • The segment is removed by revision garbage collection or the segment cannot be found due to some bug in the code.   
  • The segment cannot be found due to some bug in the code.
  • Various maintenance tasks are not performed on time leading to repository growth and low disk space.
  • Forcefully stopping AEM by killing java process.
Diagnosing repository corruption issues:
  • Review the error.log file and check if there is SegmentNotFoundException or IllegalArgument Exception.
  • To determine whether a segment has been removed by revision garbage collection,  check the output of the org.apache.jackrabbit.oak.plugins.segment.file.TarReader-GC (enable debug log) logger. That logger logs the segment ids of all segments removed by the cleanup phase. Only when the offending segment id appears in the output of that logger is revision garbage collection the cause for the exception.    
  • In case of corruption in external datastore, search log file for all occurrences of error Error occurred while obtaining InputStream for blobId. This error means that you are missing files from your AEM datastore directory.
Solution to repair corruption issues:
  • Determine the last known good revision of the segment store by using the check run-mode of oak-run.  Manually revert the corrupt segment store to its latest good revision. This operation will revert the Oak repository to a previous state in time.  You should completely backup the repository before performing this operation.
    • To perform check and restore, follow steps mentioned in this article.
    • If the check fails with ConsistencyChecker - No good revisions found then implement the steps in part B of this article.
  • If you are already using a datastore and you encounter the error "Error occurred while obtaining InputStream for blobId", then there are likely files missing from the datastore. Follow this article to resolve the issue.
  • If you are not using a datastore, then use an external file, S3 or Azure datastore, instead of default segmentstore.
    • Using a datastore provides better performance.
    • Migrate the instance to one with a datastore using crx2oak.
  • Apply the latest Service Pack and Cumulative Fix Pack and Oak Cumulative Fix Pack.