A few years ago, we started working with a Splunk customer with a 10TB license. The customer was on the verge of eliminating Splunk from their environment because they said Splunk didn’t work. Users weren’t receiving their scheduled reports and ad hoc searches took too long to execute. We were asked to help resolve the problem.
We discovered two main issues. They didn’t have enough indexers to support a 10TB license and they had over 100,000 searches all scheduled to run at midnight. Splunk simply cannot execute 100,000 searches at the same time, which meant that users never received their reports in the morning.
Solving the problem for this customer took several months of painstaking work. We categorized each search, found its owner and determined if the search was still necessary. We discovered that most of the scheduled searches were no longer required and reduced the search load down to a manageable number. For the searches that remained, we were able to spread out the execution times to balance the system load. Today this customer is very happy with Splunk and has expanded to a 30TB license.
Chances are good that your environment doesn’t have 100,000 scheduled searches, but it’s likely that most of your scheduled searches run at midnight because that’s the default time to run a search. As your system grows this can become a real problem.
Unfortunately, managing your scheduled searches is still overwhelmingly a manual process. Download our list of reference searches for more tips on managing your Splunk scheduled searches. Here are a few ideas that can help.
The most obvious feature Splunk introduced in the past few years to help with scheduled searches is the search schedule window. When scheduling a search, you’ll find an option called ‘Schedule Window.’ Next to the label is a dropdown that lists a variety of time ranges from 5 minutes to 8 hours, customer and none.
The search window defaults to none, but many searches, especially reports, can execute over a wide range of times. The search window informs Splunk that a search can be run at any time during the time range specified, starting at the scheduled search time. For example, a search scheduled to run a 1am with a 1-hour window will run sometime between 1am and 2am, depending on the Splunk work load.
We suggest you encourage your users to always create as large of a search window as possible when scheduling a search.
If the time range of a scheduled search isn’t aligned properlywith the time the search is scheduled to execute, the search may miss some of your data. This is due to propagation delay.
Propagation delay, or the time required for an event to be indexed in Splunk, is inherently greater than zero. The event must be written to a log, and a forwarder must read the event and send it to an indexer. The indexer processes and writes the event to a bucket. All of this takes time. This means that an event that occurs at 1:00:00 am may not be indexed and available for searching until 1:00:02 am.
In well performing systems the propagation delay is often milliseconds, but in poorly performing environments it may be minutes or longer. The cause of this ranges from overworked indexers to network latency to forwarder problems.
This means a search with an end time of “now” may only pick up events with timestamps a few seconds or, in the worst cases, minutes before “now.”
If you schedule a search to execute every 10 minutes with a time range of -10m@m to now, you may not see all events that are close in time to “now.” When the search runs again in 10 minutes you will only see events that have a timestamp of -10m@m to now. Since some of your delayed data as a timestamp before 10m@m, your searches will never return all your events.
The second problem occurs when the time range is relative, such as using -24h@h to now with a search set to run at midnight. Presumably the user intended to run the search for yesterday. Depending on when this search executes, the search will most likely not return results from exactly midnight to midnight. Even with no search schedule window, the search may queue or be delayed causing the search time to be skewed by a seconds or minutes. Additionally, if the search was originally schedule to run at midnight, but then changed to run at 2am, the relative time range will shift by two hours.
To resolve both problems, we recommend setting your search to an absolute time range with relative snap, such as -1d@d to @d. If the search is run prior to the next period with taking propagation delay into account, the search range will be accurate and correct.
Here are a few examples:
If your users are creating too many scheduled searches, we recommend revoking the capability to schedule searches. Remove the ‘schedule_search’ and ‘schedule_rtsearch’ capabilities from the appropriate roles.
If you’ve read our maintaining Splunk over the long-haul article on Users and Roles, you’ll know that we recommend creating a base role for each user segment (we called it a Logical Group of Users). We also recommend overlaying a role with additional privileges reserved for group leaders or managers. If you follow this model, one suggestion is to give power users the capability to schedule searches while removing the capability from the base roles.
Another is to force users to submit a scheduled search for review and only allow Splunk admins to schedule the searches on the user’s behalf. While this creates a burden on the Splunk administrators, it will ensure scheduled searches don’t overload your system.
In our experience, Splunk scheduled searches are one of the most difficult aspects of Splunk to manage. We hope that the concepts we outlined in this article provide a basis to educate your user base, as well as a foundation to ensure scheduled searches won’t overwhelm your Splunk environment.
Download our list of reference searches for more tips on managing your Splunk Scheduled Searches.