Troubleshooting SAP HANA High CPU Utilisation
High CPU Utilisation
Whilst using HANA i.e. running reports, executing queries, etc. you can see an alert in HANA Studio that the system has consumed CPU resources and the system has reached full utilisation or hangs.
Before performing any traces, please check to see if you have Transparent HugePages enabled on your system. THP should be disabled across your landscape until SAP has recommended activating them once again. Please see the relevant notes in relation to TransparentHugesPages:
SAP Note 1944799 – SAP HANA Guidelines for SLES Operating System Installation
SAP Note 1824819 – SAP HANA DB: Recommended OS settings for SLES 11 / SLES for SAP Applications 11 SP2
SAP Note 2131662 – Transparent Huge Pages (THP) on SAP HANA Servers
SAP Note 1954788 – SAP HANA DB: Recommended OS settings for SLES 11 / SLES for SAP Applications 11 SP3
The THP activity could also be checked in the runtime dumps by searching “AnonHugePages”. Whilst also checking the THP, it is also recommended to check for:
Swaptotal = ??
Swapfree = ??
This will let you know if there is a reasonable amount of memory in the system.
Next you can Check the (GAL) Global allocation limit: (search for IPM) and check the limit and ensure it is not lower than what the process/thread in question is trying to allocate.
Usually it is evident what caused the High CPU’s. In many events it is caused by the execution of large queries or running reports from HANA Studio on models.
To capture the High CPU we can use a Kernel Profiler Trace. To be able to use the kernel profiler, you must have the SAP_INTERNAL_HANA_SUPPORT role. This role is intended only for SAP HANA development support.
The kernel profile collects, for example, information about frequent and/or expensive execution paths during query processing. It is recommended that you start kernel profiler tracing immediately before you execute the statements you want to analyze and stop it immediately after they have finished. This avoids the unnecessary recording of irrelevant statements. It is also advisable as this kind of tracing can negatively impact performance.
When you stop tracing, the results are saved to trace files that you can access on the Diagnosis Files tab of the Administration editor.
You cannot analyze these files meaningfully in the SAP HANA studio, but instead must use a tool capable of reading the configured output format, that is KCacheGrind or DOT (default format).
You activate and configure the kernel profile in the Administration editor on the Trace Configuration tab. Please be aware that you will also need to have run the runtime dumps also.
The Kernel Profiler Trace results reads in conjunction from the runtime dumps to pick out the relevant Stacks and Thread numbers. To see the full information on Kernel Profiler Trace’s please see Note 1804811 or follow the steps below:
Please be aware that you will also need to execute 2-3 runtime dumps also. The Kernel Profiler Trace results reads in conjunction from the runtime dumps to pick out the relevant Stacks and Thread numbers.
**Please execute the runtime dumps first, then after the RTE dumps are finished you can then activate the kernel profiler trace. We do this because we do not want the RTE dumps recording the kernel tracing and confusing the read**
To see the full information on Kernel Profiler Trace’s please see Note 1804811 or follow the steps below:
Connect to your HANA database server as user sidadm (for example via putty) and start HDBCONS by typing command “hdbcons”.
To do a Kernel Profiler Trace of your query, please follow these steps:
1. “profiler clear” – Resets all information to a clear state
2. “profiler start” – Starts collecting information.
3. Execute the affected query.
4. “profiler stop” – Stops collecting information.
5. “profiler print -o /path/on/disk/cpu.dot;/path/on/disk/wait.dot” – writes the collected information into two dot files which can be sent to SAP.
Once you have this information you will see two dot files called
To read these .dot files you will need to download GVEdit. You can download this at the following:
Once you open the program it will look something similar to this:
The wait.dot file can be used to analyse a situation where a process is running very slowly without any reasons In such cases, a wait graph can help to identify whether the process is waiting for an IndexHandle, I/O, Savepoint lock, etc. (If you are using this for a hang situation and you want to get a proper time line then I suggest you look into the performance load graph for a flat line ie: where nothing was recorded.)
So once you open the graph viz tool, please open the cpu.dot file. File > open > select the dot file > open > this will open the file:
Once you open this file you will see a screen such as
The graph might already be open and you might not see it because it is zoomed out very large. You need to use the scroll bar (horizontal and vertical to scroll).
From there on it will depend on what the issue is that you are processing.
Normally you will be looking for the process/step that has the highest amount on value for
Where “E” means Exclusive
There is also:
Where “I” means Inclusive
The Exclusive is of more interest because it is the exclusive value just for that particular process or step that will indicate if more memory/CPU is used in that particular step or not. In this example case we can see that __memcmp_se44_1= I =16.399% E = 16.399%. By tracing the RED colouring we can see where most of utilisation is happening and we can trace the activity, which will lead you to the stack in the runtime dump, which will also have the thread number we are looking for
By viewing the CPU.dot you have now traced the RED trail to the source of the most exclusive. It is now that you open the RTE (Runtime Dump). Working from the bottom up, we can now get an idea of what the stack will look like in the RTE (Runtime Dump).
By comparing the RED path, you can see that the path matches exactly with this Stack from the Runtime dump. This stack also has the Thread number at the top of the stack.
So now you have found the thread number in which this query was executed with. So by searching this thread number in the runtime dump we can check for the parent of this thread & check for the child’s related to that parent. This thread number can then be linked back to the query within the runtime dumps. The exact query can now be found, giving you the information on the exact query and also the USER that executed this query.
A second Example of using theKernel Profiler Trace
Lets say for example you have upgraded to EHP8 and now after the upgrade you are seeing High CPU on the HANA server which is resulting in poor performance, what to do?
Run the following once again:
1: Kernel Profiler Trace (2-3 mins).
Once they are both ran (separately), open the CPU.dot file and follow the RED linage of High Inclusive size:
Now that we have reached the end of the Kernel Profiler trace we can see the end of the stack that we need to search for.
The next step is now to open Xsearch for SAP Search of SAP Notes / KBA’s.
What we will search is the end line of the stack. Please remember that a Kernel Profiler trace reads bottom to top. If searching this stack in runtime dumps it will be top to bottom.
Open up the Xsearch and search the bottom line of the Kernel Profiler which we can see is:
When searching this we clearly see 2321573 – High CPU Utilization Caused by Multiple Threads in sse_icc_lib::mgetSearchi_SSE4impl
This note specifies that a secondary index needs to be created for the Table in which the query is trying to INSERT, UPDATE, etc. To find the table you must go the runtime dump and find the Query by searching sse_icc_lib::mgetSearchi_SSE4impl. This will then bring you to the stack. On the stack there is a thread number which can be searched which will then bring you to the Query and the table.
What if I do not have a runtime dump or a Kernel Profiler Trace? Is it possible to find an RCA then?
The answer is yes, but you must have your statistic server tables retention time set correctly. For more information on Statistic server tables please see this SAP Note 2147247
I have dealt with cases in the past were a HANA system was hanging or experienced high CPU but there was no runtime dumps + kernel profiler taken. What to do?
Firstly you need to do & check the following
1: Check you have the right historical data in the statistics server so you can check the threads. If you do not have the retention time set correctly then there will be no historical data there for you to query from. If you have not set this up then it will not be possible to get an RCA. A quick check you can do is to run the Mini Checks script from SAP Note 1969700.
2: You need to know the exact date & time of the hang. Without this it makes its very hard to narrow down the correct timeframe.
3: Go to SAP Note 1969700 and download the latest version of the SQL Scripts.
In the SQL scripts there are many useful scripts to help you analyze issues, the one we will be focusing on is HANA_Threads_ThreadSamples_FilterAndAggregation_1.00.100+ (please use the script which is suited for your HANA Revision).
By using this script we can filter on many levels from time & date, Host, Thread ID, Thread State, etc. By viewing the thread states at the time, you can get a good picture of what the states were doing and also what was the top statement hash was at the time. This query is getting its information from the following tables:
You can now see why its very important to set the retention time to the recommend settings. If you would like to see these recommended settings, then you can also run the statement HANA_Configuration_Parameters_Values_1.00.90+
If you have a hanging happening you can also query the current threads with HANA_Threads_CurrentThreads_1.00.120+
For more information or queries on HANA CPU please visit Note 2100040 – FAQ: SAP HANA CPU
I hope you find this instructive,