You didn’t write it, you can’t modify it, but you’ve been asked to support it. How do you deal with this type of challenge? I love a good performance challenge and the constraints on this one always make these fun to solve.
I’ve been in many consulting scenarios where a client implemented a software package, but experienced performance issues due to their data volumes being much higher than the software vendor had experience with. This normally leads to support tickets being opened with the software vendor; however, the vendor’s support team is flying blind due to a lack of visibility into the customer’s environment. This makes troubleshooting from the vendor’s perspective very difficult. This makes troubleshooting from the customer’s perspective difficult as well since they did not write the software.
Most of the time the vendor has no visibility into their customers’ databases so they can’t see what is happening when performance issues occur. What’s worse is a lot of customers don’t use Performance Monitoring tools so the customer doesn’t even have visibility to what’s happening in the database.
As a DBA what do you do?
The first course of action is to do your best to isolate the problem. If you are able to isolate the poorly performing query you’ve gone a long ways towards getting help from the vendor. How do you isolate the poorly performing query?
If you are getting timeout errors in the application, check application logs. This is probably beyond the realm of what you do as a DBA, but you probably need to have the team that supports the application look into any logs and any messages associated with the error occurring. If you are lucky, the software vendor will have made it easy for you and they will log the failing query in the application’s logs. If not, the stack trace in the log (if it exists) should let the vendor know what the query or stored procedure was that generated the timeout.
The next valuable piece of information is to provide the vendor with the query’s execution plan. There is a catch 22 here. The best way to retrieve the execution plan is from the plan cache – this is an accurate representation of the plan the query took when it executed from the application. Unfortunately if a query never completes because it times out due to a bad execution plan, the plan never gets stored in the plan cache. As a result, you can’t get the “real” problematic plan unless you currently have a long execution of the query running. In this case you might need to get an estimated or actual plan via SSMS.
Here’s a couple techniques I find useful for digging out the poorly performing query and plan.
Find it via sys.dm_exec_procedure_stats if it is in a stored procedure
This is a two step process (we do this in two steps so as not to impact the system by generating a long running query when searching the plan cache). If the procedure has not aged from the plan cache, we should be able to get the plan with this technique.
First we must get a sql_handle for the long executing procedure. To find the long executing procedure, run the following query, replacing the like clause with the name of the stored procedure you are looking for:
select st.text,
ps.total_elapsed_time/ps.execution_count as avg_elapsed_time,
ps.sql_handle,
ps.database_id,
db_name(ps.database_id) as database_name
from sys.dm_exec_procedure_stats ps
cross apply sys.dm_exec_sql_text(ps.sql_handle) st
where st.text like '%CreateExecQueryWaitStatsDuration%'
order by 2 desc
The output of this is as follows. Make note, in this scenario there are multiple rows returned. You’ll need to judge which row is the one you are looking for. In this case, we had multiple rows due to this being a sql instance where the same procedure was created in different databases – you’ll notice this by looking at the database_name column.
Something else to make note of, the avg_elapsed_time is the amount of time on average it takes for this procedure to execute (This time is in microseconds). This query is ordered by longest avg_elapsed_time. If we want to get the execution plan, copy the sql_handle of the row you want to get the plan for and run the following query (replace the sql_handle in it from above):
select convert(xml, qp.query_plan),
qs.total_elapsed_time/qs.execution_count as avg_elapsed_time,
SUBSTRING(st.text, (qs.statement_start_offset/2)+1,
((CASE qs.statement_end_offset
WHEN -1 THEN DATALENGTH(st.text)
ELSE qs.statement_end_offset
END - qs.statement_start_offset)/2) + 1) AS statement_text
from
sys.dm_exec_query_stats qs
cross apply sys.dm_exec_sql_text(qs.sql_handle) st
cross apply sys.dm_exec_text_query_plan(qs.plan_handle,
qs.statement_start_offset,
qs.statement_end_offset) qp
where qs.sql_handle = 0x03000800399F0055C5C45100F2A7000001000000000000000000000000000000000000000000000000000000
order by 2 desc
The output of this result is as follows:
You’ll notice this one sql_handle returned multiple rows. The reason for this is the procedure has multiple individual statements associated with it. You’ll see each of the individual queries in the statement_text column. Each of these statements has their individual execution plans broken out into the plan column. The output of this query is ordered by average elapsed time (avg_elapsed_time) this time is in microseconds. You can download the execution plan by clicking on the plan’s link.
Find it via sys.dm_exec_query_stats if it is a stand alone statement or batch
This will also find queries that a part of a stored procedure; however, searching the plan cache is faster if you look for stored procedures in s.dm_exec_procedure_stats. This approach is similar to the previous, but there will be many more queries to sort through when searching the plan cache. You will also need to craft the like clause with something that should be fairly unique to the query you are searching for.
Once again, we do this as a two part process to prevent the overhead of pulling large number of plans from the cache as we run this query. First we run a query to get the plan_handle, statement_start_offset, and statement_end_offset (replace the like clause with the query text you’re searching for):
select
pa.value as database_id,
db_name(convert(smallint, pa.value)) as database_name,
qs.total_elapsed_time,
qs.total_elapsed_time/qs.execution_count as avg_elapsed_time,
SUBSTRING(st.text, (qs.statement_start_offset/2)+1,
((CASE qs.statement_end_offset
WHEN -1 THEN DATALENGTH(st.text)
ELSE qs.statement_end_offset
END - qs.statement_start_offset)/2) + 1) AS statement_text,
qs.plan_handle,
qs.statement_start_offset,
qs.statement_end_offset
from sys.dm_exec_query_stats qs
cross apply sys.dm_exec_sql_text(qs.sql_handle) st
cross apply sys.dm_exec_plan_attributes(qs.plan_handle) pa
where st.text like '%select top 100000 a.QueryHash%' and pa.attribute = 'dbid'
order by 1 desc
This query is sorted by the longest combined elapsed time across all executions of the queries this time is in microseconds.
Select the query you want to get the plan for and run the following with the selected plan_handle, statement_start_offset, and statement_end_offset to retrieve the query from the plan cache:
select convert(xml, query_plan) from
sys.dm_exec_text_query_plan(0x06001700820F5A1850DE425A4400000001000000000000000000000000000000000000000000000000000000,
0,2650)
Find it in sys.dm_exec_requests if it is executing at the moment
If the query is never completing, but you can see it as it is executing, you can try running this query:
select SUBSTRING(st.text, (er.statement_start_offset/2)+1,
((CASE er.statement_end_offset
WHEN -1 THEN DATALENGTH(st.text)
ELSE er.statement_end_offset
END - er.statement_start_offset)/2) + 1) AS statement_text,
convert(xml, qp.query_plan) as execution_plan,
er.start_time,
datediff(ms, er.start_time, getdate()) as observed_elapsed_time,
er.total_elapsed_time,
er.cpu_time,
er.wait_type,
er.wait_time,
er.blocking_session_id,
er.sql_handle, er.plan_handle,
er.statement_start_offset,
er.statement_end_offset
from sys.dm_exec_requests er
inner join sys.dm_exec_sessions s on er.session_id = s.session_id
cross apply sys.dm_exec_sql_text(er.sql_handle) st
cross apply sys.dm_exec_text_query_plan(er.plan_handle,
er.statement_start_offset,
er.statement_end_offset) qp
where er.session_id != @@spid
and s.is_user_process = 1
order by 3
If you know the query that is running long, you should be able to recognize the name of it when it appears in the statement_text column. Clicking the link to the execution_plan should download the plan. This query retrieves some additional columns that might be of interest.
start_time: This is the time the request started executing.
observed_elapsed_time: This is the difference between now (i.e. a call to getdate()) and the start_time. This time is in milliseconds.
cpu_time: This is how much cpu_time has been consumed by this query. If there is a large discrepancy between total_elapsed_time and cpu_time it might be worth making note of the values that appear in the wait_type column. This time is in milliseconds.
wait_type: If the query is in a wait (i.e. suspended status), this will show what the query is waiting on.
In our example above, you’ll notice nearly every session is in a PAGEIOLATCH_* wait, or in a WRITELOG wait. Both of these wait types are indicators that the sessions are waiting on physical IO. These might be indicative of a slow IO subsystem in some cases.
Okay, I know the problem query, but the vendor won’t help, now what?
You have certain levers you can pull to tune poorly performing queries. A query is going to be performing poorly for one of two general reasons – bad execution plan, or wait events.
A query can get stuck in certain wait events as a result of a bad execution plan; however, more often than not a bad execution plan will surface as high CPU usage on a query.
What are some of the levers we have to resolve bad execution plans?
- Indexing
- Statistics
- Fragmentation
- Plan Guides
Indexing: Will the vendor let you create indexes? I’ve found that the more established a software product is, the less likely a vendor is to allow you to create your own indexes. I actually agree with this (to some extent) given that missing indexes in a mature software package usually have been ferreted out. On occasion I’ve seen where adding additional indexes can cause other areas of the system to perform poorly. Letting customers add their own indexes can also prove to be a headache since it will be a Pandora’s box for any future upgrades. That being said, I’ve also run into scenarios where an index was clearly missing, and most vendors will deliver a “patch” index for this, but for the most part software vendors tend not to allow this.
Statistics: Missing statistics will likely cause bad execution plans. Outdated statistics will also likely cause bad execution plans. Be sure the vendor has a maintenance plan in place, and if not find out why they do not. Most likely it is an oversight, but it’s always good to ask before making changes to add your own. Sampled statistics can also lead to bad execution plans. On occasion, I’ve seen queries pick up bad plans unless statistics were sampled at 100%.
Fragmentation: Fragmentation can cause bad execution plans. Retrieve the execution plan. Look at fragmentation on all objects within the execution plan. Do any indexes need to be reorganized or rebuilt? Once again, check if a maintenance plan is in place, and if it is running often enough to keep up with the churn in the database.
Plan guides: Plan guides allow you to add query hints, parameterize non parameterized SQL, or force specific execution plans for queries. There’s a decent number of gotchas and requirements involved with implementing them, but they are a powerful tool for stabilizing a system that you didn’t write and can’t make code/query changes on. I’ve got a presentation on these that I need to convert to a future blog post that outlines a lot of the gotchas and details involved with creating plan guides.
Conclusion
I didn’t go into any deep details on tuning in this; however, I provided some techniques that I find as a good starting point when fixing performance problems in software you didn’t write and can’t change. I also provided a few high level options that you should research further if you find that a vendor is not willing or able to fix the problems you are experiencing. The techniques I used for pulling data were all related to using DMV’s. We can collect some of the same info with Extended Events and SQL Trace. I avoid SQL Trace as a last resort, and I try and keep to using DMV’s to reach a broader (i.e. sql 2008/r2) audience. All that being said, there is a lot of value in using a good performance monitoring tool which collects query waits, and execution plans when you need to diagnose these types of problems.
Watch us diagnose a query timeout with SQLGrease:
See how you can get query level wait events without Query Store:
Free you say? How can that be?