How to troubleshoot DISK/IO performance issues in SQL Server

It is always my favorite subject troubleshooting performance issues in SQL Server. I really enjoy learning new methods and approaches to troubleshoot the performance issues in SQL server. There is no hard rule or a single window approach for all issues, may be the best part in troubleshooting that I would love at most. However, I learned it is always a very tough job when it come at DISK or IO level issues.

Of late, I faced a curious case of IO performance issue in one of our performance test environment, showing high response time resulted to a delay in processing. As always, the analysis started with LIVE Monitoring queries to identify the requests that are running and for any blocks on the server.While analyzing, it is observed that the requests are in suspended mode with a wait type “PAGEIOLATCH_**”.

PAGEIOLATCH_** wait typically represents an issue with DISK/IO. However, there could be many other reason for which cause PAGEIOLATCH_** eg. non-optimal index, bad coding etc. So the first thing we wanted to make sure that there are no issues with code or design (index or procedure code) and no lock/blocks occurring due to the same.

As we could not find blocking/locking in our case, we further moved to analyze the issue at resource level. As a very first step, the performance counter has been enabled on the server to enable the analysis. By enabling the counters, we would be able to analyze the resource utilization at server level.

Please find sample templates(XML) to collect major performance counters for SQL Server & .NET.You may download the below templates and change the extension to XML and configure in perfmon tool(easy to configure with template, you may be able to add more counters to the list as per your requirement).
Perfmon_SQLServer_Template and Perfmon_dotNet_Template

The attached template will collect the information on various counter for Memory/processor/Disk etc. As we are finding the issues are associated with DISK, the scope of our analysis on the counter evaluation restricts at DISK level for this blog post, you may need to carry out the analysis for other resource levels as well.

What we need to look out?

DMV Analysis on Wait types (sys.dm_os_wait_stats)

— The wait type analysis for PAGEIOLATCH_**.

SELECT
    *
   ,wait_time_ms/waiting_tasks_count AS 'Avg Wait in ms'
FROM sys.dm_os_wait_stats 
WHERE waiting_tasks_count > 0 and wait_type in ('PAGEIOLATCH_SH','PAGEIOLATCH_EX')
ORDER BY wait_time_ms DESC

In our case, we identified there are lots of queries in suspended with PAGEIOLATCH_** wait type in our server and the [Avg Wait in ms] was above 40 ms. I personally would recommend to investigate the [Avg Wait in ms] with value more than 20 ms and above for any issues(it may not always indicate a DISK issue).

Perfmon counter analysis

— There are many counters related with DISK, out of which the below counters are my favorites to start the analysis with DISK.

Performance Threshold

Note:The above thresholds are not a general thumb rule, but it can be used as a reference.Read the comments in the picture above carefully to understand more about.

DMV Analysis on IO Stall (sys.dm_io_virtual_file_stats)

— IO stall read/write information gives us for every IO read/write latency for a period of time and it is a cumulative average.

select database_id, file_id
    ,io_stall_read_ms
    ,num_of_reads
    ,cast(io_stall_read_ms/(1.0+num_of_reads) as numeric(10,1)) as 'avg_read_stall_ms'
    ,io_stall_write_ms
    ,num_of_writes
    ,cast(io_stall_write_ms/(1.0+num_of_writes) as numeric(10,1)) as 'avg_write_stall_ms'
    ,io_stall_read_ms + io_stall_write_ms as io_stalls
    ,num_of_reads + num_of_writes as total_io
    ,cast((io_stall_read_ms+io_stall_write_ms)/(1.0+num_of_reads + num_of_writes) as numeric(10,1)) as 'avg_io_stall_ms'
from sys.dm_io_virtual_file_stats(null,null)
order by avg_io_stall_ms desc

It is observed that [avg_io_stall_ms] value is considerably higher with one of our drives(D).

To confirm, we moved all the data files from D drive to E drive and observed a huge differences in performance. You may find the data points as below.

PAGEIOLATCH_** wait type Information(Before Vs After)

PAGEIOLATCH_Comparison

IO Stall Information(Before Vs After)
DISK_IO_STALL_Compare

Hope, the above methods would help you to troubleshoot issues at disk level.However, these observations needs to be shared and verified with a storage engineer to confirm the issue with DISK/IO.

I would also like to request you to have your comments to share more approaches you may dealt in the past as part of our learning and sharing.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s