Here’s a little story. I got a call from our Production DBA and Project System Owner complaining that our new database server is acting weird the whole morning. They both say that the web application pages takes forever to load and when they checked the third party performance tools running in the server, its giving out high CPU and memory utilization.
I asked the Production DBA (she was newly hired then) “What other areas have you checked?”
“Nothing.”, she says, “System Owner asked me to restart services so that’s what i did.”
“OK”, i’m coming.” i said. When i came into the Operations Team room, there was my boss, the system owner/project manager and the production DBA all waiting for me,
I checked the server, its a super server! 4 CPU Physical Processors; 64G Memory and SAN storage. Log files and data files are on a separate LUN. A single TempDB file is on another LUN. The data files sizes aren’t that huge. There’s no way it could have used all that available memory. Its this hard to troubleshoot SQL Servers when somebody just restart the SQL services without checking parameters, but that’s another story.
Anyways, i noticed that the tempDB is just one single file, so i ran the following query to check IO wait stats;
select * from sys.dm_os_wait_stats where wait_type like 'PAGEIOLATCH%' order by wait_type asc
High IO wait stats! OK multiple CPU, high memory utilization, high IO wait stats, single tempDB file, hmm..
I’m thinking IO bottleneck. So I added more tempDB file for each physical CPU. I added 3 more tempDB files, all with initial size of 8MB. All of them not on auto grow. Then i explained to the production DBA why i’m doing what i’m doing. Spreading the tempDB files across 4 files as there were 4 physical CPU’s, the allocation will be evenly distributed among the 4 files. Since the allocation in SQL Server is done on proportional fill, it will be distributed evenly among the tempDB files that you have. This is also the reason why they all need to have the same initial size.
If you are going to apply this solution to your production servers, do it this way: Create a tempDB file for each of the core CPU that you have and set with the same initial size of say, 8MB. It is also better if you can create this files on another logical unit drive. This will help ease out your IO bottleneck issues. A good reference for this can be found here.
That incident happened 5 months ago when we were in the thick of migrating our database servers to SQL 2008 R2 from SQL 2000. So far so good.
At the end of that day, CPU and memory utilization were manageable, application system running like a well oiled machine, Production DBA learned something new and I am a rock star!