Thursday, July 19, 2012

Microsoft SQL 2008 - SQL agent log bugging with ODBC error - Could not open connection to SQL server.


I have received one of my customer problem and trouble shoot it till closure hence good to share here with step by step I performed.


Issue:
Microsoft SQL 2008 being used in Clustered with Mixed mode authentication and did modifications like, 
Order of connection, TCP id first than Named pipes
Tried by putting Alias name as a Cluster name and then restart agent service but didn't help.


Troubleshoot Step by step:


http://msdn.microsoft.com/en-us/library/ms175176.aspx   - By default, the SQL Server Agent service connects to an instance of SQL Server over named pipes by using dynamic server names that require no additional client configuration.

Please check if it connects the database with SQL Server ODBC drivers. You can start ODBC connection wizard from control panel, select new & then SQL server. At the end, it will show option of ‘Validate Test’ where you can check if it connects to SQL with ODBC drivers by looking test get success or failed.

However in my case, validation test was successful


So proceed further with next step.

There are a couple of things which we can check, that might be going on this case… (All of the following configurations are made on the computer running your SQL Server 2008 instance)
Allow remote connections to this server
The first thing you want to check is if Remote Connections are enabled on your SQL Server database.
To do this, open SQL Server 2008 Management Studio, connect to the server in question, right click the server…and open the Server Properties.





Navigate to Connections and ensure that Allow remote connections to this server is checked. Check if this solves the problem. If it does, here you go, continue with whatever you were doing and have a nice day.
Protocols for MSSQLServer
If you’re still running in issues let’s dig a bit deeper. The next good thing to check is the SQL Server Network Configuration. Open the SQL Server Configuration Manager, unfold the node SQL Server Network Configuration and select Protocols for MSSQLServer (or whatever the name of your SQL Server instance is).




Make sure that TCP/IP is enabled and try again. Even though I hope that this resolved your problems there might still be an issue with…
The Firewall
If there is still no communication happening between your computer and the remote SQL Server you most likely need to configure your firewall settings. A good first step is to figure out which port is being used by TCP/IP (and which you need to open in your firewall). You can do this by right clicking TCP/IP and selecting Properties.




Click on the tab IP Addresses – Port 1433 it is :-) That was easy enough and all there is left to do is to allow inbound TCP/IP traffic on Port 1433 in your firewall. In Windows 7 this works something like this. Open the Control Panel and navigate to Windows Firewall.




Click on Advanced Settings on the left hand side and you should see the Windows Firewall with Advanced Security. Select the Inbound Rules on the left hand side and click on New Rule… on the right hand side.




This opens the New Inbound Rule Wizard which you can use to allow inbound traffic on Port 1433 for TCP/IP (and which is exactly how you configured your SQL Server in the steps above). Just follow the steps outlined below.)












By performing above steps, it lead to resolved my issue.


Microsoft System Centre Operation Manager (SCOM) 2007 R2 - Heartbeat failing randomly from any Management Servers to Any Clients.


Environment:
Operating System - Microsoft Windows 2008 R2 Enterprise
Database - Microsoft SOL 2008 
Application - Microsoft SCOM 2007 R2 with Management server in Clusters.


Issue is, Heartbeat failing randomly from any Management servers to any of clients with event 20022 which caused by many of the reasons like network issue, server performance issue etc... In this case, I faced this issue and troubleshoot it step by step as specified below.



To analyse if the issue happens at network side, I have provided steps to collect data below

Disable TCP Chimney
======================
Please help disable TCP chimney on MS,RMS and the SQL server as a best practice. Some more information of TCP chimney is shared below:
http://support.microsoft.com/default.aspx?scid=KB;en-us;q945977



Run SQL Queries Below
Please launch SQL management studio and run below queries then save the result to .csv file and send them to me.
Use operationsmanager
SELECT so.name,
8 * Sum(CASE WHEN si.indid IN (0, 1) THEN si.reserved END) AS data_kb,
Coalesce(8 * Sum(CASE WHEN si.indid NOT IN (0, 1, 255) THEN si.reserved END), 0) AS index_kb,
Coalesce(8 * Sum(CASE WHEN si.indid IN (255) THEN si.reserved END), 0) AS blob_kb
FROM dbo.sysobjects AS so JOIN dbo.sysindexes AS si ON (si.id = so.id)
WHERE 'U' = so.type GROUP BY so.name  ORDER BY data_kb DESC

Use operationsmanagerDW
SELECT so.name,
8 * Sum(CASE WHEN si.indid IN (0, 1) THEN si.reserved END) AS data_kb,
Coalesce(8 * Sum(CASE WHEN si.indid NOT IN (0, 1, 255) THEN si.reserved END), 0) AS index_kb,
Coalesce(8 * Sum(CASE WHEN si.indid IN (255) THEN si.reserved END), 0) AS blob_kb
FROM dbo.sysobjects AS so JOIN dbo.sysindexes AS si ON (si.id = so.id)
WHERE 'U' = so.type GROUP BY so.name  ORDER BY data_kb DESC

Network Logs
=============
Capture Network trace or run Netmon tool.......

1. On the selected machine and SCOM MS, download and install Network monitor 3.4
2. Run below command on both servers
nmcap /network * /capture /file <drive letter>:\nmcap.chn:200M
NOTE: Above command may generate a great number of files with 200M size. Please select a drive with sufficient free disk space and monitor the disk usage regularly.
3. Once MS reported event 20022 and the machine name is listed in the event 20022 , then please press Ctrl+C on both servers  to stop nmcap, save it as a file.

MS Tracing Log
======================
1. On MS server, replace C:\Program Files\System Center Operations Manager 2007\Tools\TracingGuidsNative.txt by attached one.
2. Open a command window MS server, change directory to C:\Program Files\System Center Operations Manager 2007\Tools
3. Run "Starttracing.cmd VER".
4. Wait until issue happens.
5. Run "stoptracing.cmd"
6. Run "formattracing.cmd"
7. Compress all files under C:\Windows\Temp\OpsMgrTracing 
8. Please also export Operations Manager event logs from MS and RMS (Root Management Server) 

Analysis
To work around this issue, I also suggested to set "Number of missed heartbeats allowed" to 10 from SCOM server.
By doing that, we can at least fix some false alert and still can get the correct information if MS or agents are done.

The issue still persists even after implementing above changes, so collected performance monitor log to trace the performance.
From that performance log, most the system resource, like CPU and Memory, are running at healthy level. but, the disk performance is not very good, especially on C and D. (as have 2 drives)
1.       On driver C, the Data transfer rate is not very high (average less than 1MB) but the disk queue length is pretty high. Also, every IO on that disk on takes about 0.2sec to be completed which is not a very good performance.
2.       On driver D, the IO load is much higher. About 6MB data transferred every second and most of these IO load comes from HealthService. The time to complete every IO request is similar with C driver (about 0.2-0.3sec, even reach to 1sec at the peak time)

By comparing the registry setting, I have noticed the RMS cluster node B Health Service Store is located on D: drive, which is a local drive. On A node, it’s on J: drive, which is on SAN, shared between both nodes. I noticed below reg key is configured differently:

On Node A:



On Node B:


This probably contributed to the disk IO issue as health service state should be on J: drive instead.

Solution
Changing above "State Directory" to SAN disk, issue is resolved now.





The Service Host process (Svchost.exe) that hosts the Computer Browser service and the Server service stops unexpectedly in Microsoft Windows Server 2003

Symptoms:

In Microsoft Windows 2003 X64 R2 SP2, unable to start Browser and Server service, error – “Access Denied”

While you restart the box, both will start automatically but not starting manually, it throws an error – Access Denied.

Cause:

In some of 2003 box, browser DLL file not upgraded during patching for some issues and then running with older version which conflict with latest OS file sub system and cause this issue. It may also cause due to other issue example if some third-party server becomes the master browser. When a Windows-based server is the master browser, it does not let any host that is advertising its share to use a server comment that is larger than 48 bytes. When a third-party server message block (SMB) server becomes the master browser, it lets a host use a server comment that is larger than 48 bytes. In this case, when a Windows Server receives and tries to process the browser list, the Server service crashes. This behavior occurs because there is an overflow that eventually leads to a heap corruption

Resolution:

Check Browser.dll file with version, it will look like,

Module Path: C:\WINDOWS\SYSTEM32\BROWSER.DLL
Symbol Status: SYMBOLS_PDB
Time/Date String: "February 17, 2007 15:03:41"
Product Version: (5.2:3790.3959)
File Version: (5.2:3790.3959)
Company Name: Microsoft Corporation
File Description: Computer Browser Service DLL

Whereas newer version is - 5.2.3790.4040

So, action plan is to upgrade file to its latest version for specific OS running on the box.

Update for Windows Server 2003
Update for Windows Server 2003 for Itanium-based Systems
Update for Windows Server 2003 x64 Edition
Update for Windows XP x64 Edition



Symptom:
In SCOM 2012, we cannot discover windows computers as network devices, so the SNMP trap send from Windows computers can be received by SCOM management server.

Analysis:
After discussing with the owner of network monitoring in Operations Manager product group, I personally confirmed it’s not possible to monitor SNMP traps from a windows computer via SNMP monitors.

I also learnt that the reason to filter the windows computers out of network discovery is because after we introduced network monitoring in OM 2012, a device is either a network device or a computer, if we support SNMP on a Windows computer, the same device will be discovered twice, once as a Windows computer, once as a network device and this will cause problem.

So in this scenario, we will need to use other mechanism to receive the trap and generate alert.

Alternate Way:
1.       Customize the application to send alert via other method like eventlog.
2.       Use other solution to receive the trap and call SCOM API to generate a property bag, and SCOM can receive this and generate alert.