In this post, I will talk about configuration best practice for disk, it is vast topic by the way but I will limit it to the incident I have seen. In recent past I had a page complaining heart beat lost for a server. While I tried connecting, it was successful, however there weren’t any disk showing there except local disk one was for OS and another for an ad-hoc storage. I did contacted the client, had on bridge for an hour and finally we were good as all disks that weren’t reported in comes up online after we did server reboot.
What was that error?
Looking at the error log showed below snippet, which means, SQL Server started but it did not find the user databases because the device is not found. If you read the error log it says Operating System error 1167 (The device is not connected.) encountered and Error: 17053, Severity: 16, State: 1.
[ERROR] 2016-12-08 01:49:49.38 spid4s Error: 17053, Severity: 16, State: 1. --> SQLServerLogMgr::LogWriter: Operating system error 1167(The device is not connected.) encountered. [WARNING] 2016-12-08 01:49:49.38 spid4s Write error during log flush. [ERROR] 2016-12-08 01:49:49.38 spid4s Error: 17053, Severity: 16, State: 1. --> SQLServerLogMgr::LogWriter: Operating system error 1167(The device is not connected.) encountered. [ERROR] 2016-12-08 01:49:49.38 spid764s Error: 9001, Severity: 21, State: 5. --> The log for database 'MyDB' is not available. Check the event log for related error messages. Resolve any errors and restart the database. [WARNING] 2016-12-08 01:49:49.38 spid4s Write error during log flush.
What was the cause?
There was a scheduled maintenance that night, a firmware update for SAN which was completed successfully. Following maintenance, the two production server eventually lost the connection with SAN, which wasn’t anticipated by storage admin, and, it started paging me. Noticeable thing was, that both the servers was using the same SAN storage.
How the error was resolved?
Error 9001, Error 17053 and Error 1167 all talks about connectivity of the SAN disk with system. While we were on the bridge, storage admin has invited the tech support from vendor who suggested reboot of Windows Server box, one by one. As soon as we rebooted the box, luckily the drives come back and all the user databases were up again.
Lesson learn:
- Perform maintenance activity one by one
- Do not use the same storage for two or more servers you’ve intended to use as Primary and Secondary (HA and/or DR)
- Always use separate SAN for backup so that you can have them handy in cases like this
- If you are using virtualization, be cautious, have your Primary and Secondary (HA and/or DR) server on different Virtual Machine
I hope you have liked what I wrote about configuration best practice for disk, let me know if I missed anything or if you want me to write on something specific. You can always email me, or let me know your thoughts via a comment section.