A few weeks ago, a customer contacted me to tell me that voicemail wasn’t working correctly for a few users. What started out as an intermittent malfunction quickly spread to many users, and then to complete meltdown. This customer is running Lync 2013 Enterprise with 3x Front Ends. They have 2x Edge Servers. They have a KEMP Load Balancer which also serves as their Reverse Proxy. For all intents and purposes, this is our typical/standard Enterprise level High Availability Lync 2013 rollout. Here’s a little debrief of the issue and how it was resolved.
Calls to voicemail weren’t working
I misunderstood this at first to being the web service of the “Attendant” process; however, upon further investigation – it was Mediation between Lync003 and the Exchange servers. This makes a LOT of sense, because some users were working and others were not. This particular function, Mediation, is specifically between the Lync Front End Server on which that user is primarily placed. User placement on Lync 2013 is a bit different — with a primary and a secondary placement based on the Lync Fabric.
More about the Fabric Pool Manager.
For users that were placed on Lync001, it was fine. If it went via Lync003 – it failed because Mediation was broken.
When did Mediation break? Well, Lync003 Mediation broke on Tuesday August 27. How do I know that? Because…
Cen-Lync003 started throwing Certificate Store errors at 9:42am that day.
Notice the cert number — it matches (reverse order) the serial number of the GoDaddy (starfield) certificate.
At the same time, the event logs on Lync003 filled up with a plethora of really bad errors/warnings. So, basically, Lync003 had been unusable since August 27 at 9:42am because of corrupt certificate stores.
Next, during the client’s troubleshooting, servers started to get rebooted…and that’s when things really got bad.
At approximately 8:31pm, during a Lync002 reboot, the certificate store also got corrupted.
Same error, and if you look at the details, it’s the same certificate. So, by this time, Lync003 was unusable (since August 27) and Lync002 was unusable. Lync001 reflected this.
This is the when the fabric was first lost. You can “filter current log” on 32169, and see this is the beginning. This is when someone was attempting to fix things by bouncing all the front ends. Typically, I believe all Microsoft Engineers are the same way, the mantra of “when all else fails, reboot” is true. But, in this case, it caused the Fabric to fail. Quorum was lost. Insert drama here.
In a three-server pool, quorum is impossible until 2x front ends can see each other. So, this is when the total meltdown started.
What I did here, via the MMC, was remove and reimport a known-good PFX (including private key) of this cert from Lync001 – and when I did that, Mediation was able to start on both Lync002 AND Lync003. That was good. It removed the 14397 errors from above. But, that didn’t quite help.
At that time, I started to re-build the fabric (again) using the reset-cspoolfabricstate process.
It started and was running…and after several discussions with Microsoft pre-support and an examination of his pool of resources, we just had to sit and wait it out.
During this waiting period, I spent a LOT of time in the event logs. Although the above “certificate fix” cleaned up that problem, it introduced more problems with the same certificate. An example:
So, what I did that time, on Lync002 and Lync003, was open up the Lync Deployment Wizard, go to step three, and I removed all of the certificates from Lync and the associated Lync Stores.
Then from the MMC, I exported the certs (including private key) and checked the “remove private key” option. After that I removed the cert from MMC.
Then I went back to step three of Lync Wizard and I re-imported both the cert and the private key from the known-good PFX export from when the servers were built. I did this on both Lync002 and Lync003.
I then restarted all Lync Services and they all came back online – except Front End, which was still not working because of the fabric rebuild.
At this time, I was waiting for the fabric to rebuild. Lync002 and Lync003 were consistently getting 32,169 warnings about the quorum (the same message as above). Lync001 was as well and it was getting matching 32,170 errors.
So, at this time, I waited. And waited. And waited. At approximately 3:41pm, the Fabric successfully initialized.
And now, the Fabric Pool Manager started placing users in the routing groups
All servers (Lync001, Lync002, Lync003) had matching entries related to this starting approximately 3:41pm.
I waited more.
At approximately 6:29pm, things appeared really good:
At about 6:33pm, I verifed services started and email the client. It’s also at this time all of the system level schannel (TLS/certificate) errors cleared up on all machines
I’ve seen a lot of these errors over the last few weeks, on all servers, but after all of my cert work during this issue, and fabric rebuild, these hadn’t been seen since the fabric reinitialized and services started successfully.
Anyway, over the course of the next day, I tested a bunch successfully. Another engineer went onsite and we tested voice. We called it a night about 8:00 and Jonathan went back in Friday to verify functionality.
At the end of the day, it was a number of problems.
- Voicemail calls faile because of Mediation failures because of certificate stores getting corrupt on Lync003.
- The effort to fix caused certificate stores to go corrupt on Lync002 which caused the fabric quorum to be lost.
- When this happens, Lync001 was the only “usable” machine, but couldn’t get quorum, so everything went down.
Anecdotally, a colleague mentioned blue screens on his HyperV hosts with Broadcom NICs. I’m currently theorizing the blue screens caused the virtual machines to crash, and this happening several times began to throw schannel (TLS/certificate) errors. Eventually, this corrupted certificate stores on Lync003 and Lync002.
Why Lync001 wasn’t affected? Dumb luck.