Recently, one of our customers run into a strange issue with Lync Online. They were simply unable to see the presence info for contacts in the HP.com domain, or initiate conversations with them. On the other side of the line, everything worked OK. Here’s the interesting bit though: the issue only affected *some* users!
Just to be on the safe side, we verified that any federation related settings are properly configured. Everything looked OK on customer’s side and HP is open-federated, so pretty much every Lync organization is allowed. The per-user settings were also configured correctly, and we also verified that communicating with other external parties is not a problem. Unable to think of any reasonable explanation for such behavior, we raised a call with Microsoft for further investigation. Iulian Bucatariu, the ‘dedicated’ Lync engineer in the Romanian Premier office was as helpful as ever, and we quickly run some tests and collected troubleshooting information. From the UCCAPI logs, it was visible that for the affected users, the “INVITE” request sent to HP.com contact was never acknowledged and simply timed out. As unfortunately the logs do not include routing info for failed requests, we didn’t have a direct evidence on where exactly the issue lies, but it hinted for something fishy going on with the edge servers. So we agreed to look further into this by requesting a more detailed trace on both the Microsoft and HP servers, respectively.
As a result of the enquiry Iulian made to the PG, on the next day we informed that this was a known issue and the following description appeared on the SHD of the tenant:
Current Status: Engineers have identified an intermittent issue affecting Lync Online Instant Messaging (IM) and Presence between federated partners. The investigation determined that recent capacity additions to the service infrastructure caused a specific failure scenario. Engineers are deploying a fix to the environment to address the configuration issue. The deployment is expected to complete by mid-March 2015.
User Experience: Affected users are unable to IM or see Presence information with federated partners.
Customer Impact: This issue affects tenants who are on-premise and in a hybrid state. Although the issue is intermittent, the impact may be prolonged up to 24 – 48 hours.
Incident Start Time: Friday, December 12, 2014, at 12:00 PM UTC
Estimated Restoration Time: By mid-March 2015
Preliminary Root Cause: As part of our efforts to improve service performance, additional capacity units were deployed to the service infrastructure; however, the update caused a failure scenario in which the new topology is seen as active prior to the capacity completion date.
Next Update by: Friday, February 20, 2015, at 10:30 PM UTC
A bit vague, as usual. Needless to say, the customer was not happy, especially given the dates in the message. This triggered some additional mail exchange (sorry about that Iulian), and as a result a more detailed description of the issue was provided:
This issue was located recently and only affects the specific scenario that is occurring for these users.
This issue can also appear intermittently and may not affect all users in a tenant.
Here is the Scenario that will cause this:
• LyO Tenant is homed in another forest (the customer tenant – Europe) from the federated partner (HP tenant – US)
• The partner is setup as a Hybrid Tenant (A DNS record pointing to an on-premises deployment and also have a LyO tenant with the same SIP domain added is enough). Like how HP.com is configured currently.
• This causes the edge servers in the tenants local forest to send it across to the partners local forest.
• A Code bug was identified that is related to required capacity expansion that is under-way.
This bug has been identified and has been addressed in the next release, the current ETA for that deployment could start as early as Mid-February.
This issue will also be resolved once the current service expansion work is completed which should be within the next 2-3 weeks.
Unfortunately for us, HP, this almost certainly means that more of our customers are affected. Hopefully, the resolution will be faster than expected. And hopefully, Microsoft will better inform any affected parties, as in our case the information only become available thanks to the patience and dedication Iulian shows in his work.