Office 365 outages and the support experience

Office 365 is six years old now, has blown away the competition and is confidently on its way to reaching the $20bn goal set by Satya Nadella. What’s even more impressive is the fact that only a handful of major incidents have occurred over the course of the last six years, truly a remarkable achievement that speaks a lot about the level of operational maturity Microsoft has developed. One area where things still fall short however is the support experience, which is impacted by several long-standing issues.

To date, the way outages in the service are handled remains an annoyance, communication-wise. The infamous Service Health Dashboard has gone over several iterations over the past few years, but remains as unreliable as ever. More often than not, issues that multiple users in your organization are complaining about fail to be reflected on the SHD, even after you have contacted support and the issue has been acknowledged by the support engineer or even escalated. Similarly, situations in which an issue affecting multiple tenants or entire regions is not reflected in the SHD for your tenant are also common. In all fairness, given the scale at which Office 365 services operate, it’s not a trivial task to identify the exact number of affected users across all the different layers. Thus, Microsoft is taking a somewhat cautious approach and only flagging tenants for which they are certain the issue is ongoing.

Having worked across all sides, both as Office 365 support engineer and top tier support for large enterprises, I wouldn’t necessarily agree that this is the best approach. In most cases, the customers are reasonable enough and understanding, and I’ve rarely come across the “I don’t care it’s an issue on Microsoft side, I want it fixed right now” behavior. Simply knowing the issue is being worked on is enough for most situations, which is why I believe it would benefit both Microsoft and the customer if issues were more reflected on the SHD in a timely manner, and the scope relaxed a bit to cover even tenants which are “suspected” to be impacted from the issue at hand. The information provided on the actual Service Interruption Event (SIE) message is another are where Microsoft needs to improve, but more about this later on.

Just to clarify, there is actually some information included about the scope of an incident nowadays. First of all, the latest SHD version adds the “User count” column, which can give you an indication how many users in your tenant are affected by the issue. Similarly, some information about the number of tenants affected is also available, but in order to get that you have to use the Office 365 Service Communications API. I’ve blogged about this here:

The text of the actual SHD items is another common source of complaints. The “neutral” language used can be mildly irritating at times, with all the “may have been affected” and similar phrases. More importantly, the lack of detail and timeframes often leaves customers unsatisfied and might result in having them open yet another support request, in hopes of getting more information about the issue. Personally, I would prefer to have a more clear statement where possible, instead of the generic wording.

Of course, at times there are issues that affect the actual SHD or the admin section of the Office 365 or even Office 365 login entirely. For example, last week a service interruption event (MO107347) prevented customers from using the Office 365 admin center to open service requests. As another example, earlier this morning (July 3rd) parts of EMEA were affected by an issue with the authentication infrastructure, which prevented users from accessing the Office 365 portal. A logon loop was experienced for new logins, while users that were already logged in to the service were greeted with the following error when trying to access the Admin portal:

070417 0758 Office365ou1

As both the SHD and the service requests page are part of the Office 365 Admin portal, you can guess what happened when you try to click the “Service Request” link on the above screenshot. In turn, this meant that users were left with no way to get additional information about the issue. Now, the more experienced Office 365 administrators out there will probably want to correct me here – there is indeed a place you could go and get information about any ongoing issues that might prevent you from logging to the Office 365 portal. Namely, the Office 365 Service Health Status page, accessible via the following URL:

There’s a reason most of the Office 365 administrators have never even heard about this page – it’s pretty much useless. It hasn’t been updated in years and I cannot recall ever visiting the URL above and seeing anything other than the customary “There are currently no known issues preventing you from signing in to your Office 365 service health dashboard” message displayed. Never ever. No surprises this time around too:

070417 0758 Office365ou2

Despite numerous requests over the years, Microsoft is yet to provide us with a reliable, independently-hosted dashboard or status page, which we can refer to when the occasional outage hits. Even 6 years after GA, Twitter and other social media are far more reliable sources of information about any ongoing issues than Microsoft’s own tools. The different Office 365 communities out there, such as the Microsoft Tech Community or 3rd party products can also be very helpful in situations like this one.

In this particular instance, the symptoms closely matched a recent SIE, reported by multiple participants in this thread on the Tech Community last Friday. Even though the Friday issue was never flagged in my tenant, thanks to the information others have already shared last week I was able to quickly perform some checks to better understand the scope and identify possible workarounds. Granted, this will not always be the case, but it’s an example that shows that knowing someone out there has a similar issue to what you are running into can be helpful for the customer, and not at all hurtful to Microsoft. I can understand the arguments behind making the information tenant-specific, and I believe this is indeed the right approach for most customers. Still, having and “advanced mode” if you will, one that surfaces a broader view over the current state of the service can be helpful in some cases, instead on having to rely on channels such as Twitter.

Apart from the service status issues, Office 365 support in general has been a common source of complaints from customers. The experience can drastically differ depending on the region, vendor or agent you work with, and of course “Premier” tiers offer a lot more. That said, almost everyone I’ve ever talked to about support has expressed few concerns, to put it mildly. And the latest attempts to streamline the support experience seem to have caused some additional disturbances, most of which could have been avoided if Microsoft simply communicated the changes in advance.

For example, some organizations were surprised to find out that they can only have a single ticket opened at a time, after a change was introduced few weeks back. Moreover, even the ticket history was inaccessible, which further confused people. What Microsoft aimed to achieve with the changes was not clear at that time and what’s worse, no one bothered to explain. In reality, Microsoft’s intentions were good – being aware that most customers prefer to talk to a person instead of filling in forms, the support process was updated so that the focus was shifted to gathering the minimum required set of data and having a call placed by the agent once you press the relevant button in the new service request form.

070417 0758 Office365ou3

Intentions aside, the execution wasn’t particularly good. Without a proper introduction to the changes and with the lack on any in-context explanations, the new service request experience (as shown on the screenshot above) can be a bit misleading. Only a single box is presented, which surfaces a very limited number of suggestions based on the input. Any complex query fails to trigger the built-in “solutions” and results in returning several suggested support articles instead, very similar to what a (poorly constructed) online search would yield. Only then, the “call me” option becomes available and using it triggers the creation of a new ticket behind the scenes.

Then, after a short wait, a support agent will contact you in order to gather additional information about the issue and assist you as needed. In the UI, no option to provide additional information, upload log files or screenshots or even categorize the issue is presented at any point of the process, in hope that all this and more should be handled by the support agent. And, in what is probably the biggest change in the support model, the “one issue per ticket” rule is no more. Thus, no need to open multiple tickets and the removal of the relevant controls from the UI experience. As no one bothered to explain this to customers however, and the removal of the “ticket history” page added to the confusion.

Once you realize that the support experience is phone-driven, most of the concerns will go away. Whether this is the correct approach for all customers however, is arguable. I can certainly think of some downsides with almost all the different parts of the process. Having worked as support engineer, it was not that uncommon to have a customer that only wants to be contacted via email. The language barrier can be another issue, especially in regions such as the EU. And of course, the lack of options to provide additional details, be it in the form of more detailed description, screenshots or log files. Forcing all interactions with support over the same path will cause the occasional hiccup, but overall, I can understand why Microsoft is making the changes and in time, they should result in improved support experience, especially for small-sized tenants which might not be used to tackling technical issues.

I do hope however that some minor improvements are made to the new support UI, as currently it’s a bit unintuitive. If the intention is to drive the process through a phone interaction, why don’t simply surface the hotline number or at least put a little description on the UI elements to let the user know what the end goal is? If anything, because of the new UI it looks like Microsoft is making it harder to get in touch with support, while their intentions are quite opposite. It’s just another example on how important proper communication is. I’m sure that if the reasoning behind the new support experience is explained, the majority of customers will approve of it.

I guess the key takeaway from this article is that while Microsoft is certainly trying to improve things, it’s a process that takes time. And there will be situations where the best approach would be to contact support directly, be it to get an information about or report an outage, or simply open a new support ticket. As even Microsoft wants you to call in instead of using an email-based approach, make sure to bookmark the Office 365 support number for your country and use it:

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.