Ignite 2019 is now over and as usual I’ve been spending some time going over sessions I wasn’t able to attend in person. One of these sessions was the “Service Health Dashboard” one, Incident communications at cloud-scale: How Microsoft 365 is improving when things go wrong, presented by Mike Ziock, Sanjoy Ghosh and Sirisha Pingali. This year, there are actually few very nice new features/improvements being released, so I’d say the session is worth watching, but there are a few buts…
Let’s start with the good news. After years and years of leaving the same type of feedback with regards to the ineffectiveness of the SHD, Microsoft has finally released features to address some of the gaps. The biggest one is the “Report an issue” feature, in other words the ability to let Microsoft know you are experiencing some issue with the service by filling a form. Sounds a lot like opening a service request? Yeah, we’ll talk about this in a minute. First, here’s how Report an issue looks like:
So, the idea behind this functionality is that you can send a signal to Microsoft that something is wrong. A signal that contains information relevant to a (potential) outage, gets correlated with similar signals from other customers and (hopefully) ends up routed to the engineering teams much more faster. Because apparently even the SHD folks now agree that the service request experience in the portal sucks (no, they didn’t say that in the session).
If you go over the controls of the Report an issue form, you will see that the selection is very limited and very targeted to outage scenarios. Is it causing impact to your business? You probably wouldn’t be on that page if it wasn’t, but OK. Which service is it affecting, with the choices being Exchange Online, Teams, SharePoint Online, OneDrive for Business and Skype for Business. You can also select Other, which then gives you the option to designate some other services on the next question, including Intune, PowerApps, Flow, PowerBI, the admin center or the SCC.
For each of the individual services/products, you will need to further categorize the issue by selecting from a handful of service-specific scenarios. For example, selecting Exchange Online will result in the following choices: connectivity issues, issues with mail flow, calendaring issues, and other. Selecting Teams will give you a choice of connectivity issues, issues with calls or meetings, issues with chats or presence, or again “other”. Lastly, you are given a free text field where you should provide additional information about the issue, within 1000 chars or less. Pressing the Submit button will then inform you that you will see the issue appear on the SHD within 30 minutes, should Microsoft confirm it’s validity and that it’s affecting your tenant. And, you will also receive an email confirmation. In the future, you will also be able to receive email notifications with regards to the submitted issue’s status updates.
That’s the new Report an issue feature in a nutshell. It’s a good addition, and customer feedback during the preview has been overwhelmingly positive. What’s more important, Microsoft has already validated the usefulness of the feature for the purposes of better scope evaluation and cause determination for emerging issues. And in some cases the feature resulted in the engineering team being alerted even before the internal signals.
To continue with the new stuff, we finally have the option to configure email notifications for incidents. You can configure those by going to the SHD page and clicking the Preferences button on top of the list. Only a single setting is exposed currently, namely Send me email notifications for service health incidents. Some other small additions deserve mention, such as the ability to “confirm” an incident. More specifically, you can click on an a given Advisory and press the Are you experiencing this issue? link to signal Microsoft that their scoping and/or classification of the advisory is incorrect and instead it should be considered an Incident. Another addition should bring “notifications” about current outages as part of the service request submission process.
If all you care about is the new stuff, stop reading here and have a nice day. A small rant will follow, don’t say I didn’t warn you 🙂
So why am I trying to bitch about the Report an issue feature? Because it shows the deeper issue – the disconnect between Microsoft support, as in the vendors they hire to handle support requests, and their engineering teams, the people actually running the service. The above process sound a lot like opening a service request, no? That’s what it is after all, a support request, submitted by a customer experiencing an issue with the service Microsoft is providing, by using the exact same portal you use to submit a “regular” service request. You simply bypass the “helpful” chat bot, “suggestions” and other “AI” gimmicks that have plagued the service request submission process for the past few years, and probably even the first line of support (that part is not confirmed by Microsoft).
A large part of the session was devoted to describing how Microsoft monitors support tickets currently and emphasized the use of ML (of course) and telemetry to identify emerging issues. Signals from other sources are also used, such as internal monitoring probes, the downdetector website, Twitter and more. All good I suppose, but we’ve been hearing this for years and years. Yet, no one I know was happy with the state of the SHD and the accuracy of the information presented there (and the language used, but lets leave that for another rant). Now suddenly a feature that takes into account what customers are reporting as an issue is released and both sides are very happy with it. Why is that?
If support agents actually did a decent job in triaging the tickets, validating issues reported by the customers, communicating back to the submitter and escalating in a timely manner where needed, you wouldn’t need to duplicate the existing “report an issue” functionality. The whole outsourcing model is the problem, and in turn it translates into a people problem. When you outsource your support to organizations that hire the cheapest possible labor, and only care about keeping that contract by any means possible, there’s only so much you can expect. I don’t imagine Microsoft will bring support for their services in-house (they do have the cash for this, mind you), but it’s obvious that the current approach of trying to solve this problem via technology while ignoring the people element aint gonna cut it.
And yes, I do understand (most of) the reasoning behind their approach. I have actually worked as support engineer for Office 365 back when it first GA’d, so I’ve seen the issues from various angles. Supporting such a diverse userbase, ranging from individuals huge enterprises across all industries, and such a huge number of services bundled together is not an easy thing to do. You inevitably get a lot of “how to” tickets, tickets with little to no relevant information, tickets about known issues or ongoing incidents, angry customers who just want to vent, and so on. Many of the support features we’ve seen launched over the past few years have been designed to help reduce the number of tickets, with questionable effect (that’s of course my own opinion). Every little bit helps I suppose, but sometimes something more radical is needed. Hopefully this new feature is an indicator of things slowly moving into the right direction.
Anyway, back to the SHD session. One of the other interesting parts was listening to the questions at the end. Unsurprisingly, many of the same topics that were discussed in sessions from previous years were mentioned again. Which is a very strong “signal” for Microsoft I’d say. Do you know the difference between an incident and advisory? 🙂
1 thought on “Service Health Dashboard improvements and another rant on support”
Yeeaaah. It’s like engineering team is trying to come up with a workaround to skip the L1 support 🙂 That’s why news about Office 365 adding another xxx millions of users are not so exciting, because if it works, it is ok. But if it doesn’t, then you will be facing worse and worse support and at some point it can byte them back. That’s why having so partners with expertize is good. Saved us during a huge outage out of nowhere with Password hash sync a year or so ago.