The year started with some compliance related news, namely the deprecation of eDiscovery/In-Place Holds in the EAC, as well as the beloved Search-Mailbox cmdlet. Some questionable decisions were made there IMO, as the suggested replacements either require you to spend additional $$$ on premium SKUs, or do things manually, but we’ve already discussed those in a previous article. Now, we continue the compliance-focused discussion with a recently surfaced issue that might have some severe consequences.
The issue can be summarized as follows: if you use the Outlook client in cached mode to remove an attachment from received message, the copy-on-write page protection mechanism will fail and the original item will not be saved. To give you some additional context, the COW page protection ensures that for mailboxes that are on hold (Litigation hold, In-place hold or retention policy created in the SCC), a copy of the original message is preserved when certain properties are modified. As the documentation explains, this includes messing with any of the attachments of a message or post. The copy of the original, unmodified item will be then saved in the /Versions subfolder of the Recoverable Items subtree and will still be available for eDiscovery investigations.
To my surprise though, a user posting on the Spiceworks forums reported that he was unable to find the original item after removing an attachment. Instead, only the current version of the message, with the attachment missing, was returned in a content search. Realizing the implications of such an issue, I was quick to perform few tests on my own, and after confirming I’m seeing the same behavior, asked for clarification on the Exchange MVP DL. Fellow MVPs Ingo Gegenwarth and Tony Redmond were able to independently confirm the issue, at which point it was clear that it’s not just a random fluke.
It took several days and a SevA case to get Microsoft’s attention on this, and almost a week after it was reported we still have no clear statement for the scope of the issue, its full impact and possible remediation. There is a SHD post, but afaik it is limited to just the tenants that have reported the issue, and the details presented therein leave a lot to be desired. In any case, as the issue is still under investigation, things will certainly change. I will make sure to update this article as we get additional information, but in the meanwhile, I feel obliged to issue another general rant. In case you only care about the technicalities, you can stop reading here and head out to Tony’s article at petri.com.
OK, rant time. First of all, this is a MAJOR issue, the type of which should result in updating few resumes. We are not only talking about potential data loss here, we are talking about being able to circumvent the hold functionality, which is the corner stone of Microsoft’s compliance solutions across most of Office 365. Being able to immutably protect any and all data stored inside user mailboxes is a requirement for a number of important regulations, such as the SEC 17a-4 regulation (other workloads also use the mailbox store to make sure compliance requirements are met). Now, it might turn out that Exchange Online has failed to comply with such regulations for an extended period of time. A point as to how exactly such compliance is validated by third-party auditors can also be extended here, and you might want to start independently verifying any and all claims in case your organization is in a highly-regulated industry.
Now, before I go full bitch mode, few mitigation factors should be mentioned. First of all, the COW page protection seems to only fail on processing attachment changes. Editing other properties will result in a copy of the original item correctly saved under the /Versions subfolder. In addition, the issue only seems to affect the Outlook client, more specifically Outlook running in cached mode. We’ve confirmed that several different builds of Outlook are affected, so the code issue has been around for at least an year. Luckily, most other email clients do not have the functionality to remove attachments or are unaffected. Even considering all these factors, this is still a huge issue, with potentially severe implications.
Which brings us to another rant point: why on Earth would Microsoft choose to code things in such a way that COW is enforced client-side? If I am to make a stupid analogy here – take any online game, if no server checks are performed one can easily modify practically every parameter/variable and enable “god mode”. But we’re not talking about games here, we’re talking about business that is bringing billions of revenue to Microsoft, based on the promises they’ve made with regards to protecting customers’ data and complying with regulations. And what’s even worse, this isn’t the first client-side bug that bypasses holds that we’ve uncovered. Back in 2014 I stumbled upon an issue with OWA’s move folders functionality that also resulted in data loss, even when the mailbox was put on hold. Another client-side issue that resulted in purging data from the mailbox store. One would think that Microsoft would learn their lesson and start using a server-side solution, but alas…
The next point, and one I’ve made repeatedly over the past few years, is the total neglect of QA in the service. Add to that the “standard” practice of releasing minimal viable products instead of complete, feature-proof solutions, and we get all the ingredients we need for a service plagued with issues. We’ve basically come to accept not only minor annoyances such as typos and UI glitches (which albeit minor for the most part, show an alarming lack of attention to detail), but even compromises with features that are required for the services to operate normally. Reporting such issues to Microsoft has mixed results, as some teams seem to be only interested in releasing the next roadmap feature and moving on to something else, instead of ensuring flaws are being disposed of in a timely manner. Even compliance issues are being ignored for months, such as the fact that any user can single-handedly prevent you from performing eDiscovery searches against his own ODFB site collection.
Since we know that this issue isn’t affecting on-premises Exchange installs (up to at least Exchange 2019 RTM), it is mind-boggling how a change to functionality as important as the COW page protection mechanism got released and pushed to the entire Exchange Online active base without conducting proper testing. One would imagine that testing every aspect of compliance features is on the top of list for any code change. The issue itself is quite easy to reproduce, it takes a minute to test and it’s very hard to miss if you have even a basic understanding on how things are supposed to work. The obvious conclusion here is that this is simply another code change that got pushed directly into production, with minimum or no verification. Long live the DevOps model and its proper applications in practice…
The scope of the issue is another thing that should haunt Microsoft’s execs. Not only it seems to affect everyone in the service, it has been around for months, if not years. While we still have no official statement, nor a post-incident report giving us the gritty details, we know that a broad range of server and client versions are affected. In turn, this means that the impact on customers is immeasurable. What’s even worse, because of the lack of backups in the service, there is no way for Microsoft to actually restore the missing data, even if they knew all the individual tenants and items that have been affected by this. Which they don’t anyway. Even addressing the client-side aspect of this can be problematic, considering the Office update model and the support windows for the different channels.
Last but not least, Microsoft is yet again failing to communicate things to its customer base. Yes, there is a SHD posting on the issue, but you are probably not able to see it (you can find a screenshot of the latest update below). As customary, the SHD item description is understating the scope of the issue and its impact. The language used in these posts continues to be vague at best, even after years and years of feedback from MVPs and customers alike. I’m sure there are a lot of reasons and countless hours have been spent polishing the tone and language of these messages in the company of top lawyers, but the simple fact is that they fail to fulfil their primary purpose.
There is no doubt that Office 365 is highly successful, but it might be becoming a victim of its own success. From the position of a leader, it’s easy to cite statistics and telemetry to justify yet another questionable decision (*cough* self-service purchases *cough*) or claim that support is top notch or that forcing a shitty chat bot experience actually helps. The lack of competition allows Microsoft to get away with many frivolities, and such committed not just by their marketing machine. We get examples of the aforementioned on a monthly, if not weekly basis. But when functionalities such as holds/retention start failing, alarms should be sounding left and right. Lawyers will probably have the last say on this particular topic, as I imagine quite few organizations will have a lot to ask and demand about this issue.
Hopefully this time around Microsoft will actually learns a lesson or two, and something positive comes out of this fiasco.
UPDATE 28 Jan: Well the issue has been fixed (server side!), and that’s that. Looks like Microsoft does not plan to release a post-incident report detailing the root cause, timespan or the number of users affected by the issue, neither to cover it in any other way. So much for transparency. Until next time (which I’m sure wont take long).