Reporting on SharePoint Online and OneDrive for Business item size with version history included using the Graph API

In my last article, I went on a bit of a rant “inspired” by my recent dealing with the Graph API endpoints related to working with SharePoint Online and OneDrive for Business items versions. In this article, we’ll see what my suffering resulted in. In other words let’s talk about my latest script – reporting on storage usage for SharePoint Online and OneDrive for Business, down to individual item versions (i.e. taking version history into account).

How the script works

The logic of the script is quite similar to that of the shared files report one. The script uses application permissions to list all SharePoint Online and OneDrive for Business site collections (sites), then iterates over each of them to cover any subsites and enumerate any lists therein, with the focus on list used for storing user-accessible items. At this point, the script will get all items, and if you provide the corresponding parameter, all the versions created for a given item.

That’s the short version, now for some details. As with all things Graph, the script must obtain a valid access token in order to do its job, and for that you need to run it with a service principal that has been granted the Sites.Read.All permission. If you run the script with the default set of parameters, it will then retrieve a list of all site collections and start processing them. In most organizations, you will want to run the script against a specific site, or set of sites instead, and for that you can leverage the -Sites parameter. We will cover parameters in more details in the next section.

For each site (collection), the script will then check for the presence of any subsites and give you a quick overview of the total and remaining quota. Next, it will enumerate any lists with the documentLibrary or mySiteDocumentLibrary template, and process items stored within them. For each item, metadata such as size and creation/modification details are gathered. This in turn allows you to hone in on any large or inactive items for a potential cleanup.

The most interesting, and troublesome part of the script is the logic to cover item versions. As items can potentially have hundreds or even thousand of versions, looking at the item size on its own does not give you the whole picture. By gathering metadata for items versions as well, the script enables you to build accurate storage reports, similar in functionality to what the Storage Metrics page offers. Unfortunately, this comes with some drawbacks.

Most notably, working with item versions in SharePoint Online and OneDrive for business is quite the unreliable experience, as described in my recent rant post. Apart from the many, many, many reliability issues, performance degrades significantly once versions are included, which in turn affects the ability to run the script against large sites, or many sites simultaneously. For that, and many other reasons, the script will only process item versions if you explicitly specify the –IncludeVersions switch.

On the positive side, processing item versions not only allows you to provide more accurate storage report, it also enables the script to give folder size values that correctly reflect storage consumed by item versions. While we can fetch the size property out of the driveItem resource for folder items, said value is often underreported. In contrast, by summing the total size value of any items stored in the folder, with their version history included, we get a number that matches the Storage Metrics data.

How to run the script

Now that you understand what the script does and the permissions required, it’s time to head over to my GitHub repo and download it. Once the file is downloaded, make sure to set the authentication variables (lines 266-268), or if you so choose, replace the whole connectivity block/function with your preferred solution. Do not forget to grant the Sites.Read.All scope and consent to it!

To run the script, you can leverage the following set of parameters:

  • Sites – use this parameter to provide a list of sites to run the script against. Valid values include the full site URI, the ID as reported by Graph or a path (see examples below). The parameter is optional, and if you do not provide a value for it, the script will enumerate all sites within the tenant.
  • IncludeODFBsites – switch parameter, indicates whether to enumerate and process OneDrive for Business sites. When using the Sites parameter, I’m assuming that you want all the provided sites processed, so this parameter is ignored.
  • NoItemLevelStats – switch parameter, indicates whether to include item-level details in the output. If you run the script with this parameter specified, only site-level statistics will be returned. Overrides the IncludeVersions switch.
  • IncludeVersions – switch parameter, indicates whether the script should process any item versions found and adjust the output values accordingly. Be warned that processing item versions greatly impacts the performance and reliability of the script.
  • ExportToExcel – switch parameter, use it to generate a “prettier” output to Excel file, with clickable links and some conditional formatting. Requires the ImportExcel module. If you do not specify this parameter, CSV output is generated.
  • Verbose – use this parameter to show additional details as the script progresses.

Below are some examples on how to run the script. I’ve said it already but I’ll reiterate once more – if you plan to run the script with the –IncludeVersions switch, limit its scope to specific sites only by leveraging the –Sites parameter as well. Or, update the output bits to write to disk after each site is processed, if you prefer. Chances are you are going to run into an issue otherwise. Don’t say I didn’t warn you!

#Run the script without any parameter to process all SPO sites
.\Graph_SPO_storage_report.ps1

#If you want to also process OneDrive for Business sites, use the IncludeODFBsites switch
.\Graph_SPO_storage_report.ps1 -IncludeODFBsites

#To generate a report on the site-level only (no item level details), use the NoItemLevelStats switch
.\Graph_SPO_storage_report.ps1 -NoItemLevelStats


#Use the Sites parameter to process specific sites only
.\Graph_SPO_storage_report.ps1 -Sites "tenant.sharepoint.com,12345678-1234-1234-1234-cf16a0a8a888,12345678-1234-1234-1234-c3a1f49ff1d4","https://tenant.sharepoint.com/sites/newwwwww","tenant.sharepoint.com/sites/newwwwww"

#You can also feed the list of sites via CSV/object
.\Graph_SPO_storage_report.ps1 -Sites (Import-CSV blabla.csv).Site

#Run the script with the IncludeVersions switch to process item versions
#Whenever you include the IncludeVersions switch, I'd strongly recommend to run the script against specific sites only
.\Graph_SPO_storage_report.ps1 -Sites michev.sharepoint.com -IncludeVersions -ExportToExcel

SPOStorageReport

Examining the script output

By default, the script will generate a CSV file as output and store it in the working directory. In addition, output will also be stored in the global variable $varSPOSharedItems, in order to allow you further processing before exporting. If you have the ImportExcel module installed, you can use the -ExportToExcel switch in order to get a “prettier” version of the output. It will include clickable links to the items, in order to make it easier to review them if needed, as well as some basic Insights as to the top 10 largest items. Of course, much more can be done with the ImportExcel module, so feel free to adjust the code to your needs.

The screenshots below illustrate how the output will look like. On the first screenshot, you see the “most basic” type of report that the script can generate, simply giving you a list of sites (when you run the script with the –NoItemLevelStats switch). Note that some properties will be null in this case – this is the expected behavior as the Graph API does not return values for lastModifiedBy for site objects. And in some cases it also does not return their size.

.\Graph_SPO_storage_report.ps1 -NoItemLevelStats -IncludeODFBsites -ExportToExcel

SPOStorageReport1

In contrast, here’s how the “fullest” output will look like, covering item-level details, including version count and sizes.

.\Graph_SPO_storage_report.ps1 -IncludeVersions -ExportToExcel

SPOStorageReport2

In most cases, you’d probably want to run the script against a single site, in which case you will get output closely resembling that of the Storage Metrics page. Here’s a comparison between the two for my personal ODFB site:

SPOStorageReport3

The following columns can be found in the generated Excel file:

  • Site – clickable URL pointing to the SPO/ODFB site collection URL.
  • SiteURL – (hidden) same as above, simply used as source for the pretty links.
  • Name – the (file) name of the item.
  • ItemType – whether this is a Site, File or a Folder item.
  • Shared – whether the item is Shared (i.e. someone other than the Owner has access to it, internal users included).
  • Size – the size of the item in bytes. For folders, the value might reflect the total size of all versions of any child items. Or it might not – this is the reason why we recalculate folder size as described above.
  • VersionCount – the number of versions discovered for the given item. Only included when the script is run with the –IncludeVersions switch.
  • VersionSize – the sum of individual version sizes. Only included when the script is run with the –IncludeVersions switch.
  • % of Site quota – the ratio of the items’ Size (or VersionSize if the –IncludeVersions switch is used) value to the quota of the parent site, expressed in percent.
  • CreatedDateTime – item’s creation timestamp.
  • LastModifiedDateTime – item’s last modification timestamp.
  • LastModifiedBy – the user that last modified the item.
  • ItemPath – the “folder” path to the item, represented by its webURL value.
  • ItemID – the Graph API identifier of the item. You can use it to run a quick query in order to retrieve additional details or perform actions such as removing permissions or deleting the item.

The script is intentionally not exposing the site quota, as you will have the same value repeated for each item within a site. If you want to add it, it’s a simple task, but the % of Site quota column should serve the same purpose. Well almost. And has the added benefit of bringing the output of the script closer to that of the Storage Metrics tool.

Let’s also discuss what the script does not cover. Sites that do not have a drive facet are filtered out. Within a given site, the same applies to any list objects without a drive facet, lists that are hidden or do not use document library templates. This in turn excludes some important bits, such as the Preservation hold library. Should you want to cover such lists, make sure to update the script’s logic at line 379:

$SiteLists = $SiteLists.value | ? {$_.list.hidden -eq $false -and ($_.list.template -eq "documentLibrary" -or $_.list.template -eq "mySiteDocumentLibrary")}

The script also does not cover any file storage containers. Let me know if you are interested in reporting on such.

You also might observe some discrepancies in property values, such as the folder size we discussed above. Another example is lastModifiedBy, which sometimes is not returned by the /lists/{id}/items call. We use the $expand=driveItem operator to get it instead, as otherwise we’d need to make additional calls. Speaking of additional calls, the LIST method for sites does NOT return the lastModifiedDateTime property, so we need make additional calls per-site. Not to be left behind, the GET method does not return the isPersonalSite property… the joys of working with the Graph API.

Another trick the script uses is to fetch versions via the $expand=versions operator on the /lists/{id}/items call. While technically ListItemVersion is a different thing from the DriveItemVersion object, we can use this as a quick check whether any versions exist for a given list item and fetch their size from a subsequent call to /drives/{id}/items/{id}/versions. Sadly, the $top operator cannot be used together with $expand, so we cannot use $top=2 or similar for our check 🙁

As to why reporting on versions is important, take a look at the following screenshot. We’re looking only at items with two or more versions, sorted by version size. Who would have guessed that a file of 6k size is the second largest storage consumer in my personal OneDrive for Business? Thanks to a whooping 6164 versions, its total storage amounts to over 2.4GB!

SPOStorageReport4

Additional notes and summary

In this article, I presented you with my latest creation – a PowerShell script that you can use to generate storage usage report for SharePoint Online and OneDrive for Business sites, down to individual item level. The script allows you to also report on item versions, which in turn gives you a more comprehensive overview. As with all my other scripts, you can get a copy of the script from my GitHub repository. As always, treat this script as a simple “proof of concept” and not a complete production-ready solution.

While the overall script logic closely resembles that of my sharing report script, the biggest difference between the two is that fetching item versions can cause HUGE slowdowns as well as some reliability issues. The script was designed with the goal of minimizing the number of Graph API calls required, but as a side effect of that some calls can result in a GatewayTimeout or ServiceUnavailable errors. The former usually indicates that the response is too large in size, thus limiting the number of requested objects is the only thing that helps. To get the script to run reliably with versions in my tenant, I had to reduce the page size for the /lists/{id}/items call to 100 (line 387). Your mileage will vary.

On the other hand, all the ServiceUnavailable and BadRequest errors I’ve run into when testing the script are retriable, and likely related to throttling. Even though I’ve never encountered a response with the Retry-After header. Go figure. While the script has some logic in place to handle throttling and failed requests, it’s obviously not completely error-prone, so my advice is to run it against specific sites only. For a small tenant, you might be able to enumerate all items, but do not expect miracles in any organization of size. As a baseline – it took it 49 minutes to process 76 sites in my tenant, with a total of 58k+ items, of which 1419 have more than one version.

As always, comments and feedback are welcome.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.