Microsoft AI researchers accidentally expose 38TB of data to GitHub

A Microsoft employee has accidentally exposed 38 terabytes of private data while revealing a large portion of open-source AI training data on GitHub.

VIEW GALLERY - 2

Jak Connor

Tech and Science Editor

Published Sep 20, 2023 11:18 AM CDT
Updated Oct 9, 2023 12:02 AM CDT

45-second read time

Voice: Jak ConnorSpeed

0:00 / --:--

A staggering 38 terabytes of data was accidentally leaked by Microsoft AI researchers on the website called GitHub, according to a cloud security company report.

Microsoft AI researchers accidentally expose 38TB of data to GitHub 5448

VIEW GALLERY - 2 IMAGES

The new report released by Wiz, a cloud security company, among the leaked files, were two entire backups of workstation computers that contained confidential Microsoft information such as company "secrets, private keys, passwords, and over 30,000 internal Microsoft Teams messages". The incredibly large data exposure may result in Microsoft's AI systems being vulnerable to attack or any other Microsoft-related systems. So, how did this happen?

Unfortunately, it was a simple yet critical mistake that occurred when Microsoft AI researchers were trying to publish a "bucket of open-source training material" and "AI models for image recognition" to GitHub. The files' SAS token was misspelled, resulting in the public's storage permissions switching to the entire storage account rather than the AI material that developers were attempting to publish. Unfortunately, the bad news doesn't stop there.

The permission mishap didn't just grant the public viewing access to the storage account, it also enabled "full control" of the account, meaning files could be downloaded, deleted, copied, altered, and more. Microsoft has responded.

An "attacker could have injected malicious code into all the AI models in this storage account, and every user who trusts Microsoft's GitHub repository would've been infected by it," Wiz's researchers write.