Vendors Training AI With Customer Data is an Enterprise Risk

Zoom received some flak recently for planning to use customer data to train its machine learning models. The reality, however, is that the video conferencing company is not the first, nor will it be the last, to have similar plans.

Enterprises—especially those busy integrating AI tools for internal use—should be viewing these potential plans as emerging challenges which need to be proactively addressed with new processes, oversight and technology controls where possible.

Abandoned AI Plans

Zoom earlier this year changed its terms of service to give itself the right to use at least some customer content to train their AI and machine learning models. In early August the company abandoned that change after pushback from some customers who were concerned about their audio, video, chat and other communications being used fin this way.

The incident—despite the happy ending for now—is a reminder that companies need to pay closer attention to how technology vendors and other third parties might use their data in the rapidly emerging AI era.

One big mistake is to assume that data a technology company might collect for AI training is not very different from data the company might collect about service use, says Claude Mandy, chief evangelist, data security at Symmetry Systems. “Technology companies have been using data about their customer’s use of services for a long time,” Mandy says. “However, this has generally been limited to metadata about the usage, rather than the content or data being generated by or stored in the services.” In essence while both involve customer data, there’s a big difference between data about the customer and data of the customer, he says.

Clear Distinction

It’s a distinction that is already the focus of attention in a handful of lawsuits involving major technology companies and consumers. One of them pits Google against a class of millions of consumers. The lawsuit filed July in San Francisco accuses Google of scraping publicly available data on the Internet—including personal and professional information, creative and copywritten works, photos and even emails—and using them to train its Bard generative AI technology. “In the words of the FTC, the entire tech industry is “sprinting to do the same” — that is, to vacuum up as much data as they can find,” the lawsuit alleged.

Another class action lawsuit accuses Microsoft of doing precisely the same thing to train ChatGPT and other AI tools such as Dall.E and Vall.E. In July, comedian Sarah Silverman and two authors accused Meta and Microsoft of using their copyrighted material without consent for AI training purposes.

While the lawsuits involve consumers, the takeaway for organizations is they need to make sure technology companies don’t do the same thing with their data where possible.

“There is no equivalence between using customer data to improve user experience and [for] training AI. This is apples and oranges,” cautions Denis Mandich co-founder of Qrypt and former member of the US intelligence community. “AI has the additional risk of being individually predictive putting people and companies in jeopardy,” he notes.

As an example, he points to a startup using video and file transfer services on a third-party communications platform. A generative AI tool like ChatGPT trained on this data could potentially be a good source of information for a competitor to that startup, Mandich says. “The issue here is about the content, not the users experience for video/audio quality, GUI, etc.”

Oversight and Due Diligence

The big question of course is what exactly organizations can do to mitigate the risk of their sensitive data ending up as part of AI models.

A starting point would be to opt out of all AI training and generative AI features that are not under private deployment, says Omri Weinberg, co-founder and chief risk officer at DoControl. “This precautionary step is important to prevent the external exposure of data [when] we do not have a comprehensive understanding of its intended use and potential risks.”

Make sure too that there are no ambiguities in a technology vendors terms of service pertaining to company data and how it is used, says Heather Shoemaker, CEO and founder of Language I/O. “Ethical data usage hinges on policy transparency and informed consent,” she notes.

Further, AI tools can store customer information beyond just the training usage, meaning data could potentially be vulnerable in the case of a cyber-attack or data breach.”

Mandich advocates that companies insist on technology providers using end-to-end encryption wherever possible. “There is no reason to risk access by third parties unless they need it for data mining and your company has knowingly agreed to allow it,” he says. “This should be explicitly detailed in the EULA and demanded by the client.” The ideal is to have all encryption keys issued and managed by the company and not the provider, he says.

CyberSigna

Cyber Forensics and Research

Vendors Training AI With Customer Data is an Enterprise Risk

Abandoned AI Plans

Clear Distinction

Oversight and Due Diligence