Large language models (LLMs) such as ChatGPT have shaken up the data security market as companies search for ways to prevent employees from leaking sensitive and proprietary data to external systems.
Companies have already started taking dramatic steps to head off the possibility of data leaks, including banning employees from using the systems, adopting the rudimentary controls offered by generative AI providers, and using a variety of data security services, such as content scanning and LLM firewalls. The efforts come as research reveals that leaks are possible, bolstered by three high-profile incidents at consumer device maker Samsung and studies that finds as much as 4% of employees are inputting sensitive data.
In the short term, the data security problem will only get worse — especially because, given the right prompts, LLMs are very good extracting nuggets of valuable data from training data — making technical solutions important, says Ron Reiter, co-founder and CTO at Sentra, a data life cycle security firm.
“Data loss prevention became much more of an issue because there’s suddenly … these large language models with the capability to index data in a very, very efficient manner,” he says. “People who were just sending documents around … now, the chances of that data landing into a large language model are much higher, which means it’s going to be much easier to find the sensitive data.”
Until now, companies have struggled to find ways to combat the risk of data leaks through LLMs. Samsung banned the use of ChatGPT in April, after engineers passed sensitive data to the large language model, including source code from a semiconductor database and minutes from an internal meeting. Apple restricted its employees from using ChatGPT in May to prevent workers from disclosing proprietary information, although no incidents were reported at the time. And financial firms, such as JPMorgan, have put limits on employee use of the service as far back as February, citing regulatory concerns.
The risks of generative AI are made more significant because the large, complex, and unstructured data that is typically incorporated into LLMs can defy many data security solutions, which tend to focus on specific types of sensitive data contained in files. Companies have voiced concerns that adopting generative AI models will lead to data leakage, says Ravisha Chugh, a principal analyst at Gartner.
The AI system providers have come up with some solutions, but they have not necessarily assuaged fears, she says.
“OpenAI disclosed a number of data controls available in the ChatGPT service through which organizations can turn off the chat history and choose to block access by ChatGPT to train their models,” Chugh says. “Still, many organizations are not comfortable with their employees sending sensitive data to ChatGPT.”
In-House Control of LLMs
The companies behind the largest LLMs are searching for ways to answer those doubts and offer ways to prevent data leaks, such as giving companies the ability to have private instances that keep their data internal to the firm. Yet even that option could lead to sensitive data leaking, because not all employees should have the same access to corporate data and LLMs make it easy to find the most sensitive information, says Sentra’s Reiter.
“The users don’t even need to summarize the billions of documents into a conclusion that will effectively hurt the company,” he says. “You can ask the system a question like, ‘Tell me if there’s a wage gap’ [at my company]; it will just tell you, ‘Yes, according to all the data I’ve ingested, there is a wage gap.'”
Managing an internal LLM is also a major effort, requiring deep in-house machine learning (ML) expertise to allow companies to implement and maintain their own versions of the massive AI models, says Gartner’s Chugh.
“Organizations should train their own domain-specific LLM using proprietary data that will provide maximum control over the sensitive data protection,” she says. “This is the best option from a data security perspective, [but] is only viable for organizations with the right ML and deep learning skills, compute resources, and budget.”
New LLM Data Security Methods
Data security technologies, however, can adapt to head off many scenarios of potential data leakage. Cloud-data security firm Sentra uses LLMs to determine which complex documents may constitute a leak of sensitive data if they are submitted to AI services. Threat detection firm Trellix, for example, monitors clipboard snippets and Web traffic for potential sensitive data, while also blocking access to specific sites.
A new category of security filters — LLM firewalls — can be used to both prevent an LLM from ingesting risky data and stop the generative AI model from returning improper responses. Machine learning firm Arthur announced its LLM firewall in May, an approach that can both block sensitive data from being submitted to an LLM and prevent an LLM service from sending potentially sensitive — or offensive — responses.
Finally, companies are not without recourse. Instead of completely blocking the use of LLM chatbots, a company’s legal and compliance teams could educate users with warnings and feedback to not submit sensitive information or even limit access to a specific set of users, says Chugh. At a more granular level, if teams can create rules for specific sensitive data types, those rules can be used to define data loss prevention policies.
Finally, companies that have deployed a comprehensive security by adopting zero trust network access (ZTNA), along with cloud security controls and firewall-as-a-service — a combination Gartner refers to as the security services edge (SSE) — can treat generative AI as a new Web category and block sensitive data uploads, says Gartner’s Chugh.
“The SSE forward proxy module can mask, redact, or block sensitive data in-line as it’s being entered into ChatGPT as a prompt,” she says. “Organizations should use the block option to prevent sensitive data from entering ChatGPT from Web or API interfaces.”