The large language models, or LLMs, that power many popular text-based AI applications are vulnerable to jailbreaking attacks, during which a user enters a malicious prompt to bypass an application’s guardrails and trick it into making inappropriate or harmful content.
Johns Hopkins computer scientists have found that these attacks are far more successful in low-resource languages such as Armenian and Maori, where limited text data is available for AI model training, as opposed to more widely used languages like English and Spanish. Their study will be published in the proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, to be held in Bangkok, Thailand from August 11 to 16.
Using malicious prompts written in various high- and low-resource languages on LLMs, the researchers evaluated the generated responses for harmful, unethical, or biased content. Even with the application of various alignment techniques, which map similar meanings across different languages, the researchers found that low-resource languages were at a significantly higher risk of succumbing to jailbreaking attacks.
“The discrepancies stem from the initial training stage, when the LLMs are exposed to only a small amount of data in these languages with limited resources,” says lead author Lingfeng Shen, Engr ’24 (MS), now a research scientist at ByteDance. “This means the root issue is that there simply isn’t enough data available for less widely used languages during the model’s first training process.”
The study highlights a significant vulnerability in current LLM technology with serious implications for multilingual applications, the researchers say.
“Ensuring that LLMs can safely interact with users in various languages, including those with fewer resources, is critical for inclusivity and global applicability,” says Daniel Khashabi, an assistant professor of computer science at the Whiting School of Engineering, a member of the Center for Language and Speech Processing, and one of the researchers on the team. “If these systems are not safe and reliable across all languages, it could lead to misinformation, harmful content dissemination, and overall decreased trust in AI technologies such as chatbots, virtual assistants, and translation tools.”
The computer scientists encourage those training the next iterations of popular LLMs to include more data from low-resource languages such as Mongolian, Urdu, and Hausa.
“We hope that enhancing the pre-training process with more diverse linguistic data could mitigate the issues we observed,” says Shen.
They also recommend that researchers work to improve existing alignment techniques and develop new approaches specifically designed to handle languages that don’t have enough available data to train AI models on.
“Our research advocates for more equitable AI development that considers the linguistic diversity of all users,” says Khashabi.
Additional authors of this work include Philipp Koehn, a professor of computer science; Johns Hopkins PhD students Weiting “Steven” Tan, Yunmo Chen, Jingyu “Jack” Zhang, and Haoran Xu; and Sihao Chen and Boyuan Zheng, PhD students at the University of Pennsylvania and the Ohio State University, respectively.