ASCII to Unicode tag conversion is a technique that can be leveraged to bypass input sanitization filters designed to prevent prompt injection attacks. ASCII encoding represents characters using a standard 7-bit code, meaning it can only represent 128 unique characters. Unicode, on the other hand, provides a much broader encoding scheme, capable of representing over a million characters. By converting ASCII characters to their Unicode equivalents, attackers can manipulate or encode certain characters in ways that might evade detection by security systems, which may only recognize the original ASCII characters. This technique allows malicious actors to insert harmful inputs, such as command sequences or SQL queries, into systems that rely on simple filtering mechanisms based on ASCII-based input validation.
In prompt injection scenarios, this conversion is particularly useful because many input validation systems expect inputs in a specific character set, like ASCII, and might not be configured to handle Unicode properly. For example, an attacker could use Unicode homographs or encode certain special characters like semicolons or quotation marks that are typically filtered in ASCII form but pass through unnoticed when represented in Unicode. Once bypassed, these encoded characters can still be interpreted by the target system in their original form, allowing the attacker to execute malicious commands or manipulate outputs. This method of encoding to bypass input restrictions can become a key vulnerability in poorly secured prompt handling systems.