Rice CS Assistant Professor Konstantinos Mamouras and his group published findings on a ‘safer’ way to code at this year's PLDI Conference in Copenhagen.
The paper, Static Analysis for Checking the Disambiguation Robustness of Regular Expressions, articulates and solves a key problem that can occur when using regular expressions (regex). The team's research won a Distinguished Paper Award at the conference.
Regex and the Ambiguity Problem
Regular expressions (regex for short) are, according to Mamouras, “a language used to describe patterns that arise in text.” They’re used to find patterns in various sequences, from numbers to DNA base pairs. That utility means regex are used in many different contexts. But they do have a weakness.
Because a regex can return multiple results depending on the string of characters used, those results can be ambiguous. Think of it as using the “command + F” shortcut to search for a certain word; typing “our” will return that specific word, but also words like “tour” and “pour” that contain the same letters.
Regex can be more fine-tuned, but matches still have some ambiguity. Two techniques, called disambiguation policies, were developed to try and solve that problem: the POSIX policy and the greedy (PCRE) policy.
These two policies help narrow the results, but some regex don’t work with both policies. One could work with the POSIX policy but not the PCRE policy, and vice versa. Mamouras explained this could lead to erroneous results.
“If you try to reuse something and it’s not safe for reuse you’ll get wrong results and those results could compromise your application,” Mamouras said. “It could make [the results] incorrect and it could also have security implications.”
Disambiguation Robustness and Safe Reuse
Mamouras and his team had two goals with their research: establish the computational difficulty of testing regex for safe reuse and devise a way to do it anyway. They did both, and their conclusions are clearly outlined in the paper.
In the paper’s introduction, Mamouras describes an application using regex that falsely identifies an IP address as suspicious because it used the wrong disambiguation policy. If people had a way to prove regex were safe to use with either policy, that kind of error wouldn’t happen.
“We call this property of safe reuse ‘disambiguation robustness,’” said Mamouras. Regex that are robust can be reused because, he explains, “Whether I use it in a POSIX-based tool or a greedy-based tool it doesn’t matter because the behavior will be the same.
“We looked at real data sets and showed that there is a significant part of regular expressions that cannot be reused,” he added. “It’s not just a theoretical curiosity, it can really be a problem in practice.”
His group found the solution by creating an algorithm to test the disambiguation robustness of regex. Ones that showed high robustness could be safely reused with either major disambiguation policy.
What’s Next
While the algorithm isn’t publicly available yet, it will be soon. It currently works as a command-line tool and could eventually be made into a website where anyone can plug regex in and see if they’re safe to reuse.
One example Mamouras gave was people who use collections of regex they find online. They would never know which disambiguation policy those regex were written for and couldn’t be certain they’d work. The algorithm described in this paper would solve that uncertainty.
Regex are used in applications from web scraping to email validation, in conjunction with multiple coding languages, so safely recycling them could make those many little everyday functions more secure and efficient.
Mamouras hopes this algorithm will be integrated into everyday computing tools, improving any application that uses regex and helping people who don’t necessarily have a computer science background to work with them.
Rice CS students Alexis Le Glaunec, Wu Angela Li, and Agnishom Chattopadhyay worked with Mamouras on this research paper.