Dusting for Cyber Fingerprints: Computer Scientists Use Coding Style to Identify Anonymous Programmers
February 26 2015
A team of computer scientists, led by researchers from Drexel University’s College of Computing & Informatics, have devised a way to lift the veil of anonymity protecting cybercriminals by turning their malicious code against them. Their method uses a parsing program to break down lines of code, like an English teacher diagraming a sentence, and then another program captures distinctive patterns that can be used to identify its author.
“Just like writers and artists, every coder has a unique style all their own,” said Aylin Caliskan-Islam, a doctoral student at Drexel who developed the system and is the lead author of a recently posted technical report about it. “Our process distills the most important characteristics of a programmer’s style —which is the first step toward identifying anonymous authors, tracking cybercriminals and settling intellectual property questions.”
Caliskan-Islam drew on contributions from Princeton University, the University of Maryland, the University of Gottingen in Germany, and the Army Research Laboratory to produce a digital analytics system that could become a kit for electronically “fingerprinting” cybercriminals.
After a concerted effort to identify the traits of program code that are most useful for author-identification, the team was able to successfully “fingerprint” and match authors to their work with a very high degree accuracy.
The program is an extension of dual research thrusts at Drexel’s Privacy, Security and Automation Lab: to develop stylometry software that can unmask authors —the kind that thwarted J.K. Rowlings’s attempt at a pseudonymous novel—and to make an adversarial program that will hide all traces of an author’s style –rendering them truly anonymous.
PSAL’s Jstylo and Anonymouth software use writing patterns —the placement of pronouns, the frequency of using certain modifiers or sentence structures—to create an author’s style profile. Caliskan-Islam’s program takes the same linguistic categories —layout, lexicon and syntax— and applies them to their equivalents in coding language.
“We already have a very good tool that uses natural language processing to identify anonymous authors from their writing, so I thought ‘what if we can make a similar program that will identify the authors of code?’” Caliskan-Islam said.
The key, according to Caliskan-Islam, is analyzing multiple facets of the code so that the places where these features intersect form a unique pattern that is only found in code written by a particular author.
To create these digital fingerprints, Caliskan-Islam wrote a program that looks at the overall layout of the code, its length, blank space, use of tabs vs. spaces and placement of comments. It also considers the lexicon that the programmer chose to use—the names of variables, favoring certain functions over others, essentially the programmer’s equivalent of word choice.
These two lines of analysis are bolstered by a third evaluation, called a syntax tree analysis, which resembles a rather complicated, multi-layer sentence diagram. Using a program called Joern, which is known in computer science circles as a fuzzy abstract syntax tree parser, sample code is distilled into these tree-like diagrams that represent every structural decision the author made in producing a string of code.
Drexel researchers are able to discern features found in volumes of computer code that can be used to identify its author.
This syntactic breakdown calls attention to details such as the order in which commands are placed and the depth at which functions are nested in the code. These features are often very telling when it comes to identifying an author’s style and they cannot be arbitrarily altered —in an attempt to disguise authorship—without changing the function of the program itself.
Caliskan-Islam’s program plucks the features from the tree, creating “sets” of the most relevant ones. These sets are the basis from which the digital prints are made. Think of it as a fingerprint analyst deciding to look at the shape, and size of friction ridges but not their continuity—because new scars on the finger might rule out the correct match.
To put their theory to the test, Caliskan-Islam’s team acquired volumes of code—the collective work of 250 contestants who solved coding challenges as part of “Google Code Jam” competitions from 2008-2014. This sample yielded 20,000 distinct coding features and Caliskan-Islam’s program narrowed that list down to the most relevant 137, which were used as the data points for generating digital fingerprints for the authors.
Then, like good detectives, the team put together a lineup of anonymous author “suspects” to see if the program could successfully match them to some of their code. Using the code of 62 authors who correctly solved multiple “Code Jam” problems, Caliskan-Islam programmed her set of 137 features into a standard machine-learning classifier tool. Then she had the tool review nine problems-worth of code (about 70 lines per problem) from each author—to get a well-defined digital fingerprint of each author.
With the prints recorded, the classifier was then presented with new a new set of code to analyze, and then match with its author. The classifier correctly paired the code and its author with 95 percent accuracy—marks nearly as high as modern fingerprint analysis techniques.
Additional tests alternately increased the number of author “suspects” and decreased the amount of sample code—to simulate common challenges in the pursuit of cyber criminals. Both came up with results in the same high range of accuracy—the most recent tests, using a set of 250 authors, produced matches with 97 percent accuracy.
“We also noticed that more sophisticated code—correct answers to the most challenging Code Jam problems—gave us the most distinctive feature sets and the highest degree of author-matching accuracy,” Caliskan-Islam said. “This indicates that even a small amount of code could—from a master cybercriminal—could still prove useful in tracking them down.”
Thinking like these adversaries, the team also ran a set of tests after using a commercial authorship obfuscation program—the kind used by coders trying to hide their identity. But the system was still able to correctly match the authors to their code with virtually no change in accuracy.
“The effectiveness of this program opens a lot of doors for those who protect electronic data and intellectual property,” Caliskan-Islam said. “This can be a powerful tool for legal consultants, cybersecurity professionals and digital forensics experts. I could definitely see it being used to settle questions about original authors of software in cases of copyright disputes.”
The team is continuing to improve its accuracy and utility by expanding the features that it can identify and enabling it to analyze code written in any programming language.