Opening the Pandora (Papers) Box

Image credit: iStockphoto/Oleksandr Siedov

Over 11.9 million leaked records, 14 different offshore services, and 2.9 terabytes of data in various formats, including written notes. It was what the International Consortium of Investigative Journalists (ICIJ) faced when they tried to unravel the Pandora Papers.  

ICIJ knew it had its job cut out from the onset. After all, it was about to expose the assets of more than 330 politicians and 130 Forbes billionaires, along with celebrities, drug cartel leaders, religious leaders, and royal family members. Many have spent considerable resources to keep the information secret.

But sieving through the massive data dump was going to be more complex. ICIJ knew that a traditional approach to data and analytics using relational data was not going to work. So, they ended up working with Neo4j.

Understanding the data dump

The Pandora Papers are unlike the Panama Papers of 2016 and Paradise Papers of 2017 in many ways. The data size was similar: Panama Papers involved 2.6 terabytes holding 11.5 million leaked records, and Paradise Papers had 13.4 million records. But the difference was the number of sources: they only had one. Panama Papers leak came from Mossack Fonseca and Paradise Papers from AppleBy

The Pandora Papers, on the other hand, involved 14 financial services companies. And because they were leaked, these documents were not cataloged and prepared for analysis. What ICIJ had was a haphazard collection of documents that were primarily unstructured.

According to ICIJ, there were 6.4 million text documents (of which 4 million were PDFs), with some running to more than 10,000 pages each. There were passport photos, copies of bank statements and tax declarations, and various real estate contracts and company incorporation documents. Within the data dump, there were 4.1 million images and emails.

So the first thing ICIJ did was to make sense of a large amount of unstructured data searchable. Thankfully, data and analytics technologies were ready.

“Depending on the data type, they used different mechanisms. If it were PDF or document files, they could use Python to automate the data extraction process and structure the data. For the other data types, they used a combination of machine learning with other tools to identify and separate specific documents and forms,” says Maya Natarajan, Neo4j’s senior director for product marketing. 

Once converted, the data was loaded into Neo4j to discover the relationships between the different entities. They also used Linkurious, a partner of Neo4j, to visualize these relationships.

“Together, they allowed reporters to investigate all the complex direct and indirect connections between companies and people across the 14 different offshore firms,” says Natarajan.

According to ICIJ, Neo4j allowed more than 600 journalists from 150 media outlets in 117 countries involved in the project to have a complete picture of any byte of data. It also unearthed second- and third-degree relationships and the different ways in which two pieces of data could be linked.

To add additional context, ICIJ also created a knowledge graph. “And they did this by adding external data sets like public records, sanction lists, and previous leak information. This allowed them to create those bigger stories, like finding out if criminals were involved and how that affected the entire picture,” adds Natarajan.

The graph database approach

Neo4j was no stranger to ICIJ. The latter saw the versatility of graph databases for teasing out hidden data relationships or data lineages since the Panama Papers.

“I think the first interesting lesson that they learned then was that they couldn't do it with anything else apart from a graph database. It was just not going to work with relational databases because no data relationships were stored. And so, if you want to query relationships using a relational database, it is a very difficult task as you will have to do a number of joins,” says Natarajan.

Besides the difficulty, the relational database route posed two immediate problems: cost and resource usage. 

“It’s really costly both in terms of memory as well as computing power. Apart from that, if you're going to go beyond three hops, [the query] may just hang, and you won't even get results,” Natarajan explains further. 

Graph databases allow you to do real-time detection faster and with less computing power while maintaining the native data relationships. “You have a chance to strike first before the criminals get away,” she adds. 

Graph data platforms, coupled with a visualization tool like Linkurious, can also throw up recognizable patterns that suggest criminal activity.

“For instance, [fraud] may look like islands, especially when it involves money laundering. So, [graph databases] allow you to find these islands right away,” Natarajan details, adding that human trafficking has a similar pattern. 

“The beauty of graph databases is that it allows you to find not just the obvious relationships, but the hidden ones, and that's where you find all the interesting information. I mean, you know these nefarious characters and connections that may not be easily traceable. That's what graphs allow you to do,” she continues.

Parallel financial universe exposed

Natarajan sees graph data platforms becoming a major tool for different use cases beyond investigating journalism and fraud discovery. The latter is where Neo4j cut its teeth.

But even here, Pandora Papers offered some rude awakenings for Natarajan. One shock was how fraud and deception in wealth transfer involved more than 14 service providers, some of them well known.

In addition, it also showed how fraud and financial crime adapts. For example, ICIJ found that more than 500 BVI companies had been clients of Mossack Fonseca (remember, the Panama Papers scandal) and moved their business to other BVI providers in the aftermath.

Another data nugget was that one provider out of the 14, Alcogal, represented nearly half of politicians and public figures identified in the leakage. Alcogal is headquartered in Panama, and one of the public figures identified served as Panama’s ambassador to the U.S. Meanwhile, several Russians were identified after linking them to the data from Demetrios A. Demetriades LLC, or DadLaw, a provider headquartered in Cyprus and Seychelles-based Alpha Consulting Group.

For Natarajan, more shocking is that these activities were occurring adjacent to the current financial system. “Essentially, what the Pandora papers unraveled is that there is this interesting parallel financial universe of a complex global system of tax evasion and secrecy that actually spans multiple geographies and generations covering five decades.”

The Pandora Papers investigation will not stop tax evasion and financial crime. Neither will it prevent various parties from co-opting the current financial system for their nefarious ends — not with so much money at stake.

But ICIJ armed with tools like Neo4j can improve our oversight, making it even harder to hide. You can also be sure that they will be at the forefront of deciphering the next leak, which may or may not start with the letter “P”.

Winston Thomas is the editor-in-chief of CDOTrends and HR&DigitalTrends. He is always curious about all things digital, including new digital business models, the widening impact of AI/ML, unproven singularity theories, proven data science success stories, lurking cybersecurity dangers, and reimagining the digital experience. You can reach him at [email protected].

Image credit: iStockphoto/Oleksandr Siedov