Trustworthy Data Integration: Machine Learning to Expose Financial Corruption

The world was taken by storm when the International Consortium of Investigative Journalists (ICIJ), along with other media bodies, released millions of documents exposing financial chicanery and political corruption. The leaks detailed how prominent people, such as Icelandic Prime Minister Sigmundur Davíð Gunnlaugsson, used offshore entities for illegal activities. Perhaps the most famous of these leaks was the Panama Papers in 2016, which followed up the Offshore Leaks in 2013 and were augmented by the Bahamas Leaks later in 2016. Then came the Paradise Papers in 2017-18, which exposed many multinational corporations and organizations’ financial dealings and more prominent people. These leaks led to follow-up investigations, many of which are still ongoing. However, the vast number of leaked documents and the complexity of the fraud chains have been roadblocks to investigative thoroughness and led to a large fraction of potential wrongdoers escaping retribution.

Galois’s goal in analyzing these leaks was to build machine learning models to direct and triage investigative energy for future leaks. To achieve this, we had to first analyze the graphs of connected entities involved in financial irregularity, then integrate the data with actual investigative outcomes.

Software tools used in this effort include Apache Spark, Graphframes, and the Python pandas library.

Graph Analysis

The complete data from the leaks is available on the ICIJ website in the form of graphs. Nodes (vertices) in the graph can be one out of several types – ‘officer’ (e.g., people involved), ‘entity’ (e.g., offshore tax company), ‘intermediary’ (e.g., people in the middle who build connections), ‘address’, and ‘other’. Nodes have related information such as country of origin and start date of the company. Edges (connections) in the graph show connections between nodes, for example, an entity connected to an address. Here is a sample subgraph, which can be obtained by searching the ICIJ database.

We collected the data into two databases:

  • Panama papers (augmented with Offshore and Bahamas Leaks) – 1,040,535 nodes and 1,481,904 edges
  • Paradise papers – 867,931 nodes and 1,650,601 edges

Degree centrality

Degree centrality is a commonly used graph technique that lists the nodes according to how many connections each has. We found, not surprisingly, that one of the highest-degree nodes in the Panama papers is a former Panamanian firm, Mossack Fonseca. Providing offshore services,  the firm was prominently mentioned in news articles following the Panama papers and later raided by Salvadorian police in connection with their illicit activities. Other high degree nodes were mostly intermediaries, i.e., quite literally the “men in the middle.”

Clustering via Connected Components

Connected components is a commonly used graph algorithm to group nodes such that all nodes that are connected (directly, or through other nodes) lie in the same cluster. Running this uncovered an interesting trend. Almost a million nodes (~90%) in the Panama database are part of the same cluster, as shown by the rightmost solitary dot in the left plot. On the other hand, the Paradise papers is a more fragmented dataset where there are several big clusters all containing upwards of 10,000 nodes (right plot). The most influential, i.e., high-degree, nodes in both datasets were present in the largest clusters.

We also uncovered that for both datasets there are thousands of clusters containing just 2-3 nodes, as shown by the top-left dots in both plots. This implies that the graphs have a lot of small cliques, for example, a single person connected to a single entity via a single intermediary.


PageRank, first used in the Google search engine, is an algorithm to assign importance to nodes. We ran PageRank on the graphs and found that it was very highly correlated (>90%) with the degree of nodes, as shown below. This was an important discovery and will be further talked about when we get to Machine Learning.

Triangle Count

Triangle count is also a measure of importance of a node, specifically the number of triangles (three nodes all connected to each other) it is a part of. We found that the triangle count of nodes was well correlated (~83%) with their degree for the Panama dataset, but more weakly (~35%) for the Paradise dataset. An explanation for this could be that the Paradise dataset is more fragmented.

Data Integration and Machine Learning (ML)

The ultimate goal of our efforts was to build models that can point investigators in the right direction for future financial leaks and minimize the time they would have to spend sifting through gigantic data dumps. To this end, we had to tie the leaked data to actual investigative outcomes. Other knowledge databases such as Wikipedia and Wikidata contain names of prominent people and organizations investigated after the leaks. Cleaning this data and integrating it into the Panama and Paradise papers uncovered a trend – nodes found in other knowledge databases occurred at the lower end of the degree-pagerank spectrum. However, this is also because the lower end has a higher concentration of nodes. This trend is illustrated in the plot below, where the red nodes are those found in other knowledge databases.

Dataset construction

This is where the machine learning begins! The goal was to build a binary classifier that can separate the positives from negatives, where positive samples are the red nodes found in other knowledge databases along with their first and second-degree connections. These can be thought of as friends, and friends of friends, respectively. Negative samples are all the other nodes. Negatives outnumber positives by far, so we sampled the negatives to cap them to 10 times the number of positives. This leads to 37.5k total samples (positives and negatives) in the Panama dataset, and 59k in Paradise.

There are two different problems to be considered and they illustrate the classic bias-variance tradeoff in machine learning:

  1. Combine both datasets and then choose 60% of the samples for training and the remaining 40% for testing. The combined dataset is hard to train on in the first place, so a good ML model should have considerable complexity, i.e. favor low bias over low variance.
  2. Train on one dataset and test on the other. In this case, it is easy to overfit to one dataset and fail to generalize well. Thus, a good ML model should be of low complexity, i.e. favor low variance over low bias. The two cases for this problem are:
    1. Train on Panama and test on Paradise. This is a somewhat impractical problem because the number of training samples (37.5k) is far lower than the number of test samples (59k).
    2. Train on Paradise and test on Panama, which is more practical than vice-versa.

Thus, the model requirements for the two problems are conflicting and accordingly the results are of two kinds.


Degree was an important feature to look at since the red nodes have fairly low degrees. PageRank and triangle count were also used because of their correlation with degree. A fourth, interesting feature is the corruption perception index (CPI) of the country to which the node belongs since common sense dictates that financial irregularity is correlated with corruption.


Since the classification problem is imbalanced, accuracy is not a very useful metric to look at. Instead, looking at the precision-recall curves, and the average precision (AP) in particular, is a good indicator of the ML algorithm’s success.

Algorithm – Random Forest

Random Forests construct an ensemble of decision trees to classify a dataset and have the added benefit of exposing the importance of each feature in making decisions. Important hyperparameters are number of estimators and depth of individual trees, which are directly correlated with model complexity, and minimum samples in trees, which is inversely correlated. Optimizing hyperparameters according to the two problems gave a healthy AP score of 0.6 on problem 1 (shown on the left) – high complexity model trained and tested on the combined dataset, and 0.38 on problem 2b (shown on the right) – low complexity model trained on Paradise and tested on Panama.

The performance can be slightly improved by increasing the number of estimators (the figure on the left uses 1000), but the added time penalty makes this a potentially unattractive option from the point of view of a cost-benefit analysis.

Algorithm – Gradient Boosting

Gradient boosting is another ensemble algorithm which, instead of aggregating different trees as in a random forest, trains each classifier to focus on misclassified samples. In machine learning terminology, random forests is a type of bagging algorithm while gradient boosting is, as the name suggests, a boosting algorithm. The hyperparameters are similar to random forests with the important addition of learning rate, and the results are slightly better.

Feature Importances

We found that PageRank was the most important feature when training on the combined dataset (problem 1). In fact, it accounted for 70% importance while the other three features accounted for 10% each. Two possible reasons for this are: a) The red dots in the degree-pagerank curves occur for lower values of PageRank as compared to degree, implying that PageRank dominates the decisions taken by the trees; and b) since the ML models for problem 1 are of high complexity, they have some (beneficial) tendency to overfit, leading to prioritizing a particular feature significantly. The opposite case occurs in problem 2, where the ML models show a more equitable distribution of feature importances with each contributing 10-30%.

Other Algorithms

It is worth mentioning that we also experimented with support vector machines (SVMs). However, the performance was distinctly inferior to the ensemble methods of random forest and gradient boosting. This is likely because many negative samples (black nodes) lie close to the positive samples (red nodes), so constructing decision boundaries for SVMs becomes extremely complicated.


This project was a research effort to analyze graph databases of financial leaks, integrate with investigative outcomes, and build machine learning models to assist in future investigations. These are topics of interest to various agencies concerned with large-scale finance, anti-corruption, and transparency. We hope to continue building on these efforts. For questions, comments, and suggestions, please reach out. Thanks for reading!