Hopular applied to Fraud Detection

| October 1, 2022

Project overview

Automated fraud detection methods can take two distinct forms:

  • a set of discrete rules (e.g. IF Amount > x AND Currency == $: Fraud) relying on expert knowledge that is expensive to acquire,
  • machine learning (ML) and Deep Learning (DL) approaches: learning from a training dataset whether a payment is fraudulent. While ML methods can be more efficient than rule-based methods, it is not easy to implement them efficiently. Indeed, fraud detection has two major characteristics that differentiate it from standard applications: (i) unbalanced data: non-fraudulent payments are largely in the majority in the datasets, (ii) concept drift: the distribution from which fraud data are derived may vary over time.

Generally, for classification tasks on tabular data, rather than DL models, the best performing approaches are methods based on Gradient Boosting (e.g. LightGBM, XGBoost). This is also true in the case of unbalanced data where Gradient Boosted Decision Trees (GBDT) seem to be the best performing methods. Nevertheless, recently a number of deep methods for tabular data have been proposed in the literature and seem to show encouraging results on balanced data. In particular Hopular, proposed in [1] based on Modern Hopfield Network. The objective of this project will be to implement this model on real fraud data provided by LUSIS in order to compare its performance to baseline GBDT models. The complexity of the Modern Hopfield Networks makes it impossible to use them on a dataset of too large size, therefore this approach will be used to perform two types of experiments. First, a comparison with GBDT on a well-chosen subset of the data, and second, using Hopular to try to classify among the payments predicted as fraud by a model (e.g. GBDT) those that are actually frauds in order to improve accuracy.

During this project, you will:

  • get acquainted with the issues related to the use of ML methods for fraud detection,
  • do bibliographic research to determine the metrics to use to evaluate the models,
  • read and understand the model proposed in [1],
  • adaptation the code made available by [1] (https://github.com/ml-jku/hopular) to the fraud dataset made available by LUSIS,
  • do performance comparison with XGBoost and LightGBM on a subset of data,
  • use Hopular to try to improve the performance of a Gradient Boosting model,
  • if enough time, modify the Hopular model to try to improve its performance.

References

[1] Schäfl, Bernhard and Gruber, Lukas and Bitto-Nemling, Angela and Hochreiter, Sepp. “Hopular: Modern Hopfield Networks for Tabular Data”. https://arxiv.org/abs/2206.00664 (2022).

[2] Jesse Davis et Mark Goadrich. “The Relationship Between Precision-Recall and ROC Curves”. In : t. 06. Juin 2006. doi : 10.1145/1143844.1143874.