| October 1, 2022
Project overview
Automated fraud detection methods can take two distinct forms:
- a set of discrete rules (e.g. IF (Amount > x) AND (Currency == $) : Fraud) relying on expert knowledge that is expensive to acquire,
- Machine Learning (ML) and Deep Learning (DL) approaches, which is learning from a training dataset whether a payment is fraudulent.
While ML methods can be more efficient than rule-based methods, it is not easy to implement them efficiently. Indeed, fraud detection has two major characteristics that differentiate it from standard applications: (i) unbalanced data: non-fraudulent payments are largely in the majority in the datasets, (ii) concept drift: the distribution from which fraud data are derived may vary over time.
Generally, for classification tasks on tabular data rather than deep models, the best performing approaches are those based on Gradient Boosting (e.g. LightGBM, XGBoost). This is also true in the case of unbalanced data where Gradient Boosted Decision Trees (GBDT) seem to be the best performing methods. Nevertheless, in parallel to these approaches, anomaly detection (AD) methods, at the border between supervised and unsupervised, have also shown interesting performances on unbalanced data. In [1], the authors propose to combine these two methods to try to improve the performances of GBDT methods. In particular, they propose to use existing DA methods to create new features.
During this project, you will:
- get acquainted with the issues related to the use of ML methods for fraud detection,
- search the literature to determine the metrics to use to evaluate the models.
- implement AD methods to create new features:
- standard One-Class Classification methods: One-Class SVM, Isolation Forest (sklearn)
- deep methods: Deep SVDD [2], GOAD [3], NeuTraLAD [4] (code available on Github).
- methods based on Deep Autoencoders [5], [6].
- use methods implemented in 3° to augment existing features and try to improve GBDT performance as suggested in [1].
- if enough time, adapt/modify the method proposed in [1] to try to improve its performance.
References
[1] Zhao, Yue and Maciej K. Hryniewicki. “XGBOD: Improving Supervised Outlier Detection with Unsupervised Representation Learning.” 2018 International Joint Conference on Neural Networks (IJCNN) (2018): 1-8. (https://arxiv.org/ftp/arxiv/papers/1912/1912.00290.pdf).
[2] Ruff, L., Vandermeulen, R., Goernitz, N., Deecke, L., Siddiqui, S.A., Binder, A., Müller, E. ; Kloft, M.. (2018). Deep One-Class Classification. Proceedings of the 35th International Conference on Machine Learning, in Proceedings of Machine Learning Research 80:4393-4402 Available from https://proceedings.mlr.press/v80/ruff18a.html.
[3] Bergman, L., Hoshen, Y.: Classification-based anomaly detection for general data. In: International Conference on Learning Representations (2020). (https://openreview.net/forum?id=H1lKlBtvS).
[4] Qiu, C., Pfrommer, T., Kloft, M., Mandt, S., Rudolph, M.: Neural Transformation Learning for Deep Anomaly Detection Beyond Images. https://arxiv.org/abs/2103.16440 (2021).
[5] Marco Schreyer and Timur Sa]arov and Damian Borth and Andreas Dengel and Bernd Reimer (2017). Detection of Anomalies in Large Scale Accounting Data using Deep Autoencoder Networks. CoRR, abs/1709.05254. (https://arxiv.org/abs/1709.05254)
[6] Ki Hyun Kim, Sangwoo Shim, Yongsub Lim, Jongseob Jeon, Jeongwoo Choi, Byungchan Kim, & Andre S. Yoon (2020). RaPP: Novelty Detection with Reconstruction along Projection Pathway. In: International Conference on Learning Representations. (https://openreview.net/pdf?id=HkgeGeBYDB)
[7] Jesse Davis et Mark Goadrich. “The Relationship Between Precision-Recall and ROC Curves”. In : t. 06. Juin 2006. doi : 10.1145/1143844.1143874.