RoG: A Pipeline for Automated Sensitive Data Identification and Anonymisation

TitleRoG: A Pipeline for Automated Sensitive Data Identification and Anonymisation
Publication TypeConference Paper
Year of Publication2023
AuthorsNikoletos S, Vlachos S, Zaragkas E, Vassilakis C, Tryfonopoulos C, Raftopoulou P
Conference NameProceedings of the 2023 IEEE CSR Workshop on Data Science for Cyber Security (DS4CS)
Keywordsanonymisation, automated process, k-anonymity, NER, NLP, pipeline, sensitive/private data

Nowadays, the amount of data available online is constantly increasing. This data may contain sensitive or private information that can expose the person behind the data or be misused by malicious actors for identity theft, stalking, and other nefarious purposes. There is thus, a growing need to protect individuals' privacy and prevent data breaches in several application domains. Protecting data privacy though, is a complex and multifaceted issue that involves a range of legal, ethical, and technical considerations. In this paper, we discuss the challenges associated with data protection, the role of automated tools, and the effectiveness of identifying and anonymising sensitive data. We then, propose a fully-automated process for sensitive data identification and anonymisation, based on Natural Language Processing (NLP) techniques, that can be applied both in big diverse datasets and to a wide range of domains.