As more and more human activities are supported by computers nowadays, this means that more and more data is being created, gathered and shared in digital form. With the accumulation of information stored around the world, there has been a growing concern about the way this data is being managed, especially if it holds pieces that can be connected with people or other entities. In order to facilitate easier work with data in the agri-food sector the owners of the challenge have asked for a solution that can detect parts of documents or datasets that might hold PII data, flag them and suggest ways to mitigate their impact. We have previous experience with anonymisation of legal documents and are keen on taking the challenge to extend our solution to one that not only can serve the needs of the challenge owner fully, but also can be extended and customised to help organisation from any domain with data anonymisation and cleaning.
One of the main challenges of the problem is that the input data comes in different formats and languages. There have been developments in different branches of machine learning that enable us to perform complex tasks, but most of them work with specific types of information. This means that the first task for our proposed system would be to accept files in different formats, detect what type of information they hold, extract the data in a form that is supported by the algorithm downstream and send it to the best suited model.
The preprocessing will split the data in three main types:
Another important part that is addressed in the preprocessing phase is the detection of the language to be analysed. In order to achieve the best possible performance, each model has versions that are optimised for different languages (e.g. English, Spanish, Russian, etc.) and different alphabets (e.g. Latin, Greek, Cyrilic, etc).
The Solution proposed by CosmosThrace has three components. The first component is the preprocessing phase where different kinds of documents are accepted and analysed to detect and separate different types of data.
After the text, tables, images and other data types are separated they are sent to the second component, nametly the models that specialise in the detection of sensitive data for the defined type of data. These specialised models use developments in the given area of machine learning like word embeddings for text data and convolutional neural networks for the images and other visual data.
When the analysis is done the results are supplied to the third component where they are merged together and all suspected sensitive data is reported to the data curators, along with suggestions how to hide it.
This architecture ensures flexibility that allows for easy customisation and extensibility as new modules can be added if other types of data have to be analysed. It also lets the developer easily switch between different versions of a given model, without having to work on any other part of the solution.
The solution proposed by CosmosThrace is based on data-driven machine learning algorithms with special focus on deep learning techniques. In the case of text data word embeddings and long short-term memory layers in the neural network will be employed. In order to detect faces, text and other sensitive information in images and videos, the software will use convolutional and pooling layers in the neural networks trained for each specific case. The preprocessing step will use the best-in-class OCR algorithms to extract text from the other types of data like ppt and pdf. The whole solution communicates with standard REST APIs to ensure that it is easy to extend and adapt to each specific case. In order to make use of what is available readily in the area of machine learning, the system will be developed in Python, which is the most powerful programming language among data scientists
The project was developed in the following phases: