As more and more human activities are supported by computers nowadays, this means that more and more data is being created, gathered and shared in digital form. With the accumulation of information stored around the world, there has been a growing concern about the way this data is being managed, especially if it holds pieces that can be connected with people or other entities. In order to facilitate easier work with data in the agri-food sector the owners of the challenge have asked for a solution that can detect parts of documents or datasets that might hold PII data, flag them and suggest ways to mitigate their impact. We have previous experience with anonymisation of legal documents and are keen on taking the challenge to extend our solution to one that not only can serve the needs of the challenge owner fully, but also can be extended and customised to help organisation from any domain with data anonymisation and cleaning.
One of the main challenges of the problem is that the input data comes in different formats and languages. There have been developments in different branches of machine learning that enable us to perform complex tasks, but most of them work with specific types of information. This means that the first task for our proposed system would be to accept files in different formats, detect what type of information they hold, extract the data in a form that is supported by the algorithm downstream and send it to the best suited model.
The preprocessing will split the data in three main types:
Unstructured data: text without tables or other types of structure.
Semi-structured data: tables, json, xml and other semi-structured formats.
Imagery data: this includes images, video, slides and other visual data.
Any other type of data to be analysed.
Another important part that is addressed in the preprocessing phase is the detection of the language to be analysed. In order to achieve the best possible performance, each model has versions that are optimised for different languages (e.g. English, Spanish, Russian, etc.) and different alphabets (e.g. Latin, Greek, Cyrilic, etc).
The Solution proposed by CosmosThrace has three components. The first component is the preprocessing phase where different kinds of documents are accepted and analysed to detect and separate different types of data.
After the text, tables, images and other data types are separated they are sent to the second component, nametly the models that specialise in the detection of sensitive data for the defined type of data. These specialised models use developments in the given area of machine learning like word embeddings for text data and convolutional neural networks for the images and other visual data.
When the analysis is done the results are supplied to the third component where they are merged together and all suspected sensitive data is reported to the data curators, along with suggestions how to hide it.
This architecture ensures flexibility that allows for easy customisation and extensibility as new modules can be added if other types of data have to be analysed. It also lets the developer easily switch between different versions of a given model, without having to work on any other part of the solution.
The solution proposed by CosmosThrace is based on data-driven machine learning algorithms with special focus on deep learning techniques. In the case of text data word embeddings and long short-term memory layers in the neural network will be employed. In order to detect faces, text and other sensitive information in images and videos, the software will use convolutional and pooling layers in the neural networks trained for each specific case. The preprocessing step will use the best-in-class OCR algorithms to extract text from the other types of data like ppt and pdf. The whole solution communicates with standard REST APIs to ensure that it is easy to extend and adapt to each specific case. In order to make use of what is available readily in the area of machine learning, the system will be developed in Python, which is the most powerful programming language among data scientists
The project was developed in the following phases:
- Requirements analysis: in this phase we will work closely with the challenge owners to find the exact needs and details that will make the solution fit perfectly for them.
- MVP development: in this phase we will work on the development of the software that will support the solution with the described infrastructure. The focus will be on the preprocessing step where the data will be extracted from the documents in formats that are supported by the models downstream. The MVP will be tested with our existing data anonymisation model in order to prove that the concept is achievable.
- Essential Model development: In this part of the project the focus will be on the development of the different machine learning models required for the tasks defined by the challenge owners. This phase can start even before the 2nd phase is finished in order to ensure that there is enough time for the models to be optimised to their full potential.
- Iterative system improvements: Once there is a solution at place that satisfies the first of the three objectives given by the challenge owners the focus will shift towards the addition of a module that will suggest the most appropriate way to hide the PII. Parallel with these existing models will be improved and additional models like models for other languages will be added to the system.
- Continuous learning: At the final stage of the project the system will be augmented with a module that makes it easy to retrain models based on the feedback from data curators about datasets that have been passed through the solution before. This will rely on data that is collected in each of the previous phases of the project.