SoftwareX paper: AI Privacy Toolkit

The need to analyze personal data to drive business alongside the requirement to preserve the privacy of data subjects creates a known tension.

Data protection regulations such as GDPR and CCPA define strict restrictions and obligations on the collection and processing of personal data. These are also relevant for machine learning models, which can be used to derive personal information about their training sets.

We recently published a SoftwareX paper on our open-source ai-privacy-toolkit. This toolkit is designed to help organizations navigate this challenging area and build more trustworthy AI solutions, with tools that protect privacy and help ensure the compliance of AI models. It is part of a larger suite of tools and projects that include other aspects of privacy in ML models (such as building models with differential privacy guarantees and inference attack implementations), as well as additional trustworthy AI dimensions such as explainability, bias, robustness, and more.

These kinds of solutions can allow organizations to create ethical and privacy-preserving AI solutions, open up new opportunities for research within organizations where it was not previously considered possible, and open up opportunities for cross-organization collaborations on AI projects. According to Gartner, this is a crucial step towards unlocking up to 50% more personal data for model training and increasing industry collaboration by up to 70% [1].

The toolkit is designed to be used by model developers (data scientists) as part of their existing ML pipelines. It is implemented as a Python library that can be used with different ML frameworks such as scikit-learn, PyTorch, and Keras.

The ai-privacy-toolkit currently contains two main modules:

  • The anonymization module contains methods for anonymizing ML-model training data, so that when a model is retrained on anonymized data, it will also be considered anonymous. This may help exempt the model from different obligations and restrictions set out in data protection regulations such as GDPR and CCPA. It can also ensure that the personal information of a specific individual who participated in the training set cannot be re-identified.
  • The minimization module contains methods for adhering to the data minimization principle in GDPR and CCPR for ML models. It allows us to reduce the amount of personal data needed for performing predictions with a machine learning model, while enabling the model to make accurate predictions. This is done by removing or generalizing some of the input features. Even if no generalization can be performed, organizations are expected to be able to demonstrate that the data they collect is necessary for a given purpose.

All modules share a common set of utility classes and methods. These include generic wrappers for both datasets and models, so that users can apply the different modules to whatever types of models and/or datasets they are currently using. For example, datasets may be provided as NumPy arrays, Pandas DataFrames, PyTorch tensors, and so on. Models may belong to different ML frameworks such as scikit-learn, PyTorch, Keras, etc. The generic wrappers enable the modules to be implemented once with the ability to be applied to different combinations of models and data using the same generic code.

Each of these modules is a first-of-a-kind implementation of a new approach to AI privacy. They will likely foster many more similar approaches and tools to tackle these challenging issues. As AI regulations mature and case-laws around their violations become available, we foresee that many more organizations will implement and embed such tools and processes in their AI infrastructure.

Abigail Goldsteen, IBM.

[1] https://www.gartner.com/en/documents/3992922