Conference and workshop papers

Triple-negative breast cancer (TNBC) is a rare cancer, characterized by high metastatic potential and poor prognosis, and has limited treatment options. The current standard of care in nonmetastatic settings is neoadjuvant chemotherapy (NACT), but treatment efficacy varies substantially across patients. This heterogeneity is still poorly understood, partly due to the paucity of curated TNBC data. Here we investigate the use of machine learning (ML) leveraging whole-slide images and clinical information to predict, at diagnosis, the histological response to NACT for early TNBC women patients. To overcome the biases of small-scale studies while respecting data privacy, we conducted a multicentric TNBC study using federated learning, in which patient data remain secured behind hospitals’ firewalls. We show that local ML models relying on whole-slide images can predict response to NACT but that collaborative training of ML models further improves performance, on par with the best current approaches in which ML models are trained using time-consuming expert annotations. Our ML model is interpretable and is sensitive to specific histological patterns. This proof of concept study, in which federated learning is applied to real-world datasets, paves the way for future biomarker discovery using unprecedentedly large datasets.

The Yeo-Johnson (YJ) transformation is a standard parametrized per-feature unidimensional transformation often used to Gaussianize features in machine learning. In this paper, we investigate the problem of applying the YJ transformation in a cross-silo Federated Learning setting under privacy constraints. For the first time, we prove that the YJ negative log-likelihood is in fact convex, which allows us to optimize it with exponential search. We numerically show that the resulting algorithm is more stable than the state-of-the-art approach based on the Brent minimization method. Building on this simple algorithm and Secure Multiparty Computation routines, we propose SECUREFEDYJ, a federated algorithm that performs a pooled-equivalent YJ transformation without leaking more information than the final fitted parameters do. Quantitative experiments on real data demonstrate that, in addition to being secure, our approach reliably normalizes features across silos as well as if data were pooled, making it a viable approach for safe federated feature Gaussianization.

While federated learning is a promising approach for training deep learning models over distributed sensitive datasets, it presents new challenges for machine learning, especially when applied in the medical domain where multi-centric data heterogeneity is common. Building on previous domain adaptation works, this paper proposes a novel federated learning approach for deep learning architectures via the introduction of local-statistic batch normalization (BN) layers, resulting in collaboratively-trained, yet center-specific models. This strategy improves robustness to data heterogeneity while also reducing the potential for information leaks by not sharing the center-specific layer activation statistics. We benchmark the proposed method on the classification of tumorous histopathology image patches extracted from the Camelyon16 and Camelyon17 datasets. We show that our approach compares favorably to previous state-of-the-art methods, especially for transfer learning across datasets.

Neural Networks (NN) are today increasingly used in Machine Learning where they have become deeper and deeper to accurately model or classify high-level abstractions of data. Their development however also gives rise to important data privacy risks. This observation motives Microsoft researchers to propose a framework, called Cryptonets. The core idea is to combine simplifications of the NN with Fully Homomorphic Encryptions (FHE) techniques to get both confidentiality of the manipulated data and efficiency of the processing. While efficiency and accuracy are demonstrated when the number of non-linear layers is small (e.g. 2), Cryptonets unfortunately becomes ineffective for deeper NNs which let the privacy preserving problem open in these contexts. This work successfully addresses this problem by combining several new ideas including the use of the batch normalization principle and the splitting of the learning phase in several iterations. We experimentally validate the soundness of our approach with a neural network with 6 non-linear layers. When applied to the MNIST database, it competes with the accuracy of the best non-secure versions, thus significantly improving Cryptonets. Additionally, we applied our approach to secure a neural network used for face recognition. This problem is usually considered much harder than the MNIST hand-written digits recognition and can definitely not be addressed with a simple network like Cryptonets. By combining our new ideas with an iterative (learning) approach we experimentally show that we can build an FHE-friendly network achieving good accuracy for face recognition.

The secure two-party computation (S2PC) protocols SHADE and GSHADE have been introduced by Bringer et al. in the last two years. The protocol GSHADE permits to compute different distances (Hamming, Euclidean, Mahalanobis) quite efficiently and is one of the most efficient compared to other S2PC methods. Thus, this protocol can be used to efficiently compute one-to-many identification for several biometrics data (iris, face, fingerprint). In this paper, we introduce two extensions of GSHADE. The first one enables us to evaluate new multiplicative functions. This way, we show how to apply GSHADE to a classical machine learning algorithm. The second one is a new proposal to secure GSHADE against malicious adversaries following the recent dual execution and cut-and-choose strategies. The additional cost is very small. By preserving the GSHADE's structure, our extensions are very efficient compared to other S2PC methods.

In iris-based video authentication, the best frames from the video are selected in order to compare them to a reference data. As biometric data are sensitive, a non-secure matching score evaluation can leak private information. We here focus on Garbled circuits which are widely employed to securely compare biometric data. Our main idea consists in introducing a new efficient partially re-usable garbled circuit. To do that, we take advantage of the similarities between templates from the same video in order to reduce the cost of several secure matching evaluations on close inputs. We validate our proposal through experiments showing that it is practical.

This paper analyses the unlinkability and the irreversibility of the iris biometric template protection system based on Bloom filters introduced at ICB 2013. Hermans et al. presented at BIOSIG 2014 an attack on the unlinkability of these templates. In the worst case, their attack succeeds with probability at least 96%. But in their attack, they assume protected templates generated from the same iriscode. In this paper, we analyze unlinkability on protected templates generated with two different iriscodes coming from the same iris, and we moreover introduce irreversibility analysis that exploits non-uniformity of the data. Our experiments thus practically demonstrate new vulnerabilities of the scheme.

Since the seminal paper of Ratha et al. in 2001 that introduced cancelable biometrics, inner permutation of biometric templates has been widely suggested as one of the basic components to protect biometric data against compromised or cross-checking between two databases. In this paper, we study the case of iris biometrics where an inner permutation corresponds to shuffling the bits of a template in order to diversify the stored data. We analyze the security brought by a permutation and underline the impact of non-uniformity of templates on the robustness of cancelable biometrics: we introduce new attack strategies on permuted biometric databases that enable to reconstruct part of the permutation, leading to a potential privacy leakage. We finally suggest ways to improve efficiently the protection, by designing specific countermeasures, with no impact on accuracy and a low impact on the overall architecture of the system.

Preprints

Triple-Negative Breast Cancer (TNBC) is a rare cancer, characterized by high metastatic potential and poor prognosis, and has limited treatment options compared to other breast cancers. The current standard of care in non-metastatic settings is neoadjuvant chemotherapy (NACT), with the goal of breast-conserving surgery and for an in vivo assessment of chemosensitivity. However, the efficacy of this treatment varies significantly across patients, and this histological response heterogeneity is still poorly understood partly due to the paucity of available curated TNBC data. Motivated by this problem, we investigate the use of machine learning (ML) to predict at diagnosis the histological response to NACT for early TNBC patients. To overcome the known biases of related small scale studies while respecting data privacy, we conduct, for the first time, a TNBC study in a multi-centric fashion behind hospitals’ firewalls using collaborative Federated Learning (FL). Thereby allowing access to enough TNBC data to sustain a complete response heterogeneity investigation. We show evidence that local ML models relying on Whole-Slide Images (WSIs) at diagnosis are able to predict the histological response to NACT as accurately as current clinical approaches, which rely on time-consuming expert annotations. We demonstrate that collaborative training further improves performance over single-center training outperforming clinical methods. Our ML model is interpretable by design, and we show that it is sensitive to specific histological patterns. While we identify known predictive biomarkers among them, this proof of concept for real-world collaborative FL paves the way for future biomarker discovery using unprecedently large datasets.

Since 2014, the NIH funded iDASH (integrating Data for Analysis, Anonymization, SHaring) National Center for Biomedical Computing has hosted yearly competitions on the topic of private computing for genomic data. For one track of the 2020 iteration of this competition, participants were challenged to produce an approach to federated learning (FL) training of genomic cancer prediction models using differential privacy (DP), with submissions ranked according to held-out test accuracy for a given set of DP budgets. More precisely, in this track, we are tasked with training a supervised model for the prediction of breast cancer occurrence from genomic data split between two virtual centers while ensuring data privacy with respect to model transfer via DP. In this article, we present our 3rd place submission to this competition. During the competition, we encountered two main challenges discussed in this article: i) ensuring correctness of the privacy budget evaluation and ii) achieving an acceptable trade-off between prediction performance and privacy budget.

Federated Learning enables one to jointly train a machine learning model across distributed clients holding sensitive datasets. In real-world settings, this approach is hindered by expensive communication and privacy concerns. Both of these challenges have already been addressed individually, resulting in competing optimisations. In this article, we tackle them simultaneously for one of the first times. More precisely, we adapt compression-based federated techniques to additive secret sharing, leading to an efficient secure aggregation protocol, with an adaptable security level. We prove its privacy against malicious adversaries and its correctness in the semi-honest setting. Experiments on deep convolutional networks demonstrate that our secure protocol achieves high accuracy with low communication costs. Compared to prior works on secure aggregation, our protocol has a lower communication and computation costs for a similar accuracy.

Neural Networks (NN) are today increasingly used in Machine Learning where they have become deeper and deeper to accurately model or classify high-level abstractions of data. Their development however also gives rise to important data privacy risks. This observation motives Microsoft researchers to propose a framework, called Cryptonets. The core idea is to combine simplifications of the NN with Fully Homomorphic Encryptions (FHE) techniques to get both confidentiality of the manipulated data and efficiency of the processing. While efficiency and accuracy are demonstrated when the number of non-linear layers is small (eg 22), Cryptonets unfortunately becomes ineffective for deeper NNs which let the problem of privacy preserving matching open in these contexts. This work successfully addresses this problem by combining the original ideas of Cryptonets' solution with the batch normalization principle introduced at ICML 2015 by Ioffe and Szegedy. We experimentally validate the soundness of our approach with a neural network with 66 non-linear layers. When applied to the MNIST database, it competes the accuracy of the best non-secure versions, thus significantly improving Cryptonets.

Journal articles

In this work, we develop an unlinkability and irreversibility analysis of the so-called Bloom filter-based iris biometric template protection introduced at ICB 2013. We go further than the unlinkability analysis of Hermans et al. presented at BIOSIG 2014. Firstly, we analyse unlinkability on protected templates built from two different iriscodes coming from the same iris whereas Hermans et al. analysed only protected templates from the same iriscode. Moreover, we introduce an irreversibility analysis that exploits non-uniformity of the biometric data. Our experiments demonstrate new vulnerabilities of this scheme. Then we will discuss the security of other similar protected biometric templates based on Bloom filters that have been suggested in the literature since 2013. Finally, we suggest a Secure Multiparty Computation (SMC) protocol, that benefits of the alignment-free feature of this Bloom filter construction, in order to compute efficiently and securely the matching scores.
This work is an extended version of a conference paper presented at ICB 2015.

Books

In this chapter, we describe how to perform a secure and efficient IrisCode-based identification. In this use case, the first party has an IrisCode, the second party has one or several IrisCodes and they would like to discover whether the first party’s IrisCode is close to at least one IrisCode belonging to the second party without revealing any information about their own IrisCodes to the opposing party. Secure Two-Party Computation (S2PC) protocols are dedicated to this use case because they enable two parties to jointly evaluate a function over their inputs while preserving the privacy of their inputs. In this chapter, we explain how to efficiently use S2PC protocols for secure iris-based identification.

Talks

Presentation of zkinterface our new zk interoperability standard

Federated learning: Constance Beguier d'Owkin vous expliquera comment entraîner un modèle de machine learning, sans avoir accès aux données sources

Machine learning based on Neural Networks (NN) is increasingly used in areas such as medical diagnosis, computer vision and speech recognition. To accurately model or classify high-level abstractions of data, they have become deeper, leading to new algorithms, called deep neural networks, with more and more layers. Their unprecedented accuracy has turned them into the foundation of many new services and applications. Recently, Cryptonets has been introduced by Microsoft to combine these NN computations with homomorphic encryptions to ensure confidentiality of manipulated data. Considering face recognition, this could be a must-have feature to have to ensure privacy. Our work improves their proposal by modifying underlying NNs while keeping excellent accuracy performances. For instance, following their experiment with the MNIST database, we obtain with our experiments an accuracy similar to that for the non-secure version, achieving a significant improvement compared to Cryptonets. This can be seen as another step to handle the many layers of deep NN with encrypted data.

Patents

The present invention relates to a secure parameter learning method of a convolutional neuron network, CNN, for data classification; the method comprising the implementation by data processing means (11a) of a first server (1a) of steps of: (a0) receiving from a second server (1b) a previously classified learning database, said training data being homomorphically encrypted; (a1) Learning in the encrypted domain, from said learning database, the parameters of a reference CNN comprising at least: a nonlinear layer (POLYNOME) operating a polynomial function of degree at least two approximating an activation function; a batch normalization layer (BN) before each non-linear layer (POLYNOME); (a2) Transmission to said second server (1b) of the learned parameters for decryption and use in classification The present invention also relates to methods of securely classifying an input data item.

The invention proposes a method and an associated system for authenticating a user, by means of the redundancy present between several images of a video, the method using garbled circuits, named variant garbled circuits, associated with the alternative bits between the images of the video and a garbled circuit named invariant garbled circuit, associated with the invariant bits between the images of the video, so that the invariant garbled circuit only needs to be evaluated a single time.

The invention proposes a method comprising the evaluation of a function F obtained by the application to n sub-functions f i of a first operation, the evaluation comprising: implementing a series of calculation steps in which a first unit takes on a client role and a second unit takes on a server role, and the repetition of the series of calculation steps in which the client and server roles are exchanged between the units, each series of steps comprising: a) random generation, by the server, of first data, and a second datum, b) for each subfunction f i, generation by the server of a set of elements formed by: o a result of f i evaluated in the data of the client and the server, o masked by a first datum, by applying the first operation between the result and the first datum, and o masked by the second datum, by application between the masked result and the second datum of a second operation different from the first and distributive with respect thereto, c) recovery by transfer unconscious, by the client, of intermediate data corresponding to one of the elements generated by the server, d) generation, by the server, of a first part of result, by: € ¢ the masking of each first data by the second data, The application to all the first masked data of the first operation, and e) generating, by the client, a second result part, by application to all the intermediate data of the first operation.

The invention relates to a method for generating a signature of a message intended to be validated by a verification server, a client device configured to hold a private key and a corresponding public key and comprising steps of: -calculus (103) previously offline by a hardware security module of a signature token resulting from an encryption using a homomorphic encryption function, storage (104) of said signature token; generating (105) said signature of said encrypted message using said homomorphic encryption function from the result of the encryption by said homomorphic encryption function of the private key stored by the client device, the signature token and said message said signature being intended to be validated by said verification server using said public key.

The present invention relates to a method for identifying an entity implemented by an identification system from indexed distance data (d 1, ..., d n) corresponding to reference entities and comprising: a phase of determining a set (I) of minimum indices (index 1,..., index k) among said binary indexed distance data of length q 'comprising an execution step comprising, for each set of jth bits of the indexed distance data included in a list of data to be processed, j being an integer varying from q'-1 to 0, starting with the set of most significant bits of the data to be processed and ending with the set of least significant bits of the data to be processed, the search for minimum indices including, if a number of indices of a first group of indexes of distance indexed data (p) is greater than a remaining number of indexed data to discard (r), adding said indices of a second group to the set of minimum indices; an identification phase of the entity to be identified from among the reference entities corresponding to the stored reference biometric data associated with the determined minimum distance data indexes (index 1,... index k), operations on binary integers for implementing at least said execution step, being translated in the form of at least one Boolean circuit used to implement at least said execution step in a secure manner between the control server and the management server using a multiparty secure computing protocol for secure evaluation of said Boolean circuit.