Conference and workshop papers
Triple-negative breast cancer (TNBC) is a rare cancer, characterized by high metastatic potential and poor prognosis, and has limited treatment options. The current standard of care in nonmetastatic settings is neoadjuvant chemotherapy (NACT), but treatment efficacy varies substantially across patients. This heterogeneity is still poorly understood, partly due to the paucity of curated TNBC data. Here we investigate the use of machine learning (ML) leveraging whole-slide images and clinical information to predict, at diagnosis, the histological response to NACT for early TNBC women patients. To overcome the biases of small-scale studies while respecting data privacy, we conducted a multicentric TNBC study using federated learning, in which patient data remain secured behind hospitals’ firewalls. We show that local ML models relying on whole-slide images can predict response to NACT but that collaborative training of ML models further improves performance, on par with the best current approaches in which ML models are trained using time-consuming expert annotations. Our ML model is interpretable and is sensitive to specific histological patterns. This proof of concept study, in which federated learning is applied to real-world datasets, paves the way for future biomarker discovery using unprecedentedly large datasets.
The Yeo-Johnson (YJ) transformation is a standard parametrized per-feature unidimensional transformation often used to Gaussianize features in machine learning. In this paper, we investigate the problem of applying the YJ transformation in a cross-silo Federated Learning setting under privacy constraints. For the first time, we prove that the YJ negative log-likelihood is in fact convex, which allows us to optimize it with exponential search. We numerically show that the resulting algorithm is more stable than the state-of-the-art approach based on the Brent minimization method. Building on this simple algorithm and Secure Multiparty Computation routines, we propose SECUREFEDYJ, a federated algorithm that performs a pooled-equivalent YJ transformation without leaking more information than the final fitted parameters do. Quantitative experiments on real data demonstrate that, in addition to being secure, our approach reliably normalizes features across silos as well as if data were pooled, making it a viable approach for safe federated feature Gaussianization.
While federated learning is a promising approach for training deep learning models over distributed
sensitive datasets, it presents new challenges for machine learning, especially when applied in the
medical domain where multi-centric data heterogeneity is common. Building on previous domain
adaptation works, this paper proposes a novel federated learning approach for deep learning
architectures
via the introduction of local-statistic batch normalization (BN) layers, resulting in
collaboratively-trained, yet center-specific models. This strategy improves robustness to data
heterogeneity while
also reducing the potential for information leaks by not sharing the center-specific layer activation
statistics. We benchmark the proposed method on the classification of tumorous histopathology
image patches extracted from the Camelyon16 and Camelyon17 datasets. We show that our approach
compares favorably to previous state-of-the-art methods, especially for transfer learning across
datasets.
Neural Networks (NN) are today increasingly used in Machine Learning where they have become deeper and
deeper to accurately model or classify high-level abstractions of data. Their development however
also gives rise to important data privacy risks. This observation motives Microsoft researchers to
propose a framework, called Cryptonets. The core idea is to combine simplifications of the NN with
Fully Homomorphic Encryptions (FHE) techniques to get both confidentiality of the manipulated data and
efficiency of the processing. While efficiency and accuracy are demonstrated when the number of
non-linear layers is small (e.g. 2), Cryptonets unfortunately becomes ineffective for deeper NNs which
let the privacy preserving problem open in these contexts. This work successfully addresses this
problem by combining several new ideas including the use of the batch normalization principle and the
splitting of the learning phase in several iterations. We experimentally validate the soundness of
our approach with a neural network with 6 non-linear layers. When applied to the MNIST database, it
competes with the accuracy of the best non-secure versions, thus significantly improving Cryptonets.
Additionally, we applied our approach to secure a neural network used for face recognition. This
problem is usually considered much harder than the MNIST hand-written digits recognition and can
definitely not be addressed with a simple network like Cryptonets. By combining our new ideas with an
iterative (learning) approach we experimentally show that we can build an FHE-friendly network
achieving good accuracy for face recognition.
The secure two-party computation (S2PC) protocols SHADE and GSHADE have been introduced by Bringer et
al. in the last two years. The protocol GSHADE permits to compute different distances (Hamming,
Euclidean, Mahalanobis) quite efficiently and is one of the most efficient compared to other S2PC
methods. Thus, this protocol can be used to efficiently compute one-to-many identification for several
biometrics data (iris, face, fingerprint). In this paper, we introduce two extensions of GSHADE. The
first one enables us to evaluate new multiplicative functions. This way, we show how to apply
GSHADE to a classical machine learning algorithm. The second one is a new proposal to secure GSHADE
against malicious adversaries following the recent dual execution and cut-and-choose strategies. The
additional cost is very small. By preserving the GSHADE's structure, our extensions are very efficient
compared to other S2PC methods.
In iris-based video authentication, the best frames from the video are selected in order to compare
them to a reference data. As biometric data are sensitive, a non-secure matching score evaluation
can leak private information. We here focus on Garbled circuits which are widely employed to securely
compare biometric data. Our main idea consists in introducing a new efficient partially re-usable
garbled circuit. To do that, we take advantage of the similarities between templates from the same
video in order to reduce the cost of several secure matching evaluations on close inputs. We validate
our proposal through experiments showing that it is practical.
This paper analyses the unlinkability and the irreversibility of the iris biometric template
protection system based on Bloom filters introduced at ICB 2013. Hermans et al. presented at BIOSIG
2014 an
attack on the unlinkability of these templates. In the worst case, their attack succeeds with
probability at least 96%. But in their attack, they assume protected templates generated from the same
iriscode. In this paper, we analyze unlinkability on protected templates generated with two different
iriscodes coming from the same iris, and we moreover introduce irreversibility analysis that
exploits non-uniformity of the data. Our experiments thus practically demonstrate new vulnerabilities
of the scheme.
Since the seminal paper of Ratha et al. in 2001 that introduced cancelable biometrics, inner
permutation of biometric templates has been widely suggested as one of the basic components to protect
biometric data against compromised or cross-checking between two databases. In this paper, we study
the case of iris biometrics where an inner permutation corresponds to shuffling the bits of a
template in order to diversify the stored data. We analyze the security brought by a permutation and
underline the impact of non-uniformity of templates on the robustness of cancelable biometrics: we
introduce new attack strategies on permuted biometric databases that enable to reconstruct part of the
permutation, leading to a potential privacy leakage. We finally suggest ways to improve
efficiently the protection, by designing specific countermeasures, with no impact on accuracy and a
low impact on the overall architecture of the system.
Preprints
Triple-Negative Breast Cancer (TNBC) is a rare cancer, characterized by high metastatic potential and
poor prognosis, and has limited treatment options compared to other breast cancers. The current
standard of care in non-metastatic settings is neoadjuvant chemotherapy (NACT), with the goal of
breast-conserving surgery and for an in vivo assessment of chemosensitivity. However, the efficacy of
this treatment varies significantly across patients, and this histological response heterogeneity is
still poorly understood partly due to the paucity of available curated TNBC data. Motivated by this
problem, we investigate the use of machine learning (ML) to predict at diagnosis the histological
response to NACT for early TNBC patients. To overcome the known biases of related small scale studies
while respecting data privacy, we conduct, for the first time, a TNBC study in a multi-centric fashion
behind hospitals’ firewalls using collaborative Federated Learning (FL). Thereby allowing access
to enough TNBC data to sustain a complete response heterogeneity investigation. We show evidence that
local ML models relying on Whole-Slide Images (WSIs) at diagnosis are able to predict the
histological response to NACT as accurately as current clinical approaches, which rely on
time-consuming expert annotations. We demonstrate that collaborative training further improves
performance
over single-center training outperforming clinical methods. Our ML model is interpretable by design,
and we show that it is sensitive to specific histological patterns. While we identify known
predictive biomarkers among them, this proof of concept for real-world collaborative FL paves the way
for future biomarker discovery using unprecedently large datasets.
Since 2014, the NIH funded iDASH (integrating Data for Analysis, Anonymization, SHaring) National
Center for Biomedical Computing has hosted yearly competitions on the topic of private computing for
genomic data. For one track of the 2020 iteration of this competition, participants were challenged to
produce an approach to federated learning (FL) training of genomic cancer prediction models using
differential privacy (DP), with submissions ranked according to held-out test accuracy for a given set
of DP budgets. More precisely, in this track, we are tasked with training a supervised model for
the prediction of breast cancer occurrence from genomic data split between two virtual centers while
ensuring data privacy with respect to model transfer via DP. In this article, we present our 3rd
place submission to this competition. During the competition, we encountered two main challenges
discussed in this article: i) ensuring correctness of the privacy budget evaluation and ii) achieving
an acceptable trade-off between prediction performance and privacy budget.
Federated Learning enables one to jointly train a machine learning model across distributed clients
holding sensitive datasets. In real-world settings, this approach is hindered by expensive
communication and privacy concerns. Both of these challenges have already been addressed individually,
resulting in competing optimisations. In this article, we tackle them simultaneously for one of
the first times. More precisely, we adapt compression-based federated techniques to additive secret
sharing, leading to an efficient secure aggregation protocol, with an adaptable security level. We
prove its privacy against malicious adversaries and its correctness in the semi-honest setting.
Experiments on deep convolutional networks demonstrate that our secure protocol achieves high accuracy
with low communication costs. Compared to prior works on secure aggregation, our protocol has a lower
communication and computation costs for a similar accuracy.
Neural Networks (NN) are today increasingly used in Machine Learning where they have become deeper and
deeper to accurately model or classify high-level abstractions of data. Their development however
also gives rise to important data privacy risks. This observation motives Microsoft researchers to
propose a framework, called Cryptonets. The core idea is to combine simplifications of the NN with
Fully Homomorphic Encryptions (FHE) techniques to get both confidentiality of the manipulated data and
efficiency of the processing. While efficiency and accuracy are demonstrated when the number of
non-linear layers is small (eg 22), Cryptonets unfortunately becomes ineffective for deeper NNs which
let the problem of privacy preserving matching open in these contexts. This work successfully
addresses this problem by combining the original ideas of Cryptonets' solution with the batch
normalization principle introduced at ICML 2015 by Ioffe and Szegedy. We experimentally validate the
soundness of our approach with a neural network with 66 non-linear layers. When applied to the MNIST
database, it competes the accuracy of the best non-secure versions, thus significantly improving
Cryptonets.
Journal articles
In this work, we develop an unlinkability and irreversibility analysis of the so-called Bloom
filter-based iris biometric template protection introduced at ICB 2013. We go further than the
unlinkability analysis of Hermans et al. presented at BIOSIG 2014. Firstly, we analyse unlinkability
on protected templates built from two different iriscodes coming from the same iris whereas Hermans
et al. analysed only protected templates from the same iriscode. Moreover, we introduce an
irreversibility analysis that exploits non-uniformity of the biometric data. Our experiments
demonstrate new
vulnerabilities of this scheme. Then we will discuss the security of other similar protected biometric
templates based on Bloom filters that have been suggested in the literature since 2013. Finally,
we suggest a Secure Multiparty Computation (SMC) protocol, that benefits of the alignment-free feature
of this Bloom filter construction, in order to compute efficiently and securely the matching
scores.
This work is an extended version of a conference paper presented at ICB 2015.
This work is an extended version of a conference paper presented at ICB 2015.
Books
In this chapter, we describe how to perform a secure and efficient IrisCode-based identification. In
this use case, the first party has an IrisCode, the second party has one or several IrisCodes and
they would like to discover whether the first party’s IrisCode is close to at least one IrisCode
belonging to the second party without revealing any information about their own IrisCodes to the
opposing party. Secure Two-Party Computation (S2PC) protocols are dedicated to this use case because
they enable two parties to jointly evaluate a function over their inputs while preserving the
privacy of their inputs. In this chapter, we explain how to efficiently use S2PC protocols for secure
iris-based identification.
Talks
Presentation of zkinterface our new zk interoperability standard
Machine learning based on Neural Networks (NN) is increasingly used in areas such as medical
diagnosis, computer vision and speech recognition. To accurately model or classify high-level
abstractions
of data, they have become deeper, leading to new algorithms, called deep neural networks, with more
and more layers. Their unprecedented accuracy has turned them into the foundation of many new
services and applications. Recently, Cryptonets has been introduced by Microsoft to combine these NN
computations with homomorphic encryptions to ensure confidentiality of manipulated data.
Considering face recognition, this could be a must-have feature to have to ensure privacy. Our work
improves their proposal by modifying underlying NNs while keeping excellent accuracy performances.
For instance, following their experiment with the MNIST database, we obtain with our experiments an
accuracy similar to that for the non-secure version, achieving a significant improvement compared to
Cryptonets. This can be seen as another step to handle the many layers of deep NN with encrypted data.
Patents
The present invention relates to a secure parameter learning method of a convolutional neuron network,
CNN, for data classification; the method comprising the implementation by data processing means
(11a) of a first server (1a) of steps of: (a0) receiving from a second server (1b) a previously
classified learning database, said training data being homomorphically encrypted; (a1) Learning in the
encrypted domain, from said learning database, the parameters of a reference CNN comprising at least:
a nonlinear layer (POLYNOME) operating a polynomial function of degree at least two approximating
an activation function; a batch normalization layer (BN) before each non-linear layer (POLYNOME); (a2)
Transmission to said second server (1b) of the learned parameters for decryption and use in
classification The present invention also relates to methods of securely classifying an input data
item.
The invention proposes a method and an associated system for authenticating a user, by means of the
redundancy present between several images of a video, the method using garbled circuits, named
variant garbled circuits, associated with the alternative bits between the images of the video and a
garbled circuit named invariant garbled circuit, associated with the invariant bits between the
images of the video, so that the invariant garbled circuit only needs to be evaluated a single time.
The invention proposes a method comprising the evaluation of a function F obtained by the application
to n sub-functions f i of a first operation, the evaluation comprising: implementing a series of
calculation steps in which a first unit takes on a client role and a second unit takes on a server
role, and the repetition of the series of calculation steps in which the client and server roles are
exchanged between the units, each series of steps comprising: a) random generation, by the server, of
first data, and a second datum, b) for each subfunction f i, generation by the server of a set of
elements formed by: o a result of f i evaluated in the data of the client and the server, o masked by
a first datum, by applying the first operation between the result and the first datum, and o
masked by the second datum, by application between the masked result and the second datum of a second
operation different from the first and distributive with respect thereto, c) recovery by transfer
unconscious, by the client, of intermediate data corresponding to one of the elements generated by the
server, d) generation, by the server, of a first part of result, by: € ¢ the masking of each
first data by the second data, The application to all the first masked data of the first operation,
and e) generating, by the client, a second result part, by application to all the intermediate data
of the first operation.
The invention relates to a method for generating a signature of a message intended to be validated by
a verification server, a client device configured to hold a private key and a corresponding public
key and comprising steps of: -calculus (103) previously offline by a hardware security module of a
signature token resulting from an encryption using a homomorphic encryption function, storage (104)
of said signature token; generating (105) said signature of said encrypted message using said
homomorphic encryption function from the result of the encryption by said homomorphic encryption
function
of the private key stored by the client device, the signature token and said message said signature
being intended to be validated by said verification server using said public key.
The present invention relates to a method for identifying an entity implemented by an identification
system from indexed distance data (d 1, ..., d n) corresponding to reference entities and
comprising: a phase of determining a set (I) of minimum indices (index 1,..., index k) among said
binary indexed distance data of length q 'comprising an execution step comprising, for each set of jth
bits of the indexed distance data included in a list of data to be processed, j being an integer
varying from q'-1 to 0, starting with the set of most significant bits of the data to be processed and
ending with the set of least significant bits of the data to be processed, the search for minimum
indices including, if a number of indices of a first group of indexes of distance indexed data (p) is
greater than a remaining number of indexed data to discard (r), adding said indices of a second group
to the set of minimum indices; an identification phase of the entity to be identified from among
the reference entities corresponding to the stored reference biometric data associated with the
determined minimum distance data indexes (index 1,... index k), operations on binary integers for
implementing at least said execution step, being translated in the form of at least one Boolean
circuit used to implement at least said execution step in a secure manner between the control server
and
the management server using a multiparty secure computing protocol for secure evaluation of said
Boolean circuit.