Authors: Phil Stubbings and Greg Gawron
In the Privacy Enabling Technologies (PETs) team at LiveRamp, we design and implement a wide range of robust techniques to safeguard data whilst enabling analysts to extract insights and perform statistical modeling in a privacy-preserving manner.
Much of our work is based on established privacy mechanisms drawing from research in differential privacy, homomorphic encryption, and federated learning, to name but a few. The field of PETs is constantly evolving, so we invest heavily in R&D activities, striving to bring cutting-edge research into production.
A significant part of our work involves the research, prototyping, and productionalisation of algorithms for federated machine learning, in which statistical models and machine-learning algorithms are built on siloed datasets without ever moving or disclosing the original data. In this blog post, we are excited to share some of our most recent investigations into the cutting-edge deep-learning technique known as split learning.
We first provide context around the types of problems we are trying to solve, then introduce split learning and the most recent developments in this area, and finally discuss some of our own research, undertaken during participation in the Harvard University Privacy Tools Project – OpenDP summer fellowship programme, into augmenting split learning with additional privacy guarantees by means of differential privacy.
Private and public sector organisations hold vast amounts of data which are subject to data privacy legislation and regulatory constraints. In today’s world, it is critical that individual privacy and the protection of sensitive data are the highest priority for any data-driven organisation. At the same time, data are necessary for statistical analysis and modeling to derive insights, drive strategy, and aid in decision-making processes. As an example, in the context of health care, patient-level data can be used to create predictive models to assist in diagnosis and improve patient outcomes by means of optimising intervention and treatment, while at the same time safeguarding the privacy of the input data used to train such models.
In a nutshell, how can we enable practitioners to perform advanced analytics on sensitive (siloed) data whilst safeguarding, and not compromising, the original data points?
In general, there are two main approaches to this challenge. The first approach follows a “data to practitioner” architecture: data are held in safeguarded silos and queried to produce aggregate results exclusive of personally identifiable information (PII) coupled with additional privacy measures such as rounding, discretisation, and sometimes more advanced techniques based on differential privacy, which offer mathematically provable privacy guarantees. Such architectures typically expose an API for common aggregation functions, summary statistics, and even restricted SQL for ad-hoc queries.
Whilst this approach can go hand-in-hand with basic analytics, reporting, and to some extent, exploratory data analysis (EDA), more advanced use cases such as machine learning pipelines are limited to the granularity of data exposed by the API, which is typically highly aggregated and obfuscated. In addition, implementation of such architectures is highly challenging—an adversary may attempt to attack the API by crafting queries which may lead to a privacy violation or by launching sophisticated “membership inference attacks.”
The second approach follows a “model to data” architecture: again, data are held in safeguarded environments, however the practitioner sends their model to the data as opposed to sending data (aggregated or otherwise) to the practitioner. In this setting, the practitioner is effectively “working in the dark” with respect to the data. At most, they will have knowledge of the data schema, it’s metadata, available features (e.g. columns) and perhaps some insight into the distribution of features obtained from the first approach.
An architecture implementing this approach will expect as input a defined model in the form of a logistic regression, decision tree, neural network, etc. The practitioner will define which features to feed into the model and which features the model should predict or classify (in the case of supervised learning). The model is then trained on the siloed data, returning model performance metrics and optionally, the model-trained parameters, back to the practitioner for offline inference.
The advantage of this approach is that the raw data can be used for modeling without ever moving or disclosing it in any form to the practitioner. Again, implementation is highly challenging. An adversary may launch a “white-box” attack if the model parameters are disclosed, in which the original data points are reverse engineered from the model-learned parameters. This is particularly challenging with machine learning, since highly parameterised models are vulnerable to overfitting and potentially memorising data points. Even without direct access to the underlying model, an adversary can attempt “black-box” membership-inference-attacks (MIAs) by observing the model output in response to crafted input, which is further exacerbated if the model prediction or class probabilities are visible.
Both of these approaches have their own strengths and weaknesses, however the second (model to data) architecture is better suited to machine learning pipelines given the opportunity for distributed computation and the fundamental property that data never leaves its silo. It is this approach that we will dive into in this blog post. But first, to better illustrate the problem setting, we will summarise the two main types of data schema on which federated machine-learning models can be built on.
Horizontally partitioned data
In the horizontal world, data are homogeneous, and are divided between a number of silos sharing exactly the same schema. This may be intentional: if the data is very large, it may be necessary to divide it into smaller shards or host it within a cluster environment. The data may be geographically distributed, collected by different lines of business within an organisation, or may be the same data collected by different organisations. For example, different hospitals collect the same set of medical data for their own patients.
If the data were nonsensitive and nonrestricted, it would be possible to simply append all of the data into a single centralised location/file and then build an ML model. However, in this setting, we must somehow build and train our ML model with the data remaining in situ.
This is the most well-known and researched setting in federated machine learning. A popular approach to training models in this distributed context is known as Federated Averaging (FedAvg), which we covered in a previous blog post. In FedAvg, the objective is to train a single model based on data from a number of silos. To achieve this, each data silo holds its own private set of model parameters (e.g. neural network weights). During training, each silo is instructed to train its private model on its data up to a fixed point (e.g. epochs). The parameters from each of the silo’s local models are then averaged to form a single/shared model, which is then used to reinitialise the local model parameters on each of the silos. This process continues (in parallel) for a fixed number of federated rounds until the model is considered converged.
In the FedAvg process, no raw data leaves a silo, however, model parameters do. The party responsible for coordinating the training process by averaging model parameters is trusted not to attempt to reverse engineer (invert) models as to infer the original data. Furthermore, the data silos themselves are trusted, and they will not attempt to infer anything from the averaged parameters they receive during each federated round.
How can we be sure that the model parameters being shared during the training process are secure considering it would not be possible to infer the original data by examining the parameters and how they change during training? One approach could be to add a certain level of noise to each of the private models before they are sent for averaging. We will explore such an approach based on differential privacy a bit further on when we apply it to a similar problem.
Vertically partitioned data
Unlike horizontal data, in the vertical world, data are heterogeneous. The data are divided between a number of silos, with each one holding different features (columns). This is akin to a standard relational database layout, in which tables can be joined on some key. However, in this instance, the tables live in different locations, and again, the data can never leave the respective silos.
This partitioning scheme presents a particularly challenging problem if we wish to build an ML model that takes as input features spanning multiple silos. The first challenge is to somehow line up or reorder the rows of data on each silo as if they had been joined by some key (“patient id” in the example above). If the key is nonsensitive and the silos share the same keys, for instance a unique identifier, this is a relatively trivial task. We could ask each silo for its list of keys (in order), compare the order of each list to a “primary” list (for instance, the first received), then send each silo a list of indices corresponding to new row ordering.
However, what if the silos do not share the same unique identifier scheme or if we wish to align the data by some “fuzzy” criteria? For example, our join criteria may be LastName, FirstName and there may be spelling, abbreviation, and other discrepancies between these two columns in each silo. Furthermore, LastName, FirstName are obviously PII-sensitive attributes, so we cannot extract them for comparison. Fortunately, there exist techniques to solve this problem. For instance, this paper describes a method for constructing comparable cryptographic hashes by means of a bloom filter, which allows hashes of text to be compared by some similarity metrics. A Python implementation can be found here. By using comparable cryptographic hashes in place of unique identifiers (keys) or fuzzy criteria, we can align data in the same way as before by asking each silo to send a list of its cryptographic hashes.
Having aligned/reordered the data on each silo so the rows line up by some key or criteria, how can we use the data to build an ML model? The FedAvg algorithm from the horizontal data scenario is nonapplicable here because the features on each silo are different.
A promising approach to this problem is to divide a model’s parameters over each silo and then fit the model by utilising homomorphic encryption to encrypt partial gradient vectors between each silo during training. This paper describes in detail an approach which trains a binary logistic regression model on data which has been previously aligned by cryptographic hash. The details of this algorithm are well beyond the scope of this blog post, however during testing of our own implementation, which made use of Microsoft’s homomorphic encryption library, (a great open source example can be found in the FATE framework) we found the IO overhead of the algorithm and computation overhead of using homomorphic encryption did not scale well for the applications we have in mind.
Given these scalability issues and insights gained from building our DataFleets federated machine-learning platform, which suggest that vertical partitioning scheme challenges are most frequently found in the wild, we have been investigating the most recent research developments in this area. The most exciting and promising of which (in our humble opinion), we will dive into next.
Enter: split learning
Split learning is a recent federated learning technique for training deep neural networks on horizontally and vertically distributed datasets. In essence, the idea is to take a deep neural network and split it up into modules which live locally on data silos. During training, these modules compute output given their local data and then feed-forward their outputs to a coordinator neural network, which takes as input concatenated module outputs to produce a prediction. The loss with respect to the coordinator prediction is then used to calculate the coordinator gradient up until the input layer of the coordinator network. This gradient is sent back to the module networks, which is used to perform module-level back-propagation.
For example, the diagram above shows a vertically partitioned split neural network. In this scenario, there exist two data silos holding their own distinct columns (features) of data. Each silo has been configured to hold a local (private) network, which takes as input the features available on the silo and produces output, which are later used by a coordinator network as input to its own local network. In order to calculate the loss during training, the coordinator network must have access to the target variable y, which it may hold locally or it may be sent from one of the silo modules.
During the feed-forward stage, each silo is asked by the coordinator to feed forward their local data through their network submodules to produce output. The output obtained from each silo is referred to as smashed data. It is not the raw data itself, but instead an intermediate representation of the data, given the layers, weights, and activation functions defined in the silo’s network.
The in-built privacy mechanism is based on the idea that if the input data are significantly far away from the smashed data output layer, and the subnetworks are highly parameterised, then it would be difficult for an adversary who may have access to the coordinator to reverse engineer the smashed data into original form by guessing the submodule parameters.
During the back-propagation stage, the coordinator calculates the overall loss with respect to the target output y and then back-propagates the gradients up until its input layer, known as the cut-layer. Each silo subnetwork will then use the cut-layer gradient to further back-propagate on their own network(s), which completes a single round of training. Again, since y is sufficiently far away from the cut-layer, it would be difficult for the data silos to reverse engineer each other’s raw data, given the cut-layer gradient.
Compared with the federated averaging and logistic regression based on the homomorphic encryption methods mentioned previously, split learning offers some significant advantages. First, the subnetwork model weights stay private to the data silos, which is not the case in federated averaging. Consequently, should the coordinator be compromised, white-box model inversion attacks on the subnetworks are not possible, since the coordinator only receives smashed data. Second, the messaging/network overhead of this approach is significantly less than federated averaging since the neural network parameters never leave the silos. Finally, the distributed layout of the split neural network allows for parallel computation of the silo subnetworks, which may still be further configured with different topologies, which is particularly useful if the data silos live on low-power/hardware-limited devices.
Combining differential privacy with split learning
Whilst split learning provides an intuitive level of data privacy based on the idea that only intermediate representations/smashed data are transmitted over the wire, it is difficult to quantify how much protection the method provides. The main principle behind smashed data is that the output nodes of each of the silo’s subnetworks are “sufficiently” far from the original data in terms of deep neural network layers. One way to think about these subnetworks is as feature encoders bearing similarities with auto-encoders, with the coordinator network ultimately learning to predict some variable based on the encoded output of the subnetworks.
How can we be certain that the smashed data do not leak any of the original data, and furthermore, if we were to “release” all of the learned model parameters, how can we be sure that the combined model has not memorised/overfit the original data points?
Various follow-up studies to the split learning method have indeed shown that under certain conditions, it is possible to leak original data in the smashed layer output. Various methods have been proposed to circumvent this issue, such as “NoPeekNN,” which ensures that the smashed data are maximally dissimilar to the original data by means of augmenting the loss function with similarity metrics. Other studies have suggested that adding noise to the smashed data can also improve the privacy of the method.
In our own federated machine-learning platform, we aim to allow a practitioner to run inference on a trained model using new/previously unseen data instances and to “release” a model for offline inference, which involves full disclosure of the trained model parameters.
One of the main problems with machine-learning models, and with deep learning in particular due to the large parameter space, is the tendency to overfit the data during training. There are of course many techniques to overcome this phenomena, such as regularisation and early stopping, however we would like to provide a quantifiable guarantee as to the level of privacy a released model affords.
With federated averaging, we have previously made use of TensorFlow Privacy as a means to introduce parameterised noise with differential privacy guarantees during the model optimisation process. How can we apply differential privacy to split learning?
To test the idea, we have built a differentially private split learning prototype for horizontal and vertical use cases in PyTorch, based on the PyTorch distributed RPC framework. The PyTorch-distributed RPC framework fits nicely with split learning, as it implements common messaging patterns and exposes a distributed autograd and distributed optimiser API, which decouples complicated messaging patterns from the algorithm implementation. In addition, we developed an in-memory implementation, which we used to experiment with some of the more recent split learning developments, such as the NoPeekNN method mentioned earlier.
To train differentially private split neural networks, we have made use of the Opacus framework (similar to TensorFlow Privacy) as a means to introduce noise and gradient clipping during the optimisation process.
In our prototype, each silo holds it’s own local instance of a PyTorch optimiser, which has been extended with the Opacus differentially private optimisation algorithm. During training, noise is then added to the gradients prior to updating the weights of the silo networks, such that the individual silo networks can be considered to be epsilon-differentially private with respect to the data held on each silo. In other words, we do not train a split neural network with a single epsilon privacy guarantee, but rather train a collection of subnetworks, each with their own isolated epsilon guarantee. The advantage of this scheme is that the level of privacy can be adjusted on a case-by-case basis depending on the privacy requirements of each participating data silo.
We have performed a series of experiments to illustrate the trade-off between privacy and utility when training split neural networks with and without DP. We used the credit scoring dataset from a previous blog post where features from x0 up to x9 describe an individual applying for a loan. The Boolean label y indicates whether or not an applicant had later defaulted.
Multiple experiment runs were executed per single data point to average out the randomness coming from the neural network training and the application of DP noise. Each dot in the below graph represents training and evaluation of a single SplitNN model. The color of the dot indicates the actual split performed (e.g. blue means the features were split into four components).
Note that epsilon equal to 0 on the graphs means that there was no DP applied. Sometimes training hasn’t converged, so we filtered out such cases in order to have fewer outliers on the diagrams (we dropped any result with accuracy lower than 0.6).
As expected, the average accuracy increased with higher levels of epsilon (increasingly lower privacy) and decreased with lower levels of epsilon (increasingly higher privacy). As well as providing validation of our implementation as per the expected behaviour, this observation raises an important point: there is a trade-off between privacy and utility which must be adjusted according to the intended use of the model.
Another visible trend concerned the amount of splits—more splits = lower accuracy. In other words, heavy feature fragmentation amongst data silos may lead to a negative effect on model performance. This is likely due to the fact that if we exported a SplitNN as a monolithic model, the first layers of the network would not be fully interconnected.
Finally, the model training time was heavily impacted by the amount of splits, as well as the application of DP. To mitigate this effect, client feed-forward and back-propagation computations could be executed concurrently.
In this post, we introduced split neural networks—a promising federated learning method that can be applied in both horizontal and vertical distributed data settings. We also introduced differential privacy into the learning protocol to provide further privacy guarantees at the expense of model accuracy.
In a follow-up post, we will dive deeper into the world of split learning by attempting to attack split neural networks with various state-of-the-art methods and illustrate the strengths and weaknesses from a privacy perspective.
Thanks for reading!
Further reading – code and libraries
CLK Hash – Python implementation of cryptographic long-term key hashing
Microsoft SEAL – Homomorphic encryption library
Further reading – papers and articles
LiveRamp Engineering Blog: Federated Learning for Credit Scoring
MIT Media Lab’s split learning project page