This article is an extension of my previous article “What is Federated Learning.” I will focus on how you can use Federated Learning for Security and Privacy.
We’re seeing an increased focus and effort by consumers and policymakers toward enhancing privacy related to the collection and usage of data. In 2018, the General Data Protection Regulation (GDPR) took effect in Europe and affected enterprises doing business in the European Union. GDPR requires enterprises to be more attentive about collecting, storing, utilizing, and transferring customer data. Additionally, the California Consumer Privacy Act (CCPA) came into effect in the United States. Now with the CCPA, American citizens have the right to request enterprises to disclose the kind of data they possess on them and ask for its removal.
When it comes to AI (machine learning and deep learning), a sufficient amount of data is required which often includes personal data to train AI models. As data privacy and security represents a growing critical concern, given the above mentioned new areas of legislation and policies, new machine learning methodologies like federated learning (FL) have been developed in part to address these concerns.
Overview of Federated Learning Security and Privacy
Google introduced the idea of federated learning in 2017. The key ingredient of federated learning is that it enables data scientists to train shared statistical models based on decentralized devices or servers with a local data set. This means that although data scientists use the same model to train, there is no need to upload private data to the cloud or exchange data with other data scientists or research teams. Compared to traditional centralized machine learning techniques that require data sets to reside on a single server, federated learning reduces data security and privacy concerns by maintaining stores of local data.
FL has received a lot of attention in the way the technology tackles the challenge of protecting user privacy by the decoupling of data provisioned at end-user equipment and machine learning model aggregation such as network parameters of deep learning at a centralized server. The singular intent of FL is to cooperatively learn a global model without directly sacrificing data privacy. In particular, FL has distinct privacy advantages compared to data center training on a data set. Even holding an “anonymized” data set at a server can still put client privacy at risk via linkage to other data sets. In contrast, the information transmitted for FL consists of minimal updates to improve the accuracy of a particular machine learning model. The updates themselves can be ephemeral, and will never contain more information than the raw training data.
As a sample use case, recently NVIDIA introduced FL on its autonomous driving platform. As there are different geographical landscapes and potential driving situations across regions, OEMs need to train their models individually with various driving data sets. The company’s DGX edge platform will be able to retrain the shared models in each OEM with local data. The local training results can be sent back to the FL server over a secure link to update the shared model.
Federated Learning Security and Privacy Challenges
There are a number of privacy and security challenges and concerns associated with the use of FL. Privacy concerns serve to motivate the desire to keep raw data on each local device in a distributed machine learning setting. But sharing other information such as model updates as part of the training process brings up another concern, namely the potential to leak sensitive user information. For example, it’s possible to extract sensitive text patterns, such as a credit card number, from a recurrent neural network (RNN) trained on user data.
To support machine learning that works to preserve privacy, a number of approaches have been employed leading up to FL:
- One method is differential privacy, where a randomized mechanism is considered differentially private if the change of one input element leads to only a small difference in the output distribution. This means that one is unable to draw any conclusions about whether or not a specific sample is used in the learning process. For gradient-based learning methods, a common approach is to apply differential privacy by randomly perturbing (e.g. using Gaussian noise) the intermediate output at each iteration. Of course, there is an intrinsic trade-off between using differential privacy and achieving a high-level of model accuracy since adding more noise results in greater privacy, but may compromise accuracy.
- Another method for securing the learning process is homomorphic encryption where computing is done on encrypted data.
- Lastly, there is secure multiparty computation (SMC) that enables multiple parties to collaboratively compute an agreed-upon function without leaking input information from any party except for what can be inferred from the output.
The downside with the above approaches is that they may not scale well for some large-scale machine learning deployments since they sustain substantial communication and compute costs.
FL takes a step further in terms of posing original challenges to privacy for distributed machine learning algorithms. The goals for privacy solutions in FL are: computationally inexpensive, communication efficient, and tolerant to dropped devices all without compromising accuracy to any great degree.
With FL, privacy can be classified in two ways: global privacy and local privacy. Global privacy necessitates that the model updates generated at each round are private to all untrusted third parties other than the central server. At the same time local privacy further requires that the updates are also private to the server.
Current work on security and privacy for FL builds upon above methods such as SMC and differential privacy, such as an SMC protocol to protect individual model updates. Here, the central server is unable to see any local updates, but can see the exact aggregated results at each round. SMC is a lossless method, and can preserve the original accuracy with a guarantee of very high privacy. The downside with this method, however, is high extra communication cost. Another approach applies differential privacy to FL and realizes global differential privacy. These approaches include a number of hyperparameters that affect communication and accuracy and must be chosen carefully.
For situations where strong privacy guarantees are essential, new methods involving the introduction of a relaxed version of local privacy by limiting the power of potential adversaries. This approach enables stronger privacy guarantees than global privacy, and leads to better model performance than strict local privacy. Further, differential privacy can be combined with model compression methods to simultaneously reduce communication cost and obtain privacy benefits.
Future Directions for FL Security/Privacy
FL is a fertile area of research in machine learning. Researchers are working hard to further develop the methodology’s ability to address privacy and security needs. For example, the outline of privacy discussed above covers privacy at a local or global level with respect to all devices in the network. In practice however, it may be necessary to define privacy on a more granular level, in light of the fact that privacy constraints may differ across devices or even across data points on a single device. One proposal is to use sample-specific privacy guarantees instead of user-specific, thus providing a weaker form of privacy in exchange for more accurate models. Developing methods to handle mixed device-specific or sample-specific privacy restrictions seems to hold promise.
As another example of future FL trends – enabling parallel training of deep learning models on distributed data sets while preserving data privacy is complex and challenging. One group of researchers has developed a federated learning framework FEDF for privacy-preservation coupled with parallel training. The framework allows a model to be learned on multiple geographically-distributed training data sets (which may belong to different owners) while not revealing any information of each data set as well as the intermediate results.
Conclusion
In this article, we examined security and privacy with respect to federated learning, an important new technique that’s growing in popularity for distributed machine learning. As the importance of security and privacy for machine learning models accelerates with new policies like GDPR and CCPA, new methodologies like federated learning hold much promise. Federated learning is able to address many important concerns about personal by training shared statistical models based on decentralized devices or servers with a local data set.
Interesting in learning more about machine learning? Check out these Ai+ training sessions:
Machine Learning Foundations: Linear Algebra
This first installment in the Machine Learning Foundations series the topic at the heart of most machine learning approaches. Through the combination of theory and interactive examples, you’ll develop an understanding of how linear algebra is used to solve for unknown values in high-dimensional spaces, thereby enabling machines to recognize patterns and make predictions.
Supervised Machine Learning Series
Data Annotation at Scale: Active and Semi-Supervised Learning in Python
Explaining and Interpreting Gradient Boosting Models in Machine Learning
ODSC West 2020: Intelligibility Throughout the Machine Learning Lifecycle