This post was originally published by Digital Catapult at Medium [AI]
A topic of growing interest, federated learning can be associated with data privacy, distributed systems and machine learning, but what is it?
Federated learning is a particular approach for training machine learning algorithms in a way that means data stays private. Specifically, federated learning (FL) techniques aim to train machine learning (ML) algorithms across multiple, distributed devices or servers, each holding their own local and private data.
This collaborative approach contrasts with traditional machine learning techniques, which are centralised in nature and rely on all data samples to be gathered in one unique dataset before being used. It also differs from techniques based on parallel computation, which are devised to optimise computation for ML over multiple processors, using a centralised dataset that is split into identically distributed subsets for computation.
FL hence offers a broader paradigm for implementing ML solutions, essentially providing more flexibility on how the data can be managed. FL is not restricted to specific ML algorithms and it can be used in a variety of contexts. It is primarily adapting how the training procedures for those algorithms are implemented, and it can be considered for both offline or online learning (for example, training on a static dataset at once, or continuously training on new coming data). It follows that FL is not one unique method, and according to the ML technique employed, the type of data and the operational context, a different strategy will be preferable.
Some simple and intuitive FL methods have proven to be surprisingly efficient solutions for practical application, one such example being the federated averaging algorithm. It consists in averaging at regular intervals the weights of the Neural Networks trained by different FL participants, called workers, on their local data subsets to update a global model. In turn, the local neural networks are then updated with this new global model for further training. The learnings obtained from each local dataset are progressively shared across all the workers as the global model is updated. This is the method we applied to an image classification use-case in an agricultural application presented more in detail below.
See Figure 1 below a for an illustration of the federated learning principles in this case.
Machine learning applications require vast amounts of data. Acquiring sufficient quantities of data to solve a specific problem with ML can be challenging, time-consuming and costly. In practical cases, the data generated is often not centralised, but dispersed instead, and needs to be gathered first from many sources for being used. In addition, the data acquired by an organisation can have a specific distribution that may not support the development of models that can generalise well. For example, hospitals in different regions or countries may have different distributions of patients profiles and pathologies, and would benefit from collaborating to develop ML applications that serve equally all their patients. While those challenges could be theoretically addressed through the collaborations of organisations sharing a common problem and interest, such a collaboration is in practice a very hard problem to solve in any context, when it comes to exchanging data and information.
The federated learning approach enables the collaborative development of more robust and performant machine learning models, while addressing critical issues such as data transfer, privacy, and security for each individual participant.
Because in a FL system the data of each participant is not transferred and remains under the participant’s control, it provides a solution to the problem of privacy preservation when considering working in collaboration. This privacy-by-design characteristic is one of the main advantages of FL. Data privacy matters whether different organisations consider collaborating on solving a common set of problems, or when protecting sensitive and personal data is crucial but using this data would also be beneficial, such as in the case of health data. With FL, the security and access to the data can be managed locally, under the participant’s security and privacy requirements. The potential is huge, as privacy and ownership concerns are one of the main barriers to data sharing.
In a FL system, there is no assumption regarding the distribution of the data within each participant’s dataset, nor assumptions regarding the size of each of the distributed datasets. This flexibility is another key advantage of the technique. It allows different participants, with different volumes or distribution of data and varying capabilities, to collaborate on the training of ML models. Having heterogeneous datasets potentially helps building models that can generalise better. Each participant can then benefit from models trained on a richer and broader range of data with no additional cost for acquiring this data.
Federated learnning, with its many advantages, has applications in many industries:
A canonical example is Google’s Gboard, where FL has been used to train the predictive keyboard algorithm directly on millions of user’s smartphones, without needing to upload any of the user’s interactions or private text messages in the cloud. Instead, neural network models were sent to the user’s devices, trained locally on data stored in the devices, and sent back with additional privacy preserving methods to be averaged with thousands of other devices’ updates. Despite the stringent technical constraints of having to train models directly on devices without affecting the user experience, the obligation to preserve users’ privacy provided a compelling use case for FL. For the user, the improved and personalised experience using the keyboard provides an incentive for accepting to participate in the model training.
Some data can be even more sensitive than mobile keyboard interactions. Personal healthcare data must be managed with the highest consideration for privacy and security. However, this data is also invaluable to develop new useful AI applications, and FL offers a technical solution to that challenge: Hospitals can now collaborate with their data, such as medical imaging for automated cancer diagnostic systems, so researchers can use more significant databases capturing the largest spectrum of cases and pathological patterns. In the UK, NVIDIA is partnering with King’s College London and Owkin to create a federated learning platform for the National Health Service, that enables algorithms to travel from one hospital to another, training on local datasets.
Data intensive applications are another promising domain of application for FL: self-driving cars onboard many ML technologies to operate that rely on a lot of sensing data, such as computer vision with cameras or Lidar technologies. Federated learning can provide a solution to limit the volume of data transfer needed while allowing for real-time, continuous improvement for these applications, compared to classic centralised cloud approaches. This is an example of distributed edge computing application.
More generally, since FL enables collaboration without actually requiring the transfer of data, it opens opportunities for cooperation between organisations. Problems shared across an industry are good candidates for utilising FL to support the development of more robust and effective machine learning solutions. Sectors likely to gain from such collaborations include supply chain, manufacturing, energy distribution, transport. The data collection burden can be spread between the participating organisations while they collectively benefit from improved solutions. This opens the opportunity for new business models for exploiting and managing access to data.
The FL setting provides solutions to some practical ML problems as well as new opportunities for ML applications; it is important to mention the set of potential challenges that a FL framework presents.
As they are based on a distributed network, FL applications have to address risks of potential attacks or failures of numerous workers. Attacks on a federated learning setting could take different forms. An attack could originate from a participant by altering the data used to train the ML model, or altering the model itself, with the potential to compromise the global model. It can also be an attempt by a participant or the server to infer data from other workers through the model updates it receives. Failures are, in contrast, of non-malicious nature but can also adversely affect the performance of the FL process. Network unreliability, the limited availability, unresponsiveness or drop-out of the workers are more prevalent problems in a distributed setting, and FL implementations should be designed to be robust to these threats.
From an ethical point of view, concepts such as bias and fairness, while not specific to FL and relevant to most ML applications, can potentially be more challenging topics to address due to the private nature of the distributed datasets and require specific attention. However, as noted before, this challenge comes with the potential benefit to leverage Federated Learning to increase the data diversity.
Solutions to these challenges exist and are topics of active research. Although they are out of the scope of this article, it is worth mentioning differential privacy and homomorphic encryption as examples of privacy preserving techniques suitable for FL applications. AI Ethics is a topic central to Digital Catapult’s objective of accelerating the ethical and responsible adoption of AI, and we are keen to support innovation and experimentation to address those challenges, and to explore applications on FL, which we started with a demonstrator project.
As a demonstrator for federated learning applications, Digital Catapult has chosen to ground or work on a use case stemming from our observations in prior works with the agricultural sector. Crop scouting and diseases outbreak detection is an area where data could be leveraged to generate actionable insights for growers to improve productivity, reduce costs and enable more environmentally friendly treatment methods. In practice, gathering consistent data in quantity is costly and impractical for growers, because of the manual work involved, the lack of adequate tools and the expertise required. The diversity and range of crops and diseases observable is also limited for any grower. This operational data is also viewed as sensitive and sharing it represents a competitive risk, so attempts to bring consortium of growers to gather their private data has proven to be difficult. This represents a perfect use-case for FL.
Using an open-source images dataset named PlantVillage, we have simulated a collaborative ensemble of growers, shown in Figure 1, contributing to the development of a global model with their own private datasets of crops leaves presenting symptoms of various diseases. Using our own platform independent implementation of FL and the Federated Average algorithm, we demonstrated the potential for FL to achieve state of the art performance on this computer vision task. Each growers would consequently benefit from an increased performance and better generalisation power of the ML model, with a solution running on small and inexpensive edge computing devices, jetson nanos. We dive into the technical details of this FL library in an upcoming blog post.
Federated learning has received increasing attention in recent years, and for good reasons: a flexible paradigm for implementing machine learning in a distributed and privacy preserving way, FL has the potential to address some of the current limitations of applied artificial intelligence by facilitating the collaboration around data across an industry.
The growth of IoT provides another game changing application to FL, and technologies like blockchain can support the development of new business models to incentivise participants in a FL environment with new business models.
This post was originally published by Digital Catapult at Medium [AI]