Privacy Issues
This page introduces general data privacy concerns in the context of a smart city, presents the potential malicious entities and details specific data privacy issues associated with the different enabling technologies, as follows:
1- Main privacy concerns
Below, a summary of the different concerns about data privacy, that should be considered and addressed, while implementing smart technologies in a large connected city:
The first main concern is the widespread deployment of artificially intelligent processing algorithms that can be used in combination with the collected personal information to deduce involuntary correlations, leading to specific identification (other people, web pages, organizations, etc.).
The second privacy concern is related to the tracking of spatial mobility, e.g., in relation to pedestrians, consumers and vehicles. Tracking is already a legitimate part of smart city technologies, as per ensuring safety in the public space, but there is a fear of misuse, e.g., related to unwanted surveillance.
The third reported challenge consists of properly informing citizens about what the information is being used for, obtaining and maintaining informed consent in a practical manner. This challenge becomes even harder when combining different data sources. Indeed, aggregation of data may lead to profiling, discrimination, and political manipulation.
The fourth concern consists of enforcing the requirement of processing personal data only within the European Union in accordance of with art. 44 ff. of the GDPR.
There is a general concern from many cities on unnecessary or unwanted use of personal data for purposes such as marketing campaigns, telephone directory, contact-lists of some companies, etc. When it comes to personal data from video surveillance footage, the studies show a varying degree of concern from the citizens in different cities. Hence, we have to acknowledge that there are cultural differences between European countries on what is considered to be a privacy issue.
Taking the example of smart mobility, it is imperative that not only the privacy of the collected and analyzed data be preserved but also the running algorithms (usually considered as sensitive and proprietary). Regardless of the goal, the attacks and defenses relate to exposing or preventing the exposure of analysis algorithms (processes) and collected data.
2- Threat models
Data privacy risks are mainly related to environments, technologies and involved parties, and understanding privacy concerns, from a technical point of a view, leads to identify potential adversaries.
By adversaries, we refer to malicious actors that aim to gain access to personal data, while relying on:
data being transferred and processed that the adversary has access to,
and external knowledge of the adversary; e.g., information collected while colluding with other external malicious entities.
These adversaries may be passive or active, and considered under either semi-trusted or untrusted environments, presented as follows:
Passive attacks: the adversary passively observes the data and performs inference or concludes connections, e.g., without changing anything in the process.
Active attacks: the adversary actively changes the data or processes.
3- Privacy issues
In order to identify main privacy issues regarding the collection and processing of personal data, it is important to first understand main identifying features as detailed in the GDPR.
Indeed, the data subjects are identifiable directly or indirectly. In directly approach, they can be identified using an identifier such as name or a national identification number. In an indirect approach, they can be identified using an online identifier or one of several special characteristics which expresses the physical, physiological, genetic, mental, commercial, or cultural identity of these natural persons. In the real world, identity information includes all data which are or can be assigned to a person in any context.
In a connected city, it is important to consider possibilities of remote monitoring using gathered personal information, and also assume that digital tools must be using information that is recorded, stored, and processed using either standalone sensors or devices around the physical location of the scenario. Based on these assumptions, here is a list of main identifiers and their category according to GDPR.
Data identifier types | Identifiers |
Ethnical origin | color of skin, language, traditional dress, tattoos, cast related markings. |
Political & religious opinion | Protest/event banner, clothing, gesture, presence at historical event/dates, tattoos, speech |
Genetic information | Physical access to protest or festival site may allow to collect genetic material such as used objects, food, etc. |
Biometric information | Eyes, fingerprints, face, voice, birth marks, body shape, walking pattern. |
Sexual orientation | Face (cosmetic and dress), voice (this could be an ethical issue, rather than privacy) |
Health information | Temperature, smell (e.g., covid-19), waste-water samples, other vital signs |
Online information | Mobile radio identities (Bluetooth/WiFi/NFC, 2G/3G/4G/5G), social media accounts, e-mail, vehicle number plate, IP address. |
Considering the aformentionned identifiers, it is important to emphasize that privacy issues are tighly related to technologies (i.e., Internet of Things (IoT), Cloud and AI), that facilitate the sensing, collection and processing of these information. Main privacy exploits can be summarized as follows:
| IoT | Cloud | AI |
Sensing layer | Data overcollection |
|
|
Collection layer | Lack of standardized secure short-band communication protocols |
|
|
Processing layer | Limited computation resources for advanced secure (cryptographic) algorithms |
|
|
Application layer | Open and insecure APIs |
|
|
In the following, we detail main issues related to each enforcing technology, namely IoT-based challenges, Cloud-based challenges and AI-based challenges.
IoT-based challenges
There are various privacy issues associated with smart devices that are mainly due to the massive data collection, focused on the sensing and communications layers. Indeed, the connected devices have the capability to be used as a mediator storage or a fog node to perform a small computation in the network. These sensing capabilities make them vulnerable end-points for collecting the exchanged data and enriching adversarial databases, thus conducting specific correlation and inference attacks.
While a huge number of applications are continuously proposed to provide various benefits for citizens, the majority of these applications gain access to private information -without acquiring explicit informed consent- and may transfer the collected data to unauthorized parties. Finally, the sensing capabilities of the smart devices facilitate the bypass of the data minimization, and most applications usually collect more data than the necessities of original functions, while in permission scope, which is known as data overcollection.
Cloud-based issues
To cope with the shortcomings of smart devices, i.e., processing and storage capacities, battery constraints, etc. various applications delegate the data and processing management to external cloud providers. In the following, we summarize common challenges raised by cloud infrastructures, platforms and applications.
Data and computation outsourcing: by outsourcing the data to remote servers, data management is delegated to a third-party provider, usually considered as a semi-trusted or honest-but-curious entity. This raises privacy concerns, such as the anonymity of data owners.
Transfer of data outside the EU: the lack of knowledge about the physical location of data in cloud services may have an impact on the data security, quality of services and might harm users’ privacy. This latter is of utmost importance as data legislation regarding the collection and processing of data is different between different countries and regions, and can be more intrusive compared to the EU regulations.
Lack of knowledge about Service Level Agreements (SLAs): SLA is a contract signed between the client and the service provider including functional and non-functional requirements. It considers obligations, service pricing, and penalties in case of agreement violations. However, due to the abstract nature of clouds, SLA violations with regards to data involve data retention, privacy leakage.
Multitenancy: this cloud feature means that the cloud infrastructure is shared and used by multiple users. In a nutshell, data belonging to different users may be located on the same physical machine, based on a specific resource allocation policy. Due to the multi-tenancy’s economic efficiency, providers usually select this feature as an essential block for the cloud design. However, it generates new threats, such that, malicious users may exploit this co-residence issue to perform privacy (inference) attacks.
AI-based attacks
AI-based applications are key enablers of smart cities, empowering the extensive processing of gathered data and monitoring in real time the state of critical infrastructures. Unfortunately, they are generally considered as data-hungry tools and their benefits are often accompanied by a mostly black-box character and high complexity of the final algorithms in use, rendering conventional methods for safety assurance insufficient or inapplicable. As presented above, the massive collection of data from the different devices, a.k.a., referring to the sensing and data collection layers, constitute first threat vectors to attack intelligent systems due to their multitude and their limitations in terms of resources and security features. For example, by poisoning smart city’s data, adversaries can try to fake the models, implying they will learn the correct correlation between data and the state of a critical system (modifying the model boundaries), or they can push the model in taking decisions that are hampering the city's infrastructure and population. It is clear that the strong link between sensor data and AI models, as well as the intrinsic weakness of the sensors themselves, introduce new risks that cannot be ignored, such as:
Backdoor injection: it aims to mislead the AI model during the learning phase. In this case, the attacker crafts and distributes corrupted system’s parameters in order influence the system behavior. The injected backdoors induce erroneous classification of inputs, with possibly disastrous consequences on the whole system’s processes. Let us consider a AI model that is monitoring the quality of air pollution. An adversary can fake the model in believing that the presence of some chemicals in the air is innocuous, while they could be dangerous for the population.
Data Poisoning: it aims to inject specific data, generally carefully selected, to fake an existing model into taking decisions that decrease performance and increase risks of the city's infrastructure and population. This category of attacks, called adversarial examples, builds on sensor inputs that can trick a deployed model, trained on benign data, into making a wrong decision. The most difficult aspect here is that adversarial example attacks are difficult to counteract since poisoned data are generally indistinguishable from a normal input for humans. For instance, let us consider a monitoring service in a smart city, via CCTV cameras. Adversarial examples can be used by an attacker interacting with different cameras to let the model believe that certain areas of the city are congested. This would force the model to reroute the traffic towards busy areas, as well as changing the traffic light timing, creating a gridlock. This would have disastrous consequences for instance in the case of a terrorist attack.
Model theft: it is a mix of the previous two and is employed in scenarios, and mainly considered a security threat where ML models are either i) retrained over time or ii) alternative models have been trained and can be deployed on the basis of contextual information.
Inference attacks: they include two main categories (a) inference about members of the population and (b) inference about members in the training set.
For the first case (a), an adversary can use the model’s output to infer the values of sensitive attributes used as input to the model. Note that it may not be possible to prevent this if the model is based on statistical facts about the population: for example, suppose that training the model has uncovered a high correlation between a person’s externally observable phenotype features and their genetic predisposition to a certain disease; this correlation is now a publicly known fact that allows anyone to infer information about the person’s genome after observing that person.
For the second case (b), the focus is on the privacy of the individuals whose data was used to train the model. For instance, given a model and an exact data point, the adversary infers whether this point was used to train the model or not, and may also try to extract properties.
The following review paper, appeared in the TIEMS annual conference, in 2021, details the privacy challenges in urban spaces, and identifies a set of recommandations to enhance the privacy by design concept, as highlighted by EU regulations.