A wealth of data about individuals is constantly accumulating in various databases in the form of medical records, social network graphs, mobility traces in cellular networks, search logs, and movie ratings, to name only a few. There are many valuable uses for such datasets, but it is difficult to realize them and protect privacy at the same time. Even when data collectors try to protect the privacy of their customers by releasing anonymized or aggregated data, this data often reveals much more information than intended. To reliably prevent such privacy violations, we need to replace the current ad-hoc solutions with a principled data release mechanism that offers strong, provable privacy guarantees. Recent research on differential privacy has brought us a big step closer to achieving this goal. Differential privacy allows us to reason formally about what an adversary could learn from released data, while avoiding the need for many assumptions (e.g. about what an adversary might already know), the failure of which have been the cause of privacy violations in the past. However, despite its great promise, differential privacy is still rarely used in practice. Proving that a given computation can be performed in a differentially private way requires substantial manual effort by experts in the field, which prevents it from scaling in practice.
This project aims to put "differential privacy to work" – to build a system that supports differentially private data analysis, can be used by the average programmer, and is general enough to be used in a wide variety of applications. Such a system could be used pervasively and make strong privacy guarantees a standard feature wherever sensitive data is being released or analyzed. The long-term goal is to combine ideas from differential privacy, programming languages, and distributed systems to make data analysis techniques with strong, provable privacy guarantees practical for general use.
Synchronous data centers
Most major web services – such as Amazon, Google, or Facebook – employ massive data centers that can each contain hundreds of thousands of computers. Today, these data centers are almost always built using an asynchronous design, with few or no assumptions about how long it might take for two computers to communicate, or when exactly a computer might be able to complete a given task. This approach has been the standard since the very beginning of distributed computing; it is the one that is taught in textbooks, and it is rarely questioned. However, the design is not as simple as it may at first appear: in practice it often leads to low efficiency, unpredictable timing, and many other serious problems.
In this project, we investigate a radical new solution – synchronous networks – that could intrinsically eliminate these problems. At a high level, our approach involves creating a detailed choreography that describes what each computer should do and when. While synchronous designs can be more complex, they also come with considerable potential benefits, including higher efficiency, better predictability, and support for new kinds of services. In addition, they can potentially sidestep some of the most difficult problems with asynchronous designs.
Resilient cyber-physical systems
This project develops new ways to defend critical infrastructure systems, such as factory control networks, medical devices, or power plants, against attacks. These systems directly interact with the physical world, so a successful attack can have serious consequences: for instance, a compromised chemical plant could have severe environmental consequences, and a compromised medical device could result in injury or death. Contemporary security mechanisms, however, can be inadequate for at two reasons. First, current defenses tend to be quite heavyweight, which makes them difficult to apply to resource-constrained infrastructure systems. And second, timing is critical: current defenses tend to focus on preventing systems from 'doing the wrong thing' or 'failing to do the right thing', as opposed to preventing systems from 'doing the right thing at the wrong time', which is often just as damaging.
This project pursues a solution we call bounded-time recovery. Unlike many existing techniques, bounded time recovery does not attempt to mask all symptoms of an attack, which existing defenses do at great cost; rather, it leverages the fact that many systems cannot change their state arbitrarily quickly, due to properties such as inertia or thermal capacity, and can thus already tolerate brief disruptions, provided the system quickly returns to a correct state. This approach seeks guarantees that 1) the system will meet its timing requirements in the absence of an attack, and that 2) when under attack, the system will return to a correct state within a bounded amount of time, potentially after reconfiguring itself to exclude compromised nodes. The goal is to provide these guarantees in the Byzantine model, that is, without a priori knowledge of what the attacks will look like, or which nodes will be attacked.
Secure network provenance
Operators of distributed systems often find themselves needing to answer a diagnostic or forensic question. Some part of the system is found to be in an unexpected state; for example, a suspicious routing table entry is discovered, or a proxy cache is found to contain an unusually large number of advertisements. The operators must determine the causes of this state before they can decide on an appropriate response. On the one hand, there may be an innocent explanation: the routing table entry could be the result of a misconfiguration, and the cache entries could have appeared due to a workload change. On the other hand, the unexpected state may be the symptom of an ongoing attack: the routing table entry could be the result of route hijacking, and the cache entries could be a side-effect of a malware infection. In this situation, it would be helpful to be able to ask the system to "explain" its own state, e.g., by describing a chain of events that link the state to its root causes, such as external inputs. As long as the system is working correctly, emerging network provenance techniques can construct such explanations. However, if some of the nodes are faulty or have been compromised by an adversary, the situation is complicated by the fact that the adversary can cause the nodes under his control to lie, suppress information, tamper with existing data, or report nonexistent events. This can cause the provenance system to turn from an advantage into a liability: its answers may cause operators to stop investigating an ongoing attack because everything looks fine.
The goal of this project is to provide secure network provenance, that is, the ability to correctly explain system states even when (and especially when) the system is faulty or under attack. Towards this goal, we are substantially extending and generalizing the concept of network provenance by adding capabilities needed in a forensic setting, we are developing techniques for securely storing provenance without trusted components, and we are designing methods for efficiently querying secure provenance.