Transport Data: Maintaining privacy while generating insights - Part 3
Posted on behalf of author: Andreas 'Zac' Zachariah, Co-Founder Travelai
In Part 1 we sought to highlight the knowledge gap in mobility behaviour data, and the considerable caution transport has around using consumer data. In Part 2 we’ll explain how we address the privacy-utility conundrum for places, time and duration. In Part 3 we will look at a mechanism that seeks to retain the utility within route data and the value of temporal or time considerations.
We are deeply grateful for support from the Benchmark Initiative, Omidyar Network, Ordnance Survey and Geovation with this project. For more information go to https://github.com/travelai-public .
Part3: It’s not just about the destination, it’s about the journey too
Abstracting travels to the origin-destination (O-D) level is a common method for removing detailed path information from the legs of the journey, but that loses valuable data about the path taken. Using our approach travels that neither start or end in places described as private will retain the full public transit waypoints and waypoints that describe travel by private vehicle, bike or foot.
Estimating place privacy
Not all places have the same level of sensitivity about what they may reveal. We placed public places that include consumer retail, civic and transport sites as some examples that are low sensitivity. At the other end of the spectrum, we rated places of worship, hospitals and other health facilities together with a person’s home residence as having the highest sensitivity.
As TravelAi’s mode detection capability and data collected to date cover multiple countries, we needed to use a sufficiently global source in our place categorisation. We opted for a commercial provider (HERE Maps) and set about categorising the +700 categories and sub-categories with a privacy sensitivity level (0-3):
- (0) Unknown = Not enough data collected to estimate privacy sensitivity
- (1) Public = Allow the showing of a place’s location and time
- (2) Sensitive = Obfuscate start-end times and location when a routes origin or destination
- (3) Private = Obfuscate all location information near the place
Lower Layer Super Output Areas (LSOAs) were primarily designed for the publication of the once-in-a-decade Census survey. These standard areas are carefully curated so that they are relatively homogeneous in terms of their population size and household make-up. An LSOA will typically have an average population of 1500 people or 650 households. For some sense of scale across the UK there are just over 34,750 LSOAs. This compares to the larger Medium Layer Super Output Areas (MSOAs), of which there are 1,450 with an average of 7,500 people or 4000 households per area. These areas are dynamically sized and consequently will be much smaller in urban areas then they will be in rural. With resident numbers so sparse in the City of London, it’s the exception for an urban centre and there are less than a dozen LSOAs to define the Square mile.
In Fig.1 we see the obfuscation applied to a particular individual’s travel. A different individual might have different sensitive places and so the map varies from person to person. Where we see the obfuscation applied: sensitive (lime green) public (green) private (red)
Figure 1 Shows variation in privacy to place categorisation
To improve matching our travel data with places and visits to POIs, we have additionally annotated each category with a typical visit duration level (1-4):
- Brief = under 30 minutes
- Visit = 30 – 120 minutes
- Stay = 120+ minutes
- Sleep = 240+ minutes, including night-time hours between 00AM – 05AM
Due to ethical and legal questions about collecting data about underage users, our assumption to typical durations people spent at POIs is from the perspective of adults. Consequently, visits to places such as nurseries and elementary school are treated as parents dropping off or collecting their children.
While a teacher for whom the school is their “office”, the programmatic techniques designed to deal places of interest (PIs) will identify a teacher as spending sufficient time at a school for it to be classed as their workplace and consequently be treated as a place of sensitive classification.
We have also recognized that certain public transport nodes might result in more time spent waiting (aka longer dwell times) than others, for instance, airports and train stations that are interchange points. The list of POI categories, our sensitivity and duration category annotations will all be found in the GitHub associated with this project and can be freely edited to suit the needs of other projects.
Next, we show how the duration levels are used to match visit durations with POIs, and privacy sensitivity scoring to derive a sensitivity score for the likely POIs of the visit.
Travels within obfuscated cells
Travels occurring within the obfuscated cells will have all their waypoints removed, but retain other information such as purpose, distance and duration. Even then these details can be dialled out depending on the use case scenario. E.g., A person taking a walk around their neighbourhood. We even account for people who walk around their block, staying within their obfuscated cell.
Figure 2 Using LMSOA to obfuscate walk near home Figure 3 Un-obfuscated walk near a person's home
Travels between obfuscated cells
Information about travels between obfuscated cells will retain all the public transit features. Any travels by private vehicle will be retained, but no private vehicle waypoints within the cells will be retained. Walk from the office (sensitive) to a medical clinic (sensitive) with endpoints obfuscated.
Figure 4 Obfuscating walk to the clinic, centre to centre of LSOAs Figure 5 Walk to a medical clinic
Travels from or to a public place to or from a single obfuscated cell
Information about travels will be retained for all the public transit features. Any travels by private vehicle will be retained, but no private vehicle waypoints within the cells will be retained. For example, see this commuter trip from home (sensitive) to an office (sensitive) now obfuscated with trip data retained.
Figure 6 Office to home commute Figure 7 Obfuscating place and sensitive route sections
The detail of the obfuscation around an office retains the public transit level detail, right down to showing the bus stop they went to. But not the walk from there to the office (please note trips starting and ending in an obfuscated cell will always start and end at the central point of that cell/LSOA.
Figure 8 Multimodal public transit travel Figure 9 LSOA showing public transit within cell
We have also realised in building this solution the huge importance that LSOAs play in allowing us to solve a difficult challenge. The geospatial team at the UK’s Office of National Statistics have done excellent work. And it has raised a new challenge because so far, we have not yet found a direct comparable dataset that would help us apply our tool for use in the EU, North America or elsewhere, though we are aware of Mesh Blocks in Australia and work in Mexico by INEGI. The Global Statistical Spatial Framework can’t come quick enough.
We set out with a task to solve a set of competing needs; where highly granular route data could be selectively retained, and where privacy concerns around places of interest and the timing of activities could be addressed. The solution would need to be scalable and have the scope to work with international data.
The research identified time-related inputs to inform privacy sensitivity around place of interest:
- Duration at places of interest (PI) into 4 categories [brief, visit, stay and sleep]
- The time of day visiting at a PI
- Frequency of visits to PI
For places of interest (PIs) we identified:
- LSOAs as used in the Census as independently determined dynamically sized cells
- 4 categories [Unknown, Public, Sensitive and Private]
- Techniques for handling travel that either originates or ends in obfuscated cells
- Techniques for retaining and visualising travel between obfuscated cells
- Methods that help retain high-res public transit routes while obfuscating endpoints
- How to mask travel within an obfuscated cell
- Rules to differentiate between public and private travel
This is a live project and we’re continuing to identify other areas for improving our obfuscating processes. For instance, how to handle PIs that border LSOAs and how to then treat any area shape transformation or transposing.
We are very keen to find an international LSOA equivalent, so we can internationalise the obfuscation tools. Although we are aware of some dynamically created attempts at equalising populations in Europe and Australia, the output cells are too large to retain the utility of high-res mobility behaviour data.
In the meantime, we are pleased to share that we have been able to show the obfuscation in action, on real-world data, with real public and private transport stakeholders and received very positive reactions.
And finally, with thanks to:
We do not claim this to be a complete solution, which is why we invite others to contribute to this opensource project and look forward to the continued advancement of the processes so far deployed.
This is work that we are proud and excited to share, having enjoyed support from the Benchmark Initiative, Omidyar Network, Ordnance Survey and Geovation. They’ve helped us search for a solution and build a tool that balances the competing needs of privacy and high-resolution longitudinal geospatial data. We’d also like to give special thanks to Denise McKenzie, Ben Hawes and Seb Ovide for their support, input and invaluable feedback.