Transport Data: Maintaining privacy while generating insights - Part 2
Posted on behalf of author: Andreas 'Zac' Zachariah, Co-Founder Travelai
In Part 1 we sought to highlight the knowledge gap in mobility behaviour data, and the considerable caution transport has around using consumer data. In Part 2 we’ll explain how we try and resolve the privacy-utility conundrum for places, time and duration.
The challenge for TravelAi here has been how to retain the trip detail and geospatial data to describe an arc of travel and protect privacy.
We are deeply grateful for support from the Benchmark Initiative, Omidyar Network, Ordnance Survey and Geovation with this project. https://github.com/travelai-public
Part 2: The importance of place, time and duration
Mobility behaviour data at the level of individual trips will be instrumental to an optimised, dynamic, energy-efficient, demand-led transport system, that is also a key mechanism to minimise carbon emissions.
So much of the transport sector is becoming or already has gone digital. When forward-thinking public transit authorities like TfL created stable, well documented and robust Application Programming Interfaces (APIs) in 2009 open to third-party developers, they enabled SMEs to enter the transport space and build consumer-facing services like journey planning apps. In the next phase, API components will provide the backbone to Mobility-as-a-Service (MaaS), where a customer’s transport needs from door to door are taken care of by a single ticket.
Other parts of the system are also proceeding with further digitisation. Railway signalling infrastructure is going digital. Railway station operators use WiFi connection point data and CCTV footage to model crowd flows. Highways England is championing the increasing digitisation of its network, to capture more real-time data, accommodate electric vehicles and automated vehicles, and connect with other modes via Superhubs. Later this decade we can expect Airtaxis and unmanned aerial vehicles (UAVs), also relying on data. And the electrification of transport will depend on responsive digital management of networks.
We’re about to enter a world of 1s and 0s in the transport space. So how might we serve this need for richer high-utility data and protect sensitive user information?
Place…
Places and visits are like punctuations and paragraphs in a storyline. Identifying the place of interest (PI) a person has visited is a vital piece of the puzzle to assess both trip purpose and assess privacy considerations.
The location of one’s home is generally accepted as a particularly sensitive place in terms of privacy. Fortunately, in the UK, one can take a cue from transport modellers and use dynamically sized geospatial areas known as Lower Layer Super Output Areas (LSOAs) (*1) to help obfuscate these locations. LSOAs are used in the National Census survey and by NHS to name a few. They are:
- Maintained by the Office for National Statistics (ONS)
- With each area containing between 1000-3000 people and 400-1200 households
- And they typically follow natural boundaries such as roads and rivers
*1 https://datadictionary.nhs.uk/nhs_business_definitions/lower_layer_super_output_area.html
…Time and duration
Economists often talk about the value of time when commuting. If we can be productive while commuting (e.g. passenger vs driver) the time value is greater to the traveller.
Time-stamps are critical data for transport planners as they reveal if a journey was during peak times or not. A longer time series of geospatial data describing a person’s movements can indicate if a commuter regularly chooses the same transport options at the same times for their commute. Those exhibiting more irregular travel and route times could then lend themselves to behaviour change, potentially nudging them into less crowded times. Managing social distancing adds another reason to even out use of services across the day.
Duration is another aspect which - depending on the privacy level of a place - may need to be factored into obfuscation.
Others that can contribute to place sensitivity include: overall time spent, nights recorded at a place; long visits during the day or typical work hours; regularity and frequency a place is visited, which correspond to the predictability of users’ mobility behaviour.
Matching visit locations to Places of Interest (PIs)
To connect a place to a real-world PI, we have designed a probabilistic model with metrics about proximity to the PI, open hours and visit duration:
- Distance d: we consider all PIs within 100 meters of the place geolocation. A distance probability score Pdist(d) is assigned to each place using a formula:
- Visit duration t: For each candidate PI, we calculate visit duration match Pvd(t) reflecting match between recorded visit duration and the duration categorisation from the HERE category list.
- In case visit duration falls within the category duration range, we assign a value of 1.0.
- To avoid a sharp drop to zero probability at the edge of category limits, we use a linear decay function relative the annotated duration category C = [Cmin, Cmax]:
- Visit time v = [varr, vdep]: For each candidate PI, we calculate open hours match Poh(v) between recorded visit time and the PI declared open hours [Hopen, Hclose]:
- In case the visit start - end times are within places of interest open hours, we assign a value of 1.0.
- To allow arriving/leaving shortly before or after open hours, we set a 20 minutes span of time during which probability of open hours match linearly decays to 0:
As with any real-world data, we also have to account for possible missing data fields on the details we can acquire for PIs. For instance, we found the PI data often lacks open hours data. In some cases, it doesn’t make sense to have open hours for the PI (e.g., landmarks, residential area – buildings), or because data about open hours is not available, or in others the PI is never open to public (e.g. private companies that are not open to public). For each category of case, we need to treat the missing data differently and compensate the model for the missing values.
Temporal factors for place sensitivity
In addition to using PI sensitivity category, we decided that place and visit times would be the determining factors in deciding the level of privacy. We identify places where the user is spending most night-time hours and most day-time hours. These usually correspond to home and work places and are deemed of highest sensitivity.
Our patterns of travel and visiting places are also often highly regular. Knowledge about someone’s mobility patterns can expose sensitive information that can be used to predict their whereabouts in the future. Based on these observations, our list of temporal factors for place sensitivity are:
- Visit durations calculated as a mean over all visits to the place,
- number of night-time visits of over 4 hours durations (labelled as ‘sleep’)
- regularity of visits as coefficients of variation for weekdays of visits and minute-of-the-day of the visits, and
- frequency of visits as a maximum of visits per day over a 10-day sliding window.
Fusing it all together into a privacy category
Having explored the different factors for PI detection, PI category sensitivities and different temporal factors, we now have plenty of variables to estimate the sensitivity of a place. The sensitivity score of a place is derived from user’s data and is thus user-specific. The only global variable is the sensitivity scores that we have manually annotated for PI categories. It is worth noting though that there could be interesting future work in personalizing PI sensitivities to a user-level resolution.
As an output of the process, we categorize each place into one of four privacy sensitivity categories as previously described (unknown, public, sensitive and private). As an initial step, we set aside places deemed as ‘home’ or ‘sleep’ and note these with highest (=4) privacy category. For other places, we derive final privacy category by combining estimated category sensitivity with temporal factors including regularity, frequency, duration and number of nights spent at the place; see Fig.1 below showing snapshot of the variables involved in the final privacy scoring function.
We calculate category sensitivity score by weighting each detected nearby PI with its probability score. Due to how the probability score is derived (as detailed previously), we noticed that using probability as a weighting coefficient directly doesn’t fairly allocate weights to different PIs.
From analysing PI accuracies, we deemed that values P < 0.5 were rarely correct, whereas values P > 0.7 had a high chance of picking up the right PI. Consequently, we calculate PI weight α as a normalization of range [0.5, 0.7] to [0, 1], i.e., so that P = 0.5 yield weight 0 and P = 0.7 gives weight 1.0 with values in between increasing linearly.
As some locations can feature tens or even hundreds of nearby PIs, causing multiple lower probability PIs to overshadow the most likely PIs. We address these situations separately, calculating sensitivity scores for top-1 places, top-3 places and all places and average these to yield a category sensitivity score.
Finally, we adjust the category sensitivity score with temporal information, i.e., values from frequency, regularity, duration, and sleep score. We find weights and effect-ranges for each variable so that they contribute reasonably to the final score; see Fig.2 below for details about the values used in this process. This privacy score is the sum of category sensitivity, frequency adjustment, regularity adjustment, duration adjustments and sleep adjustments. This final privacy score is then rounded to 1, 2 or 3, such that privacy score values over 3 are capped as privacy category 3.
A different individual might have different sensitive places and so the map varies from person to person.
In Part 3 we’ll show how we get to the above output, and address gaining benefits from multi-modal information.