Denise McKenzie's profile picture

Denise McKenzie

Benchmark Programme Director

21 May 2021


Transport Data: Maintaining privacy while generating insights

Author: Andreas 'Zac' Zachariah, Co-Founder Travelai

Geospatial coordinates tell us where something is. Mobility data tells stories. Movement trajectories tracked in mobility behaviour data tell a many-layered narrative of how people use transport systems. They can also tell more about us and our lives than anyone would be happy with, so sometimes there’s a need to anonymise the data.

Supported by the Benchmark Initiative’s Entrepreneur-in-Residence programme, we at TravelAi have developed an Open Source anonymization tool for trajectory data. The data structure and format are described here on Github.

Underlying this work to protect the privacy of individual travellers while gaining insights into their journeys, is a mission to help make transport systems work better for citizens, to cut costs for operators, and improve environmental sustainability. We outline the work in three parts:

  • Part 1 – The knowledge paradox in transport
  • Part 2 – Importance of place, time and duration
  • Part 3 – It’s not just about the destination, it’s about the journey too

Transport moves people and goods, drives much of the economy, access to it and an outsized impact on social mobility. At the same time, transport is the largest category contributor of greenhouse gas emissions in the UK and US(*1), and about 28% of global GHGs (*2). It was the largest category of UK household expenditure in 17 of the last 19 years (*3).

Yet there is a surprising lack of data and data-enabled understanding of how people move around.  The pandemic has generated new reasons to want that information, including assessing compliance with lockdowns, monitoring the safety of public transport, and deciding when and how to relax transport restrictions. That has driven some innovation, but there is still a need for better tools to assess what happens from day to day, to match provision and need more effectively.



*3 see Excel Table 4.3

Part 1: The knowledge paradox in transport

The mixed environment of legacy IT infrastructure, complex systems and multiple public and private operators mean that a lot of transport data is in unconnected silos.

For example, someone buying a train ticket to London will register on the LENNON and MOIRA ticket purchasing systems, but when she uses an interchange to Transport for London (TfL) underground services, she will be recorded in the data as effectively a different person on a different journey.

With the capital, London’s travel data system, like that of many larger international cities, is relatively networked. The system, built around the Oyster card and now supporting card and app payment, delivers a single-payment mechanism to make using underground, bus, tram, the overground and train travel within 9 zones around the capital seamless. But it is limited to Greater London.

There are also ways to track a car or other vehicle asset, including the ANPR (Automatic Number Plate Recognition). But when the car trip ends, the next stage of transit activity will be recorded in another data silo. Pedestrian activity will likely effectively be lost, in terms of data. Cycling and walking, and habits and patterns around cycling and walking, are represented in data even less than all the other modes of transport. This is a particular failing at a time when governments want to encourage more active travel for several converging reasons: to reduce traffic, noise and pollution, to improve the environment in shared spaces, and to encourage people to take more exercise.

Real-time feeds track the mobile asset vehicle (usually using GPS) and service the displays on train platform and at bus stops that show times to next arrival and communicate delays. Frequency counting (using pneumatic and induction loops or observers with clicker counters) is popular for logging cyclists and vehicles, but the data collection can only take place where the hardware is located. At the start of the first lockdown in March 2020 the government stated it was using Mobile Network Data as a source of plotting population-wide changes in road and rail commuter traffic. However, local and national travel demand surveys (NTS) remain the most widely-used datasets in transport planning.

TravelAi’s interactions with public and private sector transport stakeholders have revealed a knowledge paradox. The aggressive ways in which some big tech, advertising and social media organisations have collected, harvested and monetised consumer data has made transport organisations be more concerned about collecting data and how that might be seen by the public.  That means there are a lot of major stakeholders in transport planning and provision who are not in a position to make effective and confident use of data that could bring benefits for everyone.

Therefore, many benefits can result from developing an open-source solution that addresses ethical and privacy concerns while deriving actionable insights from the data. We wanted to help transport operators address their concerns and reservations, and also to release obfuscated mobility traces for use in a transport data commons.

Privacy concerns at risk of undermining larger ambitions

Privacy concerns around this data centre primarily around the sensitivity of locations, though some considerations also need to be factored in for timestamps and routes being taken. Sensitive locations include a person’s home, workplace, schools, places of worship, doctors’ surgeries and medical centres.

TravelAi is well-placed to take on this work because our smartphone-based software solution, enabled through explicit user permission, generates multimodal movement trajectories without additional user input. These stories of citizens’ mobility habits are derived automatically from sources of raw smartphone sensor data (GPS and accelerometer) that is primarily processed on device (aka Edge computing) and on server (the cloud). The output as shown to users is like that seen in Google Map’s YourTimeline (Fig.1), which is shown here side-by side with TravelAi’s own MyWays Digital Travel Diary output (Fig.2).



Figure 12 Google Maps Your timeline                   Figure 21 Travelai MyWays Travel Diary App


While it would be technically possible to remove time-stamps and route information, there is much value in retaining journey detail, including the consistency of patterns (eg peak, off-peak or variable pattern) in an individuals’ habits. Variations due to seasonality or weather can also be useful insights. This requires the generation of a continuous stream of data points over time (or as transport planners like to call it, longitudinal time series data).

The untapped potential we see in transport data is part of a wider picture. The Government’s Geospatial Commission published a National Geospatial Strategy in 2020, and is pursuing a set of measures to unlock location-enabled data and promote understanding of its potential. There is also a great deal of work going on elsewhere to manage data-related issues of accessibility, consistency, risk management and integrity, and on developing various types of data commons for transport.  Demonstrable compliance with data protection law is also essential, and data service providers will want to develop solutions that can work across the UK and EU.

At TravelAi we have spent several years working to reveal how people travel. The Benchmark Initiative has helped us ‘refactor’ our thinking to obfuscate the data we have worked to discover.

Sensitive mobility data

Figure 3 Office to home commute

Fig.3 is a simple enough trip and route detail of one person commuting home from the office after work. There is a mix of walking and bus that gets her to a train station, from where she crosses the city by train, arriving at her local station, from where she walks home. In this case, the office and the home are sensitive locations that need to be protected, while details of the multiple legs, interchanges and full route taken are collected. The data gets more complex in the event of trip-chaining where additional stops (e.g. for shopping) complicate a journey.

Figure 4 


This track (Fig 4) in the City of London shows her office, but also shows her ending a walk near where a medical clinic is located.

Public policy support for more active travel, smarter transportation, more demand responsive transport (DRT) options and Mobility-as-a-Service options add further reasons to reap insights from better data. Autonomous Vehicles add another dimension, as their sensors scan and monitor spaces around them. AVs will use this data to train their systems and retain it in cases of accidents.

A solution needs to consider a number of parameters to work at scale, and handle mobility behaviour data that can be generated for groups of users and enterprises.

In Part 2 we’ll explain how we try and resolve the privacy-utility conundrum for places, time and duration. Please visit our public repo here


Appendix (TLDR):

Summary of available ways to monitor mobility

This is not an exhaustive account of the sources of mobility behaviour data, but it indicates the complexities of trying to represent everyday mobility habits, impacts of weather and seasonality, digital silos, private and public transit, population density and technologies.

Road vehicles

  • Manual counts (on-site clicker counts and/or counting cars recorded in videos)
    • Pros – Flexible, can deploy anywhere quickly
    • Cons – Not cheap, limited/narrow insights beyond mode being documented
  • Roadside surveys
    • Pros – Good levels of detail,
    • Cons – Expensive, typically take police away from other activities and hence small sample sizes, potential recall error, stated preference bias
  • Travel Demand/Household travel Surveys
    • Pros – Established, long history, trip-chaining (up to 1 week), multi-modal
    • Cons – Stated preference, recall error, costly, no seasonality data
  • Video vehicle detection/ANPR (Automatic Number Plate Recognition)
    • Pros – Taps into extensive CCTV networks, doesn’t require ‘target’ consent, huge scale, 24/7, can follow vehicle within a network, can associate same car to different days
    • Cons – Tracks cars, not mobility behaviour of an individual, registering cycling activity requires latest software developments
  • Pneumatic Loop counters
    • Pros – Portable so can be relocated, not CCTV, simple/low maintenance tech
    • Cons – Expensive for age and simplicity of tech, can get confused by motorbikes and especially cyclists not registered, limited and unevenly located across the UK (as many in London as rest of UK)
  • Piezoelectric/Induction Loop counters
    • Pros – Portable so can be relocated, not CCTV, simple/low maintenance tech, cheap and more accurate than pneumatic loops
    • Cons – Not inexpensive, cyclists not registered, limited numbers across the UK (as many in London as rest of UK)
  • Mobile Network Data
    • Pros – Lots of it already collected, each network can cover 25-35% of the population, it doesn’t require engaging with citizens/consumers directly, it’s great for aggregated flow patterns

Public transit (train, tube, tram and bus)

  • Manual counts
    • Pros – relatively quick to deploy and simple to collect
    • Cons – not immodest cost, usual small sample sizes
  • Travel Demand/Household travel Surveys
    • Pros – Established, long history, trip-chaining (up to 1 week), multi-modal
    • Cons – Stated preference, recall error, costly, no seasonality data
  • Oyster/tap card system
    • Pros – multiple public transit modes measured, effortless
    • Cons – Huge OPEX and CAPEX, no walking, cycling, private car data, just in London
  • LENNON/NRS/RARS (backend rail ticket reconciliation and revenue allocation systems)
    • Pros – multiple public transit modes measured, effortless
    • Cons – Huge OPEX and CAPEX, no walking, cycling, private car data
  • Mobile Network Data
    • Pros – lots of it already collected, each network can cover 25-35% of the population, it doesn’t require engaging with citizens/consumers directly, it’s great for aggregated flow patterns
    • Cons – It’s expensive, sensitive to input data and training models, doesn’t support continuous time series analysis

If cell tower density, road network complexity and public transit proximity are favourable, then Mobile Network Data (MND) can be used to infer both road vehicle and public transit usage of a person. However, GDPR/ICO rules around privacy mean that networks are not allowed to create time-series data that could connect a person’s travels from one day to another. Aggregated one can see the effects of seasonality and weather. The source MND exists across multiple networks and doesn’t require getting any user consent, or building up a user base or asking end-users to install any apps.

Cyclists & Pedestrians

  • Manual counts (clickers)
  • Emergence of Visual recognition software for use with CCTV footage

Pedestrians and cyclists are the most underrepresented groups of transport users because getting hold of the data is hard. Sports activity tracking platforms like Strava now sell their users’ anonymised and aggregated cycling and running data to municipalities, but their users are generally not considered to be representative.


Further Reading


Meet the Team

The Benchmark Iniative is delivered by Geovation with support from the Omidyar Network.   ...

Continue Reading →


Introducing the Benchmark Initiative

Benchmark is an exciting new initiative exploring the challenges and solutions in the responsible us...

Continue Reading →


Geovation Announces the Benchmark Initiative

9 October 2019 / For Immediate Release Geovation Announces Year-Long Benchmark Initiative to Explo...

Continue Reading →