Implement (5.) - Calculation of a threshold to tell apart real privacy risks from not actual risks

Overview

With #7 we have the approximate difference between the optimal route r and the unknown actual route \tau connecting two sampled points p_{1} and p_{2}. We approximate \tau by creating a route of routes, r', between the two initial points by using a second, smaller temporal distance for a sub sampling. (See #7 for a more detailed explanation). r' is not guaranteed to represent the true route \tau that the courier took.

There still are uncertainties about r'.

Uncertainties

If r' significantly* differs in duration and/or distance compared to r, what can we derive from that?

This could be due to one or multiple delivery/ies and thus bring a privacy risk with it.
It could also occur because of to a detour due to a construction site or an accident, a break, a congestion or the courier could lose her way. These scenarios would not be privacy risks.

We need to exclude the undesired scenarios and therefore make some assumptions. We say an event is likely, if the assumed probability to face it during a complete work day is higher than 50%. Events may significantly influence only duration (slower speed) or both, distance and duration (some redirection) between two consecutive points p_{i} and p_{i+1}. We do not assume events to influence distance only. That would have to mean that the driver takes a shorter route at a lower average speed or a longer route at a higher average speed.

delivery 📦 : Very likely. Would affect duration and distance. Extra duration for the stop while delivering.
construction site 👷 : Likely. Could affect duration (extra traffic light) or duration and distance similarly (redirection).
accident own vehicle 🚨 : Not very likely. Would probably affect duration a lot or end tracking.
accident other vehicle 😭 : Not likely. Could affect duration (waiting to get pass) or both similarly (redirection).
break 😴 : Likely. Could affect duration (driver waits at a location) or both (redirection, drive somewhere an pause there). Should not happen more than once a day though.
congestion 😠 : Likely: Should only affect duration.
getting lost 😕 : Not very likely. Would affect duration and distance similarly (detour, redirection).

Given this information, we can say, that deliveries would probably cause a higher distance for r' compared to r. The higher distance and the stopps for delivery would cause a much higher duration. To clarify, because of the stopps, the difference in duration should be higher than the difference in distance.

This finding compared with the low likelihood of the other events and the fact, that they either only lead to a difference in duration or a similar difference in time and duration, but not a difference in distance and an even higher difference in duration, makes it possible to identify deliveries and thus exclude the other events.

Implementation strategy

We implement a check for differences in duration and distances in the following way:

A route r between two points p_{1} and p_{2} poses a high privacy riskes, if it is very likely to contain deliveries. r contains deliveries, if:
- Summed distance and duration of r', given sub sample rate k, differ significantly from r: The detours for deliveries.
- Within sub routes r'_{1}, \dots, r'_{k}, the actual durations from points p_{i} to p_{j} with i < j and j <= k + 1 should be longer than the optimal ones: The stopps for deliveries

Remarks

*Significance in this context has to be defined as well. We say a difference in distance (duration) between r and r' given k is significant if the distance (duration) of r' is at least dist(r) \times (1 + \frac{k}{10}) if k < 10 and dist(r) \times 2 else. The first case ensures that with low k, lower differences should be expected.

Perhaps distance and duration should be weighted when calculating differences. Does one of them have a higher information gain regarding privacy risk than the other?

One more thing to measure here: The features from #2 (closed) match coordinates to traffic structures or addresses (reverse-geocoding). The features from 3 so far take coordinates again, not the traffic structures or addresses an derive routes. There might be a bias in step 3, setting the route-starting and ending point apart from the actual coordinates. This bias might or might not be in #2 (closed) as well. This should be checked as well.

Edited Nov 16, 2021 by Lukas Gehrke