# ETOOBUSY ðŸš€ minimal blogging for the impatient

# Some Bayes helps

**TL;DR**

Iâ€™ve been thinking a bit about inference applied to a concrete problem.

In these times of Covid-19 there are a lot of discussions about the Apps that are supposed to tell you whether you got in contact with other people, so that sanitary checks can be targeted.

I know that a lot of those discussions are about the privacy concerns.
On the one hand these concerns seem somehow *out of time* after
considering the breadth of data that corporations like Google and
Facebook get since a long time; on the other hand, itâ€™s so much right
that we discuss about this, and possibly also draw conclusions about
limiting the data we allow companies to gather about us.

This post, anyway, is about a technical issue: how does the App figure out that you got too close to someone else?

# A little model

All Apps I heard about leverage Bluetooth to figure out the distance. Rightly so, I daresay: itâ€™s a technology based on proximity, itâ€™s widespread in a lot of countries, and phones exchange some data that make inferring close contacts somehow better than guessing.

In the simplified model, we are very un-fuzzy and set a hard threshold for a contact:

- we estimate the distance between two phones;
- if this distance is below a threshold, a contact is inferred; otherwise, no contact is inferred.

## What distance estimation means

Estimating the distance mainly involves some sort of *channel
inversion*. Your phone knows how much power it received, how much was
transmitted (it should be part of the message, as I understand), makes
some magic assumption about how much loss was there and with a
propagation model you have your distance back. Easy, right?

Well, not so fast. There are a *lot* of variables playing in this game,
including the specific position of the two phones by themselves (is it
in the pocket? held in hand? in a purse?) and relative to one another
(antennas propagate differently depending on the direction). So we can
expect that the *magic assumption* will have to be really *magic*.

A usual approach to this extent is to mitigate variations over multiple measurements in time. You can take averages or be more creative, but at the end of the day you will always get an evaluation of the distance.

## Inferring contacts

Just saying that one of the measurements was below the threshold might
be a bit too *conservative*. If you rely on spot measurements (i.e.
thereâ€™s no averaging over time), having a single contact event might be
too little to actually infer that a health check is due and we might
want to count at least $N$ such events before calling it a day.

At the end of the day, anyway, this would be a process that is composed
*over* the basic question of understanding whether a specific
measurement supports the hypothesis of contact or not.

# After this simple exercise for the readerâ€¦

So suppose we actually gather a lot of data and want to figure out an algorithm to extract some info from these data. We can assume that each data point holds the following basic information:

- whether the data point was taken in a real condition of proximity
(event $C$ for
*contact*) or not (event $D$ for*distancing*). This is just a boolean that has some implicit assumptions, e.g. the existence of the data point itself (so the two phones were close enough to gather it); - the outcome of the distance estimation process, that we will assume to be classified in one of $M$ possible slots, yielding events $X_1$, $X_2$, â€¦, $X_M$.

We can count all these data points in a table like the following, where
$\Gamma_{X}$ represents the count of how many times class $X$ came out
in a test under *contact* condition $C$, and $\Delta_{X}$ the count of
how many times class $X$ came out in a test under *distancing* condition
$D$.

Â | $X_1$ | $X_2$ | â€¦ | $X_M$ |
---|---|---|---|---|

C | $\Gamma_{X_1}$ | $\Gamma_{X_2}$ | â€¦ | $\Gamma_{X_M}$ |

D | $\Delta_{X_1}$ | $\Delta_{X_2}$ | â€¦ | $\Delta_{X_M}$ |

Letâ€™s see what this table can give us.

## Testing conditions

One key point might be to understand in which conditions we did our measurements campaign. Did we just schedule an equal number of tests in possible different variants? Or did we just ask people to act naturally, leave them in an environment for some time while tracking them to attach a $C$/$D$ information on each data point?

In this latter case, we can use the data to estimate the a-priori probability $P_C$ that thereâ€™s a contact or not, like this:

In lack of this estimation, we will have to resort to *guessing*, e.g.
by assuming $P_C = \frac{1}{2}$.

## Turn to estimated probabilities

We can turn the table into something containing our estimates for the probability of each event $X$ dividing each count by the sum of all counts on each line, like this:

Â | $X_1$ | $X_2$ | â€¦ | $X_M$ |
---|---|---|---|---|

C | $P_{X_1|C} = \frac{\Gamma_{X_1} }{\sum_X \Gamma_X}$ | $P_{X_2|C} = \frac{\Gamma_{X_2} }{\sum_X \Gamma_X}$ | â€¦ | $P_{X_M|C} = \frac{\Gamma_{X_M} }{\sum_X \Gamma_X}$ |

D | $P_{X_1|D} = \frac{\Delta_{X_1} }{\sum_X \Delta_X}$ | $P_{X_2|D} = \frac{\Delta_{X_2} }{\sum_X \Delta_X}$ | â€¦ | $P_{X_M|D} = \frac{\Delta_{X_M} }{\sum_X \Delta_X}$ |

Each probability estimation in the table appears as a *conditioned*
probability because it assumes the *condition* in the first column, i.e.
either $C$ or $D$.

## What to do of each X?

At this point it makes sense to ask ourselves: when a measure in the wild yields a specific class $X$, should we assume that there was a contact or not? Should we discard the data point, and wait for more?

Having measured event $X$ means that one of the following mutually exclusive events happened:

- event $CX$, i.e. both event $C$ (contact) and $X$ were true, or
- event $DX$, i.e. both event $D$ (distancing) and $X$ were true.

We *know* it was one of these two - we got $X$, right? Should we bet on
the one with $C$ or the other with $D$? Letâ€™s calculate their
probabilities now that we know that $X$ actually happened:

The sum of these two probabilities amounts to 1 because the two events
cover all possibilities (*contact* or *distancing*) when $X$ is true.
Itâ€™s easy to see that the bigger one is the safer to bet on, so:

- if $P_{C|X} > P_{D|X}$ we bet on $C$, otherwise
- if $P_{C|X} < P_{D|X}$ we bet on $D$.

And yes, if youâ€™re wondering, this is the *maximum likelihood
principle* at work.

The two probabilities in the comparison share the same denominator ($P_{X}$), so we can just as well compare the numerators only, i.e. $P_{CX}$ and $P_{DX}$. These can be calculated as follows:

which only involve numbers that we already have (or have assumed).

To summarize, the inference is as follows:

- if $P_C \cdot P_{X|C} \geq (1 - P_C) \cdot P_{X|D}$ we infer $C$ from $X$, otherwise
- if $P_C \cdot P_{X|C} < (1 - P_C) \cdot P_{X|D}$ we infer $D$ from $X$.

# Soâ€¦ Bayes?

Our initial matrix provides us with the conditional probability of
getting outcome $X$ out of initial conditions $C$ or $D$, while the
maximum likelihood principle requires us to have the *inverse*, i.e. the
conditional probability that event $C$ (or $D$) happened subject to
event $X$ happening.

If you follow the maths, you end up with:

which is exactly Bayesâ€™s Theorem.

# Enough!

I guess I abused enough of your patience at this point. Want more? Take a look at the second post on this topic: Some Bayes helps - more.