RIO 2020 - A Programming Language Approach for Differential Privacy with Accuracy

Preliminaries

You need to clone the repository: git clone git@bitbucket.org:russo/rio2020-code-std.git
When launching docker, make sure that you mount the repository under rio2020local. Recall the instructions here

Reconstruction Attacks (Introduction)

In this exercise, we will implement the reconstruction attack by Nissim and Dinur which takes exponential (the most simple attack).

We will consider the dataset of (fake) names and associated (also faked) diagnosis on H.I.V. in the file secret.csv---do not look inside this file (yet). The dataset constains the following columns: user,HIV The column user and HIV denotes the patient's name and HIV is a binary number describing if the patient is infected with HIV (a very private information).

Someone implemented the file DB.hs to provide a very simple API to the dataset of patients. It contains three functions that you can call:

loadDB :: IO DB
names  :: DB -> [Name]
add    :: DB -> [Index] -> IO Double

The module DB.hs provides an abstract datatype DB which wraps the content in the file secrets.csv. Function loadDB fetches the content of the database and return it as an abstract value of type DB. There are only two operations on values of type DB: names, which returns the name of the patients in the same order as they appear in the dataset, and add which sums the content in the column HIV and then adds some noise. The code for add is as follows:

add :: DB -> [Index] -> IO Double
add (DB rs) is = do
  let s = Prelude.sum [hiv (rs!!i) | i <- is]
  noise <- fromUniform (-noiseParam) noiseParam
  return (s + noise)

The noise is drawn from an uniform distribution between +/-noiseParam.

It is useful to check the documentation. It is located at rio2020-code-std/RAttack/dist/doc/html/rattack/index.html so you can open it as follows if you use Google Chrome as web browser:

$ google-chrome rio2020-code-std/RAttack/dist/doc/html/rattack/index.html

Exercise: Reconstruction

Please, make sure you are working in the right directory by running :cd rio2020/RAttack/src/. Your task is to complete the definition in the file Attack.hs.

Complete the definition of
```
allSums :: DB -> IO [ResultQuery]
```
This function computes the result of the noisy sums for all combination of indexes. The type ResultQuery is simply a Double.
Complete the definition of
```
sumIndexes :: Candidate -> [Index] -> ResultQuery
```
The type Candidate is a guess on the HIV conditions of all the patients, i.e., it is simply [HIV]. This function takes a candidate and computes the sum of diagnosis for the given indexes.
```
> sumIndexes [0,0,0,1,1] [0,1,3]
1.0
> sumIndexes [0,0,0,1,1] [0,3,4]
2.0
```

Complete the definition of

allSumsNoNoise :: Candidate -> [ResultQuery]

This function computes the sums on a candidate for all the combination of indexes.

> indexes 5 -- list of indexes for candidates of size 5
[[0],[1],[2],[3],[4],[0,1],[0,2],[1,2],[0,3],[1,3],[2,3],[0,4],[1,4],[2,4],[3,4],[0,1,2],[0,1,3],[0,2,3],[1,2,3],[0,1,4],[0,2,4],[1,2,4],[0,3,4],[1,3,4],[2,3,4],[0,1,2,3],[0,1,2,4],[0,1,3,4],[0,2,3,4],[1,2,3,4],[0,1,2,3,4]]
> allSumsNoNoise [0,0,0,1,1] -- sums for the indexes above
[0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,1.0,1.0,2.0,2.0,2.0,2.0]

Complete the definition of the function

generateCandidates :: DB -> [Candidate]

which given a value of type DB, it generates all the possible candidates.

> do db <- loadDB ; return (generateCandidates db) -- when the database has only 3 entries
[[0.0,0.0,0.0],[0.0,0.0,1.0],[0.0,1.0,0.0],[0.0,1.0,1.0],[1.0,0.0,0.0],[1.0,0.0,1.0],[1.0,1.0,0.0],[1.0,1.0,1.0]]

Implement the function
```
fit :: Double -> [ResultQuery] -> Candidate -> Bool
```
When calling fit noiseMag results candidate, this function determines if exists a no-noisy sum on the candidate and a corresponding noisy sum in results which distance is greater than noisyMag.

Implement the following functions:

findCandidate :: IO [Candidate]
attack :: IO [([Name], Candidate)]

Function findCandidate guesses the shape of all possible databases and function attack put the names to those guesses!

> showNames
["Michael","Jennifer","Jason","Melissa","Christopher","Jessica"]
> attack  -- We assume the noise injected is uniform between [-1,1]
[(["Michael","Jennifer","Jason","Melissa","Christopher","Jessica"],[0.0,0.0,1.0,0.0,0.0,1.0])]

Now, open the file secret.csv and check if the attack reconstructed the HIV diagnosis.

Recall from the lectures that if you inject noise of magnitude E, then the hamming distance between your guesses and the real dataset is at most 4E.
- What happens when the injected noise is of magnitude 1
- What happens when the injected noise is of magnitude 7
- What happens when the injected noise if of magnitude 10

Recall to run it several times to get a feeling about what is going on.

Counting (Introduction)

We will consider the dataset of (fake) names and associated (also faked) sexually transmitted deceases (STD) in the file std.csv---do not look inside this file (yet). The dataset constains the following columns: name,lastname,std The column name and lastname denotes the patient and std is an string describing a STD like gonorrhea, genital herpes, and HIV.

You should check the file RowDef.hs which shows how represent every row in std.csv as a record in Haskell.

type Name     = String
type Lastname = String
type STD      = String

data Row = Row {
                 name     :: Name
               , lastname :: Lastname
               , std      :: STD
               }
           deriving Show

So, the datatype Row is going to be type of the rows!

Exercise: Counting

You know that your friend, Martin Baro, went to visit a doctor of the clinic which owns the dataset std.csv, where you work as a data analyst. If your friend has a STD, then his name will be added to std.csv.

You are extremely curious to know is he is sick. So, your task is to write the following function:
```
isMyFriendSick :: Name -> Lastname -> Epsilon -> Dataset 1 Row -> Query (Value Double)
```
This function essentially counts how many records in std.csv match Martin Baro.
Assuming a β=0.05 (i.e., with 95% confidence), how certain would you be that your friend has a STD when considering the following ε? For each case, answer "very certain" or "I don't know" justifying your answer.
- ε=1
```
analysis1 = isMyFriendSick "Martin" "Baro"  1
```
- ε=10
```
analysis10 = isMyFriendSick "Martin" "Baro"  10
```
- ε=30
```
analysis30 = isMyFriendSick "Martin" "Baro"  30
```
If you are not certain when running analysis1 about your friend, but you are allowed to run that analysis many times. How many times will you need to run analysis1 to be as certain about your friend as if you run analysis10?
Now, you should take the role of the data curator. The file RunSTD.hs is written by the data curator to execute analysis1, analysis10, and analysis30.

Your task is to write three different functions, i.e.,
```
executeSTD1  :: IO (Value Double)
executeSTD10 :: IO (Value Double)
executeSTD30 :: IO (Value Double)
```
which executes the analysis analysis1, analysis10, and analysis30, respectively, with the information obtained from the file std.csv.
- After you run analysis1, 3 times, can you be sure if your friend is in the dataset?
- After you run analysis30 once, can you be sure if your friend is in the dataset?
- After you run analysis1 10 times, and average the results, can you be sure that your friend is in the dataset?
Now, you should look into the file std.csv and check if Martin Baro is there and confirm/refute your guesses.

CDFs (Introduction)

We will consider the dataset of packages in the file hotspot.csv. The dataset constains the following columns: No.,Time,Source,Destination,Protocol,Length The column No. denotes the number of package, Time is the time when the package was registered, Source and Destination are the source and destination IPs, respectively. Column Protocol is self explanatory. Finally, column Length indicates the size of the package in bytes. We will only use the the information in that column for this exercise.

In this exercise, you should write an analysis to compute commulative distribution functions (CDF) for packet lengths in hotspot.csv.

Before you start, you should check the file RowDef.hs which shows how represent every row in hotspot.csv as a record in Haskell.

type IP    = String
type Bytes = Int

data Row = Row {
                 no       :: Int
               , time     :: Float
               , src      :: IP
               , dst      :: IP
               , protocol :: String
               , size     :: Bytes
               }
           deriving Show

So, the datatype Row is going to be type of the rows! Furthermore, we define the bins for the CDF as simply a list of bytes.

type Bins = [Bytes]

Exercise: CDF (Version 1)

Your task is to write the following function given in the file CDF1.hs.
```
cdf1 :: Bins -> Epsilon -> Dataset 1 Row -> Query (Value [Double])
```
The algorithm that you should implement consists on simply distributing the packages in their corresponding bins and then performing a DP-count (recall function dpCount in DPella).
What is the accuracy of your analysis when given an ε=1 and β=0.05 and the following two different bins?
```
> accuracy (cdf1 symDataset [1,5,10,30] 1) 0.05
> accuracy (cdf1 symDataset [1,10..100] 1) 0.05
```
Does your accuracy results above depend on the actual values of the bins? If your answer is negative, write a program that shows the accuracy for bins of length i to length j for a given ε. You should implement the function exploreCDF1.
```
type Range = (Int, Int)
exploreCDF1 :: Range -> Epsilon -> Beta -> [(Int,Alpha)]
```
Using exploreCDF1 obtain the accuracy when considering 4,5,...20 buckets with ε=1 and β=0.05.

You should see that the more buckets you consider, the bigger the error. Please, report which concentration bound is being applied (i.e., union bound or Chernoff).
At this point, you have been taking the role of the data analyst. After considering the trade-off between privacy and accuracy, you have decide that the next analysis should be run by the data curator:
```
> analysis = cdf1 [10,50,100,200,10000] 1
```
Now, let's switch roles and take the data curator one! The file RunCDF1.hs is written by the data curator to execute analysis. Your task is write the code that executes the data analyst query with the information obtained from the file hotspot.csv.

How much ε the data curator (i.e., you) should have provided in dpEval to successfully run analysis?

Exercise: CDF (Version 2)

In this exercise, you need to implement a different algorithm to compute a CDF which consists on generating an histogram for the given bins first and then using that to create the CDF. Observe that every package must be put in only one bin by your program when generating the histogram.

Before writing the CDF you need to implement an auxiliary function which given a list of ordered bins and a value, it returns the bin where the value should be placed (we assume that all the values are below the maximum bin. In the example below, it is 10). More concretely, you need to implement the function assignBin in the file CDF2.hs.
```
assignBin :: Ord a => [a] -> a -> a
```
Below, we provide some examples.
```
> assignBin [1,6,10,30] 15
30
> assignBin [1,6,10,30] 7
10
> assignBin [1,6,10,30] 0
0
```
DPella provides a simpler primitive than dpPartition, called dpPartRepeat, which applies the same aggregation on each partition.
```
dpPartRepeat :: Ord k
          => (Dataset s r -> Query (Value a))
          -> [k]
          -> (r -> k)
          -> Dataset s r
          -> Query (Map.Map k (Value a))
```
Using this primitive, you should implement the algorithm to compute the CDF mentioned above. You should provide an implementation to the function `cdf2` in the file [`CDF2.hs`](https://bitbucket.org/russo/rio2020-code-std/src/master/CDFs/CDF2.hs).
```
cdf2 :: Bins -> Epsilon -> Dataset 1 Row -> Query (Value [Double])
```
What is the accuracy of cdf2 with ε=1 and β=0.05 when considering the following bins [1,5,10,30]?
Does the result in the previous point depends on the actual bins' values? That is, if you run
```
> accuracy (cdf2 symDataset [1,5,10,30] 1) 0.05
> accuracy (cdf2 symDataset [100,3,60,10] 1) 0.05
```
Would both invocations of accuracy above give the same error estimation?
How does the accuracy of cdf2 with ε=1 and β=0.05 progresses when considering bins of 50 to 100 elements? To answer that question, you should write the function exploreCDF2.
```
exploreCDF2 :: Range -> Epsilon -> Beta -> [(Int,Alpha)]
```
Which implementation to compute CDF is the best one, i.e., cdf1 or cdf2? Here, we mean best as, considering the same ε, which function produces the less error under the same β. To answer that question, you should write the function compareCDFs in the file CDF2.hs.
```
compareCDFs :: Range -> Epsilon -> Beta -> [(Int,String)]
```
This function returns a list of pairs. The first component in the pair indicates the amount of bins, while the second component indicates which implementation of CDF is better, i.e., it is either the string "CDF1" or "CDF2".
As we did for cdf1, our role of data analyst determines that the analysis to run by the data curator is as follows.
```
analysis = cdf2 [10,50,100,200,10000] 1
```
You should take now the role of the data curator and write the function executeCDF2 in the file RunCDF2.hs.

Sums (Introduction)

We will consider the same dataset as the one for the CDFs exercise above. To compute a sum of values, we need to determine the range of the values. DPella supports only natural numbers' ranges, e.g., [1,10], [5,10], etc. DPella needs that information to calculate the sensitivity of sum queries, i.e., if every value is in the range [a,b], the sensitivity is b - a. The way to specify ranges in DPella is via the primitive range.

range :: (KnownNat a, KnownNat b, KnownNat (b - a), a <= b) => Range a b

This function receives no arguments since the range is indicated at the type-level. To create ranges, we need to use type applications, e.g.,

range1 = range @1 @10
range2 = range @5 @10

Exercise: Information transmitted by the network

In this exercise, you need to implement a different algorithm to compute the amount of transmitted data.

We are going to sum the total size of data transmitted over the network. The data curator indicates that it is going to be in the order of GBs. You should implement the following function:
```
totalBytes :: Epsilon -> Dataset 1 Row -> Query (Value Double)
```
The data curator indicates that the range of size of packages are given in bytes and they go from 60 to 350 bytes.
- What is the minimum ε you need to have an error of 10Mb with β=0.05?
You need to write a similar analysis but that calculates the total amount of data sent from the router "Cisco_ba:1f:80". We know that this result should be in the order of hundred of megabytes.
```
router1 = "Cisco_ba:1f:80"
bytesRouter1 :: Epsilon -> Dataset 1 Row -> Query (Value Double)
```
- What is the minimum ε you need to obtain an error of 1Mb with β=0.05?