Preliminaries
- You need to clone the repository:
git clone git@bitbucket.org:russo/rio2020-code-std.git
- When launching docker, make sure that you mount the repository under
rio2020local
. Recall the instructions here
Reconstruction Attacks (Introduction)
In this exercise, we will implement the reconstruction attack by Nissim and Dinur which takes exponential (the most simple attack).
We will consider the dataset of (fake) names and associated (also faked)
diagnosis on H.I.V. in the file
secret.csv
---do
not look inside this file (yet). The dataset constains the following columns:
user,HIV
The column user
and HIV
denotes the patient's name and HIV
is a binary
number describing if the patient is infected with HIV (a very private
information).
Someone implemented the file
DB.hs
to provide a very simple API to the dataset of patients. It contains three
functions that you can call:
loadDB :: IO DB names :: DB -> [Name] add :: DB -> [Index] -> IO Double
The module DB.hs
provides an abstract datatype DB
which wraps the content in
the file secrets.csv
. Function loadDB
fetches the content of the database
and return it as an abstract value of type DB
. There are only two operations
on values of type DB
: names
, which returns the name of the patients in the
same order as they appear in the dataset, and add
which sums the content in
the column HIV
and then adds some noise. The code for add
is as follows:
add :: DB -> [Index] -> IO Double add (DB rs) is = do let s = Prelude.sum [hiv (rs!!i) | i <- is] noise <- fromUniform (-noiseParam) noiseParam return (s + noise)
The noise is drawn from an uniform distribution between +/-noiseParam
.
It is useful to check the documentation. It is located at
rio2020-code-std/RAttack/dist/doc/html/rattack/index.html
so you can open it
as follows if you use Google Chrome as web browser:
$ google-chrome rio2020-code-std/RAttack/dist/doc/html/rattack/index.html
Exercise: Reconstruction
Please, make sure you are working in the right directory by running :cd
rio2020/RAttack/src/
. Your task is to complete the definition in the file
Attack.hs
.
Complete the definition of
allSums :: DB -> IO [ResultQuery]
This function computes the result of the noisy sums for all combination of indexes. The type
ResultQuery
is simply aDouble
.Complete the definition of
sumIndexes :: Candidate -> [Index] -> ResultQuery
The type
Candidate
is a guess on the HIV conditions of all the patients, i.e., it is simply[HIV]
. This function takes a candidate and computes the sum of diagnosis for the given indexes.> sumIndexes [0,0,0,1,1] [0,1,3] 1.0 > sumIndexes [0,0,0,1,1] [0,3,4] 2.0
Complete the definition of
allSumsNoNoise :: Candidate -> [ResultQuery]
This function computes the sums on a candidate for all the combination of indexes.
> indexes 5 -- list of indexes for candidates of size 5 [[0],[1],[2],[3],[4],[0,1],[0,2],[1,2],[0,3],[1,3],[2,3],[0,4],[1,4],[2,4],[3,4],[0,1,2],[0,1,3],[0,2,3],[1,2,3],[0,1,4],[0,2,4],[1,2,4],[0,3,4],[1,3,4],[2,3,4],[0,1,2,3],[0,1,2,4],[0,1,3,4],[0,2,3,4],[1,2,3,4],[0,1,2,3,4]] > allSumsNoNoise [0,0,0,1,1] -- sums for the indexes above [0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,1.0,1.0,2.0,2.0,2.0,2.0]
Complete the definition of the function
generateCandidates :: DB -> [Candidate]
which given a value of type
DB
, it generates all the possible candidates.> do db <- loadDB ; return (generateCandidates db) -- when the database has only 3 entries [[0.0,0.0,0.0],[0.0,0.0,1.0],[0.0,1.0,0.0],[0.0,1.0,1.0],[1.0,0.0,0.0],[1.0,0.0,1.0],[1.0,1.0,0.0],[1.0,1.0,1.0]]
Implement the function
fit :: Double -> [ResultQuery] -> Candidate -> Bool
When calling
fit noiseMag results candidate
, this function determines if exists a no-noisy sum on thecandidate
and a corresponding noisy sum inresults
which distance is greater thannoisyMag
.Implement the following functions:
findCandidate :: IO [Candidate] attack :: IO [([Name], Candidate)]
Function
findCandidate
guesses the shape of all possible databases and functionattack
put the names to those guesses!> showNames ["Michael","Jennifer","Jason","Melissa","Christopher","Jessica"] > attack -- We assume the noise injected is uniform between [-1,1] [(["Michael","Jennifer","Jason","Melissa","Christopher","Jessica"],[0.0,0.0,1.0,0.0,0.0,1.0])]
Now, open the file
secret.csv
and check if the attack reconstructed the HIV diagnosis.Recall from the lectures that if you inject noise of magnitude
E
, then the hamming distance between your guesses and the real dataset is at most4E
.- What happens when the injected noise is of magnitude
1
- What happens when the injected noise is of magnitude
7
- What happens when the injected noise if of magnitude
10
- What happens when the injected noise is of magnitude
Recall to run it several times to get a feeling about what is going on.
Counting (Introduction)
We will consider the dataset of (fake) names and associated (also faked)
sexually transmitted deceases (STD) in the file std.csv
---do not look
inside this file (yet). The dataset constains the following columns:
name,lastname,std
The column name
and lastname
denotes the patient and std
is an string
describing a STD like gonorrhea, genital herpes, and HIV.
You should check the file RowDef.hs
which shows how represent every
row in std.csv
as a record in Haskell.
type Name = String type Lastname = String type STD = String data Row = Row { name :: Name , lastname :: Lastname , std :: STD } deriving Show
So, the datatype Row
is going to be type of the rows!
Exercise: Counting
You know that your friend, Martin Baro, went to visit a doctor of the clinic
which owns the dataset std.csv
, where you work as a data analyst. If your
friend has a STD, then his name will be added to std.csv
.
You are extremely curious to know is he is sick. So, your task is to write the following function:
isMyFriendSick :: Name -> Lastname -> Epsilon -> Dataset 1 Row -> Query (Value Double)
This function essentially counts how many records in
std.csv
match Martin Baro.- Assuming a β=0.05 (i.e., with 95% confidence), how certain would you be
that your friend has a STD when considering the following ε? For each
case, answer "very certain" or "I don't know" justifying your answer.
- ε=1
analysis1 = isMyFriendSick "Martin" "Baro" 1
- ε=10
analysis10 = isMyFriendSick "Martin" "Baro" 10
- ε=30
analysis30 = isMyFriendSick "Martin" "Baro" 30
- ε=1
If you are not certain when running
analysis1
about your friend, but you are allowed to run that analysis many times. How many times will you need to runanalysis1
to be as certain about your friend as if you runanalysis10
?Now, you should take the role of the data curator. The file
RunSTD.hs
is written by the data curator to executeanalysis1
,analysis10
, andanalysis30
.Your task is to write three different functions, i.e.,
executeSTD1 :: IO (Value Double) executeSTD10 :: IO (Value Double) executeSTD30 :: IO (Value Double)
which executes the analysisanalysis1
,analysis10
, andanalysis30
, respectively, with the information obtained from the filestd.csv
.- After you run
analysis1
, 3 times, can you be sure if your friend is in the dataset? - After you run
analysis30
once, can you be sure if your friend is in the dataset? - After you run
analysis1
10 times, and average the results, can you be sure that your friend is in the dataset?
Now, you should look into the file
std.csv
and check if Martin Baro is there and confirm/refute your guesses.- After you run
CDFs (Introduction)
We will consider the dataset of packages in the file hotspot.csv
. The
dataset constains the following columns:
No.,Time,Source,Destination,Protocol,Length
The column No.
denotes the number of package, Time
is the time when the package was
registered, Source
and Destination
are the source and destination IPs,
respectively. Column Protocol
is self explanatory. Finally, column Length
indicates the size of the package in bytes. We will only use the the information
in that column for this exercise.
In this exercise, you should write an analysis to compute commulative
distribution functions
(CDF) for
packet lengths in hotspot.csv
.
Before you start, you should check the file RowDef.hs
which shows how
represent every row in hotspot.csv
as a record in Haskell.
type IP = String type Bytes = Int data Row = Row { no :: Int , time :: Float , src :: IP , dst :: IP , protocol :: String , size :: Bytes } deriving Show
So, the datatype Row
is going to be type of the rows! Furthermore, we define
the bins for the CDF as simply a list of bytes.
type Bins = [Bytes]
Exercise: CDF (Version 1)
Your task is to write the following function given in the file
CDF1.hs
.cdf1 :: Bins -> Epsilon -> Dataset 1 Row -> Query (Value [Double])
The algorithm that you should implement consists on simply distributing the packages in their corresponding bins and then performing a DP-count (recall functiondpCount
in DPella).What is the accuracy of your analysis when given an ε=1 and β=0.05 and the following two different bins?
> accuracy (cdf1 symDataset [1,5,10,30] 1) 0.05 > accuracy (cdf1 symDataset [1,10..100] 1) 0.05
Does your accuracy results above depend on the actual values of the bins? If your answer is negative, write a program that shows the accuracy for bins of length i to length j for a given ε. You should implement the function
exploreCDF1
.type Range = (Int, Int) exploreCDF1 :: Range -> Epsilon -> Beta -> [(Int,Alpha)]
Using
exploreCDF1
obtain the accuracy when considering 4,5,...20 buckets with ε=1 and β=0.05.You should see that the more buckets you consider, the bigger the error. Please, report which concentration bound is being applied (i.e., union bound or Chernoff).
At this point, you have been taking the role of the data analyst. After considering the trade-off between privacy and accuracy, you have decide that the next analysis should be run by the data curator:
> analysis = cdf1 [10,50,100,200,10000] 1
Now, let's switch roles and take the data curator one! The file
RunCDF1.hs
is written by the data curator to executeanalysis
. Your task is write the code that executes the data analyst query with the information obtained from the filehotspot.csv
.How much ε the data curator (i.e., you) should have provided in
dpEval
to successfully runanalysis
?
Exercise: CDF (Version 2)
In this exercise, you need to implement a different algorithm to compute a CDF which consists on generating an histogram for the given bins first and then using that to create the CDF. Observe that every package must be put in only one bin by your program when generating the histogram.
Before writing the CDF you need to implement an auxiliary function which given a list of ordered bins and a value, it returns the bin where the value should be placed (we assume that all the values are below the maximum bin. In the example below, it is 10). More concretely, you need to implement the function
assignBin
in the fileCDF2.hs
.assignBin :: Ord a => [a] -> a -> a
Below, we provide some examples.> assignBin [1,6,10,30] 15 30 > assignBin [1,6,10,30] 7 10 > assignBin [1,6,10,30] 0 0
DPella provides a simpler primitive than
dpPartition
, calleddpPartRepeat
, which applies the same aggregation on each partition.dpPartRepeat :: Ord k => (Dataset s r -> Query (Value a)) -> [k] -> (r -> k) -> Dataset s r -> Query (Map.Map k (Value a))
Using this primitive, you should implement the algorithm to compute the CDF mentioned above. You should provide an implementation to the function `cdf2` in the file [`CDF2.hs`](https://bitbucket.org/russo/rio2020-code-std/src/master/CDFs/CDF2.hs).cdf2 :: Bins -> Epsilon -> Dataset 1 Row -> Query (Value [Double])
What is the accuracy of
cdf2
with ε=1 and β=0.05 when considering the following bins [1,5,10,30]?Does the result in the previous point depends on the actual bins' values? That is, if you run
> accuracy (cdf2 symDataset [1,5,10,30] 1) 0.05 > accuracy (cdf2 symDataset [100,3,60,10] 1) 0.05
Would both invocations of
accuracy
above give the same error estimation?How does the accuracy of
cdf2
with ε=1 and β=0.05 progresses when considering bins of 50 to 100 elements? To answer that question, you should write the functionexploreCDF2
.exploreCDF2 :: Range -> Epsilon -> Beta -> [(Int,Alpha)]
Which implementation to compute CDF is the best one, i.e.,
cdf1
orcdf2
? Here, we mean best as, considering the same ε, which function produces the less error under the same β. To answer that question, you should write the functioncompareCDFs
in the fileCDF2.hs
.compareCDFs :: Range -> Epsilon -> Beta -> [(Int,String)]
This function returns a list of pairs. The first component in the pair indicates the amount of bins, while the second component indicates which implementation of CDF is better, i.e., it is either the string
"CDF1"
or"CDF2"
.As we did for
cdf1
, our role of data analyst determines that the analysis to run by the data curator is as follows.analysis = cdf2 [10,50,100,200,10000] 1
You should take now the role of the data curator and write the functionexecuteCDF2
in the fileRunCDF2.hs
.
Sums (Introduction)
We will consider the same dataset as the one for the CDFs exercise above. To
compute a sum of values, we need to determine the range of the values. DPella
supports only natural numbers' ranges, e.g., [1,10]
, [5,10]
, etc. DPella
needs that information to calculate the sensitivity of sum queries, i.e., if
every value is in the range [a,b]
, the sensitivity is b - a
. The way to
specify ranges in DPella is via the primitive range
.
range :: (KnownNat a, KnownNat b, KnownNat (b - a), a <= b) => Range a b
This function receives no arguments since the range is indicated at the type-level. To create ranges, we need to use type applications, e.g.,
range1 = range @1 @10 range2 = range @5 @10
Exercise: Information transmitted by the network
In this exercise, you need to implement a different algorithm to compute the amount of transmitted data.
We are going to sum the total size of data transmitted over the network. The data curator indicates that it is going to be in the order of GBs. You should implement the following function:
totalBytes :: Epsilon -> Dataset 1 Row -> Query (Value Double)
The data curator indicates that the range of size of packages are given in bytes and they go from
60
to350
bytes.- What is the minimum ε you need to have an error of 10Mb with β=0.05?
You need to write a similar analysis but that calculates the total amount of data sent from the router
"Cisco_ba:1f:80"
. We know that this result should be in the order of hundred of megabytes.router1 = "Cisco_ba:1f:80" bytesRouter1 :: Epsilon -> Dataset 1 Row -> Query (Value Double)
- What is the minimum ε you need to obtain an error of 1Mb with β=0.05?