### aggregate-functions

#### A peer-to-peer and privacy-aware data mining/aggregation algorithm: is it possible?

Suppose I have a network of N nodes, each with a unique identity (e.g. public key) communicating with a central-server-less protocol (e.g. DHT, Kad). Each node stores a variable V. With reference to e-voting as an easy example, that variable could be the name of a candidate. Now I want to execute an "aggregation" function on all V variables available in the network. With reference to e-voting example, I want to count votes. My question is completely theoretical (I have to prove a statement, details at the end of the question), so please don't focus on the e-voting and all of its security aspects. Do I have to say it again? Don't answer me that "a node may have any number identities by generating more keys", "IPs can be traced back" etc. because that's another matter. Let's see the distributed aggregation only from the privacy point of view. THE question Is it possible, in a general case, for a node to compute a function of variables stored at other nodes without getting their value associated to the node's identity? Did researchers design such a privacy-aware distributed algorithm? I'm only dealing with privacy aspects, not general security! Current thoughts My current answer is no, so I say that a central server, obtaining all Vs and processes them without storing, is necessary and there are more legal than technical means to assure that no individual node's data is either stored or retransmitted by the central server. I'm asking to prove that my previous statement is false :) In the e-voting example, I think it's impossible to count how many people voted for Alice and Bob without asking all the nodes, one by one "Hey, who do you vote for?" Real case I'm doing research in the Personal Data Store field. Suppose you store your call log in the PDS and somebody wants to find statistical values about the phone calls (i.e. mean duration, number of calls per day, variance, st-dev) without being revealed neither aggregated nor punctual data about an individual (that is, nobody must know neither whom do I call, nor my own mean call duration). If a trusted broker exists, and everybody trusts it, that node can expose a double getMeanCallDuration() API that first invokes CallRecord[] getCalls() on every PDS in the network and then operates statistics on all rows. Without the central trusted broker, each PDS exposing double getMyMeanCallDuration() isn't statistically usable (the mean of the means shouldn't be the mean of all...) and most importantly reveals the identity of the single user.

Yes, it is possible. There is work that actually answers your question solving the problem, given some assumptions. Check the following paper: Privacy, efficiency & fault tolerance in aggregate computations on massive star networks. You can do some computation (for example summing) of a group of nodes at another node without having the participants nodes to reveal any data between themselves and not even the node that is computing. After the computation, everyone learns the result (but no one learns any individual data besides their own which they knew already anyways). The paper describes the protocol and proves its security (and the protocol itself gives you the privacy level I just described). As for protecting the identity of the nodes to unlink their value from their identity, that would be another problem. You could use anonymous credentials (check this: https://idemix.wordpress.com/2009/08/18/quick-intro-to-credentials/) or something alike to show that you are who you are without revealing your identity (in a distributed scenario). The catch of this protocol is that you need a semi-trusted node to do the computation. A fully distributed protocol (for example, in a P2P network scenario) is not that easy though. Not because of a lack of a storage (you can have a DHT, for example) but rather you need to replace that trusted or semi-trusted node by the network, and that is when you find your issues, who does it? Why that one and not another one? And what if there is a collusion? Etc...

How about when each node publishes two sets of data x and y, such that x - y = v Assuming that I can emit x and y independently, you can correctly compute the overall mean and sum, while every single message is largely worthless. So for the voting example and candidates X, Y, Z, I might have one identity publishing the vote +2 -1 +3 and my second identity publishes the vote: -2 +2 -3 But of course you cannot verify that I didn't vote multiple times anymore.

### Related Links

Creating an aggregate function fails

Necessity of declaration of function in c and cpp

Calculate window average in tableau

Complexity asymptotic relation (theta, Big O, little o, Big Omega, little omega) between functions

What are TOP_COUNT and TOP_MAXCOUNT in BigQuery?

marklogic need advise for approach to aggregate documents

multiply(num) aggregate function in postgresql

Binding the Result of an Aggregate Function to a Projected Variable

tableau aggregate data based on dimension

where clause and aggregate functions

Is there such a thing as a join() aggregate function that concatentates field values at a specific character?

Aggregation of an expression in Django query spanning multiple tables

How to aggregate / roll up percentile measures

A peer-to-peer and privacy-aware data mining/aggregation algorithm: is it possible?

Multiple aggregates in SPARQL

SAP BO XI Desktop Intelligence Aggregate Calculations