Graph Based Semi-Supervised Learning for non-existing features inference

Semi-Supervised Learning

What is known as Semi-Supervised Learning (SSL) is an area of Machine Learning that includes techniques for making use of an small amount of labeled data to classify a large amount of unlabeled data. You can think this as methods for "propagate" known info in an efficient fashion (mathematically/probabilistically speaking), through the abstract space in which data resides.

For example, you can think on propagate sentiments tags on tweets you have already tagged as: "positive", "negative", "neutral", on new tweets. This is, given pairs

$$ (tw_1, t_1),\dots,(tw_n,t_n) $$

of tweets and tags, one would estimate the probability density function (pdf) for every new tweet \(TW\) having some tag, let's say for example, the probability
$$\mathbb(P)[TW,"pos"|(tw_1,t_1),(tw_2,t_2),\dots,(tw_3,t_3)]=p.$$ In other words, \(p\) would be the probability of the new tweet \(TW\) could be tagged as "pos" (or: how "pos" is our tweet) given our previous knowledge on the "tweets sentiment space".

As you can see this is a handy technique when we want to "fill big holes" in our data. This is not only useful for untagged new coming data points, but also to extend "backwards" new features on our datasets.

Graph Based Semi-Supervised Learning

There are a lot of SSL methods and applications out there, some of them really cool (these guys for example propagates info from 2D images to their 3D images dataset :o). Sadly most of the techniques have the same issue: they are very expensive computationally speaking, since sometimes the number of labels growth to the thousands.

When things become hard to compute the best thing to do (since humanity haven't been able to build more powerful machines yet) is to divide our problem in pieces of easier problems, engineers name this technique: parallelize.

The most easy way of parallelize any computational problem which involves a "relation" is coding the problem in a graph. If we are able to code our problem successfully in a graph and build a model with a couple of properties (convergence, variables separation, etc... more on this below) then you can build parallel algorithms that updates independently, "node by node" or in a parallel fashion, the values of the model for each node. In short:

Graph based semi-supervised learning are algorithms for propagate probability distributions through a graph, based on the weight of its edges.

This is the main property used by GB-SSL. Since is very intuitive to construct the graph based on the similarity (under some adequate metric) of data points and then propagate labels through the graph.
This algorithms perform better than those which are not based in graphs and their complexity growth linear on the number of edges and labels.

Kueski case of use

Well, here at Kueski we are always searching for new features, services or approaches that provides us additional info from our clients. In order to offer them a better service we are always trying to know them better. When a new feature from an external service arrives to our databases, the pipeline for the feature extraction involves clients asking for a loan, so only users that asks for loans after the feature addition to our platform have this feature in our databases. So, is a must to "fill the holes" and tag users who haven't ask for a loan recently. Or tagging those users that will never come back :'(
Note: Also, as a personal statement, we like weird stuff here. As in any startup there's a need for craziness. If you are not bringing crazy ideas over the table you are not starttuping anymore. Right? That's the definition. Besides, even when it could sounds a bit twisted, you will be amazed of how frequently brilliant things came out from weirdness ^_^. Seriously, we like to try different approaches to our problems... I think that's pretty cool.

So, in short, the main problem is

Suppose we have a database with users and there are "holes" in our features (a lot of NaNs), we need to fill them.

As you can see I have bring out the term NaN. I think there are a couple of valuable remarks on this matter

  • There are a lot of techniques for "filling NaNs" but this one is very powerful when: labeled datasets are too small compared with the unlabeled one. I would recommend you to ask to a pure statistician how he fills NaNs in a column with 100 values an 100,000 NaNs. Just for fun...

  • Also I like this approach (since I am kind of a graph freak).

  • We like weird stuff here remember...

  • We can do it.

Do you need another reason? The first one is powerful enough though.

The model

We are gonna to get a bit formal now, so if you don't like math skip this section. If you do, make yourself comfortable :).

The method we are gonna show restates a bit what is proposed in this paper.

  • Suppose we have a graph \(G=(V,E,W)\), where \(V\) are the nodes, \(E\) the edges and \(W\) the weights. The weights often come as a similarity matrix, this means that the term \(W_{i,j}\) is the value of \(d_m(v_i,v_j)\) under some distance \(d_m\) defined by an adequate metric (\(||||_m\)).

  • Also, we define \(V=V_l\cup V_u\) and \(V_l\cap V_u=\emptyset\) where \(V_l\) are the labeled subset (in this case our tagged users) and \(V_u\) the unlabeled set.

  • \(L\) is the set of labels, \(|L|=m\) and \(\mathbb{Y}\) a matrix \(n*m\) describing the current labels of the users, this is: \(\mathbb{Y}_{vl}=0\) for \(v\in V_u\). And \(\hat{\mathbb{Y}}\) is the estimated pdf for the labels in the graph.

our final goal is to learn \(\hat{\mathbb{Y}}\) propagating the information of \(\mathbb{Y}\) through \(G\).

So, we learn \(\hat{\mathbb{Y}}\) using the following objective convex function

$$\begin{array}{rl} C(\hat{\mathbb{Y}})=&\mu_1 \sum\limits_{v\in V_l} s_{vv}||\hat{\mathbb{Y}}_v-\mathbb{Y}_v||_{m} \
+&\mu_2 \sum\limits_{v\in V, u\in \mathcal{N}(v)}w_{vu}||\hat{\mathbb{Y}}_v-\hat{\mathbb{Y}}_u||_{m} \ +&\mu_3 \sum\limits_{v\in V} ||\hat{\mathbb{Y}}_v-\mathbb{U}||_{m} \end{array}$$

adding the restriction \(\sum\limits_{l=1}^{L}\hat{\mathbb{Y}}_{vl}=1,\forall v\) which states that every row needs to sum 1 (needs to be a consistent pdf).

The term which multiplies \(\mu_1\) basically states that our estimated \(\hat{\mathbb{Y}}\) needs to be very similar to the current information \(\mathbb{Y}\). Here the operand \(s_{vv}\) is 1 when \(v\in V_l\) and 0 otherwise.

The second term of the sum, which is multiplied by \(\mu_2\) ponderates the propagation of labels by similitude using \(W\). Is important to note that \(\mathcal{N}(v)\) is the set of neighbors of \(v\), so at minimizing this term we are ensuring compatibility between the features space metric we previously defined (\(||||_m\)) and the way in which labels are propagated.

In the third term we find \(\mathbb{U}\) which is an a_priori distribution we infer for unlabeled data. For example, if we were propagating the age of our clients to other clients that didn't gave us this info, we will need to take into account that age distribution is not uniform on our client universe, so this \(\mathbb{U}\) should capture this previous knowledge. For simplicity we take \(\mathbb{U}\) as the uniform distribution, what explains the chosen letter :).

So using the previous mentioned technique justified by the Jacobi lema we minimize iteratively the function above as

$$ \begin{array}{rl} \hat{\mathbb{Y}}^{(i)}_{vl}=&\frac{1}{M_{vl}}(\mu_1s_{vv}Y_{vl}+\mu_2\hat{\mathbb{Y}}^{(i-1)}_{ul}+\mu_3\mathbb{U}_l)\ M_{vl}=&\mu_1 s_{vv}+\mu_2\sum\limits_{u\in\mathcal{N}(v)}w_{vu}+\mu_3 \end{array} $$

It is worth noting that estimated \(\hat{\mathbb{Y}}\) will be exactly that, an estimation of the probability of a node (user) has some label on it. On the other hand, at first sight it seems that \(\hat{\mathbb{Y}}\) can be as large or dense as we want. With dense I am referring to the fact that entries instead of been float could easily been functions instead this is, given \(v\in V_l, \mathbb{Y}_v(s)=[f_v^1(s),f_v^2(s),\dots,f_v^m(s)]\) we could propagate this to \(\hat{\mathbb{Y}}_v(s)\), for every \(v\in V_u\). Which is a bit more crazier thing but hopefully possible to do. Anyway, is a trip for another post.

The algorithm

Now that we have our model lets take a look to what we can do with it.

The paper previously mentioned was published in the Google Research Blog. Guys at Google are using a version of this algorithm to propagate labels on text and phrases for their new chat application Allo.

The idea is to identify what kind of conversation is taking place and suggest possible answers. For this task, they are tagging words and phrases uses which are unknown for their leaning systems

Our case is very similar but a bit simpler since we've got less labels to propagate... for now }:-].

So the main thing here is to establish how we can compute the values for each node in a concurrent and parallel fashion. The algorithm needs a partition over the nodes of the graph (i.e user ser), \(V=[V_1,\dots,V_p]\) and an initialization of \(\hat{mathbb{Y}}_v^0\) such that is \(\mathbb{Y}_v\) when \(v\in V_l\) and \(\mathbb{U}\) otherwise:

$$ \begin{algorithm}[H] \KwData{A graph G=(V,E,W), where V=V_l\cup V_u} \KwResult{The pdf of labels for each node v\in V$:\ $\hat{\mathbb{Y}}_v=\hat{\mathbb{Y}}_{v1}\hat{\mathbb{Y}}_{v2}\dots\hat{\mathbb{Y}}_{vm}$\ that minimizes $C(\hat{\mathbb{Y}})$ (here $\hat{\mathbb{Y}}_{vl}$ represents the probability of $l$ on $v$).}
\textbf{Init:} $|L|=m$; $\hat{\mathbb{Y}}_{vl}^0$ as before; $G=(V,E,W)$; $V=\cup_i^p V_i, V_i\cap V_j=\emptyset$.\ \While{$i$ < max_iter}{ \textit{process}([$V_k$ for $k$ in [$1\dots p$]])\; \For{$v\in V_k$}{ [\textit{message}(u, $\hat{\mathbb{Y}}_{v}^{(i-1)}$) for $u\in \mathcal{N}(v)$]\; [$\mathcal{M}_1\dots\mathcal{M}_{|\mathcal{N}_{(v)}|}$]=[\textit{recieve}(\textit{message}(u, $\hat{\mathbb{Y}}_{u}^{(i-1)}$)) for $u$ in $\mathcal{N}(v)$]\; \textit{update}($\hat{\mathbb{Y}}_{v}^i$, [$\mathcal{M}_1\dots\mathcal{M}_{|\mathcal{N}_{(v)}|}$]) } } \caption{Computing $\hat{\mathbb{Y}}_{v}^i$} \end{algorithm} $$ Important things to note are the elections of parameters \(\mu_1,\mu_2,\mu_3\). The way in which we pick
ed them is closely related with the nature and intentions of our prediction, since for example, our a priori pdf could be unknown. Fine tuning on this hyper space of parameters is not a trivial matter and deserves much more space itself. Personally speaking I think that setting them equal to 1 as first approach can bring out very impressive results.

Implementation

The implementation could have been made in any framework that permitted graph structures to be coded and analyzed. We are choosing Neo4J and Scala/Akka for the implementation of the message passing scheme using actors. There's an actor for every partition of the nodes in the graph and the update of the values of each node is made processing the queue of messages coming from each global neighbor of each node in the partition.

Conclusions

And that's pretty much of it. As a conclusion I would like to recommend all you guys out there to take a look to the wonderful possibilities behind a well constructed data structure. There are really cool graph databases you can use in your projects and efficient algorithms for solving a wide range of machine learning and pattern recognition problems.

It feels like is a good add-on for our data ninjitsu. Not only by the fact that we data scientists are always trying to find relations in not so well related databases and never obviously related datasets, but because widening our scientific scope further through our lives is an obligation not a hobby... it's really cool when you see it like a hobby though xD