{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab 9 - Random Graphs Part 2: Community Detection\n", "\n", "\n", "#### Authors:\n", "\n", "##### v1.0 (2015 Fall) Kabir Chandrasekher \\*, Max Kanwal \\*, Dong Yin\\*, Kangwook Lee \\*\\*, Kannan Ramchandran \\*\\*\n", "\n", "\n", "In last week's lab, we studied properties of random graphs of the Erdos-Renyi flavor. Recall that in this model, the random graph $G(n,p)$ represents a graph on $n$ nodes where the probability of an edge existing between any two edges is $p$. This model is quite unrealistic for many scenarios. For example, consider the graph formed by friendships of Berkeley students and Stanford students on Facebook. The probability of two students that both go to UC Berkeley are friends is much higher than the probability that a student from UC Berkeley is friends with a student from Stanford. In the Erdos-Renyi model, however, the two edges formed by these friendships have the same probability! Let's fix this current model to be able to deal with communities such as:\n", "\n", "\n", "### Table of Contents:\n", "- Introduction\n", "- Some Helpful Code\n", "- MLE and the Min-Bisection Problem\n", " - Question 0: A Proof of the Reduction\n", " - Question 1: Playing with MLE\n", "- Thresholds on Exact Recovery: When Will MLE Correctly Recover the Communities?\n", " - Question 2: A Proof of the Lower Bound on Recovery\n", "- Efficient Algorithm for Community Detection\n", " - Question 3: Simulating Efficient Recovery\n", "\n", "----------------------------------------------------------------------------------------------------------------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A Brief Introduction to the Stochastic Block Model\n", "\n", "Let's revisit the Berkeley/Stanford example. Let's say there's an equal number of students at Berkeley and at Stanford, and we have a graph with nodes representing students edges representing friendships between students. The problem we would like to solve is: given such a graph, can we determine which students are part of the same community? We may not be able to determine which community is which, but we would like to be able to determine which students are in the same community. \n", "\n", "In the stochastic block model (let's call it SBM), we have graphs of the form $G(n,p,q)$ (for simplicity, let's assume $n$ is even and $p>q$). In this model, we have two \"communities\" each of size $\\frac{n}{2}$ such that the probability of an edge existing between any two nodes within a community is $p$ and the probability of an edge between the two communities is $q$. We are interested in recovering the communities from a realization of the random graph. First, let's visualize these community graphs and try to observe some phase transitions.\n", "\n", "## Some Helpful Functions\n", "### Helper Functions for Graphs:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%matplotlib inline\n", "from pylab import *\n", "import random as rnd\n", "import networkx as nx\n", "from __future__ import division\n", "from cvxopt import matrix, solvers, log\n", "import numpy as np\n", "from numpy import zeros, random, transpose, argsort, array\n", "from mpl_toolkits.mplot3d import Axes3D\n", "from matplotlib import cm\n", "from matplotlib.ticker import LinearLocator, FormatStrFormatter\n", "\n", "rcParams['figure.figsize'] = 12, 12 # that's default image size for this interactive session\n", "\n", "def draw_graph(graph, labels=None, graph_layout='shell',\n", " node_size=1600, node_color='blue', node_alpha=0.3,\n", " node_text_size=12,\n", " edge_color='blue', edge_alpha=0.3, edge_tickness=1,\n", " edge_text_pos=0.3,\n", " text_font='sans-serif'):\n", " \"\"\" \n", " Based on: https://www.udacity.com/wiki/creating-network-graphs-with-python\n", " We describe a graph as a list enumerating all edges.\n", " Ex: graph = [(1,2), (2,3)] represents a graph with 2 edges - (node1 - node2) and (node2 - node3)\n", " \"\"\"\n", " \n", " # create networkx graph\n", " G=nx.Graph()\n", "\n", " # add edges\n", " for edge in graph:\n", " G.add_edge(edge[0], edge[1])\n", "\n", " # these are different layouts for the network you may try\n", " # shell seems to work best\n", " if graph_layout == 'spring':\n", " graph_pos=nx.spring_layout(G)\n", " elif graph_layout == 'spectral':\n", " graph_pos=nx.spectral_layout(G)\n", " elif graph_layout == 'random':\n", " graph_pos=nx.random_layout(G)\n", " else:\n", " graph_pos=nx.shell_layout(G)\n", "\n", " # draw graph\n", " nx.draw_networkx_nodes(G,graph_pos,node_size=node_size, \n", " alpha=node_alpha, node_color=node_color)\n", " nx.draw_networkx_edges(G,graph_pos,width=edge_tickness,\n", " alpha=edge_alpha,edge_color=edge_color)\n", " nx.draw_networkx_labels(G, graph_pos,font_size=node_text_size,\n", " font_family=text_font)\n", " # show graph\n", " plt.show()\n", " \n", "def find_connected_component(graph, starting_node):\n", " \"\"\"\n", " Return the connected component containing a given starting node\n", " \"\"\"\n", " connected_nodes = set()\n", " connected_nodes.add( starting_node )\n", " changed_flag = True\n", " while changed_flag:\n", " changed_flag = False\n", " for node1,node2 in graph:\n", " if (node1 in connected_nodes and node2 not in connected_nodes) or \\\n", " (node1 not in connected_nodes and node2 in connected_nodes):\n", " connected_nodes.add(node1)\n", " connected_nodes.add(node2)\n", " changed_flag = True\n", " return connected_nodes\n", "\n", "def connected_components(graph):\n", " \"\"\"\n", " Return a list of connected_components\n", " \"\"\"\n", " nodes = set()\n", " components = []\n", " for edge in graph:\n", " for node in edge:\n", " nodes.add(node)\n", " flag = False\n", " for node in nodes:\n", " for component in components:\n", " if node in component:\n", " flag = True\n", " break\n", " if not flag:\n", " components.append(find_connected_component(graph,node))\n", " flag = False\n", " return components\n", "\n", "#Return a list of the sizes of all the connected components\n", "component_sizes = lambda graph: [len(component) for component in (connected_components(graph))]\n", "\n", "#Return the largest connected component in the graph\n", "largest_component_size = lambda graph: max(component_sizes(graph))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Helper Functions for Semi-definite Programming:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "solvers.options['show_progress'] = False\n", "\n", "def generate_sbm(n, alpha, beta):\n", " \"\"\"\n", " Generate the A matrix for an SBM.\n", " inputs: n: total number of nodes, \n", " alpha: parameter alpha corresponding to the in-cluster connection probability\n", " beta: parameter beta corresponding to the cross-cluster connection probability\n", " outputs: A: the \"A\" matrix for the SBM. A(i,i)=0 for all i; A(i,j) = 1 if (i,j) is an edge; A(i,j)=-1 otherwise.\n", " truth: the ground truth of the two clusters, represented with +/- 1\n", " both A and truth are in the CVX matrix data structure \n", " \"\"\"\n", " assert(n % 2 == 0)\n", " mid = int(n/2)\n", " # generate parameters\n", " p = alpha*log(n)/n\n", " q = beta*log(n)/n\n", " # generate A matrix\n", " A = zeros([n, n])\n", " A[0:mid, mid:n] = random.binomial(1, q, (mid, mid))\n", " for i in range(mid):\n", " for j in range(i+1, mid):\n", " A[i, j] = random.binomial(1, p)\n", " for i in range(mid, n):\n", " for j in range(i+1, n):\n", " A[i, j] = random.binomial(1, p)\n", " A = A+transpose(A)\n", " A = (A-0.5)*2\n", " for i in range(n):\n", " A[i, i] = 0\n", " # randomly permute the rows and columns\n", " perm = random.permutation(n)\n", " A = A[:, perm]\n", " A = A[perm, :]\n", " # find the ground truth\n", " argperm = argsort(perm)\n", " truth = zeros([n, 1])\n", " truth[argperm[0:mid], 0] = 1\n", " truth[argperm[mid:n], 0] = -1\n", " # return A and truth\n", " return matrix(A), matrix(truth)\n", "\n", "def is_correct(sol, truth):\n", " \"\"\"\n", " Checks whether the reconstruction found by SDP is correct.\n", " inputs: sol: the solution X^* found by SDP in CVX matrix data structure\n", " truth: ground truth x^*, a column vector in CVX matrix data structure\n", " outputs: 1 if reconstruction is correct; 0 otherwise\n", " \"\"\"\n", " # set a threshold for the difference between elements of X^* and x^*X^{*T}\n", " th = 1e-4\n", " difference = abs(sol-truth*transpose(truth))\n", " if difference.max() < th:\n", " # exact recovery\n", " return 1\n", " else:\n", " # wrong recovery\n", " return 0\n", "\n", "def recon_prob_sdp(n, alpha, beta):\n", " \"\"\"\n", " Find the probability of successful reconstruction given the parameters\n", " inputs: n: total number of nodes, \n", " alpha: parameter alpha corresponding to the in-cluster connection probability\n", " beta: parameter beta corresponding to the cross-cluster connection probability\n", " outputs: the simulated probability of successful reconstruction\n", " \"\"\"\n", " assert(n % 2 == 0)\n", " num_tests = 50\n", " num_success = 0.0\n", " for t in range(num_tests):\n", " result = generate_sbm(n, alpha, beta)\n", " A = result[0]\n", " truth = result[1]\n", "\n", " # Set parameters for the SDP\n", " c = matrix(-1., (n, 1))\n", " h = [-A]\n", " G1 = zeros([n*n, n])\n", " for i in range(n):\n", " G1[i+n*i, i] = 1\n", " G = [matrix(G1)]\n", " sol = solvers.sdp(c, Gs=G, hs=h)\n", " sol = sol['zs'][0]\n", " if is_correct(sol, truth) == 1:\n", " num_success = num_success + 1\n", " return num_success/num_tests\n", "\n", "# Test example:\n", "#print recon_prob_sdp(50, 6, 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## MLE and the Min-Bisection Problem\n", "\n", "Recall the problem set-up: we are given a graph of connections, and we know that half of the nodes belong to one community and that the other half belong to the other community. It is clear that in general, the optimal method of recovering community assignments is through MAP (maximum a posteriori) decoding. In this case, since each community assignment is equally likely, we may use ML (maximum likelihood). It is likewise clear, then, that if we have $n$ nodes where $n$ is even, then we would expect the $\\frac{n}{2}$ nodes in the same community to have more connections amongst each other than with the other $\\frac{n}{2}$ nodes. Consider a toy example:\n", "\n", "We are given the above graph and we would like to recover the community assignments. Perhaps it is clear by inspection that the proper community assignments should be:\n", "\n", "#### How did we come to this conclusion? What is going on under the hood? \n", "In order to properly address this question, we must first introduce the notion of min-bisection. Formally, consider a graph $G=(V,E)$, where $V$ denotes the set of vertices and $E$ denotes the set of edges. We define a $(2,\\frac{n}{2})$ partition of the graph to be a split of the graph into $2$ groups of nodes of size $\\frac{n}{2}$ each. The min-bisection of the graph is the $(2, \\frac{n}{2})$ partition with the minimum total edge weight across partitions. Looking back at the toy example above we can see that the community assignments correspond to a min-bisection of the graph. In fact, generally, ML on this graph is equivalent to finding a min-bisection. With this in mind, let's start playing with graphs created using the stochastic block model and try using ML to recover the assignments.\n", "\n", "### $\\mathcal{Q}$0. Show that ML on a given graph is equivalent to finding a min-bisection.\n", "\n", "### Your proof for question 0 here. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### $\\mathcal{Q}$1. In this question, we will create an implementation for the SBM and play around with using maximum likelihood to recover community assignments.\n", "\n", "### a. Fill in the following method that creates a graph according to the SBM model." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def G(n,p,q):\n", " \"\"\"\n", " Let the first n/2 nodes be part of community A and \n", " the second n/2 part of community B.\n", " \"\"\"\n", " assert(n % 2 == 0)\n", " mid = int(n/2)\n", " graph = []\n", " for i in xrange(n):\n", " graph.append((i,i))\n", " \n", " #Make community A\n", " ### Your code here\n", " \n", " #Make community B \n", " ### Your code here\n", " \n", " #Form connections between communities\n", " for i in xrange(mid):\n", " for j in xrange(mid, n):\n", " if rnd.random() < q:\n", " graph.append( (i, j) )\n", " return graph" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Visualize your SBM graph\n", "graph = G(20,0.6,0.05)\n", "draw_graph(graph,graph_layout='spring')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### b. Given a graph (assume that the graph has two communities and satisfies the thresholds outlined above), write a function to find the maximum likelihood estimate of the two communities. It might be helpful to have a graph stored as an adjacency list. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from collections import defaultdict\n", "\n", "def adjacency_list(graph):\n", " \"\"\"\n", " Takes in the current representation of the graph, outputs an equivalent\n", " adjacenty list\n", " \"\"\"\n", " adj_list = defaultdict(set)\n", " for node in graph:\n", " adj_list[node[0]].add(node[1])\n", " adj_list[node[1]].add(node[0])\n", " return adj_list\n", "\n", "def mle(graph):\n", " #Your code here\n", " return None" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### c. In the graphs we create, we know that the ground truth is that nodes $(0, \\frac{n}{2} - 1)$ and $(\\frac{n}{2}, n)$ are the actual communities. Using this knowledge, please simulate the probability of exact recovery if $\\frac{\\alpha + \\beta}{2} - \\sqrt{\\alpha \\beta} > 1$. Then do the same if $\\frac{\\alpha + \\beta}{2} - \\sqrt{\\alpha \\beta} < 1$" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def prob_recovery(n, alpha, beta):\n", " mid = int(n/2)\n", " ground_truth1 = tuple(np.arange(mid))\n", " ground_truth2 = tuple(np.arange(mid, n))\n", " ### Your code here\n", " return None\n", "\n", "#Greater than 1\n", "alpha = 4\n", "beta = 0.25\n", "n = 10\n", "print \"Value threshold: \", (alpha + beta)/2 - np.sqrt(alpha * beta)\n", "print \"Probability of recovery if threshold bigger than 1: \", prob_recovery(n,alpha,beta)\n", "\n", "#Less than 1\n", "alpha = 0.5\n", "beta = 0.25\n", "n = 10\n", "print \"Value threshold: \", (alpha + beta)/2 - np.sqrt(alpha * beta)\n", "print \"Probability of recovery if threshold bigger than 1: \", prob_recovery(n,alpha,beta)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### d. In our tests above, we used $n=10$, which is an incredibly small graph size. We would like to be able to analyze larger graphs, such as where $n=100, 500, 1000$, etc. What happens when you try to run MLE on a graph of size $n=100$? Why do you think this is happening?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Greater than 1\n", "alpha = 4\n", "beta = 0.25\n", "n = 100\n", "print \"Value threshold: \", (alpha + beta)/2 - np.sqrt(alpha * beta)\n", "print \"Probability of recovery if threshold bigger than 1: \", prob_recovery(n,alpha,beta)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Your hypothesis here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Thresholds on Exact Recovery: When Will MLE Correctly Recover the Communities?\n", "\n", "***NOTE: The results and arguments in this section are due to Abbe et. al$^1$**\n", "\n", "Experimentally, we can see that the $\\frac{\\alpha + \\beta}{2}$ is a reasonable threshold for connectivity. We conjecture that the recovery threshold is then $\\frac{\\alpha + \\beta}{2} > 1 + \\sqrt{\\alpha \\beta}$ where the additional $\\sqrt{\\alpha \\beta}$ term is necessary for recovery. It turns out that according to this condition, we can find lower and upper bounds on exact recovery in the SBM:\n", "\n", "** Lower Bound: Let $\\alpha > \\beta \\ge 0$. If $\\frac{\\alpha + \\beta}{2} - \\sqrt{\\alpha \\beta} < 1$, then maximum likelihood estimation fails in recovering the communities with probability bounded away from $0$**\n", "\n", "** Upper Bound: Let $\\alpha > \\beta \\ge 0$. If $\\frac{\\alpha + \\beta}{2} - \\sqrt{\\alpha \\beta} > 1$, then the maximum likelihood estimator exactly recovers the communities (up to a global flip) with high probability. **\n", "\n", "Using the tools that you have built up this year in EE126, you will prove this lower bound step-by-step. We will systematically show that the probability Maximum Likelihood fails is bounded away from $0$. We will provide a chain of logic from the conclusion up, each one thus implying that the probability Maximum Likelihood fails is bounded away from $0$ (specifically that it is $\\ge \\frac{1}{3}$). \n", "\n", "### $\\mathcal{Q}$2. In this question, we will walk through a proof of the lower bound. Remember, we are trying to show that the probability that Maximum Likelihood Estimation fails is bounded away from $0$.\n", "\n", "Before we jump into the math, let's look at a high levl view of the approach. Let's say, as before, that our two communities are Berkeley and Stanford. MLE will fail when a \"traitor\" exists. Here, we define a traitor as some student at Berkeley with more friends at Stanford than at Berkeley, or equivalently, a student at Stanford with more friends at Berkeley than at Stanford:\n", "\n", "\n", "\n", "We will show that given the condition on $\\alpha$ and $\\beta$ above, the probability that there is a traitor at Berkeley or Stanford is $\\ge \\frac{1}{3}$. It turns out that even though the number of students at Berkeley is $\\frac{n}{2}$ and the number of students at Stanford is $\\frac{n}{2}$, restricting our traitors to a much smaller \"traitor group\" of size $\\frac{n}{\\log^{3}{n}}$ is sufficient to show the lower bound. In this question, you will prove three crucial lemmas instrumental in showing the lower bound. Each of the three lemmas builds upon the previous. We will prove:\n", "- #### Lemma 1: $P(\\text{Student i in Berkeley's traitor group is a traitor}) > \\frac{\\ln{10}\\log{n}}{n\\log{\\log{n}}}$ implies that $P(\\text{There is a traitor in Berkeley's traitor group}) \\ge \\frac{9}{10}$\n", "- #### Lemma 2: $P(\\text{There is a traitor in Berkeley's traitor group}) \\ge \\frac{9}{10}$ implies that $P(\\text{There is a traitor at Berkeley}) \\ge \\frac{2}{3}$\n", "- #### Lemma 3: $P(\\text{There is a traitor at Berkeley}) \\ge \\frac{2}{3}$ implies that $P(\\text{MLE Fails}) \\ge \\frac{1}{3}$\n", "\n", "\n", "### a. In this part we will prove Lemma 3. Let $F$ be the event that maximum likelihood fails. Additionally, let $T_B$ be the event that there is a traitor at Berkeley (ie that there is a student at Berkeley with more friends at Stanford than at Berkeley). Show that:\n", "$$P(T_B) \\ge \\frac{2}{3} \\implies P(F) \\ge \\frac{1}{3}$$\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Your solution to part a here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----------------------------------------------------------------------------------------------------------------------\n", "\n", "You have just successfully shown the last bullet in the list of three lemmas above. It is now time to consider Berkeley's traitor group. Remember that this group is not necessarily full of traitors; rather, when we are searching for traitors, we may restrict our search to this group which is much smaller than the size of the set of all Berkeley students. Also remember that the size of such a group, let's call it $T$ is $\\frac{n}{\\log^{3}{n}} = \\frac{n}{\\gamma(n)}$ where $\\gamma(n) = \\log^3{n}$ is introduced for notational convenience. Now, the traitor group has one more nice feature. Let $A$ be the event that no student in the traitor group has more than $\\delta(n) = \\frac{\\log{n}}{\\log{\\log{n}}}$ friends in $T$ (this function should look familiar to a term in Lemma 1!). It turns out that $P(A) \\ge \\frac{9}{10}$. This fact gives some hint as to why the traitor group is so special and should come in handy in the next part. Note that this fact can be shown using the Chernoff bound, but it is difficult, so for our purposes we will just accept it as a fact.\n", "\n", "(Note that the diagram above is rough and the number of nodes/edges are not exact, it is just to give you a rough visual of what is going on).\n", "\n", "### b. In this part, we will prove Lemma 2. Remember that some student $i \\in T$ is a traitor if $i$ has more friends at Stanford than at Berkeley (note that $i$ is a Berkeley student). We will look at a slightly more specific event: Let $F_T^{(i)}$ be the event that the sum of $\\delta(n)$ and the number of friends $i$ has at Berkeley outside of $T$ is less than the number of friends $i$ has at Stanford. Also, let $F_t = \\cup_{i \\in T} F_T^{(i)}$. Show that $P(F_T) \\ge \\frac{9}{10} \\implies P(F) \\ge \\frac{1}{3}$\n", "Hint: Try to use part a.\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Your solution to part b here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### c. In this part, we will prove Lemma 1. Show that for sufficiently large $n$, if $P(F_T^{(i)}) > n^{-1}\\gamma(n)\\ln{10}$, then $P(F) \\ge \\frac{1}{3}$. As a reminder, $\\gamma(n) = \\log^{3}{n}$ as defined above.\n", "Hint: Try using the complement and then try to use part b." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Your solution to part c here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To finish the proof, we need to show that $P(F_T^{(i)}) > n^{-1}\\gamma(n)\\ln{10}$ in order to use the result of part c. \n", "### d. Show that if $Z_1,Z_2,...,Z_n \\sim \\text{Bernoulli}(\\frac{\\beta\\log{n}}{n}), W_1,...,W_N \\sim \\text{Bernoulli}(\\frac{\\alpha\\log{n}}{n})$:\n", "$$P(F_T^{(i)}) = P\\biggl(\\sum_{j=1}^{\\frac{n}{2}} Z_j - \\sum_{j=1}^{\\frac{n}{2}-\\frac{n}{\\gamma(n)}}W_j \\ge \\frac{\\log{n}}{\\log{\\log{n}}}\\biggr) \\ge P\\biggl(\\sum_{j=1}^{\\frac{n}{2}} Z_j - \\sum_{j=1}^{\\frac{n}{2}}W_j \\ge \\frac{\\log{n}}{\\log{\\log{n}}}\\biggr)$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Your solution to part d here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----------------------------------------------------------------------------------------------------------------------\n", "#### To finish, we use the inequality \n", "$$-\\log{P\\biggl(\\sum_{j=1}^{N} Z_j - \\sum_{j=1}^{N}W_j \\ge \\epsilon\\biggr)} \\le (\\frac{\\alpha + \\beta}{2} - \\sqrt{\\alpha \\beta})\\log{n} + o(\\log{n})$$ applied to the result of part d. which implies that $P(F_T^{(i)}) > n^{-1}\\gamma(n)\\ln{10}$ and the result follows from the chain of lemmas that we proved above. Phew! That was a lot of math... Using similar techniques, we can show the upper bound. Now that we've gotten this under our belt, let's revisit the implementation of recovery and see if we can be more efficient than ML." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Efficient Algorithm for Community Detection\n", "In the previous parts, we used the maximum likelihood decoder to reconstruct the two communities. As we saw in one of the previous questions, maximum likelihood is extremely inefficient. In this section, we will develop a more efficient algorithm for the community detection problem. We will use $G(V, E)$ to denote the undirected graph that we observe, where $V$ is the set of nodes $(|V|=n)$, and $E$ is the set of edges.\n", "\n", "First, let's consider an intuitive algorithm to solve the community detection problem. As we have seen, the goal of community detection is to separate the nodes into two communities, such that the number of edges within the same community is as large as possible and the number of edges between two communities is as small as possible. To achieve this goal, we consider the \"score\" of a particular separation. For an edge within a community, we get one point; for an edge between two communities, we get minus one point. We want to maximize the score over all possible separations. We identify a choice of communities by a vector $x\\in\\mathbb{R}^n$ with $\\pm1$ entries such that $x_i$ will be $+1$ if node $i$ is in one community and $-1$ if it is in the other. We also define $A$ as the $n\\times n$ matrix with zero diagonal whose non diagonal entries are given by\n", "$$\n", "A_{ij}=\\begin{cases}\n", "1 & \\text{if }(i,j)\\in E\\\\\n", "-1 & \\text{if }(i,j)\\notin E\n", "\\end{cases}\n", "$$\n", "Then we can show that, maximizing the score is equivalent to the following optimization problem (think about the reason by yourself):\n", "\\begin{align}\n", "\\max &~~x^TAx \\\\\n", "s.t. &~~x_i=\\pm1.\n", "\\end{align}\n", "However, since this optimization problem is combinatorial and hard to solve, we need to relax the constraint that $x_i$ has to be $\\pm1$.\n", "\n", "Let's look at the objective of the optimization problem:\n", "$x^TAx$. According to knowledge in linear algebra, we know that $x^TAx=\\text{Tr}(x^TAx)=\\text{Tr}(Axx^T)$. Here, \"Tr\" denotes the trace of a square matrix, i.e., the sum of all the elements on the diagonal. We can see that $x^TAx=\\text{Tr}(x^TAx)$ is obvious because the trace of a scalar is still itself; and $\\text{Tr}(x^TAx)=\\text{Tr}(Axx^T)$ is because of the fact that $\\text{Tr}(AB)=\\text{Tr}(BA)$. If we denote the rank-one matrix $xx^T$ by $X$, then the previous optimization problem is equivalent to:\n", "\\begin{align}\n", "\\max &~~\\text{Tr}(AX) \\\\\n", "s.t. &~~X=xx^T\\text{ and }x_i=\\pm1.\n", "\\end{align}\n", "\n", "Since this problem is still hard to solve, we need to relax the constraints on $X$. As we can see, the diagonal elements of $X$ are all 1. Further, we can see that $X$ is positive semidefinite.\n", "(A matrix $D\\in\\mathbb{R}^{n\\times n}$ is called a positive semidefinite matrix if and only if for any vector $u\\in\\mathbb{R}^n$, there is $u^TDu\\ge 0$). \n", "An optimization problem with linear objective functions and matrix variables which are constrained to be positive semidefinite is called a semidefinite program (SDP). SDPs are convex optimization problems, and therefore, the global minimum can be found in polynomial time. Therefore, instead of solving the combinatorial optimization problem, we solve the following SDP problem:\n", "\\begin{align*}\n", "\\max &~~\\text{Tr}(AX)\\\\\n", "s.t. &~~X_{ii}=1\\\\\n", "&X\\succeq 0,\n", "\\end{align*}\n", "and hope the the relaxed optimization problem can give us the same answer as the original problem. It is proved that if $\\alpha$ and $\\beta$ satisfy some conditions, the solution to the SDP problem $X^*$ will be the outer product of the solution to the combinatorial optimization problem $x^*$, i.e., $X^*=x^*x^{*T}$. We will use the CVX package for Python to solve this SDP:\n", "\n", "http://cvxopt.org/.\n", "\n", "Install CVX in your computer and read the instructions on solving SDP using CVX:\n", "\n", "http://cvxopt.org/userguide/coneprog.html#semidefinite-programming.\n", "\n", "Specifically, we will solve the dual SDP problem. We will use different data structures from the previous parts in order to use CVX. Therefore, we define some new functions which are useful in this part." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### $\\mathcal{Q}$3. In this question we will simulate reconstruction using the semidefinite programming algorithm and use this to plot the thresholds on reconstruction." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### a. Let alpha and beta vary between $0$ and $5$. Make two separate 3-d plots. Let the first plot the value of $\\frac{\\alpha + \\beta}{2} - \\sqrt{\\alpha \\beta}$ and let the second plot the value of the probability of recovery." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "\n", "#Plotting the probability of recovery\n", "n = 10\n", "fig = plt.figure()\n", "ax = fig.gca(projection='3d')\n", "alpha = np.arange(0, 5, 0.2)\n", "beta = np.arange(0, 5, 0.2)\n", "alpha, beta = np.meshgrid(alpha, beta)\n", "recovery = lambda a,b: recon_prob_sdp(n, a, b)\n", "recovery = np.vectorize(recovery)\n", "connected = recovery(alpha, beta)\n", "surf = ax.plot_surface(alpha, beta, connected, rstride=1, cstride=1, cmap=cm.coolwarm,\n", " linewidth=0, antialiased=False)\n", "ax.view_init(azim = 45,elev = 15)\n", "\n", "ax.zaxis.set_major_formatter(FormatStrFormatter('%.02f'))\n", "\n", "fig.colorbar(surf, shrink=0.5, aspect=2)\n", "ax.set_title('Probability of Recovery')\n", "ax.set_xlabel('alpha')\n", "ax.set_ylabel('beta')\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### An Interesting Viewpoint\n", "** The stochastic block model can be seen as a code on a discrete memoryless channel. ** In this view, we can consider the the community assignment vector $x \\in \\{0,1\\}^n$ as the \"message\" to be transmitted (the value of each $x_i$ tells us which community it belongs to). The output of this channel is $y \\in \\{0,1\\}^{\\frac{n(n-1)}{2}}$ where $y_{ij}$ is the output of $x_i + x_j \\hspace{1.4mm} \\text{mod}(2)$ through the channel:\n", "$$\n", "\\begin{bmatrix}\n", "1- p & p \\\\\n", "1-q & q\n", "\\end{bmatrix}\n", "$$\n", "In other words, when we are sending the vector $x$ across the channel, we take every pair of nodes $x_i, x_j$, both of which are either $0$ or $1$. We take the XOR of the two nodes, call it $x_{ij}$ and send it through the channel matrix described above. Note that if $x_{ij} = 0$, then $x_i$ and $x_j$ are in the same community, and if $x_{ij} = 1$, then $x_i$ and $x_j$ are in different communities. This channel matrix describes the following scenario: if $x_{ij} = 0$, then the channel output $y_{ij} = 0$ with probability $1-p$ and $1$ with probability $p$. If $x_{ij} = 1$, then the channel outputs $y_{ij} = 0$ with probability $1-q$ and outputs $1$ with probability $q$. The output matrix elements $y_{ij} = 1$ if there is an edge between node $x_i$ and $x_j$ in the graph. The following picture summarizes the setup:\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## References\n", "[1] E. Abbe, A. S. Bandeira, and G. Hall. Exact Recovery in the Stochastic Block Model. available at arXiv:1405.3297 [cs.SI], 2014." ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.5" } }, "nbformat": 4, "nbformat_minor": 0 }