<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-8128923945165565689</id><updated>2012-01-23T23:59:12.126Z</updated><category term='irish'/><category term='facebook'/><category term='clustering'/><category term='visualization'/><category term='blogosphere'/><category term='centrality'/><category term='ireland'/><category term='time; python; matlab; gregorian; standards'/><category term='enron; data; temporal; dynamic; communication; parser; email'/><category term='geography; metropolitan areas; data; geographic'/><category term='temporal; dynamic; vector clocks; time sliced; networks'/><category term='networks'/><category term='tutorial networkx python sna'/><title type='text'>The Sociograph</title><subtitle type='html'></subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://sociograph.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://sociograph.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Conrad Lee</name><uri>https://profiles.google.com/110979755871396857085</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh6.googleusercontent.com/-3Jylvn3UkJI/AAAAAAAAAAI/AAAAAAAAAAA/PM9AqhTcFSE/s512-c/photo.jpg'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>21</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-8128923945165565689.post-6471359523443406069</id><published>2012-01-11T16:54:00.000Z</published><updated>2012-01-23T23:59:12.133Z</updated><title type='text'>Talk on Facebook data, Privacy, and Scientific Research at Chaos Communication Congress</title><content type='html'>Over the holidays I gave a presentation on how privacy issues surrounding Facebook data are affecting scientific research (the talk is about 30 minutes long, with 10 minutes of questions). &amp;nbsp;I'm glad to see that it's been posted and the recording is &amp;nbsp;high-quality, both in terms of editing and audio/video quality.&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I try to flesh out the problems in the context of an example, the &lt;i&gt;Taste, Ties, and Time&lt;/i&gt; dataset and it's "evil twin" the &lt;a href="http://sociograph.blogspot.com/2011/03/facebook100-data-and-parser-for-it.html"&gt;&lt;i&gt;Facebook100&lt;/i&gt; dataset&lt;/a&gt;.&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Below you find the talk on Youtube, or you can download a high quality version &lt;a href="http://bit.ly/tbQkpE"&gt;here&lt;/a&gt;.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;object class="BLOGGER-youtube-video" classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0" data-thumbnail-src="http://1.gvt0.com/vi/o3Ih7z2rwY0/0.jpg" height="266" width="320"&gt;&lt;param name="movie" value="http://www.youtube.com/v/o3Ih7z2rwY0&amp;fs=1&amp;source=uds" /&gt;&lt;param name="bgcolor" value="#FFFFFF" /&gt;&lt;embed width="320" height="266"  src="http://www.youtube.com/v/o3Ih7z2rwY0&amp;fs=1&amp;source=uds" type="application/x-shockwave-flash"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8128923945165565689-6471359523443406069?l=sociograph.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sociograph.blogspot.com/feeds/6471359523443406069/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8128923945165565689&amp;postID=6471359523443406069' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default/6471359523443406069'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default/6471359523443406069'/><link rel='alternate' type='text/html' href='http://sociograph.blogspot.com/2012/01/talk-on-facebook-data-privacy-and.html' title='Talk on Facebook data, Privacy, and Scientific Research at Chaos Communication Congress'/><author><name>Conrad Lee</name><uri>https://profiles.google.com/110979755871396857085</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh6.googleusercontent.com/-3Jylvn3UkJI/AAAAAAAAAAI/AAAAAAAAAAA/PM9AqhTcFSE/s512-c/photo.jpg'/></author><thr:total>0</thr:total><georss:featurename>Alexanderplatz 1, 10178 Berlin, Germany</georss:featurename><georss:point>52.52136532929198 13.41254711151123</georss:point><georss:box>52.518949829291984 13.40761161151123 52.52378082929198 13.41748261151123</georss:box></entry><entry><id>tag:blogger.com,1999:blog-8128923945165565689.post-3301069611170938060</id><published>2011-12-05T16:09:00.001Z</published><updated>2011-12-05T17:02:23.499Z</updated><title type='text'>A gotcha with numpy's searchsorted</title><content type='html'>&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-3Gg_5eK0NYw/TtzuHoks5eI/AAAAAAAAHTY/wff5YsXymGE/s1600/gotcha.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="185" src="http://2.bp.blogspot.com/-3Gg_5eK0NYw/TtzuHoks5eI/AAAAAAAAHTY/wff5YsXymGE/s320/gotcha.jpg" width="320" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;Photo courtesy of &lt;a href="http://www.flickr.com/photos/mikefischer/247047603/in/photostream/"&gt;Mike Fischer&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;I'm using numpy's&amp;nbsp;&lt;a href="http://docs.scipy.org/doc/numpy/reference/generated/numpy.searchsorted.html"&gt;searchsorted&lt;/a&gt; function find an item in a sorted numpy array with more than 1 billion entries. &amp;nbsp;This should be quick, but each query was taking nearly a minute--why was this so slow?&lt;br /&gt;&lt;br /&gt;My queries looked like this:&lt;br /&gt;&lt;blockquote class="tr_bq"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;left_idx = numpy.searchsorted(mmap_fp, key, side="left")&lt;/span&gt;&lt;/blockquote&gt;The problem, as it turns out, is that my data array is of type numpy.uint64, while my key is simply a normal python integer. &amp;nbsp;This causes things to slow way down. &amp;nbsp;So the following code runs much more quickly than the code above:&lt;br /&gt;&lt;br /&gt;&lt;blockquote class="tr_bq"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;left_idx = numpy.searchsorted(mmap_fp, numpy.uint64(key), side="left")&lt;/span&gt;&lt;/blockquote&gt;&lt;br /&gt;I find this annoying because there is no mention of this important point in searchsorted's documentation. &amp;nbsp;It's not exactly a bug, but I think it falls under the category of "gotcha," which &lt;a href="http://en.wikipedia.org/wiki/Gotcha_(programming)"&gt;Wikipedia defines as follows&lt;/a&gt;:&lt;br /&gt;&lt;blockquote class="tr_bq"&gt;&lt;span class="Apple-style-span" style="background-color: white; font-family: sans-serif; font-size: 13px; line-height: 19px;"&gt;In programming, a&amp;nbsp;&lt;/span&gt;&lt;b style="background-color: white; font-family: sans-serif; font-size: 13px; line-height: 19px;"&gt;gotcha&lt;/b&gt;&lt;span class="Apple-style-span" style="background-color: white; font-family: sans-serif; font-size: 13px; line-height: 19px;"&gt;&amp;nbsp;is a feature of a system, a program or a programming language that works in the way it is documented but is counter-intuitive and almost invites mistakes because it is both enticingly easy to invoke and completely unexpected and/or unreasonable in its outcome.&lt;/span&gt;&lt;/blockquote&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8128923945165565689-3301069611170938060?l=sociograph.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sociograph.blogspot.com/feeds/3301069611170938060/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8128923945165565689&amp;postID=3301069611170938060' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default/3301069611170938060'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default/3301069611170938060'/><link rel='alternate' type='text/html' href='http://sociograph.blogspot.com/2011/12/gotcha-with-numpys-searchsorted.html' title='A gotcha with numpy&apos;s searchsorted'/><author><name>Conrad Lee</name><uri>https://profiles.google.com/110979755871396857085</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh6.googleusercontent.com/-3Jylvn3UkJI/AAAAAAAAAAI/AAAAAAAAAAA/PM9AqhTcFSE/s512-c/photo.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-3Gg_5eK0NYw/TtzuHoks5eI/AAAAAAAAHTY/wff5YsXymGE/s72-c/gotcha.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8128923945165565689.post-4305322890335082655</id><published>2011-11-18T13:49:00.001Z</published><updated>2011-11-18T14:51:38.381Z</updated><title type='text'>Scalable mean-shift clustering in a few lines of python</title><content type='html'>In my &lt;a href="http://sociograph.blogspot.com/2011/11/accessible-introduction-to-mean-shift.html"&gt;last post&lt;/a&gt;, I gave a hand-wavy, intuitive explanation of the mean-shift algorithm. &amp;nbsp;Here I show a concise implementation which, while simple, will scale to cluster several million points quickly.&lt;br /&gt;&lt;br /&gt;Part of making this implementation scalable involves choosing seed positions for the kernels cleverly---I'll present a simple solution for this problem below, but first, let's look at the core of the algorithm. The code below is based on code I've submitted to the &lt;a href="http://scikit-learn.org/"&gt;scikit-learn&lt;/a&gt; machine learning library---much of the credit for it goes to other scikit-learn developers. &amp;nbsp;If you actually want to use this implementation, I suggest you use the version in the scikit-learn library.&lt;br /&gt;&lt;br /&gt;The following code goes through a list of seeds serially, creates a kernel for each seed from that seed's position, and keeps shifting that kernel to the weighted mean of the points contained within it until either its position has converged, or a maximum number of iterations have occurred.&lt;br /&gt;&lt;iframe height="500" src="http://www.purplegene.com/script?url=https://gist.github.com/1376528.js?file=mean_shift.py" width="100%"&gt;&lt;/iframe&gt;&lt;br /&gt;&lt;br /&gt;Notice that when we lookup the nearby points of a kernel, we use a data structure called a ball tree to efficiently query all points within a given radius. &amp;nbsp;I've imported this structure (as well as other useful functions) from the scikit-learn library.&lt;br /&gt;&lt;br /&gt;The function above will return one cluster center for each of the seeds. &amp;nbsp;Many of these seeds will have converged to the same or nearly the same position, and thus you may want to weed out duplicate clusters from the output. &amp;nbsp;Because it's not really pedagogically relevant, I've left this step out here---if you want to see how this is done, look at the source code of the scikit-learn implementation.&lt;br /&gt;&lt;br /&gt;The mean_shift function takes a kernel update function as one of its arguments. &amp;nbsp;We need to implement these as well---here are update functions for perhaps the two most commonly used kernels: the flat kernel and the Gaussian kernel:&lt;br /&gt;&lt;br /&gt;&lt;iframe height="220" src="http://www.purplegene.com/script?url=https://gist.github.com/1376552.js?file=kernel_update_functions.py" width="100%"&gt;&lt;/iframe&gt;&lt;br /&gt;In these two kernels, the bandwidth parameter (which is the most important parameter in mean shift clustering) has somewhat different meanings. In the Gaussian kernel it acts as the standard deviation of a normal distribution, so you want to set it such that nearly all of the points within a cluster will fall within 3X bandwidth from the center of the cluster. &amp;nbsp;In the flat cluster, the bandwidth is the radius of the cluster, so all points should fall within 1X bandwidth.&lt;br /&gt;&lt;br /&gt;As I mentioned above, to make this implementation scalable, you need to seed the initial positions of the kernels cleverly. &amp;nbsp;The following code takes all of the data points, discretizes them by pushing them onto a grid (whose coarseness can be controlled with the bin_size argument), and then takes those points on the grid that have at least min_bin_freq points.&lt;br /&gt;&lt;br /&gt;&lt;iframe height="240" src="http://www.purplegene.com/script?url=https://gist.github.com/1376583.js?file=bin_points.py" width="100%"&gt;&lt;/iframe&gt;&lt;br /&gt;In my work on clustering geotags to find points of interest, I've found that if I set bin_size to about twice the bandwidth (using the Gaussian kernel) or the same size as the bandwidth (using the flat kernel), the results are good.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8128923945165565689-4305322890335082655?l=sociograph.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sociograph.blogspot.com/feeds/4305322890335082655/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8128923945165565689&amp;postID=4305322890335082655' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default/4305322890335082655'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default/4305322890335082655'/><link rel='alternate' type='text/html' href='http://sociograph.blogspot.com/2011/11/scalable-mean-shift-clustering-in-few.html' title='Scalable mean-shift clustering in a few lines of python'/><author><name>Conrad Lee</name><uri>https://profiles.google.com/110979755871396857085</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh6.googleusercontent.com/-3Jylvn3UkJI/AAAAAAAAAAI/AAAAAAAAAAA/PM9AqhTcFSE/s512-c/photo.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8128923945165565689.post-6516604180164252004</id><published>2011-11-17T13:40:00.001Z</published><updated>2011-11-24T15:54:26.041Z</updated><title type='text'>An accessible introduction to mean-shift</title><content type='html'>&lt;b&gt;Summary&lt;/b&gt;&lt;br /&gt;This post provides a brief description of the mean-shift algorithm, as well as a discussion of two of its variants (the flat kernel and the Gaussian kernel). In my &lt;a href="http://sociograph.blogspot.com/2011/11/scalable-mean-shift-clustering-in-few.html"&gt;next post&lt;/a&gt;, I provide a concise implementation in python. &amp;nbsp;Mean shift is a clustering algorithm; its strength is that one need not specify the number of clusters to be found in advance.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Mean shift clustering&lt;/b&gt;&lt;br /&gt;Imagine that you have a bunch of points in two dimensional space, and you draw a circle somewhere in this space (let's assume at least one of the points is inside this circle). &amp;nbsp;This circle is a called "kernel" in the mean-shift jargon. &amp;nbsp;For now, you can just think of a kernel as defined by its center and its radius, where the jargon term for the radius is the "bandwidth."&amp;nbsp;The bandwidth is essentially the only parameter of the mean-shift method (other than the choice of kernel, which I'll discuss below).&lt;br /&gt;&lt;br /&gt;The task at hand is to shift this kernel (i.e., nudge its center) such that more points are contained within it. &amp;nbsp;We keep shifting this kernel until there is no direction we can shift it such that more points will be contained within it. &amp;nbsp;That's the basic idea: it's a &lt;a href="http://en.wikipedia.org/wiki/Hill_climbing"&gt;hill climbing&lt;/a&gt; algorithm. &amp;nbsp;Thus, in the figure below (taken from &lt;a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.6.3048"&gt;this paper&lt;/a&gt;), if our kernel starts off as the green circle in the bottom-right, after it's first iteration it might end up in the position of the orange circle, and it might keep moving until it converges to the location of the red circle.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-pwd1g8Y9lCw/TsUcH9y5tII/AAAAAAAAHSA/MeQqGy7WQH0/s1600/mean_shift_illustration.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="356" src="http://3.bp.blogspot.com/-pwd1g8Y9lCw/TsUcH9y5tII/AAAAAAAAHSA/MeQqGy7WQH0/s640/mean_shift_illustration.png" width="640" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;But how do we calculate the sizes and directions these shifts (i.e., the mean-shift vector)? &amp;nbsp;It's simple: we just move the kernel to the mean position of all the points within it&amp;nbsp;(i.e., the centroid). &amp;nbsp;I assume that's where the algorithm gets it's name: we iteratively "shift" the kernel to the mean of the points within it. &amp;nbsp;Actually, that's what we do if we've chosen to run the algorithm using a "flat kernel." &amp;nbsp;Other kernels just give us different ways of calculating a weighted mean. &amp;nbsp;For example if we use a Gaussian kernel, then every point is first assigned a weight that decays exponentially as the distance from the center of the kernel increases.&lt;br /&gt;&lt;br /&gt;To some of you familiar with other clustering methods, this mean-shift clustering might sound familiar, and with good reason: there are some interesting equivalences between mean-shift and other clustering techiniques. &amp;nbsp;&lt;a href="http://en.wikipedia.org/wiki/K_means"&gt;K-means clustering&lt;/a&gt; has been shown to be a limit case of mean-shift clustering, as described &lt;a href="http://dx.doi.org/10.1109/34.400568"&gt;here&lt;/a&gt;. &amp;nbsp;Also, mean-shift with a Gaussian kernel &lt;a href="http://dx.doi.org/10.1109/TPAMI.2007.1057"&gt;has been shown&lt;/a&gt; to be&amp;nbsp;equivalent&amp;nbsp;to &lt;a href="http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm"&gt;expectation maximization&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Why might a Gaussian kernel be better than a flat kernel?&lt;/b&gt;&lt;br /&gt;The flat kernel simply wants to have as many points as possible located in it, while the Gaussian kernel wants as many points possible near its center. &amp;nbsp;This means that the flat kernel will be more likely to try to cover up two or more distinct clusters because it doesn't care if the dense cluster centers are located at its periphery. &amp;nbsp;The following image (which contains two clusters, one above the other) demonstrates this behavior:&lt;br /&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-vzLPfiVk65Y/TsUmsNOZQDI/AAAAAAAAHSI/IJZlHxD2960/s1600/gaussian-vs-flat.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="400" src="http://3.bp.blogspot.com/-vzLPfiVk65Y/TsUmsNOZQDI/AAAAAAAAHSI/IJZlHxD2960/s400/gaussian-vs-flat.png" width="205" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;Whereas the flat kernel (below) simply tries to cover as many points as possible, the Gaussian kernel assigns a higher weight to points nearer to its center.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;The Gaussian kernel gets it right because in its calculation of the weighted mean, it weights points nearer the center higher than points near the periphery.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;How can we make mean-shift scalable?&lt;/b&gt;&lt;br /&gt;While a naive implementation of mean-shift is not very scalable, a few straightforward measures allow it to run on datasets containing tens of millions of points.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Do efficient neighbor lookup&lt;/i&gt;: First, for each mean-shift iteration, we can look up only nearby points because with a flat kernel, points beyond the bandwidth are irrelevant to the mean shift calculation, and with the Gaussian kernel these will have little effect. &amp;nbsp;There are a couple of data structures that allow one to efficiently query only the nearest neighbors of a given point, such as the &lt;a href="http://en.wikipedia.org/wiki/K-d_tree"&gt;k-d&amp;nbsp;Tree&lt;/a&gt; or the &lt;a href="http://en.wikipedia.org/wiki/Ball_tree"&gt;ball tree&lt;/a&gt;&amp;nbsp;(the ball tree should scale better in higher dimensions than the k-d tree).&lt;br /&gt;&lt;i&gt;&lt;br /&gt;&lt;/i&gt;&lt;br /&gt;&lt;i&gt;Seed with binned points&lt;/i&gt;: Next, we can observe that kernels that are initiated next to each other tend to converge to the same clusters. &amp;nbsp;So we can bin (i.e., discretize) the data points by pushing them onto a grid, and use only populated points on the grid as kernels. &amp;nbsp;Furthermore, we can set some threshold and use only those bins with at least that many points as kernels.&lt;br /&gt;&lt;br /&gt;I've noticed that if you filter out points on this grid that are populated with very few points, then this&amp;nbsp;technique&amp;nbsp;also solves another problem: kernels starting out and getting stuck in very sparse regions where no cluster exists. &amp;nbsp;For example, in the image above, there is one lonely point in the top left. &amp;nbsp;If we start a Gaussian kernel here, it will likely get stuck there, and end up as a cluster with only one &amp;nbsp;or two points in it. &amp;nbsp;However, if we filter out bins with just a couple of points in them (as I did to create the clustering visualized in that image), then no kernel will be initialized there.&lt;br /&gt;&lt;br /&gt;For a more extended, technical discussion about how to speed up mean-shift, see&lt;a href="http://dx.doi.org/10.1109/CVPR.2006.44"&gt; this paper&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8128923945165565689-6516604180164252004?l=sociograph.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sociograph.blogspot.com/feeds/6516604180164252004/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8128923945165565689&amp;postID=6516604180164252004' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default/6516604180164252004'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default/6516604180164252004'/><link rel='alternate' type='text/html' href='http://sociograph.blogspot.com/2011/11/accessible-introduction-to-mean-shift.html' title='An accessible introduction to mean-shift'/><author><name>Conrad Lee</name><uri>https://profiles.google.com/110979755871396857085</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh6.googleusercontent.com/-3Jylvn3UkJI/AAAAAAAAAAI/AAAAAAAAAAA/PM9AqhTcFSE/s512-c/photo.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-pwd1g8Y9lCw/TsUcH9y5tII/AAAAAAAAHSA/MeQqGy7WQH0/s72-c/mean_shift_illustration.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8128923945165565689.post-1171355037621319183</id><published>2011-11-05T20:04:00.001Z</published><updated>2011-11-24T11:31:53.713Z</updated><title type='text'>Clique percolation in a few lines of python</title><content type='html'>&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-l2H9iSRqE60/TrZveSd0goI/AAAAAAAAHRs/U3lQPbd0M-s/s1600/percolator.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="320" src="http://2.bp.blogspot.com/-l2H9iSRqE60/TrZveSd0goI/AAAAAAAAHRs/U3lQPbd0M-s/s320/percolator.jpg" width="309" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;Photo courtesy of &lt;a href="http://www.flickr.com/photos/noricum/3521379270/"&gt;noricum&lt;/a&gt;.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;div style="text-align: center;"&gt;&lt;div style="text-align: left;"&gt;&lt;b style="text-align: -webkit-auto;"&gt;Update&lt;/b&gt;: &lt;i&gt;Thanks to&amp;nbsp;&lt;a href="http://www.aaronmcdaid.com/" style="text-align: -webkit-auto;"&gt;Aaron McDaid&lt;/a&gt;&amp;nbsp;and Fergal Reid&lt;span class="Apple-style-span" style="text-align: -webkit-auto;"&gt;&amp;nbsp;for catching a couple of bugs (I was percolating on cliques sharing&amp;nbsp;&lt;/span&gt;&lt;b style="text-align: -webkit-auto;"&gt;k&lt;/b&gt;&lt;span class="Apple-style-span" style="text-align: -webkit-auto;"&gt;&amp;nbsp;rather than&lt;/span&gt;&lt;b style="text-align: -webkit-auto;"&gt;&amp;nbsp;k - 1&lt;/b&gt;&lt;span class="Apple-style-span" style="text-align: -webkit-auto;"&gt;&amp;nbsp;nodes, and forgot to leave out cliques with fewer than&amp;nbsp;&lt;/span&gt;&lt;b style="text-align: -webkit-auto;"&gt;k&lt;/b&gt;&lt;span class="Apple-style-span" style="text-align: -webkit-auto;"&gt;&amp;nbsp;nodes) also cliques that didn't percolate were being left out.&lt;/span&gt;&lt;/i&gt;&lt;/div&gt;&lt;br /&gt;&lt;/div&gt;One well-known algorithm for detecting overlapping communities is called the &lt;b&gt;C&lt;/b&gt;lique &lt;b&gt;P&lt;/b&gt;ercolation &lt;b&gt;M&lt;/b&gt;ethod (CPM). &amp;nbsp;While in my experience this algorithm is problematic, it is nonetheless well-known and widely used, probably because the method&lt;a href="http://dx.doi.org/10.1038/nature03607"&gt; was published in Nature&lt;/a&gt;, was timely,&amp;nbsp;and because the authors created an easy-to-use implementation called &lt;a href="http://www.cfinder.org/"&gt;CFinder&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;This algorithm is not yet in python's NetworkX library (&lt;b&gt;Update:&lt;/b&gt;&amp;nbsp;&lt;i&gt;the code below made it into NetworkX 1.6&lt;/i&gt;). &amp;nbsp;That's no problem though, because if you think of the algorithm in terms of finding connected components in the "clique graph," it is very easy to implement. &amp;nbsp;Given some graph &lt;b&gt;G&lt;/b&gt;, here's how you create &lt;b&gt;G&lt;/b&gt;'s clique graph:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Find all cliques in &lt;b&gt;G&lt;/b&gt;. &amp;nbsp;These are the "nodes" in &lt;b&gt;G&lt;/b&gt;'s clique graph, but to avoid confusion I'll call them cliques in what follows.&lt;/li&gt;&lt;li&gt;Compare every pair of cliques. If they share at least &lt;i style="font-weight: bold;"&gt;k - 1&lt;/i&gt;&amp;nbsp;nodes, then draw an edge between the pair of cliques in the clique graph.&lt;/li&gt;&lt;li&gt;Find the connected components in the clique graph--these are your percolated cliques! For each connected component return the set of nodes (from G) that are in that component.&lt;/li&gt;&lt;/ol&gt;&lt;div&gt;That explanation is short but maybe a little unclear because I am describing a graph and it's clique graph, so words like node and clique are ambiguous. &amp;nbsp;The code is less ambiguous:&lt;/div&gt;&lt;iframe height="275" src="http://www.purplegene.com/script?url=https://gist.github.com/1341933.js?file=clique_percolation.py" width="100%"&gt;&lt;/iframe&gt;&lt;br /&gt;That implementation is correct but expensive---it requires O(N^2) clique comparisons, where N is the number of cliques (which is often much larger than the number of nodes!). &amp;nbsp;If we use a python dictionary to index which nodes belong to which cliques, then we can easily compare only those cliques that share at least one node in common. &amp;nbsp;This implementation is a bit longer but should be much more efficient:&lt;br /&gt;&lt;iframe height="600" src="http://www.purplegene.com/script?url=https://gist.github.com/1341940.js?file=clique_percolation_indexed.py" width="100%"&gt;&lt;/iframe&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8128923945165565689-1171355037621319183?l=sociograph.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sociograph.blogspot.com/feeds/1171355037621319183/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8128923945165565689&amp;postID=1171355037621319183' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default/1171355037621319183'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default/1171355037621319183'/><link rel='alternate' type='text/html' href='http://sociograph.blogspot.com/2011/11/clique-percolation-in-few-lines-of.html' title='Clique percolation in a few lines of python'/><author><name>Conrad Lee</name><uri>https://profiles.google.com/110979755871396857085</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh6.googleusercontent.com/-3Jylvn3UkJI/AAAAAAAAAAI/AAAAAAAAAAA/PM9AqhTcFSE/s512-c/photo.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-l2H9iSRqE60/TrZveSd0goI/AAAAAAAAHRs/U3lQPbd0M-s/s72-c/percolator.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8128923945165565689.post-6223601810241497924</id><published>2011-11-04T11:34:00.003Z</published><updated>2011-11-13T19:07:50.256Z</updated><title type='text'>Fast I/O and compact data with python / numpy's memmap</title><content type='html'>&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-Id8ZXYlKeGU/TrPGwF6LFPI/AAAAAAAAHRk/KMqYkjKb5KA/s1600/binary-fernando.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="320" src="http://3.bp.blogspot.com/-Id8ZXYlKeGU/TrPGwF6LFPI/AAAAAAAAHRk/KMqYkjKb5KA/s320/binary-fernando.jpg" width="240" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;Image courtesy of &lt;a href="http://www.flickr.com/photos/fernando/246375059/in/photostream/"&gt;fernando&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;b&gt;Update:&amp;nbsp;&lt;/b&gt;I discovered a bug in the way I was benchmarking these---t&lt;b&gt;he memmap parser ends up being 4X (not 30X) faster&lt;/b&gt;&amp;nbsp;than the naive python parser (but the memmap method should still take up substantially less memory). &amp;nbsp;Actually, it's 4X faster if the data is read in from the memmapped file using a for loop. &amp;nbsp;But the bottleneck here is not I/O, it seems to be python's need to make a separate function call for each piece of data. &amp;nbsp;If I were to benchmark using a vectorized numpy operation, then the memmap parser would, I suspect, be much faster, and disk I/O would be the bottleneck (this benchmark is on my to-do list, and I'll update this post once I've run it).&lt;br /&gt;&lt;br /&gt;I've notice that I'm turning into a python evangelist lately, but I often need to face a hard fact: &lt;b&gt;python, when used naievely, is slow and memory inefficient! &amp;nbsp;&lt;/b&gt;However, there are all sorts of tricks you can use to speed it up and use less memory, such as use &lt;a href="http://numpy.scipy.org/"&gt;NumPy&lt;/a&gt;. &amp;nbsp;Here I'll cover a little trick that allows one to quickly read in files that might not even fit in memory using NumPy's memory mapped files.&lt;br /&gt;&lt;br /&gt;I'll explain how to use mmap in terms of an example: parsing in network data. &amp;nbsp;Let's say I have my network data in the following edgelist format:&lt;br /&gt;&lt;blockquote class="tr_bq"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;nodeID1 &amp;nbsp; &amp;nbsp;nodeID2 &amp;nbsp; &amp;nbsp;edgeweight&lt;/span&gt;&lt;/blockquote&gt;&lt;blockquote class="tr_bq"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;nodeID1 &amp;nbsp; &amp;nbsp;nodeID3 &amp;nbsp; &amp;nbsp;edgeweight&amp;nbsp;&lt;/span&gt;&lt;/blockquote&gt;&lt;blockquote class="tr_bq"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;...&amp;nbsp;&lt;/span&gt;&lt;/blockquote&gt;The straightforward python parser might look like this:&lt;br /&gt;&lt;br /&gt;&lt;iframe height="180" src="http://www.purplegene.com/script?url=https://gist.github.com/1339119.js?file=simple_edgelist_parser.py" width="100%"&gt;&lt;/iframe&gt;&lt;br /&gt;&lt;br /&gt;However, this parser won't be as fast as you'd like, mostly because python does not know how large the edgelist will be, so it needs to keep dynamically allocating memory over and over again. &amp;nbsp;&lt;b&gt;To read in 1,000,000 edges&lt;/b&gt;&amp;nbsp;&lt;b&gt;this parser takes 3.13 seconds&lt;/b&gt;, which isn't great.&amp;nbsp; Even if we could allocate the memory in advance, python's split function and even its coversion of strings to integers isn't as quick as you might be used to if you use unix's command line tools like &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;cut&lt;/span&gt; or &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;sort&lt;/span&gt;. &amp;nbsp;Additionally, the edgelist will take up more memory than is needed, because python &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;int&lt;/span&gt; and &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;float&lt;/span&gt; objects take up much more memory than in C/C++/Java.&lt;br /&gt;&lt;br /&gt;Does this mean that python doesn't scale? &amp;nbsp;No, it just means &lt;i&gt;we should be using numpy&lt;/i&gt;.&lt;br /&gt;&lt;br /&gt;Here's a solution that has four nice features&lt;br /&gt;&lt;ol&gt;&lt;li&gt;It stores data in a binary format. Once created, the data does not need to be parsed from text to binary each time.&lt;/li&gt;&lt;li&gt;It elegantly handles the case where the (binary) data does not fit into memory. &amp;nbsp;You can still iterate over the edges as if they were all in memory---presumably the operating system's virtual memory is used to handle what won't fit.&lt;/li&gt;&lt;li&gt;Data is represented compactly and with a predictable size. &amp;nbsp;The data is just as compact as if you were using C or Java, and you have full control over the types (e.g., signed or unsigned ints, 32-bit or 64-bit ints), which is not usually the case in python.&lt;/li&gt;&lt;li&gt;It's fast: &lt;b&gt;To read in 1,000,000 edges takes only 0.788 seconds, which is 4X faster than the python parser above. &amp;nbsp;&lt;/b&gt;If I had a slower disk (this is on my laptop's fast ssd), this difference might be smaller be smaller, because the "parsing time" would make up less of the overall runtime, and the I/O would make up more of it.&lt;/li&gt;&lt;/ol&gt;&lt;iframe height="600" src="http://www.purplegene.com/script?url=https://gist.github.com/1339145.js?file=mmap_edgelist_parser.py" width="100%"&gt;&lt;/iframe&gt;&lt;br /&gt;&lt;br /&gt;For more information on using numpy's memmap, check out &lt;a href="http://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html"&gt;the official documentation&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8128923945165565689-6223601810241497924?l=sociograph.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sociograph.blogspot.com/feeds/6223601810241497924/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8128923945165565689&amp;postID=6223601810241497924' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default/6223601810241497924'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default/6223601810241497924'/><link rel='alternate' type='text/html' href='http://sociograph.blogspot.com/2011/11/fast-io-and-compact-data-with-python.html' title='Fast I/O and compact data with python / numpy&apos;s memmap'/><author><name>Conrad Lee</name><uri>https://profiles.google.com/110979755871396857085</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh6.googleusercontent.com/-3Jylvn3UkJI/AAAAAAAAAAI/AAAAAAAAAAA/PM9AqhTcFSE/s512-c/photo.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-Id8ZXYlKeGU/TrPGwF6LFPI/AAAAAAAAHRk/KMqYkjKb5KA/s72-c/binary-fernando.jpg' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8128923945165565689.post-1359595885908133573</id><published>2011-11-01T16:59:00.002Z</published><updated>2011-11-01T17:14:21.634Z</updated><title type='text'>Converting huge edge list files to pajek format</title><content type='html'>Some network analysis software requires the network to be in the pajek format. &amp;nbsp;For example, I want to use Rosvall et. al's handy &lt;a href="http://www.tp.umu.se/~rosvall/code.html"&gt;Infomap algorithm&lt;/a&gt; to perform community deteciton on a large network dataset, and his implementation expects a pajek-formatted file as input. &amp;nbsp;However, I usually store my networks in the simpler edgelist format, which is just a text file with one edge per line.&lt;br /&gt;&lt;br /&gt;I've just written a little python script that converts edge list files to pajek format. &amp;nbsp;The script takes in a simple edgelist file and converts it to a pajek file. &amp;nbsp;There are similar converters out there on the web, but my converter's special trick is to use the unix sort command so that &lt;i&gt;it can convert files&lt;/i&gt; &lt;i&gt;larger than will fit in memory&lt;/i&gt;. &amp;nbsp;There are options for weighted/unweigted networks as well as directed/undirected networks.&lt;br /&gt;&lt;br /&gt;The script creates a tempfile, and in order to create the tempfile safely (so that if the script is terminated or crashes , it still cleans up behind itself), I use python's handy &lt;b&gt;with statement&lt;/b&gt;. &amp;nbsp;This means you'll need python 2.6 or higher to run it.&lt;br /&gt;&lt;br /&gt;To see how to use the script, just type&lt;br /&gt;&lt;blockquote class="tr_bq"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;python edgelist2pajek.py -h&lt;/span&gt;&lt;/blockquote&gt;You can download, criticize, fork, or imporve the code &lt;a href="https://gist.github.com/1331132"&gt;here&lt;/a&gt;. Here's what the sourcecode looks like:&lt;br /&gt;&lt;br /&gt;&lt;iframe height="300" src="http://www.purplegene.com/script?url=https://gist.github.com/1331132.js?file=edgelist2pajek.py" width="100%"&gt;&amp;amp;lt;p&amp;amp;gt;&amp;amp;amp;amp;lt;br /&amp;amp;amp;amp;gt;&amp;amp;lt;/p&amp;amp;gt;&lt;/iframe&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8128923945165565689-1359595885908133573?l=sociograph.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sociograph.blogspot.com/feeds/1359595885908133573/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8128923945165565689&amp;postID=1359595885908133573' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default/1359595885908133573'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default/1359595885908133573'/><link rel='alternate' type='text/html' href='http://sociograph.blogspot.com/2011/11/converting-huge-edge-list-files-to.html' title='Converting huge edge list files to pajek format'/><author><name>Conrad Lee</name><uri>https://profiles.google.com/110979755871396857085</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh6.googleusercontent.com/-3Jylvn3UkJI/AAAAAAAAAAI/AAAAAAAAAAA/PM9AqhTcFSE/s512-c/photo.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8128923945165565689.post-2634322743982473161</id><published>2011-09-26T17:27:00.000+01:00</published><updated>2011-09-26T17:27:22.708+01:00</updated><title type='text'>Compiling ATLAS by hand speeds up NMF in python 3X</title><content type='html'>&lt;br /&gt;&lt;br /&gt;I've been doing some Non-negative Matrix Factorization (NMF) to co-cluster users and the locations that they visit, as in the figure below&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-kIp5OkLAGYg/ToCjhj_o1SI/AAAAAAAAHQI/ciEoC3nInlM/s1600/NMF.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="146" src="http://4.bp.blogspot.com/-kIp5OkLAGYg/ToCjhj_o1SI/AAAAAAAAHQI/ciEoC3nInlM/s320/NMF.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;Basically there is one important parameter, &lt;i&gt;k&lt;/i&gt;, which you have to set and which determines the number of clusters to be found. &amp;nbsp;NMF then returns two matrices, &lt;b&gt;W&lt;/b&gt;&amp;nbsp;and &lt;b&gt;H, &lt;/b&gt;whose product approximates the original data. &amp;nbsp;&lt;b&gt;W&lt;/b&gt;&amp;nbsp;has one row per tourist that tells you how much each tourist belongs to each cluster. &amp;nbsp;&lt;b&gt;H&lt;/b&gt;&amp;nbsp;has one column per Point of Interest (POI) which tells you how much each POI belongs to each cluster.&lt;br /&gt;&lt;br /&gt;I have been using the well-documented &lt;a href="http://scikit-learn.sourceforge.net/"&gt;scikit-learn&lt;/a&gt; python library to perform the NMF, which is nice because it provides clear examples. &amp;nbsp;Scikit-learn relies on &lt;a href="http://numpy.scipy.org/"&gt;numpy&lt;/a&gt; and &lt;a href="http://scipy.org/"&gt;scipy&lt;/a&gt; for fast matrix operations, and these in turn rely on a slew of low-level linear algebra packages (e.g., BLAS and LAPACK). &amp;nbsp;BLAS, a linear algebra library written in Fortran, seems to be key to these operations. &amp;nbsp;It is possible to compile BLAS with settings so that it is optimized for your particular machine using a program called ATLAS. &amp;nbsp;As far as I understand it, ATLAS performs a bunch of tests to empirically determine the optimal way to configure BLAS.&lt;br /&gt;&lt;br /&gt;Using the guide provided &lt;a href="http://advice.mechanicalkern.com/question/16/how-do-you-install-numpyscipy-with-atlas-support-on-a-machine-with-non-root-access#17"&gt;here&lt;/a&gt;&amp;nbsp;(which thankfully does not assume you have root access), I used ATLAS to optimize all this linear algebra and get my python setup to use these optimized libraries. &amp;nbsp;This is a rather involved process of compiling packages that took me more than half an hour. &amp;nbsp;Was it worth it? &amp;nbsp;A few benchmarks on my data show a speed-up of 3X (on a 24 core Dell server with Intel chips running at 2.4 GHZ). &amp;nbsp;Note that I was not able to disable the CPU-throttling as recommended by ATLAS, so perhaps others who can do so will benefit even more.&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8128923945165565689-2634322743982473161?l=sociograph.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sociograph.blogspot.com/feeds/2634322743982473161/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8128923945165565689&amp;postID=2634322743982473161' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default/2634322743982473161'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default/2634322743982473161'/><link rel='alternate' type='text/html' href='http://sociograph.blogspot.com/2011/09/compiling-atlas-by-hand-speeds-up-nmf.html' title='Compiling ATLAS by hand speeds up NMF in python 3X'/><author><name>Conrad Lee</name><uri>https://profiles.google.com/110979755871396857085</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh6.googleusercontent.com/-3Jylvn3UkJI/AAAAAAAAAAI/AAAAAAAAAAA/PM9AqhTcFSE/s512-c/photo.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-kIp5OkLAGYg/ToCjhj_o1SI/AAAAAAAAHQI/ciEoC3nInlM/s72-c/NMF.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8128923945165565689.post-5753722686321456800</id><published>2011-08-25T17:08:00.001+01:00</published><updated>2011-08-26T01:00:16.784+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='geography; metropolitan areas; data; geographic'/><title type='text'>Metropolitan Sinks: A tunable definition of metropolitan areas</title><content type='html'>&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-4A74z3DLDOc/TlZrIeyGUjI/AAAAAAAAHPo/_x2mpT5wf-o/s1600/usa-0.5.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="231" src="http://3.bp.blogspot.com/-4A74z3DLDOc/TlZrIeyGUjI/AAAAAAAAHPo/_x2mpT5wf-o/s400/usa-0.5.png" width="400" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;Metropolitan sinks with &lt;b&gt;&lt;i&gt;r&lt;/i&gt;&lt;/b&gt;=0.5 degrees (55 km).&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-i-MkKpMQB3Q/TlZrQH9FxPI/AAAAAAAAHPs/I4QgS4WXH_4/s1600/usa-1.5.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="230" src="http://1.bp.blogspot.com/-i-MkKpMQB3Q/TlZrQH9FxPI/AAAAAAAAHPs/I4QgS4WXH_4/s400/usa-1.5.png" width="400" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;Metropolitan sinks with &lt;b&gt;&lt;i&gt;r&lt;/i&gt;&lt;/b&gt;=1.5 (165km)&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;I'm working on a project that studies how behavior varies regionally. One of the basic prerequisites to this problem is a good definition of the term&amp;nbsp;&lt;i&gt;region.&lt;/i&gt;&amp;nbsp; In some contexts there might be some obvious definition: in the USA, one might choose states, counties, or zip codes as the unit of analysis.&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;However, I am interested in some international datasets, and in this context a couple of problems arise. First, it's hard to find a definition of region that is consistent in many countires. &amp;nbsp;For example, the U.S. has states, Canada has provinces, and Switzerland has cantons. &amp;nbsp;There are great services such as Yahoo!'s &lt;a href="http://developer.yahoo.com/geo/geoplanet/"&gt;GeoPlanet&lt;/a&gt;&amp;nbsp;APIs which will inform your program that a canton is a direct child of Switzerland, just as a state is a direct child of the USA. Because the service will list the child units of any country, it might seem that we have a solution. But are these units equivalent&amp;nbsp;across&amp;nbsp;countries? Or are Swiss cantons so small that they should be compared to US counties rather than states? After all, the entire country of Switzerland is about the size of a typical US state, both in terms of population and area. &amp;nbsp;So in an international comparison of regions, do we compare each Swiss canton to each U.S. state, or do we compare the entire Swiss nation to the states?&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;By using cities and their metropolitan areas as the unit of analysis, we can ignore these problems more directly compare regions globally. The website &lt;a href="http://www.geonames.org/"&gt;geonames.org&lt;/a&gt;&amp;nbsp;provides a great&amp;nbsp;&lt;a href="http://download.geonames.org/export/dump/cities1000.zip"&gt;dataset&lt;/a&gt; that contains the geographic coordinates and population (as well as other information) of apparently all cities and towns in the world which have at least 1,000&amp;nbsp;inhabitants. However, the &lt;a href="http://en.wikipedia.org/wiki/Metropolitan_area"&gt;competing definitions of metropolitan area&lt;/a&gt;&amp;nbsp;again make international comparison difficult.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The "correct" definition of metropolitan area will depend on how the definition is to be used. &amp;nbsp;In my case, I have varied datasets which show how people from places around the world use social media sites. Each person can be resolved to a latitude and longitude using, for example, &lt;a href="http://developer.yahoo.com/geo/placemaker/"&gt;Yahoo!'s Placemaker API&lt;/a&gt;. &amp;nbsp;Then the task is to assign a metropolitan area (if any) to each of these coordinates. In other words, we need to bin users geographically. &amp;nbsp;While binning, we should keep in mind that although these datasets are in some sense large (based on millions of users), they are also sparse: there are suburbs, small towns, or villages in which nobody is a user of LastFM or Flickr. &amp;nbsp;That means if we use each town as a bin, we will have many bins with very few users or no users, making it difficult or impossible to recognize trends. &amp;nbsp;So the sparser the data, the larger we want the bins to be so that each one contains enough users for trends to be recognized. &amp;nbsp;Rural regions will likely not have towns with enough users for trends to be recognized, so in practice we will be analyzing only densely populated areas, and we need to decide how many suburbs and surrounding areas do we associate with each city. &amp;nbsp;Ideally we would have a single parameter that we could tune which control how many of the surrounding suburbs and town are bundled in with each metropolitan area.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Although one probably exists, I am not aware of any such parameterized definition of metropolitan area. &amp;nbsp;It is, however, easy enough to come up with a hacky definition that is useful enough to work with. &amp;nbsp;I propose the following procedure for detecting what I call "metropolitan sinks":&lt;/div&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt;Insert city (i.e., city, town, or village) as a node into a graph.&lt;/li&gt;&lt;li&gt;For each city &lt;i&gt;&lt;b&gt;c&lt;/b&gt;&lt;/i&gt;, take all cities within a radius of length &lt;b&gt;&lt;i&gt;r&lt;/i&gt;&lt;/b&gt;&amp;nbsp;and find the largest. Call this largest city &lt;i style="font-weight: bold;"&gt;n&lt;/i&gt;. &amp;nbsp;Draw a directed edge from &lt;b&gt;&lt;i&gt;c&lt;/i&gt;&lt;/b&gt; to &lt;b&gt;&lt;i&gt;n&lt;/i&gt;&lt;/b&gt;. &amp;nbsp;Self loops are allowed.&lt;/li&gt;&lt;li&gt;Find the weakly connected&amp;nbsp;components&amp;nbsp;in the graph constructed above. &amp;nbsp;These are the metropolitan sinks. Each connected component will have one sink node---this is useful for naming each metropolitan area---for example, Chicago will be the sink of the Chicagoland area.&lt;/li&gt;&lt;/ol&gt;&lt;div&gt;You might also want to filter out metropolitan sinks whose sum population is below some threshold---that will ensure that not every remote village is recognized as a metropolitan sink. The above procedure requires you to look up all points within some radius of a given point. To do so efficiently, you can use a &lt;a href="http://en.wikipedia.org/wiki/K-d_tree"&gt;k-d tree&lt;/a&gt;;&amp;nbsp;it's easy to do so in python using the scipy.spatial -- see for example &lt;a href="http://stackoverflow.com/questions/6371187/find-all-coordinates-within-a-circle-in-geographic-data-in-python/6372492#6372492"&gt;this solution&lt;/a&gt; to the problem.&lt;/div&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div&gt;What does this definition give us? To answer that question I have tried to visualize the results. To produce the visualization, I first calculated the alpha shape (a.k.a., the concave hull) of each metropolitan area, and then exported those alpha shapes into a KML file that can be viewed in Google Earth. &amp;nbsp;At the top of this post, you see the the metro sinks of the eastern US with &lt;i style="font-weight: bold;"&gt;r&lt;/i&gt;&amp;nbsp;set to 0.5 and 1.5 decimal degrees. &amp;nbsp;Below you see the same for most of Europe. &amp;nbsp;(Note that only cities of size 1000 or larger are included, and only those metropolitan areas with a total population of 10000 or larger are included. &amp;nbsp;Also, if a metropolitan area is a single city or just two cities, then it is just a point or a line, and so will not be visible in the images.)&lt;br /&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-MIyU-eJHWII/TlZq-rwy2-I/AAAAAAAAHPk/y-kmAYDQne0/s1600/europe-1.5.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="231" src="http://2.bp.blogspot.com/-MIyU-eJHWII/TlZq-rwy2-I/AAAAAAAAHPk/y-kmAYDQne0/s400/europe-1.5.png" width="400" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;Metropolitan sinks with &lt;b&gt;&lt;i&gt;r&lt;/i&gt;&lt;/b&gt;=1.5 (165 km)&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-BwqKK1-M4Oc/TlZq3xyNTtI/AAAAAAAAHPg/y87ap6E0fGE/s1600/europe-0.5.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="231" src="http://3.bp.blogspot.com/-BwqKK1-M4Oc/TlZq3xyNTtI/AAAAAAAAHPg/y87ap6E0fGE/s400/europe-0.5.png" width="400" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;Metropolitan sinks with &lt;i style="font-weight: bold;"&gt;r&lt;/i&gt;=0.5 (55km)&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;br /&gt;&lt;/div&gt;Is this a perfect way to bin cities into metropolitan areas? &amp;nbsp;No, but it's good enough for my purposes. &amp;nbsp;I can make these coarser or finer depending on how sparse my data is. &amp;nbsp;The are some clear imperfections, but none that will likely be very important. &amp;nbsp;For example, you can see that with the larger radius (1.5 decimal degrees or 165 km), a few towns in northern France are in the metropolitan sinks of Southhampton and Bristol. But all in all, these metropolitan areas provide a reasonable way to bin geographic space into regions internationally, and it requires only an initial list of cities with their coordinates and their populations.&lt;/div&gt;&lt;div&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-WlOL9wTTXWw/TlZxWyWzFiI/AAAAAAAAHPw/3t1fGsVnngY/s1600/asia-0.5.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="231" src="http://1.bp.blogspot.com/-WlOL9wTTXWw/TlZxWyWzFiI/AAAAAAAAHPw/3t1fGsVnngY/s400/asia-0.5.png" width="400" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;Metropolitan sinks with &lt;i style="font-weight: bold;"&gt;r&lt;/i&gt;=0.5 decimal degrees (55km).&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8128923945165565689-5753722686321456800?l=sociograph.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sociograph.blogspot.com/feeds/5753722686321456800/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8128923945165565689&amp;postID=5753722686321456800' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default/5753722686321456800'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default/5753722686321456800'/><link rel='alternate' type='text/html' href='http://sociograph.blogspot.com/2011/08/metropolitan-sinks-tunable-definition.html' title='Metropolitan Sinks: A tunable definition of metropolitan areas'/><author><name>Conrad Lee</name><uri>https://profiles.google.com/110979755871396857085</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh6.googleusercontent.com/-3Jylvn3UkJI/AAAAAAAAAAI/AAAAAAAAAAA/PM9AqhTcFSE/s512-c/photo.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-4A74z3DLDOc/TlZrIeyGUjI/AAAAAAAAHPo/_x2mpT5wf-o/s72-c/usa-0.5.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8128923945165565689.post-8715589179691259251</id><published>2011-07-25T13:25:00.000+01:00</published><updated>2011-07-25T13:25:54.748+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='tutorial networkx python sna'/><title type='text'>A tutorial introduction to network analysis in python</title><content type='html'>Derek Greene recently presented a tutorial at the Web Science Summer School in Galway, Ireland. &amp;nbsp;I find that &lt;a href="http://mlg.ucd.ie/summer"&gt;his slides and supporting material&lt;/a&gt; not only give a nice introduction to network analysis, but also to my favorite software library for network analysis, &lt;a href="http://networkx.lanl.gov/"&gt;NetworkX&lt;/a&gt;.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;For those of you who haven't used NetworkX, it's a truly pythonic environment for network analysis (in comparison to projects like &lt;a href="http://igraph.sourceforge.net/"&gt;iGraph&lt;/a&gt; or&lt;a href="http://osl.iu.edu/~dgregor/bgl-python/"&gt; python boost graph&lt;/a&gt;, which just provide bindings to a C++ library.) &amp;nbsp;This comes with many advantages like clean syntax and interactivity, and the distadvantage of being less scalable. &amp;nbsp;Still I've used it for graphs with tens of thousands of nodes and over one million edges. &amp;nbsp;One other nice thing about NetworkX is its &lt;a href="https://networkx.lanl.gov/trac"&gt;active community&lt;/a&gt; and project leader, Aric Hagberg.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8128923945165565689-8715589179691259251?l=sociograph.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sociograph.blogspot.com/feeds/8715589179691259251/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8128923945165565689&amp;postID=8715589179691259251' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default/8715589179691259251'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default/8715589179691259251'/><link rel='alternate' type='text/html' href='http://sociograph.blogspot.com/2011/07/tutorial-introduction-to-network.html' title='A tutorial introduction to network analysis in python'/><author><name>Conrad Lee</name><uri>https://profiles.google.com/110979755871396857085</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh6.googleusercontent.com/-3Jylvn3UkJI/AAAAAAAAAAI/AAAAAAAAAAA/PM9AqhTcFSE/s512-c/photo.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8128923945165565689.post-2386613189546770741</id><published>2011-05-09T11:57:00.001+01:00</published><updated>2011-05-09T11:58:46.186+01:00</updated><title type='text'>Communication networks part 3: The University of Kiel E-mail dataset</title><content type='html'>&lt;b&gt;Summary, download&lt;/b&gt;&lt;br /&gt;I'll first briefly introduce the data and provide the cleaned files for download---if you're interested in the details of the data and the steps I took to clean it, read the following sections.&lt;br /&gt;&lt;br /&gt;As I mentioned in the two previous posts in this series, I'm working on creating models of event-based networks, such as those formed by personal communication (the previous two posts are on the &lt;a href="http://sociograph.blogspot.com/2011/04/communication-networks-part-1-enron-e.html"&gt;Enron e-mail dataset&lt;/a&gt; and &lt;a href="http://sociograph.blogspot.com/2011/04/communication-networks-part-2-mit.html"&gt;the MIT Reality Mining dataset&lt;/a&gt;). &amp;nbsp;Before using this data, I'm examining it to get a feel for its limitations and trying to clean it up. &amp;nbsp;I'm also sharing up the data in its final cleaned form so others can replicate any experiments I perform with it.&lt;br /&gt;&lt;br /&gt;In this post I describe a dataset that was introduced in a paper from Ebel et al. called &lt;a href="http://arxiv.org/abs/cond-mat/0201476"&gt;Scale-free topology of e-mail networks&lt;/a&gt;. The original dataset can be downloaded&amp;nbsp;&lt;a href="http://www.itp.uni-bremen.de/complex/email_net.html"&gt;here&lt;/a&gt;. &amp;nbsp;In the above-menitoned paper, there is very little description of the where the data came from, basically only the following description:&lt;br /&gt;&lt;blockquote&gt;&lt;blockquote&gt;The e-mail network studied here is constructed from log files of the e-mail server at Kiel University, recording the source and destination of every e-mail from or to a student account over a period of 112 days.&amp;nbsp;The nodes of the e-mail network correspond to e-mail addresses which are connected by a link if an e-mail&amp;nbsp;has been exchanged between them. The resulting network&amp;nbsp;consists of N = 59 812 nodes (including 5165 student accounts)&amp;nbsp;with a mean degree of k=2.88 and contains several&amp;nbsp;separated clusters with less than 150 nodes and one giant&amp;nbsp;component of 56 969 nodes (mean degree &lt;kgiant&gt;=2.96)...Let us briefly discuss how our result on e-mail networks&amp;nbsp;may be influenced by the measurement process. The sampling&amp;nbsp;of the network has been restricted to one distinct e-mail&amp;nbsp;server. Therefore, only the degrees of accounts at this server&amp;nbsp;are known exactly. Here, these internal accounts correspond&amp;nbsp;to e-mail addresses of local students, whereas the external&amp;nbsp;nodes are given by all other e-mail addresses.&lt;/kgiant&gt;&lt;/blockquote&gt;&lt;/blockquote&gt;The cleaned dataset contains 392,280 events sorted by time. The "edges" file, available here, contains information on each event, which is known to the nearest minute. You can get the edges file &lt;a href="http://dl.dropbox.com/u/1800572/blog/kiel/all-cleaned-contiguous.edges"&gt;here&lt;/a&gt;. &amp;nbsp;It is in the following format:&lt;br /&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;unix_timestamp;sender_id;receiver_id&lt;/span&gt;&lt;/blockquote&gt;&amp;nbsp;The "attributes", available &lt;a href="http://dl.dropbox.com/u/1800572/blog/kiel/all-cleaned-contiguous.attributes"&gt;here&lt;/a&gt;, file labels each node as either in the core or in the periphery. &amp;nbsp;It is in the following format&lt;br /&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;node_id;bool&lt;/span&gt;&lt;/blockquote&gt;where the value of "&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;bool&lt;/span&gt;" is &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;1&lt;/span&gt; if the node is in the core, and &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;0&lt;/span&gt; if the node is in the periphery.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;A problem: sparsity in the "core" of the dataset&lt;/b&gt;&lt;br /&gt;Just as in the other two networks, one important aspect of the Kiel data is the notion of core and periphery. We have complete e-mail data for only those 5165 nodes in the core, and for the other approx 50,000 nodes in the periphery we know only how they communicated with nodes in the core. &amp;nbsp;A closer look at the data reveals a serious &amp;nbsp;limitation with the dataset:&amp;nbsp;&lt;i&gt;the vast majority of communication is between the core and the periphery. &lt;/i&gt;In the original, uncleaned data there are 447,533 events. Of these, only 6,256 (1.4%) take place between nodes in the core. &amp;nbsp;After performing a couple of cleaning steps outlined below, only 0.6% of the communication events take place in the core. &amp;nbsp;So although this dataset is in one sense large because it contains nearly half a million events, if you're interested in the subset of nodes among whom all communication is known, then it is quite a small, sparse dataset.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Cleaning the data: removing self-messages and spammers&lt;/b&gt;&lt;br /&gt;Given that I'm interested in how information could spread in a communication network, I want to remove e-mails that people send to themselves. &amp;nbsp;Surprisingly, more than half of all of the communication events in the core (3,301 of 6,256) were self-loops.&lt;br /&gt;&lt;br /&gt;Additionally, I wanted to remove messages from spammers or large mailing lists so that in the end the data would contain only personal communication written by humans. Spam can be identified by a large number of e-mail being simultaneously sent by the same source. &amp;nbsp;However, I also noticed that there were a couple of days in which a single sender would send out thousands of e-mail within the same hour. As one can see in the plot below, this behaviour was anomalous:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.scribd.com/doc/54994136/All-Per-Hour-Hist" style="-x-system-font: none; display: block; font-family: Helvetica,Arial,Sans-serif; font-size-adjust: none; font-size: 14px; font-stretch: normal; font-style: normal; font-variant: normal; font-weight: normal; line-height: normal; margin: 12px auto 6px auto; text-decoration: underline;" title="View All Per Hour Hist on Scribd"&gt;All Per Hour Hist&lt;/a&gt;&lt;iframe class="scribd_iframe_embed" data-aspect-ratio="1.61904761904762" data-auto-height="true" frameborder="0" height="600" id="doc_94240" scrolling="no" src="http://www.scribd.com/embeds/54994136/content?start_page=1&amp;amp;view_mode=list&amp;amp;access_key=key-2eqx9bq4ammx1vv332yj" width="100%"&gt;&lt;/iframe&gt;&lt;script type="text/javascript"&gt;(function() { var scribd = document.createElement("script"); scribd.type = "text/javascript"; scribd.async = true; scribd.src = "http://www.scribd.com/javascripts/embed_code/inject.js"; var s = document.getElementsByTagName("script")[0]; s.parentNode.insertBefore(scribd, s); })();&lt;/script&gt;&lt;br /&gt;&lt;br /&gt;To remove such suspiciously spammy e-mails, I used the following procedure: if more than 20 e-mail came from the same sender in one hour, then I consider that all messages from that sender in that hour were spam, and I removed it from the dataset.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Periodicity, Stability&lt;/b&gt;&lt;br /&gt;In contrast to the other event datasets, the Kiel dataset is quite stable. &amp;nbsp;This is probably because, in contrast to the MIT Reality Mining dataset, people did not need to actively participate, and they had no option to "opt-out." &amp;nbsp;Also, in contrast to the Enron dataset, there is a much bigger core (around five thousand accounts rather than around 150 accounts), and the period of observation is briefer (just 112 days), meaning that not as many people come and go. In the following plot we see the number of e-mail events per day---note that you can see the weekends, one holiday (in Germany October 3rd is the day of German Unity), and a strange gap in the middle---perhaps the e-mail server went down.&lt;br /&gt;&lt;a href="http://www.scribd.com/doc/54994878/Kiel-All-Cleaned-Contiguous-Over-Time" style="-x-system-font: none; display: block; font-family: Helvetica,Arial,Sans-serif; font-size-adjust: none; font-size: 14px; font-stretch: normal; font-style: normal; font-variant: normal; font-weight: normal; line-height: normal; margin: 12px auto 6px auto; text-decoration: underline;" title="View Kiel All Cleaned Contiguous Over Time on Scribd"&gt;Kiel All Cleaned Contiguous Over Time&lt;/a&gt;&lt;iframe class="scribd_iframe_embed" data-aspect-ratio="4.896" data-auto-height="true" frameborder="0" height="600" id="doc_63614" scrolling="no" src="http://www.scribd.com/embeds/54994878/content?start_page=1&amp;amp;view_mode=list&amp;amp;access_key=key-2atxehf35jdocr7ug34w" width="100%"&gt;&lt;/iframe&gt;&lt;script type="text/javascript"&gt;(function() { var scribd = document.createElement("script"); scribd.type = "text/javascript"; scribd.async = true; scribd.src = "http://www.scribd.com/javascripts/embed_code/inject.js"; var s = document.getElementsByTagName("script")[0]; s.parentNode.insertBefore(scribd, s); })();&lt;/script&gt;&lt;br /&gt;&lt;br /&gt;If we look at a finer resolution --- the number of e-mail sent per hour --- we also see stable periodicity&lt;br /&gt;&lt;a href="http://www.scribd.com/doc/54994905/Kiel-All-Cleaned-Contiguous-Over-Time-Hourly" style="-x-system-font: none; display: block; font-family: Helvetica,Arial,Sans-serif; font-size-adjust: none; font-size: 14px; font-stretch: normal; font-style: normal; font-variant: normal; font-weight: normal; line-height: normal; margin: 12px auto 6px auto; text-decoration: underline;" title="View Kiel All Cleaned Contiguous Over Time Hourly on Scribd"&gt;Kiel All Cleaned Contiguous Over Time Hourly&lt;/a&gt;&lt;iframe class="scribd_iframe_embed" data-aspect-ratio="4.896" data-auto-height="true" frameborder="0" height="600" id="doc_50563" scrolling="no" src="http://www.scribd.com/embeds/54994905/content?start_page=1&amp;amp;view_mode=list&amp;amp;access_key=key-292fx8oqbarrspr5fx13" width="100%"&gt;&lt;/iframe&gt;&lt;script type="text/javascript"&gt;(function() { var scribd = document.createElement("script"); scribd.type = "text/javascript"; scribd.async = true; scribd.src = "http://www.scribd.com/javascripts/embed_code/inject.js"; var s = document.getElementsByTagName("script")[0]; s.parentNode.insertBefore(scribd, s); })();&lt;/script&gt;&lt;br /&gt;&lt;br /&gt;Note that in the original dataset, only the month, day, hour, and minute of each e-mail is given. &amp;nbsp;Thus, it's not clear from which year this dataset comes. &amp;nbsp;But given that the paper that introduces this dataset was submitted for publication in early 2002, it seemed likely to me that the data comes from 2001. &amp;nbsp;This guess is further supported by the fact that in 2001 the holiday on October third landed on a Wednesday, a fact which is compatible with the data. &amp;nbsp;One can also observe that there were somewhat more e-mails sent out on Sept. 11 than usual, indicating that the year would be 2001.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Distribution of messages / dyad&lt;/b&gt;&lt;br /&gt;To get a feeling for the data, it's also useful to look at the distribution of messages per dyad. &amp;nbsp;Note that below only those dyads are considered if there was at least one message between them. &amp;nbsp;Also, the distributions below are for directed dyads (the plots for undirected dyads look similar). &amp;nbsp;The first plot shows the distribution for all dyads, the second for only those dyads in the core.&lt;br /&gt;&lt;a href="http://www.scribd.com/doc/54998073/Dyad-Distro-Directed-All" style="-x-system-font: none; display: block; font-family: Helvetica,Arial,Sans-serif; font-size-adjust: none; font-size: 14px; font-stretch: normal; font-style: normal; font-variant: normal; font-weight: normal; line-height: normal; margin: 12px auto 6px auto; text-decoration: underline;" title="View Dyad Distro Directed All on Scribd"&gt;Dyad Distro Directed All&lt;/a&gt;&lt;iframe class="scribd_iframe_embed" data-aspect-ratio="1.33333333333333" data-auto-height="true" frameborder="0" height="600" id="doc_79375" scrolling="no" src="http://www.scribd.com/embeds/54998073/content?start_page=1&amp;amp;view_mode=slideshow&amp;amp;access_key=key-1wmkeoh5aaje979073hb" width="100%"&gt;&lt;/iframe&gt;&lt;script type="text/javascript"&gt;(function() { var scribd = document.createElement("script"); scribd.type = "text/javascript"; scribd.async = true; scribd.src = "http://www.scribd.com/javascripts/embed_code/inject.js"; var s = document.getElementsByTagName("script")[0]; s.parentNode.insertBefore(scribd, s); })();&lt;/script&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.scribd.com/doc/54998072/Dyad-Distro-Directed-Core" style="-x-system-font: none; display: block; font-family: Helvetica,Arial,Sans-serif; font-size-adjust: none; font-size: 14px; font-stretch: normal; font-style: normal; font-variant: normal; font-weight: normal; line-height: normal; margin: 12px auto 6px auto; text-decoration: underline;" title="View Dyad Distro Directed Core on Scribd"&gt;Dyad Distro Directed Core&lt;/a&gt;&lt;iframe class="scribd_iframe_embed" data-aspect-ratio="1.33333333333333" data-auto-height="true" frameborder="0" height="600" id="doc_57200" scrolling="no" src="http://www.scribd.com/embeds/54998072/content?start_page=1&amp;amp;view_mode=slideshow&amp;amp;access_key=key-h2a2vhtysp8cio5ro0f" width="100%"&gt;&lt;/iframe&gt;&lt;script type="text/javascript"&gt;(function() { var scribd = document.createElement("script"); scribd.type = "text/javascript"; scribd.async = true; scribd.src = "http://www.scribd.com/javascripts/embed_code/inject.js"; var s = document.getElementsByTagName("script")[0]; s.parentNode.insertBefore(scribd, s); })();&lt;/script&gt;&lt;br /&gt;&lt;br /&gt;We note that there are a few dyads between which many more messages were sent than normal. &amp;nbsp;For example, one e-mail address sent another one 7000 e-mails within the 112 day period (over 60/day on average).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8128923945165565689-2386613189546770741?l=sociograph.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sociograph.blogspot.com/feeds/2386613189546770741/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8128923945165565689&amp;postID=2386613189546770741' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default/2386613189546770741'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default/2386613189546770741'/><link rel='alternate' type='text/html' href='http://sociograph.blogspot.com/2011/05/communication-networks-part-3.html' title='Communication networks part 3: The University of Kiel E-mail dataset'/><author><name>Conrad Lee</name><uri>https://profiles.google.com/110979755871396857085</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh6.googleusercontent.com/-3Jylvn3UkJI/AAAAAAAAAAI/AAAAAAAAAAA/PM9AqhTcFSE/s512-c/photo.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8128923945165565689.post-2080498176952126928</id><published>2011-04-13T20:02:00.003+01:00</published><updated>2011-04-14T10:26:03.454+01:00</updated><title type='text'>Communication networks part 2: the MIT Reality Mining dataset</title><content type='html'>&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-FuMWBmie14I/Taa9VYpSLnI/AAAAAAAAHA4/l0cyqvpcC8g/s1600/mit-horizontal.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="211" src="http://2.bp.blogspot.com/-FuMWBmie14I/Taa9VYpSLnI/AAAAAAAAHA4/l0cyqvpcC8g/s400/mit-horizontal.png" width="400" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;One view of the network created by contact among participants in the MIT Reality Mining dataset&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;div style="text-align: center;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; border-collapse: collapse; line-height: 18px;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; border-collapse: collapse; line-height: 18px;"&gt;Like &lt;a href="http://sociograph.blogspot.com/2011/04/communication-networks-part-1-enron-e.html"&gt;the last post&lt;/a&gt; in this series of three, this post is half intended to serve as self-documentation. &amp;nbsp;It should also be anyone who wants to use the MIT Reality Mining dataset and wants to understand some nitty-gritty details, as it includes some information that's not in the official description of the data. &amp;nbsp;Additionally, if you're just after a cleaned up dataset in a simple form, I provide that below.&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; border-collapse: collapse; line-height: 18px;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; border-collapse: collapse; line-height: 18px;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;The MIT Reality Mining Dataset is based on a quite high-profile collection of behavioral data. &amp;nbsp;The data can be obtained from the official site (annoyingly, you have to e-mail someone to get the link to download the data, but the response time is good). &amp;nbsp;The &lt;a href="http://dl.dropbox.com/u/1800572/blog/Eagle%2C%20Pentland%20-%202009%20-%20The%20Reality%20Mining%20Data.pdf"&gt;official read-me&lt;/a&gt;&amp;nbsp;is a good place to begin to understand the data. &amp;nbsp;Here's the abstract from that readme:&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; border-collapse: collapse; line-height: 18px;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;The Reality Mining project was conducted from 2004- 2005 at the MIT Media Laboratory. The Reality Mining study followed ninety-four subjects using mobile phones pre-installed with several pieces of software that recorded and sent the researcher data about call logs, Bluetooth devices in proximity of approximately five meters, cell tower IDs, application usage, and phone status. Subjects were observed using these measurements over the course of nine months and included students and faculty from two programs within a major research institution. We also collected self-report relational data from each individual, where subjects were asked about their proximity to, and friendship with, others.&lt;/span&gt;&lt;/span&gt;&lt;/blockquote&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;One impressive feature of this dataset is that the phones that people carried around were nearly always on, as the following quote from the readme summarizes:&lt;/span&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;On average we have logs accounting for approximately 85.3% of the time that the phones have been deployed.&lt;/span&gt;&lt;/blockquote&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;The data includes three types of social networks: phone calls, SMS messages, and proximity (via bluetooth). &amp;nbsp;In what follows I will extract time-stamped communication events from only the voice communication.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;A core of 94 people that&amp;nbsp;&lt;/span&gt;collectively&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&amp;nbsp;contact a periphery of 26,965 people&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;One important aspect is that there is a "core-periphery" structure to the data, i.e., it includes &amp;nbsp;a "core" of 94 people for whom we know all communication data during their active people; additionally, there is data on all of the tens of thousands of people in the "periphery," i.e., those who had contact with the core. &amp;nbsp;For a person in the periphery, we know only about his communication with the core. &amp;nbsp;The vast majority of communication is between the core and the periphery. &amp;nbsp;Crucially, we do not know about communication that takes place between members of the periphery.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; border-collapse: collapse; line-height: 18px;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; border-collapse: collapse; line-height: 18px;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;b&gt;Participation in the study varies over time&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; border-collapse: collapse; line-height: 18px;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;Unlike the Enron dataset, we don't have to worry about noise brought in to this data from broadcast-style communication, because mobile phone communication is inherently dyadic (between two people, except for three-way calls, which are rare enough that we can probably safely ignore them). However, in this dataset we have to worry about attrition; i.e., the tendency of people to drop out of the study. &amp;nbsp;If we look at the following chart we can see that although the study was conducted over the course of a full academic year, there is a relatively short period during which the study is in full swing.&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;iframe class="scribd_iframe_embed" data-aspect-ratio="1.33333333333333" data-auto-height="true" frameborder="0" height="600" id="doc_76512" scrolling="no" src="http://www.scribd.com/embeds/52939502/content?start_page=1&amp;amp;view_mode=list&amp;amp;access_key=key-16uehcjf32jxdd0omooc" width="100%"&gt;&lt;/iframe&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="color: #333333;"&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; border-collapse: collapse; line-height: 18px;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="color: #333333;"&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; border-collapse: collapse; line-height: 18px;"&gt;Because I'm interested in network data, I am interested in periods during which a large number of people are simultaneously involved in the study. &amp;nbsp;The periods with few users contain much less information on the communication in the core of the network, i.e., the part for which we know about all pairwise communication events. &amp;nbsp;For that reason, &lt;i&gt;I am only going to include communication that occurs between midnight on Sept 24, 2004 and midnight on January 8th, 2005&lt;/i&gt;, in which the study was in full swing&lt;i&gt;. &amp;nbsp;&lt;/i&gt;This range is outlined in red in the image above.&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="color: #333333;"&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; border-collapse: collapse; line-height: 18px;"&gt;&lt;i&gt;&lt;br /&gt;&lt;/i&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="color: #333333;"&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; border-collapse: collapse; line-height: 18px;"&gt;&lt;b&gt;Periodicity&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="color: #333333;"&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; border-collapse: collapse; line-height: 18px;"&gt;It's interesting to look at the frequency of communication within these periods. &amp;nbsp;If we aggregate by days and plot the number of communication events per day, we get the following chart (minor tick marks correspond to weekends):&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="color: #333333;"&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; border-collapse: collapse; line-height: 18px;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;iframe class="scribd_iframe_embed" data-aspect-ratio="4.896" data-auto-height="true" frameborder="0" height="600" id="doc_26799" scrolling="no" src="http://www.scribd.com/embeds/52940583/content?start_page=1&amp;amp;view_mode=list&amp;amp;access_key=key-20dzelamjf3ueuvu52in" width="100%"&gt;&lt;/iframe&gt;&lt;script type="text/javascript"&gt;(function() { var scribd = document.createElement("script"); scribd.type = "text/javascript"; scribd.async = true; scribd.src = "http://www.scribd.com/javascripts/embed_code/inject.js"; var s = document.getElementsByTagName("script")[0]; s.parentNode.insertBefore(scribd, s); })();&lt;/script&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="color: #333333;"&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; border-collapse: collapse; line-height: 18px;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="color: #333333;"&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; border-collapse: collapse; line-height: 18px;"&gt;It's hard to make much from the data aggregated at the daily level. &amp;nbsp;We can see a spike on Nov. 2nd that is likely related to the presidential election, and we can see a lull at the end of December that might correspond to the winter break. &amp;nbsp;If we look at events aggregated on a finer level, on an hourly basis, an more interesting periodicity emerges:&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="color: #333333;"&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; border-collapse: collapse; line-height: 18px;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;iframe class="scribd_iframe_embed" data-aspect-ratio="24.48" data-auto-height="false" frameborder="0" height="400" id="doc_84984" scrolling="no" src="http://www.scribd.com/embeds/52941240/content?start_page=1&amp;amp;view_mode=list&amp;amp;access_key=key-2ck03gduopehukadkbn4" width="100%"&gt;&lt;/iframe&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="color: #333333;"&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; border-collapse: collapse; line-height: 18px;"&gt;Above you can more clearly the periodicity and see things like that election day (November 2nd) was an anomaly (actually you have to zoom way in to see anything, and at that zoom level Scribd does something bad -- click on the download link and examine the original to see clearly). &amp;nbsp;You can also see that there were many calls made late in the night on Halloween (which was on a Saturday night) --- we should expect this given that many of the participants were students. &amp;nbsp;People also seemed to take it easy with phone communication on Thanksgiving morning. &amp;nbsp;On the other hand, I don't see the spike in communication that I would have expected late in the evening on New Year's Eve.&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="color: #333333;"&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; border-collapse: collapse; line-height: 18px;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="color: #333333;"&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; border-collapse: collapse; line-height: 18px;"&gt;&lt;b&gt;Download&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="color: #333333;"&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; border-collapse: collapse; line-height: 18px;"&gt;I used &lt;a href="http://dl.dropbox.com/u/1800572/blog/parse_network.py"&gt;this python script&lt;/a&gt; to parse the data. &amp;nbsp;The resulting list of communication events&amp;nbsp;&lt;a href="http://dl.dropbox.com/u/1800572/blog/mit/voice-all.edges"&gt;can be downloaded here&lt;/a&gt; and contains 52,050 calls between 27,059 people, 94 of whom are in the core. &amp;nbsp;It output this file of communication events, and this file which indicates which nodes are in the core and which are in the periphery. &amp;nbsp;The format of the communication events is a bunch of lines with the following form:&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="color: #333333; font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; border-collapse: collapse; line-height: 18px;"&gt;unix_timestamp;caller_id;receiver_id;duration_in_second;communication_type&lt;/span&gt;&lt;/span&gt;&lt;/blockquote&gt;&lt;span class="Apple-style-span" style="color: #333333;"&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; border-collapse: collapse; line-height: 18px;"&gt;&lt;a href="http://dl.dropbox.com/u/1800572/blog/mit/all.attributes"&gt;This file indicates&lt;/a&gt; which nodes are in the core, and it has the following form:&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="color: #333333;"&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; border-collapse: collapse; font-family: 'Courier New', Courier, monospace; line-height: 18px;"&gt;node_id;bool&lt;/span&gt;&lt;/span&gt;&lt;/blockquote&gt;&lt;span class="Apple-style-span" style="color: #333333;"&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; border-collapse: collapse; line-height: 18px;"&gt;where bool==1 if the node is in the core and 0 otherwise.&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="color: #333333;"&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; border-collapse: collapse; line-height: 18px;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="color: #333333;"&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; border-collapse: collapse; line-height: 18px;"&gt;&lt;b&gt;A couple of notes&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="color: #333333;"&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; border-collapse: collapse; line-height: 18px;"&gt;It occurred to me that in the above plots, I count each call as two communication events (an edge in each direction), so there are the number of calls is actually twice the number it should be.&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="color: #333333;"&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; border-collapse: collapse; line-height: 18px;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="color: #333333;"&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; border-collapse: collapse; line-height: 18px;"&gt;There is some uncertainty about the exact time of communication events. &amp;nbsp;The official readme did not &amp;nbsp; specify which representation of time was used. &amp;nbsp;The data was contained in a matlab object, in which time was usually represented in the following form:&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="color: #333333;"&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; border-collapse: collapse; line-height: 18px;"&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: 0px; color: black; font-family: 'courier new', monospace; font-size: 13px; line-height: normal;"&gt;731965.04835648148&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/blockquote&gt;I wasn't sure what this was, but it must be matlab's datenum format. I nearly fell into some dangerous traps as I converted this format into python's DateTime objects (&lt;a href="http://sociograph.blogspot.com/2011/04/how-to-avoid-gotcha-when-converting.html"&gt;I've documented this conversion process here&lt;/a&gt;), and the main remaining uncertainty has to do with the timezone. &amp;nbsp;The data should be in UTC time because that's the timezone that matlab's datenum objects are assumed to use, but I don't think they are. &amp;nbsp;I think they're in Eastern time, and so I've subtracted 5 hours from them. &amp;nbsp;On top of this confusion, the data spans a time in which there was a transition from daylight savings time to standard time, and I don't know what to do about that. &amp;nbsp;I contacted someone who is in charge of the dataset, and although he was generally very helpful about answering questions, to this one he just replie&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;d: "&lt;span class="Apple-style-span" style="border-collapse: collapse;"&gt;I'd assume Eastern or GMT" which was utterly unhelpful.&lt;/span&gt;&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="color: #333333;"&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; border-collapse: collapse; line-height: 18px;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="color: #333333;"&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; border-collapse: collapse; line-height: 18px;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="color: #333333;"&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; border-collapse: collapse; line-height: 18px;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8128923945165565689-2080498176952126928?l=sociograph.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sociograph.blogspot.com/feeds/2080498176952126928/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8128923945165565689&amp;postID=2080498176952126928' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default/2080498176952126928'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default/2080498176952126928'/><link rel='alternate' type='text/html' href='http://sociograph.blogspot.com/2011/04/communication-networks-part-2-mit.html' title='Communication networks part 2: the MIT Reality Mining dataset'/><author><name>Conrad Lee</name><uri>https://profiles.google.com/110979755871396857085</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh6.googleusercontent.com/-3Jylvn3UkJI/AAAAAAAAAAI/AAAAAAAAAAA/PM9AqhTcFSE/s512-c/photo.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-FuMWBmie14I/Taa9VYpSLnI/AAAAAAAAHA4/l0cyqvpcC8g/s72-c/mit-horizontal.png' height='72' width='72'/><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8128923945165565689.post-8057940352801321740</id><published>2011-04-12T18:45:00.003+01:00</published><updated>2011-04-17T17:07:26.631+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='time; python; matlab; gregorian; standards'/><title type='text'>Converting MATLAB's datenum to Python's datetime</title><content type='html'>&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-0eYb8NoMnAI/TaSPbOFgzwI/AAAAAAAAHA0/jOHc7bG3vYM/s1600/clocks.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-0eYb8NoMnAI/TaSPbOFgzwI/AAAAAAAAHA0/jOHc7bG3vYM/s320/clocks.jpg" width="320" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;Image courtesy of &lt;a href="http://www.flickr.com/photos/servus/16117730/sizes/m/in/photostream/"&gt;servus&lt;/a&gt;.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;It's often convenient to represent a date and a time as a single number. &amp;nbsp;This is commonly carried out by choosing some arbitrary point in time as the "epoch," and specifying time as the number of seconds or days from that epoch.&lt;br /&gt;&lt;br /&gt;The most common way to represent time as a number a&lt;a href="http://en.wikipedia.org/wiki/Unix_time"&gt; UNIX timestamp&lt;/a&gt;, which is the number of seconds since midnight on Jan 1st, 1970 in UTC, (not including leap seconds). &amp;nbsp;If you're trying to decide which way to represent time as a number use UNIX timestamps and ignore the rest of this post.&lt;br /&gt;&lt;br /&gt;I was recently confronted with the task of converting times stored in a matlab object into a python datetime object, and very nearly made a big mistake&amp;nbsp;(366 days big, to be exact). &amp;nbsp;The matlab representation of time looks like this:&lt;br /&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse; font-family: 'courier new', monospace; font-size: 14px;"&gt;731965.04835648148&lt;/span&gt;&amp;nbsp;&amp;nbsp;&lt;/blockquote&gt;If you parse this using python's&lt;br /&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;datetime.fromordinal(&lt;span class="Apple-style-span" style="border-collapse: collapse; font-size: 15px;"&gt;731965.04835648148&lt;/span&gt;)&amp;nbsp;&lt;/span&gt;&lt;/blockquote&gt;then the result might look reasonable, but &lt;b&gt;beware: you're off by 366 days! &lt;/b&gt;Matlab's datenum representation is the number of days since midnight on Jan 1st, &lt;b&gt;0 AD.&lt;/b&gt; &amp;nbsp;Python's datetime.fromordinal function assumes time is the number of days since midnight on Jan 1st, &lt;b&gt;1 AD&lt;/b&gt;. &amp;nbsp;So the duration of year 0 separates the time representation returned by python's fromordinal function and the time representation used by matlab. &amp;nbsp;Apparently, the year 0 AD had 366 days, (can anyone confirm this?) so one can convert a matlab datenum to a python datetime as follows:&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;from datetime import datetime, timedelta&lt;/span&gt;&lt;br /&gt;&lt;strike&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;python_datetime = datetime.fromordinal(matlab_datenum) - timedelta(days=366)&lt;/span&gt;&lt;/strike&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;python_datetime = datetime.fromordinal(matlab_datenum) + timedelta(days=matlab_datenum%1) -&amp;nbsp;timedelta(days = 366)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Update: In order to achieve precision beyond days, you need to take care of the fractional part of the datenum with the&amp;nbsp;&lt;/b&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&amp;nbsp;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;+ timedelta(days=matlab_datenum%1) &lt;/span&gt;&lt;/span&gt;&lt;b&gt;term.&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;While representations of time are inherently arbitrary. &amp;nbsp;The problem seems to have come up because there are different versions of the Gregorian calendar system.&lt;br /&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: sans-serif; font-size: 13px; line-height: 19px;"&gt;For dates before the year 1, unlike the proleptic Gregorian calendar used in the&amp;nbsp;&lt;a href="http://en.wikipedia.org/wiki/International_standard" style="background-attachment: initial; background-clip: initial; background-color: initial; background-image: none; background-origin: initial; background-position: initial initial; background-repeat: initial initial; color: #0645ad; text-decoration: none;"&gt;international standard&lt;/a&gt;&amp;nbsp;&lt;a href="http://en.wikipedia.org/wiki/ISO_8601" style="background-attachment: initial; background-clip: initial; background-color: initial; background-image: none; background-origin: initial; background-position: initial initial; background-repeat: initial initial; color: #0645ad; text-decoration: none;"&gt;ISO 8601&lt;/a&gt;, the traditional proleptic Gregorian calendar (like the Julian calendar) does not have a&amp;nbsp;&lt;a class="mw-redirect" href="http://en.wikipedia.org/wiki/Year_zero" style="background-attachment: initial; background-clip: initial; background-color: initial; background-image: none; background-origin: initial; background-position: initial initial; background-repeat: initial initial; color: #0645ad; text-decoration: none;" title="Year zero"&gt;year 0&lt;/a&gt;&amp;nbsp;and instead uses the ordinal numbers 1, 2, … both for years AD and BC. Thus the traditional time line is 2 BC, 1 BC, AD 1, and AD 2. ISO 8601 uses&amp;nbsp;&lt;a href="http://en.wikipedia.org/wiki/Astronomical_year_numbering" style="background-attachment: initial; background-clip: initial; background-color: initial; background-image: none; background-origin: initial; background-position: initial initial; background-repeat: initial initial; color: #0645ad; text-decoration: none;"&gt;astronomical year numbering&lt;/a&gt;&amp;nbsp;which includes a year 0 and negative numbers before it. Thus the ISO 8601 time line is -0001, 0000, 0001, and 0002.&lt;/span&gt;&lt;/blockquote&gt;&lt;br /&gt;So python's datetime.fromordinal method assumes that people are using the traditional propleptic Gregorian calendar, whereas matlab assumes people want to use the ISO Gregorian calendar. &amp;nbsp;Why doesn't python stick to the ISO standard here?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8128923945165565689-8057940352801321740?l=sociograph.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sociograph.blogspot.com/feeds/8057940352801321740/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8128923945165565689&amp;postID=8057940352801321740' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default/8057940352801321740'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default/8057940352801321740'/><link rel='alternate' type='text/html' href='http://sociograph.blogspot.com/2011/04/how-to-avoid-gotcha-when-converting.html' title='Converting MATLAB&apos;s datenum to Python&apos;s datetime'/><author><name>Conrad Lee</name><uri>https://profiles.google.com/110979755871396857085</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh6.googleusercontent.com/-3Jylvn3UkJI/AAAAAAAAAAI/AAAAAAAAAAA/PM9AqhTcFSE/s512-c/photo.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-0eYb8NoMnAI/TaSPbOFgzwI/AAAAAAAAHA0/jOHc7bG3vYM/s72-c/clocks.jpg' height='72' width='72'/><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8128923945165565689.post-2947610746415973986</id><published>2011-04-04T22:18:00.003+01:00</published><updated>2011-10-17T11:07:44.738+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='enron; data; temporal; dynamic; communication; parser; email'/><title type='text'>Communication networks part 1: the Enron e-mail dataset</title><content type='html'>This post is half intended to serve as self-documentation. I also just want to share a parsed version of an e-mail communication dataset in the hope that it will help others reproduce my&amp;nbsp;experiments&amp;nbsp;or save them time in their own experiments.&lt;br /&gt;&lt;br /&gt;I'm interested in logs of direct communication (i.e., phone communication, e-mail communication, and not things like message boards or mailing lists). &amp;nbsp;I'm not interested in the content of the communication---in fact, all I want is a file which contains data in the following form:&lt;br /&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;unix_timestamp;sender;receiver&lt;br /&gt;...&lt;/span&gt;&lt;/blockquote&gt;In this post I outline how I processed the Enron dataset into this form, and the filtering steps that I took. &amp;nbsp;In upcoming posts I'll describe how I processed two similar datasets: one on e-mail logs from students at the University of Kiel and one on the call logs of some participants in a study at MIT.&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This is a rather long post with lots of detail, so I'll begin by cutting to the conclusion and summary:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div class="separator" style="clear: both; margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px; text-align: left;"&gt;&lt;b&gt;In&amp;nbsp;summary&lt;/b&gt;: The resulting dataset includes 240,947 communication events among 28,062 addresses. &amp;nbsp;The 192 e-mail addresses that belong to the 156 members of the "core" have been disambiguated so that each person in the core is represented by a unique identifier. &amp;nbsp;Events with clearly invalid timestamps were removed. &amp;nbsp;The data was further filtered so that none of the communication events come from e-mail that had more than three recipients, and if a message comes from a address-hour combination such that more than 20 messages were sent by that address in that hour, then the message was excluded.&lt;/div&gt;&lt;div class="separator" style="clear: both; margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px; text-align: left;"&gt;&lt;b&gt;UPDATE: &lt;/b&gt;I noticed that I accidentally selected the date range starting from Jan. 1st, 2001 because I said that after this date the vast majority of events take place. &amp;nbsp;Actually the date where the communication really takes off is a year earlier, around Jan. 1st,&lt;i&gt;&amp;nbsp;2000. &amp;nbsp;&lt;/i&gt;I've updated the files below to include all nodes since Jan. 1st, 2000.&lt;/div&gt;&lt;div class="separator" style="clear: both; margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px; text-align: left;"&gt;&lt;b&gt;Download:&amp;nbsp;&lt;/b&gt;You can find the parsed list of events&amp;nbsp;&lt;a href="http://dl.dropbox.com/u/1800572/enron/enron-cleaned.edges"&gt;here&lt;/a&gt;&amp;nbsp;(the format is as indicated above). &amp;nbsp;Additionally,&lt;a href="http://dl.dropbox.com/u/1800572/enron/enron-cleaned.attributes"&gt;&amp;nbsp;this file&lt;/a&gt;&amp;nbsp;indicates which indicates which nodes belong to the core and which belong to the periphery. &amp;nbsp;The format is as follows&lt;/div&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;node_id;boolean&lt;/span&gt;&lt;/blockquote&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;where if&amp;nbsp;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;boolean==1&lt;/span&gt;, then the node belongs to the core, and if&amp;nbsp;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;boolean==0&lt;/span&gt;, then the node belongs to the periphery.&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;The Enron Dataset: a description and the pre-processing steps I took&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;A good starting point to describe this dataset is the abstract of&lt;a href="http://ceas.cc/2004/168.pdf"&gt; the paper that introduces i&lt;/a&gt;t:&lt;/div&gt;&lt;blockquote&gt;&lt;blockquote&gt;A large set of email messages, the Enron corpus, was made public during the legal&amp;nbsp;investigation concerning the Enron corporation. This dataset, along with a thorough&amp;nbsp;explanation of its origin, is available at http://www-2.cs.cmu.edu/~enron/. This paper&amp;nbsp;provides a brief introduction and analysis of the dataset. The raw Enron corpus contains&amp;nbsp;619,446 messages belonging to 158 users. We cleaned the corpus before this analysis by&amp;nbsp;removing certain folders from each user, such as “discussion_threads”. These folders&amp;nbsp;were present for most users, and did not appear to be used directly by the users, but rather&amp;nbsp;were computer generated. Many, such as “all_documents”, also contained large numbers&amp;nbsp;of duplicate emails, which were already present in the users’ other folders...&lt;/blockquote&gt;&lt;/blockquote&gt;&lt;div&gt;This dataset is unique in that privacy seems to be completely disregarded. &amp;nbsp;The full text of e-mails is there for all to see, even though many of the e-mail are quite personal. &amp;nbsp;As far as I know, real names and e-mail addresses are used.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;After reading the literature on this dataset, I was initially overwhelmed by it. &amp;nbsp;The dataset includes a lot of "noise" created by the fact that some e-mail are merely automated reports, some people have many e-mail addresses, many e-mail messages are duplicated in various ways, e.g. by people putting the same e-mail in many folders. &amp;nbsp;A couple groups of researchers such as a group at &lt;a href="http://www.cs.cmu.edu/~enron/"&gt;C&lt;/a&gt;&lt;a href="http://www.cs.cmu.edu/~enron/"&gt;arnegie Mellon University&lt;/a&gt;&amp;nbsp;and &lt;a href="http://bailando.sims.berkeley.edu/enron_email.html"&gt;another at Berkeley&lt;/a&gt;&amp;nbsp;have come up with cleaned, yet conflicting versions of the dataset. &amp;nbsp;In the end I found the data in a nice format at &lt;a href="http://www.infochimps.com/datasets/enron-email-data-with-manager-subordinate-relationship-metadata"&gt;infochimps&lt;/a&gt;&amp;nbsp;-- this version is based on the Berkeley version because according to the infochimps dataset description,&amp;nbsp;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;"a&lt;span class="Apple-style-span" style="color: #222222; line-height: 12px;"&gt;fter inspecting both [the CMU and Berkeley versions], it became clear that the UC Berkeley dataset contains more dimensions of the original data&lt;/span&gt;".&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;It's important to keep in mind that although this dataset contains information on the e-mail communication of over 87,000 e-mail addresses, the "core" of the network contains only 156 people (the abstract above says 158, but the version on infochimps is slightly different). &amp;nbsp;In other words, we have complete information on the e-mail communication of only the 156 people in the core. &amp;nbsp;For everyone else, we about only those e-mail that were also sent to one of the core members. &amp;nbsp;So an important first step is to clearly fish out of the data which nodes belong to the core and which belong to the periphery. &amp;nbsp;This is basic, but analysis in some papers, e.g.,&lt;a href="http://arxiv.org/abs/cond-mat/0410313"&gt; this one&lt;/a&gt; which I ran into yesterday, ignore this crucial distinction and report statistics on the aggregation of the core and the periphery, when clearly its more interesting to know these statistics about only those pairs of nodes in the core.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The second step I performed was straightforward: I removed events with invalid timestamps. &amp;nbsp;A small proportion of e-mail all have one of a few different dates in the 1970s and 1980s. &amp;nbsp; These times were easy to identify because a lot of messages were all reported as being sent within a single second at a few moments in these decades. &amp;nbsp;The first timestamp after the mid eighties occurs at on Jan 1st, 1997. &amp;nbsp;I filter out all events that take place before this date. Furthermore, the last few events take place in the year 2020---these are clearly mistakes and so I've filtered out all events after the end of 2005.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I carried out additional pre-processing with the intention of&lt;/div&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt;removing automated e-mail notifications&lt;/li&gt;&lt;li&gt;removing messages that are sent out to too many people --- I want avoid including impersonal messages sent out to many people and focus on direct communication.&lt;/li&gt;&lt;li&gt;Keeping track of who is in the "core" of the dataset and who is in the periphery&lt;/li&gt;&lt;li&gt;Making sure that the nodes in the network represent people and not e-mail addresses (because some can have many e-mail addresses---we could carry out this step on only the 156 people in the "core.")&lt;/li&gt;&lt;/ol&gt;&lt;/div&gt;&lt;div&gt;The third and fourth steps were straightforward to carry out. &amp;nbsp;The first and second require answering two questions: (1) how many recipients can an e-mail have before we label it as an uninteresting "broadcast" type of communication that should be left out? (2) How can we detect automated e-mails? &amp;nbsp;To answer the first question, it's helpful to look at the following plot&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-GcS9EMsNiR0/TZnnoZg-sVI/AAAAAAAAHAQ/PqWrTYQ7t84/s1600/enron-preprocessing.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="278" src="http://2.bp.blogspot.com/-GcS9EMsNiR0/TZnnoZg-sVI/AAAAAAAAHAQ/PqWrTYQ7t84/s320/enron-preprocessing.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;When interpreting the plot above, it's important to keep the following point in mind: each e-mail message turns into several e-mail events if it is sent to multiple recipients. &amp;nbsp;So although only a minority of e-mails have more than 3 recipients, if we include them in the dataset these will have a disproportionate weight because exactly these messages create a large number of events. &amp;nbsp;For example, if we include only those e-mails with 3 or fewer recipients, then then the dataset includes 242,000 email events, whereas if we include all e-mails, then the dataset contains 1,387,139 email events. &amp;nbsp;In the end I choose to include only those messages with 3 or fewer recipients. &amp;nbsp;I'll be the first to say that this is a bit arbitrary. &amp;nbsp;But it seems like a good balance between filtering out impersonal e-mails and trying to include as much data as possible on communication. &amp;nbsp;Also note that I decided to include all recipients when counting how many people have received an e-mail (i.e., direct recipients, cc'd recipients, and bcc'd recipients).&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;I assume that if an email address sends out more than &lt;i&gt;max_sent_per_hour&lt;/i&gt;&amp;nbsp;e-mails in one hour, then it's likely that some program is behind these messages and so I filter them out. &amp;nbsp;It's not clear what value to use for &lt;i&gt;max_sent_per_hour&lt;/i&gt;, but the following plot provides some feeling for this decision (note the log-scale in the y-axis):&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-F1gcdhVM_o8/TZouYEMV62I/AAAAAAAAHAk/17De35Agj4g/s1600/sent_per_hour_hist.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="281" src="http://2.bp.blogspot.com/-F1gcdhVM_o8/TZouYEMV62I/AAAAAAAAHAk/17De35Agj4g/s400/sent_per_hour_hist.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;Based on the chart, there seem to be very few mass-emails sent around (probably partly due to the pre-processing we did in the last step with multiple recipients). It seems reasonable to set &lt;i&gt;max_sent_per_hour &lt;/i&gt;to 20, a value which filters out 0.28% of events. &amp;nbsp;Of course this particular value is quite arbitrary---it could just as easily be 15 (which would filter out 0.73% of events) or 40 (0.04%), but at least the chart above indicates that the resulting dataset would not vary widely based on this choice. &amp;nbsp;Again, the idea here is that if an address sends out more than 20 messages in one hour, then there's a relatively high probability that a program is behind those messages, and a relatively low probability that a person is behind them.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;Here's a plot of the frequency of email over time:&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-MaOEZbM6wKw/TZ3ELIJM3VI/AAAAAAAAHAo/lqC9JI45MWU/s1600/enron-all-cleaned-contiguous-over-time.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="114" src="http://1.bp.blogspot.com/-MaOEZbM6wKw/TZ3ELIJM3VI/AAAAAAAAHAo/lqC9JI45MWU/s640/enron-all-cleaned-contiguous-over-time.png" width="640" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;There are some interesting features in the plot, such as the peak in December before the holidays, and the trough during the holidays (visible in December 2001 and 2002). &amp;nbsp;There is a similar peak/trough in June 2001. The plot also suggests that although there are some emails starting at the beginning of 1997 and these continue until Jan 2006, the vast majority of the data is between Jan &lt;strike&gt;2001&lt;/strike&gt; 2000&amp;nbsp;and April 2002. &amp;nbsp;For this reason, anyone looking at the dynamics of this network may want to focus on only this range.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;a href="http://dl.dropbox.com/u/1800572/enron/enron-cleaned.edges"&gt;http://dl.dropbox.com/u/1800572/enron/enron-cleaned.edges&lt;/a&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8128923945165565689-2947610746415973986?l=sociograph.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sociograph.blogspot.com/feeds/2947610746415973986/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8128923945165565689&amp;postID=2947610746415973986' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default/2947610746415973986'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default/2947610746415973986'/><link rel='alternate' type='text/html' href='http://sociograph.blogspot.com/2011/04/communication-networks-part-1-enron-e.html' title='Communication networks part 1: the Enron e-mail dataset'/><author><name>Conrad Lee</name><uri>https://profiles.google.com/110979755871396857085</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh6.googleusercontent.com/-3Jylvn3UkJI/AAAAAAAAAAI/AAAAAAAAAAA/PM9AqhTcFSE/s512-c/photo.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-GcS9EMsNiR0/TZnnoZg-sVI/AAAAAAAAHAQ/PqWrTYQ7t84/s72-c/enron-preprocessing.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8128923945165565689.post-4570428272682130120</id><published>2011-03-05T17:49:00.003Z</published><updated>2012-01-03T11:18:24.756Z</updated><title type='text'>Facebook100 data and a parser for it</title><content type='html'>&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="https://lh3.googleusercontent.com/--FUT1llNq-s/TXJ3WBfwydI/AAAAAAAAG3Y/y1Kit50niQU/s1600/gephi-small.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="240" src="https://lh3.googleusercontent.com/--FUT1llNq-s/TXJ3WBfwydI/AAAAAAAAG3Y/y1Kit50niQU/s320/gephi-small.png" width="320" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;Caltech's Facebook network, part of the Facebook100 dataset&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;A few weeks ago, &lt;a href="http://masonporter.blogspot.com/2011/02/facebook100-data-set.html"&gt;Mason Porter posted a goldmine of data&lt;/a&gt;, the Facebook100 dataset.&amp;nbsp; The dataset contains all of the Facebook friendships at 100 US universities at some time in 2005, as well as a number of node attributes such as dorm, gender, graduation year, and academic major.&amp;nbsp; The data was apparently provided directly by Facebook.&lt;br /&gt;&lt;br /&gt;As far as I know, the dataset is unprecedented and has the potential advance both network methods and insights into the structure of acquaintanceship.&amp;nbsp; Unfortunately, the Facebook Data Team requested that Porter no longer distribute the dataset. It does not include the names of individual or even of any of the node attributes (they have been given integer ids), but Facebook seems to be concerned. Anonymized network data is after all vulnerable to de-anonymization (for some nice examples of why, see the last 20 minutes of &lt;a href="http://videolectures.net/kdd07_kleinberg_cisnd/"&gt;this video lecture&lt;/a&gt; from Jon Kleinberg).&lt;br /&gt;&lt;br /&gt;It's a shame that Porter can no longer distribute the data.&amp;nbsp; On the other hand, once a dataset like that has been released, will the internet be able to forget it?&amp;nbsp; After a bit of poking around I found the dataset as a torrent file.&amp;nbsp; In fact, if anyone is seeding the torrent, you can download it by following &lt;a href="http://www.monova.org/torrent/4266013/facebook100_zip.html"&gt;this link&lt;/a&gt; &lt;strike&gt;and it appears to be on &lt;a href="http://rapidshare.com/files/451188192/facebook100.zip"&gt;rapidshare&lt;/a&gt;&lt;/strike&gt;.&lt;br /&gt;&lt;br /&gt;As Gephi's Sebastien Heymann noted, the dataset is a bit awkward to use if you don't have matlab.&amp;nbsp; I've written a parser for it in python (which you can download &lt;a href="http://dl.dropbox.com/u/1800572/parse_facebook100.py"&gt;here&lt;/a&gt;) that converts the 100 datasets into gml, graphml, and a simple edgelist-based format.&amp;nbsp; To use it you will need to have scipy and networkx installed (both of these are python libraries).&amp;nbsp; Then just put the script in the same folder as all those matlab files, and from that directory run the command&lt;br /&gt;&lt;blockquote&gt;&lt;span style="font-family: 'Courier New', Courier, monospace;"&gt;python parse_facebook100.py&lt;/span&gt;&lt;/blockquote&gt;That will create a subdirectory called output, which will be populated with one folder for each university dataset.&amp;nbsp; In each university's subfolder you'll find the graphml, gml, and edgelist formats. The output takes up 4.4GB.&lt;br /&gt;&lt;br /&gt;If you're interested to see some visualizations of these networks, check out &lt;a href="http://sociograph.blogspot.com/2011/02/visualizing-large-facebook-friendship.html"&gt;my earlier post on the topic&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8128923945165565689-4570428272682130120?l=sociograph.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sociograph.blogspot.com/feeds/4570428272682130120/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8128923945165565689&amp;postID=4570428272682130120' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default/4570428272682130120'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default/4570428272682130120'/><link rel='alternate' type='text/html' href='http://sociograph.blogspot.com/2011/03/facebook100-data-and-parser-for-it.html' title='Facebook100 data and a parser for it'/><author><name>Conrad Lee</name><uri>https://profiles.google.com/110979755871396857085</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh6.googleusercontent.com/-3Jylvn3UkJI/AAAAAAAAAAI/AAAAAAAAAAA/PM9AqhTcFSE/s512-c/photo.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='https://lh3.googleusercontent.com/--FUT1llNq-s/TXJ3WBfwydI/AAAAAAAAG3Y/y1Kit50niQU/s72-c/gephi-small.png' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8128923945165565689.post-5975476127663967777</id><published>2011-03-05T14:32:00.000Z</published><updated>2011-03-05T14:32:17.747Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='temporal; dynamic; vector clocks; time sliced; networks'/><title type='text'>Problems with time-sliced network representations</title><content type='html'>Lately I've been analyzing event-based network data. &amp;nbsp;Whereas most network analysis looks at a static snapshot of a network, or a series of snapshots over time (panel data), in event-based data, each edge has a time-stamp which indicates when the interaction took place. &lt;br /&gt;&lt;br /&gt;Communication is naturally represented by such a network, and if you're interested in how information could flow via communication, then network measures would ideally incorporate this rich temporal information.&lt;br /&gt;&lt;br /&gt;In this post I'll briefly review some disadvantages related to static representations of this kind of data. &amp;nbsp;Note that many of the so-called dynamic network analysis methods also suffer from these disadvantages, because they are based on, for example, splitting a network into non-overlapping snapshots. &amp;nbsp;Such "time-slicing" &amp;nbsp;flattens the temporal information of the network--I'll now explain why.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Why static representations of networks are lossy&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;For simplicity, let's consider a network with only three nodes. &amp;nbsp;Let's assume that we know which nodes communicated over the course of a week, and that there were just two communication events. &amp;nbsp;If we create the usual static network, it looks like this:&lt;br /&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="https://lh5.googleusercontent.com/-DByJ7WlnW1A/TXIzXVVp-_I/AAAAAAAAG2w/kJuDwPRqmM0/s1600/edge123.gif" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="33" src="https://lh5.googleusercontent.com/-DByJ7WlnW1A/TXIzXVVp-_I/AAAAAAAAG2w/kJuDwPRqmM0/s320/edge123.gif" width="320" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;Static (flattened) representation of communication&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;Why is this representation lossy? &amp;nbsp;Presented with the image above, it seems like node 1 could have indirectly received information about node 3, and vice versa. &amp;nbsp;However, if we assume that the communication events did not take place at the same time, then it's impossible for both 1 to know about 3 and for 3 to know about 1. Instead, one of the following situations must be the case:&lt;br /&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="https://lh4.googleusercontent.com/-NRzlYsWywuk/TXIznq1_EUI/AAAAAAAAG20/L2uCvwPLPpM/s1600/forward_flow.gif" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" src="https://lh4.googleusercontent.com/-NRzlYsWywuk/TXIznq1_EUI/AAAAAAAAG20/L2uCvwPLPpM/s1600/forward_flow.gif" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;Either node 3 has indirect information on node 1, &lt;i&gt;or&lt;/i&gt; (XOR, at that!)&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="https://lh6.googleusercontent.com/-_FTptDZa0y4/TXIzondxiYI/AAAAAAAAG24/qOY9tB_jTrE/s1600/backward_flow.gif" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" src="https://lh6.googleusercontent.com/-_FTptDZa0y4/TXIzondxiYI/AAAAAAAAG24/qOY9tB_jTrE/s1600/backward_flow.gif" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;node 1 has indirect information on node 3.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;If one is interested in, say, viral spread or the flow of information, then one's representation of the graph would ideally pick up on even more temporal restrictions than the one just outlined. &amp;nbsp;For example, consider the following situation&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="https://lh4.googleusercontent.com/-tmPQIK8IaA8/TXJI6Th0gDI/AAAAAAAAG3Q/fhdmcQQTlzM/s1600/animation.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="https://lh4.googleusercontent.com/-tmPQIK8IaA8/TXJI6Th0gDI/AAAAAAAAG3Q/fhdmcQQTlzM/s1600/animation.gif" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;There are two propagation patterns that repeat in the above image: information either flows along the top or along the bottom. &amp;nbsp;If the nodes propagate only recent information, then it could be the case that the green node never gets information from the red node. &amp;nbsp;If we look at the static representation of this situation, that's not clear.&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="https://lh3.googleusercontent.com/-fd9HLULAw5I/TXJJTNqQfYI/AAAAAAAAG3U/auwlwttEz6Q/s1600/all-small.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="https://lh3.googleusercontent.com/-fd9HLULAw5I/TXJJTNqQfYI/AAAAAAAAG3U/auwlwttEz6Q/s1600/all-small.gif" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;One might argue that I'm just picking out pathological situations that are not grounded in any particular setting, but I'm just trying to keep things simple. &amp;nbsp;There are plenty of settings where retaining such dynamic patterns may be essential. For example, if one wanted to detect communities in a large mobile phone network, then it is likely the case that communities are not all active at the same time. &amp;nbsp;Some will be active in evenings, others during the day, others on weekends, and others only on special holidays. By time-slicing the data you flatten out this temporal information, and it looks like all communities are active at the same time, making the task of community detection more difficult.&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;So how should event-based networks be represented?&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;I'm not sure. But given problems like those that I've outlined above, I have an aversion to the time-sliced approach (which happens to currently be the most popular approach).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Vector clocks are a potentially powerful representation of such event-based networks that does not suffer from the first problem that I outlined above. &amp;nbsp;I became aware of vector clocks via an excellent 2008 KDD paper &lt;i&gt;&lt;a href="http://portal.acm.org/citation.cfm?id=1401945"&gt;The Structure of Information Pathways in a Social Communication Network&lt;/a&gt;&lt;/i&gt;&amp;nbsp;by&amp;nbsp;G. Kossinets, J. Kleinberg, D. Watts (available for free on &lt;a href="http://www.cs.cornell.edu/home/kleinber/"&gt;Jon Kleinberg's page,&lt;/a&gt;&amp;nbsp;there's also a nice &lt;a href="http://videolectures.net/kdd08_kleinberg_sipscn/"&gt;video lecture in which Kleinberg presents the paper&lt;/a&gt;). &amp;nbsp;From the &lt;a href="http://en.wikipedia.org/wiki/Vector_cl"&gt;wikipedia article&lt;/a&gt;, it seems that vector clocks were developed to detect causality violations in distributed systems. &amp;nbsp; They don't seem like a plausible way to represent social networks because they implicitly assume that when people communicate they have unlimited bandwidth. &amp;nbsp;Nonetheless, the results of that KDD paper are fascinating and the whole framework could possible be modified to better reflect how communication takes place among humans.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8128923945165565689-5975476127663967777?l=sociograph.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sociograph.blogspot.com/feeds/5975476127663967777/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8128923945165565689&amp;postID=5975476127663967777' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default/5975476127663967777'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default/5975476127663967777'/><link rel='alternate' type='text/html' href='http://sociograph.blogspot.com/2011/03/problems-with-time-sliced-network.html' title='Problems with time-sliced network representations'/><author><name>Conrad Lee</name><uri>https://profiles.google.com/110979755871396857085</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh6.googleusercontent.com/-3Jylvn3UkJI/AAAAAAAAAAI/AAAAAAAAAAA/PM9AqhTcFSE/s512-c/photo.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='https://lh5.googleusercontent.com/-DByJ7WlnW1A/TXIzXVVp-_I/AAAAAAAAG2w/kJuDwPRqmM0/s72-c/edge123.gif' height='72' width='72'/><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8128923945165565689.post-4826849504660160367</id><published>2011-02-10T14:57:00.004Z</published><updated>2011-02-14T17:52:17.498Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='centrality'/><category scheme='http://www.blogger.com/atom/ns#' term='clustering'/><category scheme='http://www.blogger.com/atom/ns#' term='irish'/><category scheme='http://www.blogger.com/atom/ns#' term='networks'/><category scheme='http://www.blogger.com/atom/ns#' term='ireland'/><category scheme='http://www.blogger.com/atom/ns#' term='blogosphere'/><title type='text'>The Irish Blogosphere</title><content type='html'>As part of an interdisciplinary project, a few of us here in &lt;a href="http://cliquecluster.org/"&gt;Clique&lt;/a&gt;&amp;nbsp;have recently mapped out the Irish blogosphere. &amp;nbsp;Our goal was to provide a group of cultural studies scholars with a representative sample of Irish blogs. &amp;nbsp;Along the way we uncovered some results that might be interesting for active Irish bloggers. &amp;nbsp;Before presenting the results, I should give credit where it's due: Karen Wade motivated this whole project, Derek Greene is responsible for scraping the data and the document clustering, and Daniel Archambault is responsible for much of the visualization work. &amp;nbsp;Prof. Pádraig Cunnignham, my supervisor, oversaw and advised the project. &amp;nbsp;Additionally, this work was supported by Science Foundation Ireland grant No. 08/SRC/I140 (Clique: Graph and Network Analysis Cluster). &amp;nbsp;The full technical report is available &lt;a href="http://www.csi.ucd.ie/content/identifying-representative-textual-sources-blog-networks"&gt;here&lt;/a&gt;. &lt;b&gt;&amp;nbsp;The data we collected is available &lt;a href="http://mlg.ucd.ie/blogs"&gt;here&lt;/a&gt;.&lt;/b&gt;&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The following results are based on a scrape of about 180,000 posts from 600 blogs which were prominent and clearly Irish. &amp;nbsp;To collect data, we started with a "seed" set of blogs collected by looking at winners of the &lt;a href="http://awards.ie/blogawards/"&gt;Irish Blog Awards&lt;/a&gt;. &amp;nbsp;We then performed a "snowball crawl" by extracting all blogs on the blogrolls of the seed, adding them to the seed, and iterating this process twice. &amp;nbsp;Finally, we filtered out all blogs with fewer than two incoming blogroll links. &amp;nbsp;We manually filtered out blogs that are dead or not Irish. &amp;nbsp;We argue that the resulting data includes nearly all active Irish blogs with moderate readership, but it will in some cases exclude some blogs that have little following. &amp;nbsp;For each blog we scraped the most recent 1000 posts.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;The lay of the land: an overview&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;As a starting point, we'll take a look at the overall blogroll-link structure of the blogosphere:&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-Y5giPviqR7c/TVObmuOMdQI/AAAAAAAAG1g/KDNJPBCgi3I/s1600/full_degree_annotated.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="281" src="http://3.bp.blogspot.com/-Y5giPviqR7c/TVObmuOMdQI/AAAAAAAAG1g/KDNJPBCgi3I/s400/full_degree_annotated.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;In the above image, links indicate whether one blog has listed the other on its blogroll. &amp;nbsp;The larger nodes have more incoming links. &amp;nbsp;While the image includes too many nodes to list their individual names without making a mess, we can see in this overview that there is a densely connected "core" and a number of outlying clusters. &amp;nbsp;Further analysis (the text clustering, explained below) revealed that some of these clusters correspond to the topics, which we've sketched in.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;Can you guess who the big dot in the middle might be? &amp;nbsp;This will soon be clear.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;b&gt;The most prominent bloggers' blogs&lt;/b&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-zL4aGNX4-RM/TVOwgPsv6XI/AAAAAAAAG1s/TP7wzd4pPEM/s1600/post-over-45-crop.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="280" src="http://3.bp.blogspot.com/-zL4aGNX4-RM/TVOwgPsv6XI/AAAAAAAAG1s/TP7wzd4pPEM/s320/post-over-45-crop.png" width="320" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;All hail King &lt;a href="http://www.mulley.net/"&gt;Mulley&lt;/a&gt;!&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;The image above shows one version of the top Irish blogs, sized by &lt;a href="http://en.wikipedia.org/wiki/PageRank"&gt;PageRank&lt;/a&gt; in our Irish blogosphere dataset (rather than the PageRank in this subgraph). &amp;nbsp;Only those blogs that were linked to more than 45 times in the posts of other Irish blogs make the cut for this image. &amp;nbsp;The&amp;nbsp;thickness&amp;nbsp;of the edges indicate how often these blogs link to one another.&lt;br /&gt;&lt;br /&gt;We can now scientifically verify that claim that &lt;a href="http://www.mulley.net/"&gt;Damien Mulley&lt;/a&gt;&amp;nbsp;is the high king of the Irish blogosphere. &amp;nbsp;Although PageRank is just one of many possible network centrality measures, we also tried others and the results were similar: &lt;a href="http://mulley.net/"&gt;mulley.net&lt;/a&gt; was on top. &amp;nbsp;Other top-ranked blogs include &lt;a href="http://twentymajor.net/"&gt;twentymajor&lt;/a&gt;, &lt;a href="http://www.nialler9.com/"&gt;nialler9&lt;/a&gt;, &lt;a href="http://www.mamanpoulet.com/"&gt;mamanpoulet&lt;/a&gt;, &lt;a href="http://cedarlounge.wordpress.com/"&gt;cedarlounge&lt;/a&gt;, and &lt;a href="http://sluggerotoole.com/"&gt;sluggerotoole&lt;/a&gt; and all the others in that image.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Territories in the Kingdom&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;Many of the top blogs come from the biggest cluster in the blogosphere, but there are more specialized topics, and each has its local celebrities. &amp;nbsp;To discover these clusters and their topics, we applied a document clustering algorithm, which puts blogs with similar vocabularies into clusters. &amp;nbsp;The algorithm returned 12 clusters (which we visualize using &lt;a href="http://www.wordle.net/"&gt;Wordle&lt;/a&gt;):&lt;br /&gt;&lt;br /&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-cMs102Drf-w/TVPp3TmnNyI/AAAAAAAAG14/8fuqu1c38Q0/s1600/music_words.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="77" src="http://4.bp.blogspot.com/-cMs102Drf-w/TVPp3TmnNyI/AAAAAAAAG14/8fuqu1c38Q0/s320/music_words.png" width="320" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;Music&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-fFBsFPb26OQ/TVPp4SWc4RI/AAAAAAAAG2A/lUaQO4my-K4/s1600/photo_words.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="85" src="http://4.bp.blogspot.com/-fFBsFPb26OQ/TVPp4SWc4RI/AAAAAAAAG2A/lUaQO4my-K4/s320/photo_words.png" width="320" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;Photography&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-TV_nHfC1qrc/TVPp4lNB9EI/AAAAAAAAG2E/rkJ_85C466Y/s1600/social_words.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="90" src="http://3.bp.blogspot.com/-TV_nHfC1qrc/TVPp4lNB9EI/AAAAAAAAG2E/rkJ_85C466Y/s320/social_words.png" width="320" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;Technology&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-uMlexM8OCG0/TVPp44PQ1GI/AAAAAAAAG2I/yzmjp4gRUqk/s1600/sport_words.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="90" src="http://2.bp.blogspot.com/-uMlexM8OCG0/TVPp44PQ1GI/AAAAAAAAG2I/yzmjp4gRUqk/s320/sport_words.png" width="320" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;Sports&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-l2yiODxlV9I/TVPp35vlXEI/AAAAAAAAG18/-KN46Gsyrh0/s1600/movie_words.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="82" src="http://4.bp.blogspot.com/-l2yiODxlV9I/TVPp35vlXEI/AAAAAAAAG18/-KN46Gsyrh0/s320/movie_words.png" style="cursor: move;" width="320" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;Film (+ law and writing)&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-e6g-nnm4OQw/TVPp5SYk0GI/AAAAAAAAG2M/UcAmqeKWCPA/s1600/wine_words.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="102" src="http://1.bp.blogspot.com/-e6g-nnm4OQw/TVPp5SYk0GI/AAAAAAAAG2M/UcAmqeKWCPA/s320/wine_words.png" width="320" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;Wine&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-zY267qsT7SI/TVPp7RmsmsI/AAAAAAAAG2g/zOO9HvpDmbk/s1600/food_words.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="113" src="http://1.bp.blogspot.com/-zY267qsT7SI/TVPp7RmsmsI/AAAAAAAAG2g/zOO9HvpDmbk/s320/food_words.png" width="320" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;Food&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-la41deSWrjY/TVPp598SYbI/AAAAAAAAG2Q/B1Arcao1xNI/s1600/beauty_words.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="106" src="http://1.bp.blogspot.com/-la41deSWrjY/TVPp598SYbI/AAAAAAAAG2Q/B1Arcao1xNI/s320/beauty_words.png" width="320" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;Beauty&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-XQMBo3mv44g/TVPp65k1pyI/AAAAAAAAG2c/hT1v2tAXum4/s1600/fashion_words.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="93" src="http://1.bp.blogspot.com/-XQMBo3mv44g/TVPp65k1pyI/AAAAAAAAG2c/hT1v2tAXum4/s320/fashion_words.png" width="320" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;Fashion&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-qjhcsxnG6fI/TVPp6hV9-oI/AAAAAAAAG2Y/HH6aWsuMJvs/s1600/emo_words.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="77" src="http://3.bp.blogspot.com/-qjhcsxnG6fI/TVPp6hV9-oI/AAAAAAAAG2Y/HH6aWsuMJvs/s320/emo_words.png" width="320" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;Personal&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-ZIOIfLAWrqs/TVPp6DJbC0I/AAAAAAAAG2U/A8aMDbuthSM/s1600/discussion_words.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="104" src="http://2.bp.blogspot.com/-ZIOIfLAWrqs/TVPp6DJbC0I/AAAAAAAAG2U/A8aMDbuthSM/s320/discussion_words.png" width="320" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;Politics, general discussion&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-4qCOQwbi3M4/TVPp7tqxiGI/AAAAAAAAG2k/pKw8vrmgjLM/s1600/irish_words.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="89" src="http://2.bp.blogspot.com/-4qCOQwbi3M4/TVPp7tqxiGI/AAAAAAAAG2k/pKw8vrmgjLM/s320/irish_words.png" width="320" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;Gaelic&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;The blogs on the right side of these tag clouds are examples of members of the clusters. &amp;nbsp;The size here does not indicate popularity, but rather membership strength. &amp;nbsp;In other words, blogs which use a vocabulary that is very representative of the cluster are the largest. &amp;nbsp;It's interesting that the method was able to distinguish a food from wine and beauty from the fashion, indicating that these topics are associated with distinct vocabularies.&lt;br /&gt;&lt;br /&gt;The clustering method was nice in that it doesn't force all blogs to belong to exactly one cluster---some belong to no clusters, and others belong to more than one. &amp;nbsp;In the following image we used the overlapping membership of the clusters to reveal the relations between them. &amp;nbsp;You should be able to see overlap of clusters by areas where two types of cross-hatching interfere with each other.&lt;br /&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-SQSj2qAHmi8/TVPvnmaDUZI/AAAAAAAAG2o/lPj5Sk-btxg/s1600/AuBAndUnion20.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="305" src="http://1.bp.blogspot.com/-SQSj2qAHmi8/TVPvnmaDUZI/AAAAAAAAG2o/lPj5Sk-btxg/s320/AuBAndUnion20.png" width="320" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;The main overlapping clusters of the Irish blogosphere.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;Some measures based on text similarity and &lt;a href="http://del.icio.us/"&gt;del.icio.us&lt;/a&gt; tags revealed that the largest community, "Discussion," is not very cohesive. &amp;nbsp;When we applied a community finding algorithm on the subgraph induced by this cluster, we detected sub-communities, with two political sub-communities (one of which had a greater emphasis on Northern Ireland than the other), one academic sub-community with a focus on law, one on movies, and one on humor/satire. &amp;nbsp;There was also a small but clear cluster of three Sinn Féin (&lt;a href="http://leargas.blogspot.com/"&gt;leargas.blogspot.com&lt;/a&gt;, &lt;a href="http://daithimckay.blogspot.com/"&gt;daithimckay.blogspot.com&lt;/a&gt;, &lt;a href="http://ograshinnfein.blogspot.com/"&gt;ograshinnfein.blogspot.com&lt;/a&gt;). &amp;nbsp;Otherwise, I was surprised to find that the political sub-communities did not part along party lines, in sharp contrast to the state of the US political blogosphere:&lt;br /&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-m12a5fz1NmE/TVP1hRkN1qI/AAAAAAAAG2s/sev57wE7F9c/s1600/political_blogosphere.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="211" src="http://3.bp.blogspot.com/-m12a5fz1NmE/TVP1hRkN1qI/AAAAAAAAG2s/sev57wE7F9c/s320/political_blogosphere.png" width="320" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;The US political blogosphere is clearly divided on party lines, while the Irish political blogosphere (not pictured) is not. &amp;nbsp;Image from &lt;a href="http://portal.acm.org/citation.cfm?id=1134271.1134277"&gt;a paper by Adamic and Glance&lt;/a&gt;.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;Note that the methods we used are not in all cases sensitive to very small communities, or to communities that have no specialized vocabulary. So not all small groups of personal blogs were detected using the document clustering, for example.&lt;br /&gt;&lt;br /&gt;Those are some of the more interesting aspects of the study. &amp;nbsp;We also had one interesting negative finding, for those who are interested in network analysis, especially in the detection of network communities. &amp;nbsp;While the document clustering (and network community detection within the largest document cluster) produced nice results, when we ran community detection algorithms on the two global versions of the blogosphere graph, the results were disappointing. &amp;nbsp;We ran three community detection algorithms (InfoMap, GCE, and MOSES) on the two versions of the blogosphere that we came up with (one based on blogroll links, the other based on links between blog posts). &amp;nbsp;The results varied widely, both when we used the same algorithm on the two different versions of the blogosphere, and when we used two different algorithms on the same data. &amp;nbsp;Thus, we were left without a clear picture of the community structure based on network alone.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8128923945165565689-4826849504660160367?l=sociograph.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sociograph.blogspot.com/feeds/4826849504660160367/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8128923945165565689&amp;postID=4826849504660160367' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default/4826849504660160367'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default/4826849504660160367'/><link rel='alternate' type='text/html' href='http://sociograph.blogspot.com/2011/02/irish-blogosphere.html' title='The Irish Blogosphere'/><author><name>Conrad Lee</name><uri>https://profiles.google.com/110979755871396857085</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh6.googleusercontent.com/-3Jylvn3UkJI/AAAAAAAAAAI/AAAAAAAAAAA/PM9AqhTcFSE/s512-c/photo.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-Y5giPviqR7c/TVObmuOMdQI/AAAAAAAAG1g/KDNJPBCgi3I/s72-c/full_degree_annotated.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8128923945165565689.post-3622705185485634223</id><published>2011-02-05T12:07:00.006Z</published><updated>2011-03-05T20:06:12.198Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='visualization'/><category scheme='http://www.blogger.com/atom/ns#' term='facebook'/><category scheme='http://www.blogger.com/atom/ns#' term='networks'/><title type='text'>Visualizing Large Facebook Friendship Networks</title><content type='html'>&lt;div style="text-align: left;"&gt;&lt;span class="Apple-style-span"&gt;A few years ago, Traud, Kelsic, Mucha, and Porter posted a paper on &lt;a href="http://arxiv.org/abs/0809.0690"&gt;arXiv&lt;/a&gt;, (&lt;a href="http://masonporter.blogspot.com/2011/02/facebook-visualization-challenge.html"&gt;the most recent version&lt;/a&gt; has been accepted to &lt;i&gt;SIAM Review&lt;/i&gt;) in which they analyze five Facebook networks.  In a bold (and in my opinion commendable) move, they also released the potentially sensitive data on which their work was based.  The dataset consists of five collegiate Facebook friendship networks (it's rumored that the authors had access to 100). The dataset also includes information on gender, high-school, dorm, academic major, and some other attributes.  Note that names of users have been stripped away, and attributes have been assigned numeric values to help protect identities, but it's a potentially risky move because anonymization of network data is an unsolved problem.  The dataset has come to be known as the "Facebook5," and is available for download &lt;strike&gt;here&lt;/strike&gt;.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span"&gt;&lt;strike&gt; &lt;/strike&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;span class="Apple-style-span"&gt;&lt;b&gt;[Update&lt;/b&gt;: Mason Porter confirmed the rumor of the 100 collegiate networks, and he and his colleagues have&lt;strike&gt; &lt;a href="http://masonporter.blogspot.com/2011/02/facebook100-data-set.html"&gt;released the data&lt;/a&gt;&lt;/strike&gt;.&lt;b&gt;]&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span"&gt;&lt;b&gt;[Further update: &lt;/b&gt;At the request of the Facebook Data team, Mason Porter has taken down the data --- both the Facebook5 and the Facebook100&lt;b&gt;. &lt;/b&gt;I hope you downloaded it before it disappeared.&lt;b&gt;]&amp;nbsp;&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span"&gt;&lt;b&gt;[Another update: &lt;/b&gt;The Facebook100 dataset is still kicking around the internet; for details, see &lt;a href="http://sociograph.blogspot.com/2011/03/facebook100-data-and-parser-for-it.html"&gt;this post&lt;/a&gt;.&lt;b&gt;] &lt;/b&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;span class="Apple-style-span"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;span class="Apple-style-span"&gt;Facebook networks like these are notoriously hard to visualize, largely because people have so many friends, and these friends belong to distinct groups.  As Sune Lehmann &lt;a href="http://sunelehmann.com/2010/06/29/pervasive-overlap/"&gt;noted on his blog&lt;/a&gt;, part of the problem here is pervasive overlap:&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;blockquote&gt;&lt;/blockquote&gt;&lt;span class="Apple-style-span" style="color: #333333; line-height: 24px;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="color: #333333; line-height: 24px;"&gt;If every node is a member of more than one community — and this is clearly the case in the LinkedIn example, as well as in real social networks — &lt;i style="background-color: transparent; border-width: 0px; font-style: italic; margin: 0px; padding: 0px; vertical-align: baseline;"&gt;then the global structure of the network is not at all modular. &lt;span style="background-color: transparent; border-width: 0px; font-style: normal; margin: 0px; padding: 0px; vertical-align: baseline;"&gt;Rather, the network will be a dense mess with no visually discernible structure. The network will look like ball of yarn … or a hairball (&lt;/span&gt;&lt;/i&gt;above, right). In fact, this is precisely the type of structure which has recently been discovered in empirical investigations of a comprehensive set of large networks (social and otherwise)&lt;/span&gt;&lt;/blockquote&gt;&lt;/div&gt;&lt;div&gt;&lt;blockquote&gt;&lt;/blockquote&gt;&lt;span class="Apple-style-span"&gt;In their paper, the authors attempt a visualization of only the smallest dataset, which looks like the hairball described by Lehmann:&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;span class="Apple-style-span"&gt;&lt;a href="http://3.bp.blogspot.com/_8NCZ65JlePs/TU1G4Reiz3I/AAAAAAAAG0A/JGWxVaFXNpk/s1600/Traud-Vis-Caltech.png"&gt;&lt;img alt="" border="0" id="BLOGGER_PHOTO_ID_5570186246705303410" src="http://3.bp.blogspot.com/_8NCZ65JlePs/TU1G4Reiz3I/AAAAAAAAG0A/JGWxVaFXNpk/s400/Traud-Vis-Caltech.png" style="cursor: pointer; display: block; height: 194px; margin: 0px auto 10px; text-align: center; width: 400px;" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;I should note that the authors did not focus on visualization, but rather used the above image as motivation for quantitative measures. Nonetheless, I can verify that it's hard to get much out of visualizing these networks. I have also tried to visualize this and similar data sets, and I've invariably ended up with even more tangled hairballs.  I've wondered whether complicated visualization software, with scores of features hidden in long menus, could do a better job if only I understood all of those features.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;So I was excited to be the official "data torturer" in a visualization workshop (VACN) on Jan. 19th 2011 by &lt;a href="http://www.ucd.ie/casl/"&gt;CASL&lt;/a&gt;. I had the opportunity to see whether the experts who develop some cutting edge visualization software could do a better job visualizing this data.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;The workshop brought together the lead scientists behind five software packages used to visualize network data:&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;&lt;span class="Apple-style-span"&gt;&lt;a href="http://www.visone.info/"&gt;Visone&lt;/a&gt; (Germany, edu)&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span"&gt;&lt;a href="http://vlado.fmf.uni-lj.si/pub/networks/pajek/"&gt;Pajek&lt;/a&gt; (Slovenia, edu)&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span"&gt;&lt;a href="http://www.tulip-software.org/"&gt;Tulip&lt;/a&gt; (France, edu)&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span"&gt;&lt;a href="http://gephi.org/"&gt;Gephi&lt;/a&gt; (France, edu)&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span"&gt;I2's &lt;a href="http://www.i2group.com/us/products--services/analysis-product-line/analysts-notebook"&gt;Analyst Notebook&lt;/a&gt; (U.K., commercial)&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;In the challenge section of the workshop, we asked participants to visualize three Facebook networks from the Facebook5:&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;&lt;span class="Apple-style-span"&gt;Caltech (769 Nodes, 16656 Edges)&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span"&gt;Princeton (6596 Nodes, 293,320 Edges)&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span"&gt;UNC Chapel Hill (18163 Nodes, 766,800 Edges)&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;Here are the results of three of the teams:&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;b&gt;Visone&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;a href="http://www.inf.uni-konstanz.de/%7Enick/"&gt;Bobo Nick&lt;/a&gt; used Visone to produce the following visualization of the Caltech network.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span"&gt;&lt;a href="http://1.bp.blogspot.com/_8NCZ65JlePs/TU1LXgn3D7I/AAAAAAAAG0I/3IghRt1zAY8/s1600/visone_small.png"&gt;&lt;img alt="" border="0" id="BLOGGER_PHOTO_ID_5570191181393366962" src="http://1.bp.blogspot.com/_8NCZ65JlePs/TU1LXgn3D7I/AAAAAAAAG0I/3IghRt1zAY8/s400/visone_small.png" style="cursor: pointer; display: block; height: 400px; margin: 0px auto 10px; text-align: center; width: 350px;" /&gt;&lt;/a&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;To create the layout, Mr. Nick applied a spectral layout to the modified laplacian, and then used stress minimization to improve on the spectral layout. He colored nodes by dorm.  The height of nodes indicates degree, and the width indicates betweenness centrality.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;At first glance, the visualization looks rather like a hairball. But if you focus on the position and shape of the nodes, and ignore the edges, then this image reveals important aspects of the Caltech network.  First and foremost, the layout reveals that nodes cluster according to their dorms.  Furthermore, the orange dorm appears to be especially insular, while the light-green and dark blue dorms seem to mingle.  By focusing on the width of the nodes, we can see that the nodes which appear between dorms also have a high betweenness centrality.  As I recall, when Bobo colored the nodes by gender, he tentatively found that these bridge nodes tend to be female.  The white nodes indicate that the dorm attribute is "unknown."  Using this layout, one could make an educated guess that the white nodes in the orange cluster would be colored orange if we had data on them.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span"&gt;Tulip&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;a href="http://www.labri.fr/perso/auber/"&gt;David Auber&lt;/a&gt;, used Tulip to produce the following images.  The first is of the large UNC dataset, which contains over 800,000 edges, and is particularly difficult to visualize.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="line-height: 19px;"&gt;&lt;span class="Apple-style-span" style="color: black; line-height: normal;"&gt;&lt;a href="http://3.bp.blogspot.com/_8NCZ65JlePs/TU1SSHazAbI/AAAAAAAAG0Q/dXf5_xleuvI/s1600/tulip-largeClustoSpaltStFac.png"&gt;&lt;img alt="" border="0" id="BLOGGER_PHOTO_ID_5570198785309737394" src="http://3.bp.blogspot.com/_8NCZ65JlePs/TU1SSHazAbI/AAAAAAAAG0Q/dXf5_xleuvI/s400/tulip-largeClustoSpaltStFac.png" style="cursor: pointer; display: block; height: 374px; margin: 0px auto 10px; text-align: center; width: 400px;" /&gt;&lt;/a&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: #111111; line-height: 19px;"&gt;Dr. Auber describes the process he used to create the image:&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: #111111; line-height: 19px;"&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse; color: black; font-size: x-small; line-height: normal;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="color: #111111; line-height: 19px;"&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse; color: black; font-size: x-small; line-height: normal;"&gt;I've run a spring embedder layout, and a clustering algorithm that finds important clusters based on the connectivity of the element inside a cluster. I have colored the nodes according the student/faculty attribute. For the rendering, I used a bump mapping effect. You can see on the picture that there are many big clusters with almost the same composition of student/faculty. Only one cluster has a special structure mixing different student/fac (blue and yellow halo). You can also see in the the center of the image that there are hubs which connect these clusters, and that these hubs have got a high faculty ratio.&lt;/span&gt;&lt;/span&gt;&lt;/blockquote&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="line-height: 19px;"&gt;&lt;span class="Apple-style-span"&gt;&lt;/span&gt;&lt;span class="Apple-style-span"&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;/blockquote&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="line-height: 19px;"&gt;&lt;span class="Apple-style-span"&gt;It's fascinating to see that instead of a mess, the graph clusters into five distinct clusters, one of which is closely related to some non-undergraduate part of the network (either professors or graduate students).  I wish Auber would have colored the nodes by year instead of student/fac---I speculate that these five clusters correspond to freshmen, sophmore, junionr, senior, and grad students.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 19px;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="line-height: 19px;"&gt;&lt;span class="Apple-style-span"&gt;The second image is of the Princeton dataset. &amp;nbsp;I've never seen this sort of representation of the relations between various attributes.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="line-height: 19px;"&gt;&lt;span class="Apple-style-span"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="line-height: 19px;"&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="color: black; line-height: normal;"&gt;&lt;a href="http://4.bp.blogspot.com/_8NCZ65JlePs/TU1UTdQDpDI/AAAAAAAAG0Y/jTcRDLT38kY/s1600/tulip-mediumparallelcoords.png"&gt;&lt;img alt="" border="0" id="BLOGGER_PHOTO_ID_5570201007373394994" src="http://4.bp.blogspot.com/_8NCZ65JlePs/TU1UTdQDpDI/AAAAAAAAG0Y/jTcRDLT38kY/s400/tulip-mediumparallelcoords.png" style="cursor: pointer; display: block; height: 397px; margin: 0px auto 10px; text-align: center; width: 400px;" /&gt;&lt;/a&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="line-height: 19px;"&gt;&lt;span class="Apple-style-span"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="line-height: 19px;"&gt;&lt;span class="Apple-style-span"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="line-height: 19px;"&gt;&lt;span class="Apple-style-span"&gt;Auber descibes its creation and meaning:&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="line-height: 19px;"&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="color: black; line-height: normal;"&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse; font-size: x-small;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="line-height: 19px;"&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="color: black; line-height: normal;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse;"&gt;I've used a paralllel coordinate plots that show each dimension of your dataset plus the degree of each element. One can see correlation between dimension and connectivy (degree centrality) [i.e., lines are colored by the number of friends each node has]. One can see that student_fac and years are really correlated&lt;/span&gt;---&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse;"&gt;student fac 1 have only in year 06 07 08 09, while student_fac 2 are only in year 06 05 04 etc ... One can also see that high degree nodes come from student_fac 1 and year 2009-2004.  One interesting thing is that it seems that gender 1 and gender 2 are not going in the same high school. We can also see all high connected nodes have a second major .... too many things to say here...&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/blockquote&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="line-height: 19px;"&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="color: black; line-height: normal;"&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="line-height: 19px;"&gt;Auber's third image is also like nothing I've seen before:&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/_8NCZ65JlePs/TU8Ve00I6LI/AAAAAAAAG1Q/Eg5hA6B7zbY/s1600/smallTreeMapDorm.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="233" src="http://3.bp.blogspot.com/_8NCZ65JlePs/TU8Ve00I6LI/AAAAAAAAG1Q/Eg5hA6B7zbY/s320/smallTreeMapDorm.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="line-height: 19px;"&gt;Dr. Auber explains the process of its creation:&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: inherit; font-size: x-small;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse;"&gt;The image is a treemap representation, where the nodes&amp;nbsp;have been clusterized according to dorm then gender then&amp;nbsp;the year. On top of the treemap, the connection between nodes are&amp;nbsp;drawn using an edge bundling technique. The color of &amp;nbsp;nodes represent the degree of elements.&lt;br /&gt;&lt;span class="Apple-style-span" style="border-collapse: separate;"&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse;"&gt;On this image one can detect several interesting properties.&amp;nbsp;&lt;/span&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse;"&gt;First, the connectivity between people from the same&amp;nbsp;&lt;/span&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse;"&gt;dorm seems to be the same in all dorms except for dorm zero.&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse;"&gt;If we extract the induced subgraphs for each dorm we can&lt;/span&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse;"&gt;&amp;nbsp;&lt;/span&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse;"&gt;see that this phenomena is true. For intance, the density of the&lt;/span&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse;"&gt;&amp;nbsp;&lt;/span&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse;"&gt;dorm 0 cluster is 13 times smaller than the dorm 171.&lt;/span&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse;"&gt;&amp;nbsp;&lt;/span&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse;"&gt;It seems that dorm 0 represents people that do not specify&lt;/span&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse;"&gt;their dorm (ie. a random cluster of people). Thus we can conclude&lt;/span&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse;"&gt;&amp;nbsp;&lt;/span&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse;"&gt;that the dorm property is correlated with the connectivity (people from the same&lt;/span&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse;"&gt;&amp;nbsp;&lt;/span&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse;"&gt;dorm seems to have more connections).&lt;/span&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse;"&gt;The image also enables to see intra cluster connections. We can see that some&amp;nbsp;&lt;/span&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse;"&gt;cluster are more linked together, for instance the 169-172 and 166-172.&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse;"&gt;For the node 622 (highest degree) [which is highlight and whose friendships are green] , we can see exactly all his/her connection&amp;nbsp;&lt;/span&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse;"&gt;with dorm/gender ... Visually, we see that he/she is more connected to the cluster&amp;nbsp;&lt;/span&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse;"&gt;169, 166 and 172. It is interesting to see that these 3 clusters are the&lt;/span&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse;"&gt;&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse;"&gt;3 more inter-connected clusters that we detected before.&lt;/span&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse;"&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/blockquote&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="line-height: 19px;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="line-height: 19px;"&gt;&lt;b&gt;Gephi&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="line-height: 19px;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="line-height: 19px;"&gt;&lt;a href="http://fr.linkedin.com/in/heymann"&gt;Sebastien Heymann&lt;/a&gt; and other members of the Gephi team won the visualization challenge. One of the most impressive images they came up with is the following visualization on the largest dataset, UNC.  Note that it doesn't look like a hairball, but rather reveals clear structure.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/_8NCZ65JlePs/TU1cGiCSJmI/AAAAAAAAG0o/EnyBZRopFHo/s1600/gephi-large.png"&gt;&lt;span class="Apple-style-span"&gt;&lt;img alt="" border="0" id="BLOGGER_PHOTO_ID_5570209581412525666" src="http://4.bp.blogspot.com/_8NCZ65JlePs/TU1cGiCSJmI/AAAAAAAAG0o/EnyBZRopFHo/s400/gephi-large.png" style="cursor: pointer; display: block; height: 300px; margin: 0px auto 10px; text-align: center; width: 400px;" /&gt;&lt;/span&gt;&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span"&gt;Mr. Heymann comments on this visualization:&lt;/span&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse; font-size: 13px;"&gt;For the first step of the layout: I've run the&lt;a href="http://gephi.org/2010/openord-new-layout-plugin-the-fastest-algorithm-so-far/"&gt; OpenOrd layout&lt;/a&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse; font-size: 13px;"&gt;, which is a force-directed layout particularly suitable for very large networks; I've then hidden the edges as we were only interested by the visual clusters. In the second step, I added visual information: I interpreted the year of graduation as a quantitative variable by mapping it to a color partitioning.&lt;/span&gt;&lt;/span&gt;&lt;/blockquote&gt;&lt;span class="Apple-style-span"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span"&gt;The Traud paper that introduced this dataset found that in the UNC Facebook network, graduation year appears to be the most important attribute.  In order to arrive at this conclusion, they introduced complicated measures.  While these measure certainly have their advantages, it would be great to be able to visualize that finding.  Traud et al note:&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;/span&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;/blockquote&gt;&lt;div&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span"&gt;Despite this demonstration of the utility of visualizing communities [in the Caltech data], it is typically necessary to perform quantitative analyses after detecting communities, as Caltech is unusual among universities in having a single characteristic that aligns so closely with its communities. For other institutions, we observe more heterogeneous communities, and it is typically difficult to visually assess which characteristics best correlate with the communities or even whether there is any strong correlation at all.&lt;/span&gt;&lt;/span&gt;&lt;/blockquote&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;So we were very impressed that Gephi was able to find the correlation with year in the UNC data set, a finding that agrees with Traud et al's quantitative analysis.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span"&gt;&lt;a href="http://1.bp.blogspot.com/_8NCZ65JlePs/TU1dtC1J5bI/AAAAAAAAG0w/rlKtqg7GbHo/s1600/gephi-small.png"&gt;&lt;img alt="" border="0" id="BLOGGER_PHOTO_ID_5570211342562485682" src="http://1.bp.blogspot.com/_8NCZ65JlePs/TU1dtC1J5bI/AAAAAAAAG0w/rlKtqg7GbHo/s400/gephi-small.png" style="cursor: pointer; display: block; height: 300px; margin: 0px auto 10px; text-align: center; width: 400px;" /&gt;&lt;/a&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;Finally, above we see yet another visualization of the Caltech network.  It's in line with the other visualizations we've seen of this set. &amp;nbsp;Mr Heymann's comments:&lt;/span&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse; font-size: 13px;"&gt;I've run the &lt;a href="http://gephi.org/forum/topic.php?id=5403"&gt;ForceAtlas layout&lt;/a&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse; font-size: 13px;"&gt;, which is suitable for dense small graphs. Then I mapped the dorms variable to a color partitioning. We could better see the different communities by hiding the edges between nodes with a low betweenness centrality score, of course.&lt;/span&gt;&lt;/span&gt;&lt;/blockquote&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse; font-size: 20px;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse; font-size: 20px;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/blockquote&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse; font-size: 20px;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="line-height: 19px;"&gt;&lt;span class="Apple-style-span"&gt; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8128923945165565689-3622705185485634223?l=sociograph.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sociograph.blogspot.com/feeds/3622705185485634223/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8128923945165565689&amp;postID=3622705185485634223' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default/3622705185485634223'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default/3622705185485634223'/><link rel='alternate' type='text/html' href='http://sociograph.blogspot.com/2011/02/visualizing-large-facebook-friendship.html' title='Visualizing Large Facebook Friendship Networks'/><author><name>Conrad Lee</name><uri>https://profiles.google.com/110979755871396857085</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh6.googleusercontent.com/-3Jylvn3UkJI/AAAAAAAAAAI/AAAAAAAAAAA/PM9AqhTcFSE/s512-c/photo.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_8NCZ65JlePs/TU1G4Reiz3I/AAAAAAAAG0A/JGWxVaFXNpk/s72-c/Traud-Vis-Caltech.png' height='72' width='72'/><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8128923945165565689.post-2463179329152722226</id><published>2010-03-16T11:11:00.000Z</published><updated>2010-04-15T09:30:03.708+01:00</updated><title type='text'>Is Spinn3r still free for academic research?</title><content type='html'>&lt;span class="Apple-style-span"  style="color:#333333;"&gt;&lt;div style="text-align: center;"&gt;&lt;span class="Apple-style-span"  style="color:#333333;"&gt;&lt;b&gt;Update&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;The comments to this post seem to answer my question: Spinn3r &lt;i&gt;is &lt;/i&gt;still supporting academic researchers, but rather than offering free data, they are offering heavily discounted data (charging $500/month instead of $6000/month). &lt;i&gt;Why couldn't I find this this information on the Spinn3r site.&lt;/i&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;While it's too bad that the data is no longer free, it's really fair enough to charge something for it, considering that collecting, cleaning, hosting and, transferring hundreds of gigabytes of data is not free.&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="color:#333333;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;span class="Apple-style-span"  style="color:#333333;"&gt;&lt;b&gt;End update&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;---&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="color:#333333;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;Spinn3r is a service that aggregates the articles of thousands of newspapers and other media sources, as well as the posts of millions of blogs.  They clean up this data, put it in a consistent format, and provide rich metadata.&lt;/span&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="color:#333333;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="color:#333333;"&gt;For an academic researcher, especially a PhD student with limited resources, this dataset is very appealing.  Spinn3r does all the dirty (and resource intensive) work, and &lt;/span&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="color:#333333;"&gt;provides unprecedented data on news media, which could be mined to make interesting conclusions about collective human behavior.  For example, &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="color:#333333;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;a href="http://cs.stanford.edu/~jure" style="margin-top: 0em; margin-right: 0em; margin-bottom: 0em; margin-left: 0em; padding-top: 0em; padding-right: 0em; padding-bottom: 0em; padding-left: 0em; text-decoration: underline; background-color: inherit; "&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="color:#333333;"&gt;Jure Leskovec&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="color:#333333;"&gt;, &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;a href="http://www.cs.cornell.edu/~lars" style="margin-top: 0em; margin-right: 0em; margin-bottom: 0em; margin-left: 0em; padding-top: 0em; padding-right: 0em; padding-bottom: 0em; padding-left: 0em; text-decoration: underline; background-color: inherit; "&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="color:#333333;"&gt;Lars Backstrom&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="color:#333333;"&gt; and &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;a href="http://www.cs.cornell.edu/home/kleinber" style="margin-top: 0em; margin-right: 0em; margin-bottom: 0em; margin-left: 0em; padding-top: 0em; padding-right: 0em; padding-bottom: 0em; padding-left: 0em; text-decoration: underline; background-color: inherit; "&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="color:#333333;"&gt;Jon Kleinberg&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="color:#333333;"&gt; wrote a &lt;/span&gt;&lt;a href="http://www.cs.cornell.edu/home/kleinber/kdd09-quotes.pdf"&gt;&lt;span class="Apple-style-span"  style="color:#333333;"&gt;fascinating article&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="color:#333333;"&gt; on the news cycle (see the companion &lt;/span&gt;&lt;a href="http://memetracker.org/"&gt;&lt;span class="Apple-style-span"  style="color:#333333;"&gt;MemeTracker website&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="color:#333333;"&gt; for details).&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="color:#333333;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="color:#333333;"&gt;Spinn3r has generously offered give academic researchers free access to their API (as they mention &lt;/span&gt;&lt;a href="http://blog.spinn3r.com/crawler/"&gt;&lt;span class="Apple-style-span"  style="color:#333333;"&gt;here&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="color:#333333;"&gt; and as the head of Spinn3r, Kevin Burton, has mentioned &lt;/span&gt;&lt;a href="http://datamining.typepad.com/data_mining/2007/10/tailrank-spinn3.html"&gt;&lt;span class="Apple-style-span"  style="color:#333333;"&gt;here&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="color:#333333;"&gt;).  However, I could not find any sign-up form for researchers.  I assume that researchers are supposed to request access using the company's &lt;/span&gt;&lt;a href="http://www.spinn3r.com/contact"&gt;&lt;span class="Apple-style-span"  style="color:#333333;"&gt;contact form&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="color:#333333;"&gt;.  I have &lt;/span&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="color:#333333;"&gt;twice &lt;/span&gt;&lt;/i&gt;&lt;span class="Apple-style-span"  style="color:#333333;"&gt;requested access, and I have waited &lt;/span&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="color:#333333;"&gt;months &lt;/span&gt;&lt;/i&gt;&lt;span class="Apple-style-span"  style="color:#333333;"&gt;for a response.  I have not heard back from them.  Has anyone else had better luck?&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="color:#333333;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="color:#333333;"&gt;For anyone interested in the data, the best I could do was find a dataset that Kevin Burton suppled for an academic conference called &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span"   style="  ;font-family:Arial, Georgia, 'Trebuchet MS', Tahoma;font-size:11px;"&gt;&lt;span style="clear: right; "&gt;&lt;a href="http://www.blogger.com/icwsm09_sanjose/" style="text-decoration: underline; font-size: 9pt; "&gt;&lt;span class="Apple-style-span"  style="color:#333333;"&gt;ICWSM 2009 - International AAAI Conference on Weblogs and Social Media&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="color:#333333;"&gt;.  &lt;/span&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="color:#333333;"&gt;You can find details on this dataset in &lt;/span&gt;&lt;a href="http://videolectures.net/icwsm09_burton_spinn3r/"&gt;&lt;span class="Apple-style-span"  style="color:#333333;"&gt;this video lecture&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="color:#333333;"&gt;.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span style="clear: right; "&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="color:#333333;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span style="clear: right; "&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="color:#333333;"&gt;Spinn3r's participation in this conference indicates that Spinn3r is serious about their support of the academic community, so why don't they seem to be answering requests for access to their API?&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8128923945165565689-2463179329152722226?l=sociograph.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sociograph.blogspot.com/feeds/2463179329152722226/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8128923945165565689&amp;postID=2463179329152722226' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default/2463179329152722226'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default/2463179329152722226'/><link rel='alternate' type='text/html' href='http://sociograph.blogspot.com/2010/03/is-spinn3r-still-free-for-academic.html' title='Is Spinn3r still free for academic research?'/><author><name>Conrad Lee</name><uri>https://profiles.google.com/110979755871396857085</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh6.googleusercontent.com/-3Jylvn3UkJI/AAAAAAAAAAI/AAAAAAAAAAA/PM9AqhTcFSE/s512-c/photo.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8128923945165565689.post-2201088114161704470</id><published>2010-03-16T10:14:00.000Z</published><updated>2010-03-16T14:59:03.139Z</updated><title type='text'>Recommended rate limit on crawling Google Profile pages: 100 per second</title><content type='html'>It recently &lt;a href="http://petewarden.typepad.com/searchbrowser/2010/03/how-to-gather-the-google-profiles-data-set.html"&gt;came to my attention that &lt;/a&gt; not only is it possible to crawl Google profile pages, but Google even supports this activity by providing a directory of user ids.  There's some interesting data on many of these profiles, including current geographic location, past locations, education history, and organization affiliation.  Also, many users have not opted-out of allowing their Buzz activity to be posted on these pages, which means there is a lot of data on media consumption and awareness of various memes.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Using a modified version of Pete Warden's&lt;a href="http://github.com/petewarden/buzzprofilecrawl"&gt; open-sourced scraper&lt;/a&gt;, I set out to crawl all of the Google Profile pages that were listed in the files in their &lt;a href="http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml"&gt;directory&lt;/a&gt;.  It appears that this directory has not been updated since October 2009, and contains ~4.2 million user IDs.  I guess a lot of people have created profiles since then due to the introduction of Buzz, so the directory might be quite out of date.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I wasn't sure of the correct crawling rate, so, because this crawling activity seems to be officially approved of by Google, I decided that it is better to ask for forgiveness than for permission and so I started off at an optimistic crawl rate.  I used a Python module called &lt;a href="http://honeypot.net/multi-processing-map-python"&gt;forkmap&lt;/a&gt; to run 8 processes of Pete Warden's scraper in parallel (each of which contained 200 threads, for a total of 1,600 threads).  I thought that bandwidth would keep the cralwer down to a reasonable rate, but I guess the University's connection is quite fast, because I received an e-mail a couple of hours later from a Google engineer informing me that I was crawling too quickly, in fact, I was scraping 1,800 profiles per second.  The e-mail was friendly enough, and the general point was that although Google supports crawling the profile data, &lt;b&gt;one should keep the crawl rate down to 100 requests per second.&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;This rate is still quite generous, and would allow one to crawl the 4.2 million listed profiles in 11.6 hours, which is really quick enough.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8128923945165565689-2201088114161704470?l=sociograph.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sociograph.blogspot.com/feeds/2201088114161704470/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8128923945165565689&amp;postID=2201088114161704470' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default/2201088114161704470'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default/2201088114161704470'/><link rel='alternate' type='text/html' href='http://sociograph.blogspot.com/2010/03/recommended-rate-limit-on-crawling.html' title='Recommended rate limit on crawling Google Profile pages: 100 per second'/><author><name>Conrad Lee</name><uri>https://profiles.google.com/110979755871396857085</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh6.googleusercontent.com/-3Jylvn3UkJI/AAAAAAAAAAI/AAAAAAAAAAA/PM9AqhTcFSE/s512-c/photo.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8128923945165565689.post-6097467548826167004</id><published>2009-11-06T15:11:00.000Z</published><updated>2009-11-26T22:38:23.729Z</updated><title type='text'>Is obesity contagious? A Review of the Debate over the "Network Effects" of Obesity</title><content type='html'>&lt;div style="text-align: center;"&gt;&lt;span class="Apple-style-span"  style="color:#0000EE;"&gt;&lt;u&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 0); -webkit-text-decorations-in-effect: none; "&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_8NCZ65JlePs/SvnUsGe-3TI/AAAAAAAAGcU/z0ENV7EPJPE/s1600-h/Screenshot.png"&gt;&lt;/a&gt;&lt;/span&gt;&lt;/u&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-weight: bold; "&gt;&lt;blockquote&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;/blockquote&gt;Introduction&lt;/span&gt;&lt;/div&gt;&lt;br /&gt;In July 2007, network science invaded medicine.  Nicholas Christakis, a professor of public health at Harvard Medical School, and James Fowler, an associate professor of political science at UC San Diego, published a cover article in medicine's most prestigious journal, the New England Journal of Medicine, complete with a teaser written by complex-network-bigshot Albert-Laszlo Barabasi.  The article served as the debut of social network analysis in medicine, complete with a glossary that explained terms such as "ego," "alter", "node," and "homophily" for uninitiated medical researchers.&lt;div&gt;&lt;br /&gt;In this article &lt;a href="http://content.nejm.org/cgi/content/abstract/357/4/370"&gt;(freely accessible from the journal&lt;/a&gt;), entitled "The Spread of Obesity in a Large Social Network over 32 Years", Christakis and Fowler (CF) make a number of claims, each more bold than the last.  In a later response to criticism, CF summarize their two major findings of the paper in the following two points:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;(1) "obese persons formed clusters in the network at all time points and that these clusters extended to three degrees of separation (e.g., to a person's friend's friend's friend)"&lt;br /&gt;&lt;br /&gt;(2) "statistical analysis suggested that the clusters were not soley attributable to the selective formation of social ties among obese persons.  A person's chances of becoming obese increased if he or she had a friend who became obese in a given time period."&lt;br /&gt;&lt;br /&gt;We should note that this brief summary of the main findings is slightly incomplete and actually leaves out the most contentious point.  If one considers point (2), there are a number possible reasons why someone's body mass index (BMI) might be linked to the BMI of his or her friends.  Let's begin by considering some possible causes for this clustering.&lt;div&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Various causes of clusters of obese people&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Noting that friends often share similar environments, it could be the case that fast food chains (or fast food advertisements) have invaded our common environment, or some technological improvements such as elevators have decreased the need to be physically active.  In this case, although a person's weight gain is correlated the weight gain of his friends, it's not really his friend that causes him to gain weight, but rather a shared environmental factor---a "confounding variable."&lt;br /&gt;&lt;br /&gt;However, we can also imagine how, given some social dynamic, a person's weight gain is directly caused by his friends' weight gain. Let's reasonably assume that people have an incentive to eat lots of unhealthy food (e.g., it tastes good, is cheap, easy to find), which is held in check only by their desire to have a personally acceptable BMI.  Let's imagine that my definition of an acceptable BMI is pegged to be slightly lower than the average BMI of my friends. If my friends gain a lot of weight, then my acceptable BMI will increase, allowing me to enjoy unhealthy foods without becoming, in my mind, unacceptably overweight.  One can also imagine simpler processes whereby my friends' behavior directly affects my BMI; for example, if my friends start frequenting unhealthy restaurants and drag me along.  Let's call these  explanations of (2) cases of "social influence", and note that the "social influence" explanation of (2) differs from the "confounding variable" explanation" because we put the blame for our weight gain on the behavior of our friends rather than on some shared environmental change.&lt;br /&gt;&lt;br /&gt;So far we have established two processes that could account for clusters of obese people: confounding variables (shared environmental factors such as fast food chains) and social influence.  Point (2) also states that "the clusters were not soley attributable to the selective formation of social ties."  To understand this statement, we must introduce a third process that can create clusters of similar people: selection (aka, "choice homophily","homophily").  Selection's mechanism is simple: "birds of a feather flock together," that is, people befriend those who resemble themselves.  In this case clustering of obesity can emerge even though everyone's weight remains constant--overweight people tend to become friends with overweight people, and similarly, people who are not overweight tend to become friends with similarly slim people.  This process, selection, also leads to obese clusters.&lt;br /&gt;&lt;br /&gt;From a policy perspective it is important to know which of these three processes is responsible for the recent rise in the rate of obesity.  In particular, if social influence plays a major role, then obesity is similar to a disease---it's contagious!  When treating contagious diseases, there is a "multiplier effect."  Imagine that at a certain stage in a flu epidemic, the average person spreads the disease to two other people.  Then if I can prevent one person from getting infected, I also prevent that person from spreading the disease to two other people---thus the effect of my intervention becomes "multiplied."  The other two processes I've mentioned, confounding variables and selection, do not have a multiplier effect. From a policy perspective it is therefore important to distinguish between these three processes.&lt;br /&gt;&lt;br /&gt;Now that we have covered how confounding variables, social influence, and selection could all lead to clusters of obese people, we can return to the claims of CF.  CF contend that (2) is not the result of just confounding variables or selection; they claim it is also due to the process of social influence.  (In the course of this controversy, the process of social influence variously refered to as "endogenous effects", "induction", and the vague "social network effects".)  From a policy perspective, this implies that intervention that prevents one person from becoming obese, or that cures a case of obesity, could have a multiplier effect---in a sense this multiplier effect lowers the cost, or increases the effect, of such intervention.  Before we move on to criticisms of this claim, let's look at CF's argument in favor of social influence.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;How to control for selection (a.k.a. choice homophily) as a cause of obese clusters of friends&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Because confounding variables, selection, and social influence all create similarly clustered networks, it's very hard to say, given some clustered network, which of these three processes was responsible for the clustering.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;However, if one has longitudinal data on friendships and obesity (also called &lt;i&gt;dynamic&lt;/i&gt; network data), one can begin to untangle these three processes.  Specifically, one can distinguish the effects of selection on the one hand, from confounding variables and influence, on the other.  Separating the effects of selection from the other processes is achieved by using a "lagged variable."&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The motivation for using a lagged variable becomes clear when considering the following scenario. Let's say that we observe a group of adolescents in 2005.  We observe that they are all friends with each other and we measure the BMI of everyone in the group.  In 2009, their BMIs are once again measured, and we confirm that they are all still friends.  Let's assume that most of the members of this group became obese between 2005 and 2009.  The end result is that we observe a cluster of obese people.  If we did not have the longitudinal data from 2005, then we could not be sure whether this obese cluster was created by the process of selection or some other process.  However, with our longitudinal data, we can say that these obese people actually chose to be friends with each other before they became obese---meaning we can rule out selection (with respect to obesity) as the cause of the obese cluster.  In short, with longitudinal data one can isolate the effects of selection to a single moment, and assume that correlations in the change of BMI in the next moment are not due to selection.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;[Note: CF do not use BMI in their independent and dependent variables, but rather the obesity status.  Why did they make this decision?   I find it suspicious: why throw away data by turning a real-valued variable into a categorical (true/false) variable?] &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;CF exploit this advantage of longitudinal data by using a logistic-regression model with the use of lagged dependent and independent variables.  Let's call the dependent variable in their model, which is the obesity status of the subject at time t+1, &lt;i&gt;obese?(subject, t+1)&lt;/i&gt;.  The independent variables include: &lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;&lt;i&gt;obese?(subject,t)&lt;/i&gt;&lt;/li&gt;&lt;li&gt;&lt;i&gt;obese?(friend,t)&lt;/i&gt;&lt;/li&gt;&lt;li&gt;&lt;i&gt;obese?(friend,t+1)&lt;/i&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;In this model, the effects of selection (homphily) should be absorbed in the coefficient for &lt;i&gt;obese?(friend,t)&lt;/i&gt;, and we assume that the coefficient for the variable &lt;i&gt;obese?(friend,t+1)&lt;/i&gt; reflects either some confounding variable or the effects of social influence.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In fact, the core of the debate around CF's obesity study revolved around interpreting whether &lt;i&gt;obese?(friend,t+1) &lt;/i&gt;should be attributed to confounding variables or to social influence.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;How CF argue that social influence spreads obesity (i.e., that obesity is contagious)&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;The main empirical result that CF use to support their claim is summarized in the following graphic:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div style="text-align: center; "&gt;&lt;span class="Apple-style-span"  style="color:#0000EE;"&gt;&lt;u&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 0); -webkit-text-decorations-in-effect: none; "&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_8NCZ65JlePs/SvnUsGe-3TI/AAAAAAAAGcU/z0ENV7EPJPE/s1600-h/Screenshot.png"&gt;&lt;img src="http://2.bp.blogspot.com/_8NCZ65JlePs/SvnUsGe-3TI/AAAAAAAAGcU/z0ENV7EPJPE/s400/Screenshot.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5402583082127514930" style="display: block; margin-top: 0px; margin-right: auto; margin-bottom: 10px; margin-left: auto; text-align: center; cursor: pointer; width: 400px; height: 266px; " /&gt;&lt;/a&gt;&lt;/span&gt;&lt;/u&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Pay particular attention to the first three rows in the graphic above.  The first row suggests that if I call some person &lt;i&gt;f&lt;/i&gt; my friend, who does not reciprocally name me as a friend, and that &lt;i&gt;f &lt;/i&gt;becomes obese, then my risk of becoming obese increases by somewhere between approx 5% and 100%.  The second row says that if I list &lt;i&gt;f &lt;/i&gt;as a friend and &lt;i&gt;f &lt;/i&gt;lists me as a friend (i.e., the friendship is observed to be reciprocal), then if &lt;i&gt;f&lt;/i&gt; becomes obese, then the chances that I also become obese increase by anywhere between approx 75% and 300%.  The third case says that if &lt;i&gt;f &lt;/i&gt;names me as a friend, but I don't name him, then the fact that &lt;i&gt;f &lt;/i&gt;becomes obese does not affect the probability that I also become obese with statistical significance.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The discussion above should make it clear that the &lt;i&gt;cause&lt;/i&gt; of the my increased risk of obesity, given that my wife or friend becomes obese, is difficult to determine.  &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;CF's argument that the cause is the social contagiousness of obesity (i.e., influence rather than selection or confounding variables) is mostly contained in the following few paragraphs from their paper:&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;blockquote&gt;&lt;/blockquote&gt;&lt;/div&gt;&lt;blockquote&gt;&lt;div&gt;"The use of a time-lagged dependent variable (lagged to the previous examination) eliminated serial correlation in the errors (evaluated with a Lagrange multiplier test) and also substantially controlled for the ego's genetic endowment and any intrinsic, stable predisposition to obesity.  The use of a lagged independent variable for an alter's weight status controlled for homophily.  The key variable of interest was an alter's [friend's] obesity at time t+1.  A significant coefficient for this variable would suggest either that an alter's weight affected an ego's weight or that an ego and an alter experienced contemporaneous events affecting both their weights.  We estimated these models in varied ego-alter pair types." &lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;/blockquote&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Before moving on to the next paragraph let's interpret this last one.  Basically they are saying that their logistic regression model contains a variable for the past obesity status of the ego (subject) and the past and current obesity status of the alters (friends).  They contend that the former variable (the past obesity status of the ego) controls for genetic predisposition (which is reasonable) and that, as explained in the last section, the variable that represents the previous obesity status of the alter (friend) controls for selection.  The also state that a "significant coefficient for the variable &lt;i&gt;obese?(friend, t+1) &lt;/i&gt;(i.e., a correlation between my weight gain and my friend's weight gain) would indicate something ambiguous: either that some confounding variable is responsible for our mutual weight gain, or that social influence is to blame.  So far, so good.  In the next paragraph, CF make the most contentious claim: that correlated weight gain is not due to some confounding variable, but rather social influence.  Pay close attention to their line of argument:&lt;/div&gt;&lt;div&gt;&lt;blockquote&gt;"To evaluate the possibility that omitted variables or unobserved events might explain the associations, we examined how the type or direction of the social relationship between the ego and the alter affected the association between the ego's obesity and the alter's obesity.  &lt;b&gt;For example, if unobserved factors drove association between the ego's obesity and the alter's obesity, then the directionality of friendship should not matter&lt;/b&gt;." [Emphasis added]&lt;/blockquote&gt;The last sentence states that if the direction of friendship plays a role in the association of the weight gain of a subject with the weight gain of that subject's friend, then confounding factors can be ruled out and social influence must be the cause of correlated obesity status. If you accept this argument, then the graphic shown above quite convincingly indicates that obesity is contagious.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Now let's turn our attention to the criticism of this finding.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Criticism of CF's finding&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;In May 2008, Ethan Cohen-Cole, of the Federal Reserve Bank of Boston, and Jason Fletcher, of Yale University, published &lt;a href="http://scholar.google.com/scholar?cluster=5691198610034104814&amp;amp;hl=en&amp;amp;as_sdt=2000"&gt;a paper&lt;/a&gt; whose main goal, they later describe, was "to illustrate that it takes relatively specific and potentially narrow specification choices to yield the contagion result."  They also state, "our goal was to show that the &lt;i&gt;methodology&lt;/i&gt; used by Christakis and Fowler (2007) was sensitive to specification choice, and once very small, yet reasonable modifications are made to the space of possible empirical models, the data yield coefficients that are statistically indistinguishable from zero." &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;To show that CF's results are not robust, Cohen-Cole and Fletcher (CCF):&lt;/div&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt;Find data that is somewhat similar to CF's&lt;/li&gt;&lt;li&gt;Replicate CF's analysis and find similar results as CF found&lt;/li&gt;&lt;li&gt;Take into account an extra confounding variable (i.e., they extend CF's model to control for more)&lt;/li&gt;&lt;li&gt;Point out that after CF's model is extended to take this extra confounding variable into account, the main findings of CF no longer appear.&lt;/li&gt;&lt;/ol&gt;&lt;div&gt;We should first note that it's a shame that CCF did not use the same data as CF.  Their refutation is quite weakened by the fact that their so-called replication actually used different data (they use data from National Longitudinal Study of Adolescent Health [AddHealth]).  While CCF acknowledge that there are important differences between their AddHealth data and the Framingham data that CF used, they do not state why they used different data.  Presumably the Framingham data was not easy to obtain. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;CCF reasonably justify their use of AddHealth by pointing out when they replicate CF's method on the AddHealth data, they find results that are remarkably similar to CF's Framingham results.  While it is difficult to say whether this similarity means that the data really are similar or that the measure is biased to produce a narrow range for a particular coefficient, the similarity is nonetheless an indicator that the data sets are similar in an important way.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;CCF's main finding is that if they add one additional independent variable, which, in the notation above, should be labeled &lt;i&gt;obese?(community,t+1)&lt;/i&gt;, then obesity no longer appears to be contagious.  CCF explain the significance of including this variable in their model:&lt;/div&gt;&lt;div&gt;&lt;blockquote&gt;&lt;/blockquote&gt;&lt;blockquote&gt;[The additional environmental variable] formalizes the notion of contextual effects...These represent a much richer set of controls to absorb average changes in social context experienced by all individuals in the sample.  To explain further, these school-specific trends account for any environmental factors shared by individuals at the same school.  CF control for year effects, but their specification does not capture any shared confounders that also vary across geographic space.  For example, CF can control for the fact that the density of fast food restaurants has increased over time but not the fact that number of fast food restaurants has grown faster in some areas than in other areas.  For example, suppose that the number of fast food rastaurants has grown faster in Boston, Massachusetts than in western Massachusetts. Controlling for year effects (which controls for the grown th the number of fast food restaurants across the states in a given year) is not as appealing as controlling for the number of fast food restaurant [sic] in an individual's local area.&lt;/blockquote&gt;&lt;blockquote&gt;"&lt;b&gt;Without accounting for the trends, clustering of obesity in social networks that changed over time would incorrectly be absorbed in the estimation by the endogenous variable.  &lt;/b&gt;Though one, in principle, would want many more controls to account for additional contextual effects, we will note shortly that the endogenous [social influence] effect vanishes even with this relatively simple characterization."&lt;/blockquote&gt;In short, CCF add an independent variable to their model that accounts for change in BMI of the entire school.  This should help control for shared environmental (confounding) variables.  In addition to this change to CF's specification, CCF make one other change:&lt;/div&gt;&lt;div&gt;&lt;blockquote&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;/blockquote&gt;&lt;/div&gt;&lt;div&gt;&lt;blockquote&gt;We add to our evaluation by accounting for self-selection of friends (homophily). This is accomplished by looking only at the change in BMI fromthe time of declaration of friendship until the subsequentweight measurement. Note the distinction between this method and the lagged independent variable used in CF. Our method allows us to distinguish between the desire to become friends based on similarity in weight, which would appear based on the simultaneous measurement of friendship andweight, and the friendship effect ofweight gain.&lt;/blockquote&gt;After making these two modifications to CF's model, CCF find that the observed contagiousness of obesity disappears.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;CCF also point out that CF's method of controlling for selection (homphily)---the use of a lagged variable---is susceptible to a particular problem.  CCF  provide a nice description of this weakness in &lt;a href="http://www.bmj.com/cgi/content/abstract/337/dec04_2/a2533"&gt;a later critical article&lt;/a&gt; published in the&lt;i&gt; British Medical Journal&lt;span class="Apple-style-span" style="font-style: normal; "&gt;)&lt;/span&gt;&lt;/i&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;blockquote&gt;The claim [of CF] is that "the use of a lagged independent variable for an alter's weight status controlled for homophily."...Unfortunately, unless selection is conditioned only on this variable, this statement might be spurious.  For example, if friendships are formed based on characteristics like self esteem, and if self esteem affects both current weight and future weight in differing ways, then adjustment for current peer weight status will not capture the self selection of friends based on self esteem that also affects future weight.&lt;/blockquote&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Why Cohen-Cole and Fletcher's refutation hits the &lt;span class="Apple-style-span" style="font-weight: normal; "&gt;&lt;b&gt;target only halfway&lt;/b&gt;&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;CCF's findings indicate, using data that might be similar to CF's data, that CF's specification may be quite narrow---i.e., that their finding is not robust to variations in the specification of the logistic regression model. CCF suggest that if CF had controlled more shared environmental variables, their findings may have been statistically insignificant.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;However, to be fair to CF, one must remember that CF recognize the need to take confounding environmental variables into account. Recall (in the bolded part of CF's excerpt above, also repeated below) that CF do in fact consider that correlated weight gain might be due to shared environmental variables.  Thus both parties agree that confounding variables make it difficult to conclude that obesity is contagious.  CF's strategy is to exploit the directionality of friendship to rule out confounding environmental variables.  To repeat the key part of the excerpt above: "&lt;b&gt;if unobserved factors drove association between the ego's obesity and the alter's obesity, then the directionality of friendship should not matter&lt;/b&gt;."&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Why do I say that CCF only halfway succeed in criticizing CF?  CCF &lt;i&gt;do &lt;/i&gt;convincingly show that CF's result depends upon a quite narrow specification, and that if this specification is changed the the so-called contagiousness of obesity melts away into confounding variables.  They &lt;i&gt;do not&lt;/i&gt;, however, adequately address CF's main justification for disregarding confounding variables: confounders should not be related to direction of friendship---yet the data were very related to the direction of friendship! CCF briefly comment on the importance of edge direction in CF's findings:&lt;/div&gt;&lt;div&gt;&lt;blockquote&gt;CF appear to approach confounding by examining the type and direction of the friendship networks. For example, if individual A declared himself a friend of individual B but not vice-versa, then a social network effect should appear for A but not B.While network structures can be useful for identification of social network effects, their presence does not rule out the possibility that confounding environmental effects overlap and influence the decisions of network members.[15] Since CF never directly control for environmental factors, we view their results using directionality of friendship nominations as suggestive rather than conclusive.&lt;/blockquote&gt;The bracketed "[15]" refers to a footnote.  Because it's relevant, let's have a look at it:&lt;/div&gt;&lt;div&gt;&lt;blockquote&gt;[15] The argument for identification of social network effects using network architecture has been formalized in Bramoull´e et al. (2007). The methodology is intriguing and may be sufficient to control for confounding in the CF case, however, CF do not employ it, and it is not clear whether the single-Alter structure of most of their data permit identification in this setup in any case.&lt;/blockquote&gt;Thus, CCF simply state that it's too bad that CF did not use Bramoull´e et al.'s methodology, without explaining why CF's use of friendship direction is flawed.  In other words, they dance around CF's main justification for largely disregarding precisely the point that CCF criticize.  In CCF's later additional critical article in the British Medical Journal, they again address this point without making it clear why direction should not matter.  They state:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;blockquote&gt;As schoolmates are often also frioends, these common environmental exposures might produce the appearance of "social network effects" if not controlled for in the empirical models, particularly if the type and direction of the friendship networks are used for adjustment.  For example, if individual A declared himself a friend of individual B but not vice versa, then a social network effect should appear for A but not B.  Unfortunately, while this "intransitivity" of the network can facilitate identification of network effects, it does not address confounding.&lt;/blockquote&gt;&lt;/div&gt;&lt;div&gt;I wish CCF would provide an example here to explain why they believe that the direction of friendship cannot be exploited to adjust for confounding variables.  In terms of CCF's example of A and B, the following seems to hold: A and B are likely in the same environment (because A claims B as a friend), and &lt;i&gt;despite&lt;/i&gt; that fact, only B's weight affects A's weight, as one would expect if obesity were contagious.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;So is obesity contagious?&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;The debate over the contagiousness of obesity is one of the rare instances where scientific findings are given the scrutiny that is essential for the health of science in general.  The end result, however, is unsatisfying: although CCF provide evidence that CF's results are questionable, they do not directly undermine CF's &lt;/div&gt;&lt;div&gt;defense, which is based upon the importance of the direction of friendship.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Does this mean that CF's findings are solid?  Maybe, maybe not.  If one looks closely at the figure above, the error bars of all three directions overlap---meaning that one ought not reject the null hypothesis that directions does not matter. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;However there are other reasons why a relation between weight gain and friendship direction need not imply that obesity is contagious.One can imagine how, even if obesity were not contagious, edge direction could still be related to correlated weight gain.  Let's consider one possible scenario in which we would find that edge direction is related to mutual weight gain, and yet in which obesity is not actually contagious. I will simply snatch this example out of thin air---I do not mean to suggest that my example corresponds to reality, but rather demonstrate that CF's findings need not imply that obesity is contagious.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In this scenario, the following points are the most important:&lt;/div&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt;First, let's assume some hypothetical world in which we know obesity is not contagious.&lt;/li&gt;&lt;li&gt;Obesity is socially stigmatized.  Due to this stigma, non-obese people are more likely to drift away from their friends who become obese than they are to drift away from their friends who remain at a healthy weight.  We don't have to assume that this effect is extremely pronounced, let's just call it a moderate tendency.&lt;/li&gt;&lt;li&gt;People who have recently become obese do not share this moderate tendency to prefer non-obese contacts---as obese people themselves, they are no longer sensitive to the social stigma associated with obesity.&lt;/li&gt;&lt;/ol&gt;&lt;div&gt;The following points result as consequences of the three assumptions we've just made:&lt;/div&gt;&lt;ol&gt;&lt;li&gt;Due to this moderate tendency of non-obese to drift away from their newly-obese friends, when non-obese people are asked in some survey to quickly list their best friends, their newly-obese friends are less likely to be the first to come to mind.  Keep in mind that in the Framingham study, most subjects listed either no friends or just one friend.&lt;b&gt; This tendency leads to a relatively lower rate of friendship from non-obese people to obese people.&lt;/b&gt;&lt;/li&gt;&lt;li&gt;If you are newly obese, you still list your non-obese friends. &lt;/li&gt;&lt;li&gt;Furthermore, if both you and your friend become obese at the same time, you neither of you are more likely to forget about this friendship.  &lt;b&gt;This tendency leads to a higher rate of friendship among obese people (relative to the friendship from non-obese to obese people).&lt;/b&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/div&gt;&lt;div&gt;Thus, because the stigma related to obesity has a direction (from the non-obese &lt;i&gt;to&lt;/i&gt; the obese), and because this stigma is related to reporting friendships, it creates the appearance of contagious obesity, where in fact obesity is not contagious.  Although there are still some non-obese people who are friends with obese people, and perhaps the majority of obese people's friends are not obese, the &lt;i&gt;relative &lt;/i&gt;rates are key here, not the absolute rates.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I provide this example simply to illustrate that the significance of friendship direction, does not necessarily mean that obesity is contagious.  It's just one hypothetical, yet plausible, explanation---perhaps you can come up with one that is even more plausible.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;The contagiousness of smoking (cessation), happiness, headaches, acne, and height&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;A year after their first big publication in the &lt;i&gt;New England Journal of Medicine&lt;/i&gt; CF again p&lt;a href="http://content.nejm.org/cgi/content/abstract/358/21/2249"&gt;ublish a cover article&lt;/a&gt; in the same journal.  The second article is nearly exactly the same as the first (the measure used are the same, the data is from the same Framingham dataset, the structure of the research design and the paper itself are the same---even some sentences are almost word for word the same), but this time the topic is the contagiousness of smoking and smoking cessation rather than obesity.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;CF's main results in the smoking study largely mirror their results in their obesity study.  Importantly, they use the same type of logistic regression model, with lagged dependent and independent variables, to conclude that smoking is contagious across friendship and family ties.  (They include some results that smokers have moved to the periphery of social groups in the last thirty years---results which I find unconvincing.)  Because the results in this study are based on the same methods as used in the obesity study, they are susceptible to the same criticism as the obesity study.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;CF published&lt;a href="http://www.bmj.com/cgi/content/full/337/dec04_2/a2338?ijkey=05ede3db7790081a6b75acbd4913d580c2bce4ae"&gt; a third article&lt;/a&gt; in a similar vein, this time exploring whether happiness is contagious.  For their main statistical analysis they again use a regression model, with lagged variables to control for homophily, and contend that if friendship direction is important then one need not worry about confounding variables.  Again, this study is susceptible to the same kind of criticism from CCF.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8128923945165565689-6097467548826167004?l=sociograph.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://sociograph.blogspot.com/feeds/6097467548826167004/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8128923945165565689&amp;postID=6097467548826167004' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default/6097467548826167004'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8128923945165565689/posts/default/6097467548826167004'/><link rel='alternate' type='text/html' href='http://sociograph.blogspot.com/2009/11/is-obesity-contagious-review-of-debate.html' title='Is obesity contagious? A Review of the Debate over the &quot;Network Effects&quot; of Obesity'/><author><name>Conrad Lee</name><uri>https://profiles.google.com/110979755871396857085</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='//lh6.googleusercontent.com/-3Jylvn3UkJI/AAAAAAAAAAI/AAAAAAAAAAA/PM9AqhTcFSE/s512-c/photo.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_8NCZ65JlePs/SvnUsGe-3TI/AAAAAAAAGcU/z0ENV7EPJPE/s72-c/Screenshot.png' height='72' width='72'/><thr:total>0</thr:total></entry></feed>
