Michaeleen Doucleff just wrote a very fun article on our recipe network paper for NPR’s the Salt.

It made me realize that Edwin Teng, Yuru Lin and I have some leftover plots that may be Thanksgiving appropriate. If you don’t have quite the right ingredients handy while cooking Thanksgiving dinner, here is a network of common substitutions as found in reviewers’ comments on a large recipe site (click to see a larger view):

The favorite Thanksgiving ingredients are often recommended as substitutes. e.g. cranberries end up substituting for other kinds of fruits and even somehow for chocolate. In the fat category, olive oil and butter seem to be recommended as substitutes for things such as margarine. Yams are often recommended as a substitute for sweet potatoes (more so than the other way around), etc.

Recently, from my Coursera class, I created region specific networks using data shared by YY Ahn & co. in their flavor network paper. This isn’t a complete set of all regions, but see if you can guess which region is visualized in each of these (mouse over for the answer, your choices are Northern European, Southern European, North American, Latin American, Middle Eastern, South Asian, African, Southeast Asian, East Asian):

 

Lastly, and most deliciously, here is the network of complementary ingredients for Thanksgiving, created by my collaborator Edwin Teng:

Bon Apetit!

 

In cooking I alternate between following recipes exactly, for fear that any sort of deviation might ruin the outcome, and trying to throw things together arbitrarily, with occasionally edible results. Could this problem be solved the way I like to approach other problems, i.e. by analyzing a nice data set, preferably of user contributed knowledge?

So a little over a year ago, I proposed the idea of using ingredient networks to evaluate recipes at a “Wacky Wednesday” faculty meeting, where School of Information faculty gather and pitch ideas to each other. The mix of interest and skepticism with which the idea was greeted was enough to motivate me to work on the problem with my PhD student Edwin Teng. Soon thereafter, Yu-Ru Lin, from Northeastern and Harvard, joined us on the project, and lent it her insight and machine learning expertise.

A lot of fun findings ensued (you can download the paper on arxiv):

1) If one examines complementary ingredients, two main communities fall out, one sweet, the other savory (see image above).

And there is a smaller, third community of ingredients for mixed-drinks.

mixed drink ingredients

2) Recipe reviews are a goldmine of data. There are ample suggestions for modifications (additions, deletions, increases, decreases, substitutions). These could be used to create “flexible” recipes, suggesting a range for the quantity of an ingredient, and possible substitutes. In fact, a substitute network reveals global communities of interchangeable ingredients.

3) Ingredient networks can be used to predict recipe ratings. “These networks encode which ingredients go well together, and which can be substituted to obtain superior results, and permit one to predict, given a pair of related recipes, which one will be more highly rated by users.” It appears that the substitute network in particular encodes nutrition information, e.g. users’ preferences for “healthier” variants for a recipe.

4) The hypothesis presented in Catching Fire, that humans have evolved to prefer cooking methods that extract more energy value from food, is consistent with recipe ratings. Recipes that call for heating (baking, boiling, grilling), are rated on average more highly than those that only call for mechanical preparation methods (chopping, mixing). Chemical methods (marinating & brining) give a slight additional boost.

5) US regional preferences are easily discernable, e.g. frying being popular in the south, and grilling being popular on the west coast and in the mountain regions. It would be interesting to study how these are affected by the availability of ingredients and cultural influences.

Also, stay tuned for some fantastic related work by YY Ahn, Sebastian Ahnert, James Bagrow and Laszlo Barabasi, getting to the bottom of recipe preferences by analyzing networks of flavor compounds in food pairings.

Finally, a short thanks for some of the tools we used:

Gephi for visualizing the networks
Map generator for detecting communities, here are two examples:

Jun 302011
 

As I was sorting my contacts into neat little Google+ circles (talking to self: ‘this one, ah… he’s someone I could talk to about personal things, but I never do, oh, what the heck, let’s call him a “friend”‘), I was reminded that I never blogged about the related paper, soon to be poster-presented at this year’s ICWSM, that I wrote with Debra Lauterbach, Edwin Teng, and Mark Ackerman. Incidentally, Debra is currently a UX researcher at Google and no doubt lent a hand and some brains to Google+.

So, about this friend-rating paper. Previously, Edwin, Debra, and I had written a paper “I rate you. You rate me. Should we do so publicly?”, showing that person-to-person ratings differ by whether they are given publicly (anonymously or not) or privately. In this follow-up paper, we dove in depth into CouchSurfing’s friendship and trust ratings, both analyzing the millions of ratings, conducting a survey, and also Skype interviews. What were we after? Well, initially we just wanted to understand whether the higher alignment of public ratings A->B and B->A, as well as the lack of negative public feedback, were due to reciprocity (or fear of reciprocal action). But the study led us to understand quite a bit more about the nature of trust and friendship.  This rather unique data set captures quantitative ratings of trust and friendship on a very large scale. Our results may very well be relevant to products such as Google+ because trust and friendship are sometimes used synonymously (e.g. you can “trust” your friends with intimate details you post on social media sites), but we thought we’d check that.

OK, to cut to the chase. Recap of the “I rate you. You rate me.” paper.

1) The same person will give higher Epinions ratings to other users’ reviews when identifying herself (as opposed to when she chooses to remain anonymous). But this is only true for rating other people. Product reviews on Amazon are no more favorable (though a bit longer) if a person identifies themselves than if they use a pen name.

2) When ratings are shown publicly (Epinions, CouchSurfing), and there is potential for reciprocity, the ratings are more aligned (i.e. A’s rating of B tends to mirror B’s rating of A).

3) On CouchSurfing, women rate other women more highly than they do men (on both trust and friendship), but men rate men and women about equally.

OK, so here is what we learned in the in-depth CouchSurfing paper.

4) Trust and friendship are not always synonymous. The heatmap below shows how often trust/friendship pair ratings are given. What you can see is that trust tends to increase with the closeness of the friendship (e.g. best friends are almost always highly trusted), but high trust can be allocated to individuals who are not one’s closest friends (of course all of this might be context specific to CouchSurfing, but I do believe it holds more generally).

friendship and trust ratings on CouchSurfing

4) When ratings are not aligned (e.g. A indicates B is a good friend, but B categorizes A as an acquaintance), there is a bit of awkwardness (for both A and B), but most do not attach much importance to numerical ratings. Friendship ratings (which are shown publicly) are more aligned (ρ ~ 0.7) than trust ratings (ρ ~ 0.3), which are not shown to the other person.
Connecting this to Facebook lists and Google+ circles, I’d say they are private for a reason. Though I wonder if A could eventually figure out that they are in B’s acquaintance as opposed to friend circle once they hear from C about something that B shared with his friends. Or is there enough content flooding our way that we don’t mind what we’re missing?

5) Negative ratings are seldom given publicly in part because the individual being rated can reciprocate. Also there is the sense that even if one doesn’t have a good experience with someone, there is a chance that someone else would click well, so why ruin the chance of that happening. As with any categorical system that doesn’t quite capture what individuals want to express, CS users work around it: they insert comments into the text of the reference that have subtle signals, e.g.

I’ve gone pretty keen on what certain references mean, and you can tell a. . . you-were-a-nice-person-reference: ”[She] was great. She was very hospitable. She’s a great host.” That can mean in a sense you might be kind of boring.

Or users simply don’t leave a reference:

I either neglect to reference, or write a “positive” response but in a neutral tone.

6) This makes textual references (and their number) far more important than numerical levels to those who are judging whether they would like to get to know/spend time with someone based on their ratings:

How important CouchSurfing ratings are to users

6) We also checked whether trust is more a property of a node, and friendship that of an edge, e.g. a given individual would always be perceived as highly trustworthy, but their friendship ratings would depend on the characteristics of each tie. We found only very, very weak support for this, in that the per-individual normalized variance in trust ratings was only a bit smaller than the variance in friendship ratings.

7) Some other neat things, you could find out if you read in the paper :)

8 ) Something we couldn’t fit into the paper was the users’ tendency to not update friendship and trust status as the relationship evolved. This could be a function of CS’s primary purpose being to help people find one another as opposed to stay in touch, so that there is little practical utility in making such updates. On the other hand, one might rather carefully groom one’s Google+ circles and Facebook lists, lest one ends up inadvertently over or under sharing because of out-of-date designations.  It will be interesting to see to what extent users are interested in and capable of thinking through what each individual means to them and where they “fit” in their information sharing sphere.

 

This week, First Monday published a paper that I have been working on for over two years. Whether that shows a lot of focus, or lack thereof, I’m not sure. The basic question was the following: what kinds of knowledge contributors flourish: those who focus narrowly on a few subject areas, or the polymaths who contribute to many disparate areas? I think of my PhD advisor, Bernardo Huberman, who in addition to publishing in general venues such as Nature, Science and PNAS, has published in physics, computer science, sociology, economics, etc. Anyone who knows Bernardo knows that he is not typical. So my collaborators (Xiao Wei, Jiang Yang, Kevin Nam, Sean Gerrish, Gavin Clarkson) and I had to gather some data in order to pose the question across many, many knowledge contributors. The answer?

Across a wide range of knowledge contribution media (scholarly articles, patents, Wikipedia, online Q&A forums), the more focused individuals make contributions of higher quality measured in (normalized citations, normalized citations, persistence of new words introduced, percentage of answers selected as best) respectively. In the end our R2 was rather laughable, but one wouldn’t expect focus alone to explain someone’s success, would they? And of course causality remains elusive. Do individuals who focus contribute higher quality stuff because of their focus, or do they focus on the work that has already brought the success?

Still, I found it remarkable that the same pattern emerged whether one looked at “original” contributions such as articles and patents, or smaller, not-necessarily novel ones, such as Wikipedia edits and question answering.

 

The Impact of Boundary Spanning Scholarly Publications and Patents
Authors: Xiaolin Shi, me, Belle Tseng, Gavin Clarkson
Background
Human knowledge and innovation are recorded in two media: scholarly publication and patents. These records not only document a new scientific insight or new method developed, but they also carefully cite prior work upon which the innovation is built.

Methodology

We quantify the impact of information flow across fields using two large citation dataset: one spanning over a century of scholarly work in the natural sciences, social sciences and humanities, and second spanning a quarter century of United States patents.
Conclusions
We find that a publication’s citing across disciplines is tied to its subsequent impact. In the case of patents and natural science publications, those that are cited at least once are cited slightly more when they draw on research outside of their area. In contrast, in the social sciences, citing within one’s own field tends to be positively correlated with impact.
=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=
The paper came out last week. PLoS One has these neat features where readers can rate the article, leave comments in general, or comment on particular parts of the text. So far… nothing. I’m a bit bummed. Either no one has noticed our potentially controversial article, or it’s not as controversial as I had assumed.

 

Eytan will be presenting our work on Social Influence and the Diffusion of User-Created Content in Second Life this Friday. In the meantime, the study has been getting a bit of geeky press :) .

Apr 082009
 

This project was really fun. I got to collaborate with Andrei Kirilenko, Celso Brunetti, Jeff Harris (+ Paul Tsyhura,) at the CFTC (Commodity Futures Trading Commission) on the properties of time series of network metrics on automatically matched brokers trading futures contracts. Way fun. I had never seriously done time series analysis, and admittedly that big fat recently purchased book on the econometrics of financial time series is still sitting around largely unread. But I digress… What’s really neat is that you can see the flow of information into the market reflected in the network variables. The methods we developed can hopefully be used in the future to detect market manipulation and such.
Available as a preprint: “On the informational properties of trading networks”

Mar 302009
 

I’m happy to say that Matt Simmons has decided to enroll at SI as a PhD student (previously he was a masters student here). Over the years, he has worked with Drago Radev on co-author and citation networks, and with Matt Hindman and me on a project to analyze the US code (the federal law) as a network. Hopefully I’ll be able to tell you about that project in the next blog post, whenever that may be.
In the meantime, welcome Matt!
P.S. NetSI is the name of my research group: Networks @ SI.
and current members include

© 2011 ladamic's blog Suffusion theme by Sayontan Sinha