CS 224W Project Proposal
###Introduction
###Survey
###Reaction
###Facebook Location Paper
Notes:
* problems with rank
* prediction is for house address not realtime.
* someone who has recently moved/ on travel.
* using max likelihood assumes independence of friendships ? what happened to triads ??
###Jure's Contagion paper:
Notes and Thoughts:
* modelling contagions as being conditionally independent, loosing some info that way
- Instead why not have two sets of contagions; one is like a max heap where the ones that have had most influence over you exist and the other set models the recency effect where contagions you have interacted with most recently have a bigger weight.
* a highly infectious URL can help a less infectious but highly correlated url go viral while it can suppress a slightly less infectious but unrelated URL.
####Understanding the model they have used
*Consider probability of a user adopting a piece of content based on what they have seen before
* Estimate the prob of a user being infected by a contagion given the sequence of contagions she has been exposed to so far.
* Can we make an improvement in the model by modeling the dependence of the contagions. We are saying X is conditioned on Y1 to Yk but a time step ago Y1 was X. Pre computing or some sort of sampling can help us find the joint probability distribution rather than assuming independence. This seems to be a standard hand waving that is going on
*Contagion interaction function models the effect contagion uj has on our current contagion ui when ui was exposed to uj k exposures ago.
*A way to model this will be in a k different WXW matrices where rows represent different possible ui's and columns uj.
* So the best way to do this is to identify classes or clusters to which each contagion belongs and model the interactions between these classes.
*A way to set up this class is via a Membership matrix where which is of size W X T where W is number of contagions and T is the number of clusters(classes) , here M(i,j) is the probability that contagion ui is a member of cluster cj.
This new matrix gives rise to a new interaction function. Instead of modelling the interaction between two contagions we are now modeling the `interaction between two clusters` scaled by the probablities.
* Therefore now to find the interaction between any two clusters: For any contagions ui and uj we find the probability that ui belongs to any class/cluster t and for any such case we find the probability that uj belongs to any class/cluster s and inside a double summation we multiply the above two probabilities with the interaction function for these two clusters s and t.
**Didn't understand how their clustering matrix is a low rank **
Questions:
- How do they find the source(first person who posted a URL) [Start from time zero and work forwards ? BFS ? ] of the url ?
- For example: if some one copies a URL instead of retweeting is he a source or an infected party:
- An interesting thing would simply be to find the source of the contagion? Why has this not been done.
- This could help in finding people who are trying to game the system.
- Pretty sure that in the log likelihood we are trying to find the parameters such that we match the observations in D as closely as possible ?
- `95% for training and only 5% for testing ?`isn't that kinda bad. You could overfit the model and since your test and cross-validation are so small it will likely not matter.
- The paper claims that user bias made no difference and incorporating it made their prediction worse. This leads to the question: Is targeted ads useless ? or (more likely) did they model user bias in a wrong manner ?
###Uncovering structure of nested/Overlapping Communities
Notes:
* Existing methods of finding communities work well if they are disjoint but in real life we have intersections
* Each node `i` can belong to `m_i communities`
* Further, for communities `A and B` the number of overlapping nodes is `S_AB`
* Community degree: number of edges **alpha** that overlap with another community
* size of the community is simply the number of nodes it has
* To begin characterizing these communities we introduce CDF's `P(m)`, `P(S_ov)`, `P(deg_comm)` and `P(size_comm)`
* Example **P(S_ov)** refers to the proportion of overlaps that are greater than the current one we are looking at right now.
* *Basic Intuition:* Communites are made of several complete graphs. i:e several k-clique graphs who share a lot of nodes are a community
* *More formal definition*: A k-clique community is one which refers to the nodes we can reach by starting of from a k-clique graph and moving to adjacent cliques. `Adjacent Cliques` are ones which share atleast k-1 nodes.
* Here we model that the way we reach a community is important , we want deep ties to be exhibited.
* Keep in mind that a single node can belong to multiple communities
* Less strict norms for communities
- reduce k (or equivalently say need not be a complete clique)
- no cut links or cut nodes (that is cutting a edge or node ) will result in a disjoint community)
- density of edges
- `Allows overlaps`
* Their algo used on unweighted undirected graph.
* Reducing k is like reducing the resolution of the graph.
* For weighted graph only follow edges above a certain threshold w*
###Conclusion:
* This paper gives an algorithm to find clusters of people in a way that represents the real world the most and allows significant overlaps.
##MemeTracking and the Dynamics of the News Cycle
Notes:
* track and "see" how certain phrases travel relatively intact through the internet.
* utility to us: find the mutated versions of similar hashtags and group them together.
* Terminology:
- **item** - blog post or news article
- **phrase** - quote or text whose variants we want to track
- **phrase graph**: nodes are phrases and edges point from one phrase to another variation of the same
* We want to construct the phrase graph and partition it in such a way that the components of a partition are related to each other.
* Edge from p to q such that p is strictly shorter than
* **Dead end ** didn't find it to be very useful to our project
###Romero's Paper
* `Stickiness` and `Persistence` both involved in adoption of hashtag. **Stickiness** could be considered the prior probability of infection while **Persistence** is can be quantified by the marginal benefit that is achieved from repeated exposure to the same hashtag.
* Build a network based on X using the `@twitter-user` name. This indicates X is paying attention to Y, something that is not obvious from the follows relationship.
* For a given user `X` we call the set of nodes that `X` has an edge to (under the above specification) as the `neighbour set = N(X) ` of `X`
* As members of *N(X)* start tweeting a particular hashTag *H* we start looking for when **X** will adopting the hashtag. We are looking for the effect of repeated exposure to a particular topic on X.
### Patterns of Temporal Variation in Online Media
###Project Proposal
###References
###Experiments
Papers Read:
- paper1
- paper2
- paper3
CS 229 Project Readings
###Evaluation and Analysis of Personalised searches.
Notes:
- Personalisation is not always useful
- Long term and short term user history is relevant at different times.
- Personalization works on queries which have large entropy
- Also, click based methods work on repeated queries the best.
- To effectively combine different rankings use Borda's ranking fusion method. (unable to get relavance score from MSN)
- Click approach for repeated query, (about 1/3rd of queries are repeats)
- user profile ; weighted vector on 67 pre-decided topics.
- KNN for group based personalization