CS 224W Project Proposal

### Introduction

### Survey

### Reaction

### Facebook Location Paper

Notes:

- problems with rank
- prediction is for house address not realtime.
- someone who has recently moved/ on travel.
- using max likelihood assumes independence of friendships ? what happened to triads ??

### Jure’s Contagion paper:

Notes and Thoughts:

```
* modelling contagions as being conditionally independent, loosing some info that way
- Instead why not have two sets of contagions; one is like a max heap where the ones that have had most influence over you exist and the other set models the recency effect where contagions you have interacted with most recently have a bigger weight.
* a highly infectious URL can help a less infectious but highly correlated url go viral while it can suppress a slightly less infectious but unrelated URL.
```

#### Understanding the model they have used

```
*Consider probability of a user adopting a piece of content based on what they have seen before
* Estimate the prob of a user being infected by a contagion given the sequence of contagions she has been exposed to so far.
* Can we make an improvement in the model by modeling the dependence of the contagions. We are saying X is conditioned on Y1 to Yk but a time step ago Y1 was X. Pre computing or some sort of sampling can help us find the joint probability distribution rather than assuming independence. This seems to be a standard hand waving that is going on
*Contagion interaction function models the effect contagion uj has on our current contagion ui when ui was exposed to uj k exposures ago.
*A way to model this will be in a k different WXW matrices where rows represent different possible ui's and columns uj.
* So the best way to do this is to identify classes or clusters to which each contagion belongs and model the interactions between these classes.
*A way to set up this class is via a Membership matrix where which is of size W X T where W is number of contagions and T is the number of clusters(classes) , here M(i,j) is the probability that contagion ui is a member of cluster cj.
```

This new matrix gives rise to a new interaction function. Instead of modelling the interaction between two contagions we are now modeling the `interaction between two clusters`

scaled by the probablities.

`* Therefore now to find the interaction between any two clusters: For any contagions ui and uj we find the probability that ui belongs to any class/cluster t and for any such case we find the probability that uj belongs to any class/cluster s and inside a double summation we multiply the above two probabilities with the interaction function for these two clusters s and t. `

**Didn’t understand how their clustering matrix is a low rank **

Questions:

- How do they find the source(first person who posted a URL) [Start from time zero and work forwards ? BFS ? ] of the url ?
- For example: if some one copies a URL instead of retweeting is he a source or an infected party:
- An interesting thing would simply be to find the source of the contagion? Why has this not been done.
- This could help in finding people who are trying to game the system.
- Pretty sure that in the log likelihood we are trying to find the parameters such that we match the observations in D as closely as possible ?
`95% for training and only 5% for testing ?`

isn’t that kinda bad. You could overfit the model and since your test and cross-validation are so small it will likely not matter.- The paper claims that user bias made no difference and incorporating it made their prediction worse. This leads to the question: Is targeted ads useless ? or (more likely) did they model user bias in a wrong manner ?

### Uncovering structure of nested/Overlapping Communities

Notes:

- Existing methods of finding communities work well if they are disjoint but in real life we have intersections
- Each node
`i`

can belong to`m_i communities`

- Further, for communities
`A and B`

the number of overlapping nodes is`S_AB`

- Community degree: number of edges
**alpha**that overlap with another community - size of the community is simply the number of nodes it has
- To begin characterizing these communities we introduce CDF’s
`P(m)`

,`P(S_ov)`

,`P(deg_comm)`

and`P(size_comm)`

- Example
**P(S_ov)**refers to the proportion of overlaps that are greater than the current one we are looking at right now. *Basic Intuition:*Communites are made of several complete graphs. i:e several k-clique graphs who share a lot of nodes are a community*More formal definition*: A k-clique community is one which refers to the nodes we can reach by starting of from a k-clique graph and moving to adjacent cliques.`Adjacent Cliques`

are ones which share atleast k-1 nodes.- Here we model that the way we reach a community is important , we want deep ties to be exhibited.
- Keep in mind that a single node can belong to multiple communities
- Less strict norms for communities
- reduce k (or equivalently say need not be a complete clique)
- no cut links or cut nodes (that is cutting a edge or node ) will result in a disjoint community)
- density of edges
`Allows overlaps`

- Their algo used on unweighted undirected graph.
- Reducing k is like reducing the resolution of the graph.
- For weighted graph only follow edges above a certain threshold w*

### Conclusion:

- This paper gives an algorithm to find clusters of people in a way that represents the real world the most and allows significant overlaps.

## MemeTracking and the Dynamics of the News Cycle

Notes:

- track and “see” how certain phrases travel relatively intact through the internet.
- utility to us: find the mutated versions of similar hashtags and group them together.
- Terminology:
**item**- blog post or news article**phrase**- quote or text whose variants we want to track**phrase graph**: nodes are phrases and edges point from one phrase to another variation of the same

- We want to construct the phrase graph and partition it in such a way that the components of a partition are related to each other.
- Edge from p to q such that p is strictly shorter than
**Dead end**didn’t find it to be very useful to our project

### Romero’s Paper

`Stickiness`

and`Persistence`

both involved in adoption of hashtag.**Stickiness**could be considered the prior probability of infection while**Persistence**is can be quantified by the marginal benefit that is achieved from repeated exposure to the same hashtag.- Build a network based on X using the
`@twitter-user`

name. This indicates X is paying attention to Y, something that is not obvious from the follows relationship. - For a given user
`X`

we call the set of nodes that`X`

has an edge to (under the above specification) as the`neighbour set = N(X)`

of`X`

- As members of
*N(X)*start tweeting a particular hashTag*H*we start looking for when**X**will adopting the hashtag. We are looking for the effect of repeated exposure to a particular topic on X.

### Patterns of Temporal Variation in Online Media

### Project Proposal

### References

### Experiments

Papers Read:

- paper1
- paper2
- paper3

CS 229 Project Readings

### Evaluation and Analysis of Personalised searches.

Notes:

- Personalisation is not always useful
- Long term and short term user history is relevant at different times.
- Personalization works on queries which have large entropy
- Also, click based methods work on repeated queries the best.
- To effectively combine different rankings use Borda’s ranking fusion method. (unable to get relavance score from MSN)
- Click approach for repeated query, (about 1/3rd of queries are repeats)
- user profile ; weighted vector on 67 pre-decided topics.
- KNN for group based personalization