• CS 224W Project Proposal

  • CS 229 Project Readings

  • Introduction

  • Survey

  • Reaction

  • Project Proposal

  • References

  • Evaluation and Analysis of Personalised searches.

    Notes:

    • Personalisation is not always useful
    • Long term and short term user history is relevant at different times.
    • Personalization works on queries which have large entropy
    • Also, click based methods work on repeated queries the best.
    • To effectively combine different rankings use Borda’s ranking fusion method. (unable to get relavance score from MSN)
    • Click approach for repeated query, (about 1/3rd of queries are repeats)
    • user profile ; weighted vector on 67 pre-decided topics.
    • KNN for group based personalization
  • Facebook Location Paper

    Notes:

    • problems with rank
    • prediction is for house address not realtime.
    • someone who has recently moved/ on travel.
    • using max likelihood assumes independence of friendships ? what happened to triads ??

    Jure’s Contagion paper:

    Notes and Thoughts:

    * modelling contagions as being conditionally independent, loosing some info that way
            - Instead why not have two sets of contagions; one is like a max heap where the ones that have had most influence over you exist and the other set models the recency effect where contagions you have interacted with most recently have a bigger weight. 
    * a highly infectious URL can help a less infectious but highly correlated url go viral while it can suppress a slightly less infectious but unrelated URL. 

    Understanding the model they have used

    *Consider probability of a user adopting a piece of content based on what they have seen before
    * Estimate the prob of a user being infected by a contagion given the sequence of contagions she has been exposed to so far.
    * Can we make an improvement in the model by modeling the dependence of the contagions. We are saying X is conditioned on Y1 to Yk but a time step ago Y1 was X. Pre computing or some sort of sampling can help us find the joint probability distribution rather than assuming independence. This seems to be a standard hand waving that is going on
    *Contagion interaction function models the effect contagion uj has on our current contagion ui when ui was exposed to uj k exposures ago. 
    *A way to model this will be in a k different WXW matrices where rows represent different possible ui's and columns uj. 
    * So the best way to do this is to identify classes or clusters to which each contagion belongs and model the interactions between these classes. 
    *A way to set up this class is via a Membership matrix where which is of size W X T where W is number of contagions and T is the number of clusters(classes) , here M(i,j) is the probability that contagion ui is a member of cluster cj. 

    This new matrix gives rise to a new interaction function. Instead of modelling the interaction between two contagions we are now modeling the interaction between two clusters scaled by the probablities.

    * Therefore now to find the interaction between any two clusters: For any contagions ui and uj we find the probability that ui belongs to any class/cluster t and for any such case we find the probability that uj belongs to any class/cluster s and inside a double summation we multiply the above two probabilities with the interaction function for these two clusters s and t.  

    Didn’t understand how their clustering matrix is a low rank

    Questions:

    • How do they find the source(first person who posted a URL) [Start from time zero and work forwards ? BFS ? ] of the url ?
    • For example: if some one copies a URL instead of retweeting is he a source or an infected party:
    • An interesting thing would simply be to find the source of the contagion? Why has this not been done.
    • This could help in finding people who are trying to game the system.
    • Pretty sure that in the log likelihood we are trying to find the parameters such that we match the observations in D as closely as possible ?
    • 95% for training and only 5% for testing ?isn’t that kinda bad. You could overfit the model and since your test and cross-validation are so small it will likely not matter.
    • The paper claims that user bias made no difference and incorporating it made their prediction worse. This leads to the question: Is targeted ads useless ? or (more likely) did they model user bias in a wrong manner ?
  • Uncovering structure of nested/Overlapping Communities

    Notes:

    • Existing methods of finding communities work well if they are disjoint but in real life we have intersections
    • Each node i can belong to m_i communities
    • Further, for communities A and B the number of overlapping nodes is S_AB
    • Community degree: number of edges alpha that overlap with another community
    • size of the community is simply the number of nodes it has
    • To begin characterizing these communities we introduce CDF’s P(m), P(S_ov), P(deg_comm) and P(size_comm)
    • Example P(S_ov) refers to the proportion of overlaps that are greater than the current one we are looking at right now.
    • Basic Intuition: Communites are made of several complete graphs. i:e several k-clique graphs who share a lot of nodes are a community
    • More formal definition: A k-clique community is one which refers to the nodes we can reach by starting of from a k-clique graph and moving to adjacent cliques. Adjacent Cliques are ones which share atleast k-1 nodes.
    • Here we model that the way we reach a community is important , we want deep ties to be exhibited.
    • Keep in mind that a single node can belong to multiple communities
    • Less strict norms for communities
      • reduce k (or equivalently say need not be a complete clique)
      • no cut links or cut nodes (that is cutting a edge or node ) will result in a disjoint community)
      • density of edges
      • Allows overlaps
    • Their algo used on unweighted undirected graph.
    • Reducing k is like reducing the resolution of the graph.
    • For weighted graph only follow edges above a certain threshold w*

    Conclusion:

    • This paper gives an algorithm to find clusters of people in a way that represents the real world the most and allows significant overlaps.
  • MemeTracking and the Dynamics of the News Cycle

    Notes:

    • track and “see” how certain phrases travel relatively intact through the internet.
    • utility to us: find the mutated versions of similar hashtags and group them together.
    • Terminology:
      • item - blog post or news article
      • phrase - quote or text whose variants we want to track
      • phrase graph: nodes are phrases and edges point from one phrase to another variation of the same
    • We want to construct the phrase graph and partition it in such a way that the components of a partition are related to each other.
    • Edge from p to q such that p is strictly shorter than
    • Dead end didn’t find it to be very useful to our project
  • Romero’s Paper

    • Stickiness and Persistence both involved in adoption of hashtag. Stickiness could be considered the prior probability of infection while Persistence is can be quantified by the marginal benefit that is achieved from repeated exposure to the same hashtag.
    • Build a network based on X using the @twitter-user name. This indicates X is paying attention to Y, something that is not obvious from the follows relationship.
    • For a given user X we call the set of nodes that X has an edge to (under the above specification) as the neighbour set = N(X) of X
    • As members of N(X) start tweeting a particular hashTag H we start looking for when X will adopting the hashtag. We are looking for the effect of repeated exposure to a particular topic on X.
  • Patterns of Temporal Variation in Online Media

  • Experiments

  • Papers Read:

    • paper1
    • paper2
    • paper3
{"cards":[{"_id":"379f3e192f62629f9d000019","treeId":"379f3df52f62629f9d000017","seq":1,"position":1,"parentId":null,"content":"CS 224W Project Proposal\n"},{"_id":"379f437a45ba5376c8000007","treeId":"379f3df52f62629f9d000017","seq":1,"position":1,"parentId":"379f3e192f62629f9d000019","content":"###Introduction\n"},{"_id":"379f45a645ba5376c8000008","treeId":"379f3df52f62629f9d000017","seq":1,"position":2,"parentId":"379f3e192f62629f9d000019","content":"###Survey"},{"_id":"379f46f245ba5376c8000009","treeId":"379f3df52f62629f9d000017","seq":1,"position":3,"parentId":"379f3e192f62629f9d000019","content":"###Reaction"},{"_id":"379fc76945ba5376c8000046","treeId":"379f3df52f62629f9d000017","seq":1,"position":1,"parentId":"379f46f245ba5376c8000009","content":"\n\n\n\n\n\n\n\n\n\n###Facebook Location Paper\n Notes:\n * problems with rank \n * prediction is for house address not realtime. \n * someone who has recently moved/ on travel.\n * using max likelihood assumes independence of friendships ? what happened to triads ?? \n\n###Jure's Contagion paper: \nNotes and Thoughts:\n * modelling contagions as being conditionally independent, loosing some info that way\n - Instead why not have two sets of contagions; one is like a max heap where the ones that have had most influence over you exist and the other set models the recency effect where contagions you have interacted with most recently have a bigger weight. \n * a highly infectious URL can help a less infectious but highly correlated url go viral while it can suppress a slightly less infectious but unrelated URL. \n\n\n####Understanding the model they have used\n\n *Consider probability of a user adopting a piece of content based on what they have seen before\n * Estimate the prob of a user being infected by a contagion given the sequence of contagions she has been exposed to so far.\n * Can we make an improvement in the model by modeling the dependence of the contagions. We are saying X is conditioned on Y1 to Yk but a time step ago Y1 was X. Pre computing or some sort of sampling can help us find the joint probability distribution rather than assuming independence. This seems to be a standard hand waving that is going on\n *Contagion interaction function models the effect contagion uj has on our current contagion ui when ui was exposed to uj k exposures ago. \n *A way to model this will be in a k different WXW matrices where rows represent different possible ui's and columns uj. \n * So the best way to do this is to identify classes or clusters to which each contagion belongs and model the interactions between these classes. \n *A way to set up this class is via a Membership matrix where which is of size W X T where W is number of contagions and T is the number of clusters(classes) , here M(i,j) is the probability that contagion ui is a member of cluster cj. \n\nThis new matrix gives rise to a new interaction function. Instead of modelling the interaction between two contagions we are now modeling the `interaction between two clusters` scaled by the probablities.\n * Therefore now to find the interaction between any two clusters: For any contagions ui and uj we find the probability that ui belongs to any class/cluster t and for any such case we find the probability that uj belongs to any class/cluster s and inside a double summation we multiply the above two probabilities with the interaction function for these two clusters s and t. \n\u0001**Didn't understand how their clustering matrix is a low rank **\n\nQuestions:\n- How do they find the source(first person who posted a URL) [Start from time zero and work forwards ? BFS ? ] of the url ? \n- For example: if some one copies a URL instead of retweeting is he a source or an infected party: \n- An interesting thing would simply be to find the source of the contagion? Why has this not been done.\n- This could help in finding people who are trying to game the system.\n- Pretty sure that in the log likelihood we are trying to find the parameters such that we match the observations in D as closely as possible ? \n- `95% for training and only 5% for testing ?`isn't that kinda bad. You could overfit the model and since your test and cross-validation are so small it will likely not matter.\n- The paper claims that user bias made no difference and incorporating it made their prediction worse. This leads to the question: Is targeted ads useless ? or (more likely) did they model user bias in a wrong manner ? \n\n"},{"_id":"37de1837ecde9148f1000026","treeId":"379f3df52f62629f9d000017","seq":1,"position":1.5,"parentId":"379f46f245ba5376c8000009","content":"###Uncovering structure of nested/Overlapping Communities\nNotes:\n* Existing methods of finding communities work well if they are disjoint but in real life we have intersections\n* Each node `i` can belong to `m_i communities` \n* Further, for communities `A and B` the number of overlapping nodes is `S_AB`\n* Community degree: number of edges **alpha** that overlap with another community\n* size of the community is simply the number of nodes it has\n* To begin characterizing these communities we introduce CDF's `P(m)`, `P(S_ov)`, `P(deg_comm)` and `P(size_comm)`\n* Example **P(S_ov)** refers to the proportion of overlaps that are greater than the current one we are looking at right now. \n* *Basic Intuition:* Communites are made of several complete graphs. i:e several k-clique graphs who share a lot of nodes are a community \n* *More formal definition*: A k-clique community is one which refers to the nodes we can reach by starting of from a k-clique graph and moving to adjacent cliques. `Adjacent Cliques` are ones which share atleast k-1 nodes. \n* Here we model that the way we reach a community is important , we want deep ties to be exhibited. \n* Keep in mind that a single node can belong to multiple communities\n* Less strict norms for communities \n - reduce k (or equivalently say need not be a complete clique) \n - no cut links or cut nodes (that is cutting a edge or node ) will result in a disjoint community) \n - density of edges\n - `Allows overlaps`\n* Their algo used on unweighted undirected graph. \n* Reducing k is like reducing the resolution of the graph. \n* For weighted graph only follow edges above a certain threshold w* \n\n###Conclusion: \n\n* This paper gives an algorithm to find clusters of people in a way that represents the real world the most and allows significant overlaps. \n\n\n"},{"_id":"37deaffcecde9148f100002a","treeId":"379f3df52f62629f9d000017","seq":1,"position":1.75,"parentId":"379f46f245ba5376c8000009","content":"##MemeTracking and the Dynamics of the News Cycle\n\nNotes:\n* track and \"see\" how certain phrases travel relatively intact through the internet. \n* utility to us: find the mutated versions of similar hashtags and group them together. \n* Terminology:\n - **item** - blog post or news article\n - **phrase** - quote or text whose variants we want to track \n - **phrase graph**: nodes are phrases and edges point from one phrase to another variation of the same \n* We want to construct the phrase graph and partition it in such a way that the components of a partition are related to each other. \n* Edge from p to q such that p is strictly shorter than \n* **Dead end ** didn't find it to be very useful to our project\n"},{"_id":"37d1847316890e47f1000010","treeId":"379f3df52f62629f9d000017","seq":1,"position":2,"parentId":"379f46f245ba5376c8000009","content":"###Romero's Paper\n* `Stickiness` and `Persistence` both involved in adoption of hashtag. **Stickiness** could be considered the prior probability of infection while **Persistence** is can be quantified by the marginal benefit that is achieved from repeated exposure to the same hashtag.\n* Build a network based on X using the `@twitter-user` name. This indicates X is paying attention to Y, something that is not obvious from the follows relationship.\n* For a given user `X` we call the set of nodes that `X` has an edge to (under the above specification) as the `neighbour set = N(X) ` of `X`\n* As members of *N(X)* start tweeting a particular hashTag *H* we start looking for when **X** will adopting the hashtag. We are looking for the effect of repeated exposure to a particular topic on X. \n"},{"_id":"37de9ebd5c1bb3ec76000015","treeId":"379f3df52f62629f9d000017","seq":1,"position":3,"parentId":"379f46f245ba5376c8000009","content":"### Patterns of Temporal Variation in Online Media\n\n"},{"_id":"379f4a0145ba5376c800000a","treeId":"379f3df52f62629f9d000017","seq":1,"position":4,"parentId":"379f3e192f62629f9d000019","content":"###Project Proposal"},{"_id":"379f4e4f45ba5376c800000c","treeId":"379f3df52f62629f9d000017","seq":1,"position":5,"parentId":"379f3e192f62629f9d000019","content":"###References"},{"_id":"379f4bd245ba5376c800000b","treeId":"379f3df52f62629f9d000017","seq":1,"position":0.5,"parentId":"379f4e4f45ba5376c800000c","content":"###Experiments"},{"_id":"379f4eaf45ba5376c800000d","treeId":"379f3df52f62629f9d000017","seq":1,"position":1,"parentId":"379f4e4f45ba5376c800000c","content":"Papers Read:\n- paper1 \n- paper2 \n- paper3"},{"_id":"37cef5e616890e47f100000e","treeId":"379f3df52f62629f9d000017","seq":1,"position":2,"parentId":null,"content":"CS 229 Project Readings\n"},{"_id":"37cef67216890e47f100000f","treeId":"379f3df52f62629f9d000017","seq":1,"position":1,"parentId":"37cef5e616890e47f100000e","content":"###Evaluation and Analysis of Personalised searches.\n\nNotes:\n- Personalisation is not always useful\n- Long term and short term user history is relevant at different times. \n- Personalization works on queries which have large entropy\n- Also, click based methods work on repeated queries the best. \n- To effectively combine different rankings use Borda's ranking fusion method. (unable to get relavance score from MSN) \n- Click approach for repeated query, (about 1/3rd of queries are repeats) \n- user profile ; weighted vector on 67 pre-decided topics. \n- KNN for group based personalization\n\n"}],"tree":{"_id":"379f3df52f62629f9d000017","name":"CS Projects","publicUrl":"224wproposal"}}