Sign up for free to use this document yourself.
  • Statistics

  • Data Analysis

  • Notation

  • Data = Model + Error

  • Chapter 2: Simple Models: Definitions of Error and Parameter Estimates

  • Data
    Basic scores or observations that we want to analyse

  • Model
    Compact description or representation of the data.

    From simple to complex predictions of the data. E.g prediction of households that have internet access. The prediction for a particular attribute is an unknown parameter which is estimated from the data.

  • Error

    The amount by which the model fails to represent the data accurately. Often referred to as ‘the residual’ - the part that is left over after we have made our best prediction.

    Error = Data-Model

    We want to reduce error to have data accurately represent the model.

  • How to reduce error and improve models - Research methods

    Better research designs Better data collection procedures
    *More reliable instruments

    pg.2 references for help on such issues

  • How to reduce error and improve models - Data analysis

    make the model’s predictions conditional on additional information about each observation. Add parameters to the model and estimate those parameters so that the model will provide a good fit to the data by making the error as small as possible.

  • For example

    In creating a model for the internet availability in each state of America.

    Simple Apriori assumption of internet availability allows for high level of error

    Data (internet for each state) = Model (aprori percentage prediction of 44% internet availability) + Error

    Yi (Data set) = .44 + ERROR

    Yi=.44+ERROR

  • Improved model predicts the same internet usage for each state, but leaves the predicted value as an unspecified parameter

    Using the data set,it may use the AVERAGE internet availability for the 50 states

    Yi (Data Set)=B0 (average using data set) + ERROR

    Yi=B0+ERROR

  • Reduce error further by including more parameters to make conditional predictions

    West and east coast of america gain innovations quicker than the middle of the country. Therefore using the different time zones we can adjust up or down for predictive percentage.

    Yi=B0 (average using data set) + B1 (Adjustment according to time zone) +ERROR if the state is in the Eastern or Pacific Time zones.

    Yi=B0 (average using data set) + B1 (Adjustment according to time zone) +ERROR if the state is in the Central or mountain Time zones.

    The model is conditional on the time zone in which the state is located.

  • Additionally you may add a continuous rather than categorical parameter

    Xi can represent the percentage of college graduates in the state, a percentage higher will adjust the internet usage prediction to be higher too.

    Yi=B0+B1Xi+ERROR

  • We could reduce error further by creating a prediction for each individual state - however this does not offer us a theory that can be tested or any useful information.

    It would merely be a statement of fact.

  • We have a conflict of reducing error and providing the best description of DATA.

    The ultimate goal is to find the smallest, simplest model that provides an adequate description of the data sot that the error is not too large.

    Models without the additional parameters are the compact model (model c)

    The alternative augmented model (model A), includes all parameters of Model c plus some additional parameters. These additional parameters will either reduce error or leave it unchanged therefore

    ERROR(A) is less than or equal to ERROR(C)

  • Is it worth adding parameters to potentialally reduce error?

    Calculating the proportional reduction in error (PRE) represents the proportion of Model C’s error that is reduced or eliminated when we replace it with the more complex Model A.

    PRE= ERROR(C) -ERROR(A)/ ERROR C

    Alternatively

    PRE=1-ERROR(A)/ ERROR(C)

    If additional parameters do no good then PRE=0

    If Model A provides a perfect fit, then ERROR(A)=0 and (Assuming Model C does not also provide a perfect fit) PRE=1

  • The higher the PRE the more error is reduced from the Augmented model. The smaller the value of the PRE the more we will want to stick with the simpler compact model.

  • Is the new model valuable? is 40% reduction enough?

    Consider how MANY parameters is causing the 40% reduction in error. If it is just 1 then the factor may be quite an important one.

    PRE per parameter added will be a useful index.

    We will also be more imporessed with a given PRE as the difference between the number of parameters that were added and the number of parateters that could have been added becomes greater. Hence, our inferential statistics will consider how many parameters could have been added to Model C to create Model A but were not.

    Is it more influential OVER AND ABOVE the other parameters?

  • Model C corresponds to the null hypothesis and Model A corresponds to the alternative hypothesis.

    The null hypothesis is that all the parameters included in Model A but in in Model C are zero, or equivalently that there is no difference in error between Models A and C. If we reject Model C in favour of Model A, then we reject the null hypothesis in favour of the alternative hypothesis that is implied by the difference between Models C and A.

    We conclude that it is unreasonable to presume that all the extra parameter values in Model A are zero.

{"cards":[{"_id":"478cff5fddcf4b18a9000010","treeId":"478cfe63ddcf4b18a900000e","seq":513870,"position":1,"parentId":null,"content":"Statistics","deleted":false},{"_id":"4791422d42ac704543000010","treeId":"478cfe63ddcf4b18a900000e","seq":475934,"position":1,"parentId":"478cff5fddcf4b18a9000010","content":"Data Analysis"},{"_id":"4791458542ac704543000015","treeId":"478cfe63ddcf4b18a900000e","seq":476689,"position":4,"parentId":"4791422d42ac704543000010","content":"##Data = Model + Error##"},{"_id":"479145e442ac704543000016","treeId":"478cfe63ddcf4b18a900000e","seq":476690,"position":1,"parentId":"4791458542ac704543000015","content":"**Data**\nBasic scores or observations that we want to analyse"},{"_id":"479147aa42ac704543000017","treeId":"478cfe63ddcf4b18a900000e","seq":476691,"position":2,"parentId":"4791458542ac704543000015","content":"**Model**\nCompact description or representation of the data.\n\nFrom simple to complex predictions of the data. E.g prediction of households that have internet access. The prediction for a particular attribute is an unknown parameter which is estimated from the data.\n"},{"_id":"479148ae42ac704543000018","treeId":"478cfe63ddcf4b18a900000e","seq":475959,"position":3,"parentId":"4791458542ac704543000015","content":""},{"_id":"479150cf42ac704543000019","treeId":"478cfe63ddcf4b18a900000e","seq":476692,"position":4,"parentId":"4791458542ac704543000015","content":"**Error**\n\nThe amount by which the model fails to represent the data accurately. Often referred to as 'the residual' - the part that is left over after we have made our best prediction.\n\nError = Data-Model\n\nWe want to reduce error to have data accurately represent the model."},{"_id":"479156a742ac70454300001a","treeId":"478cfe63ddcf4b18a900000e","seq":476694,"position":1,"parentId":"479150cf42ac704543000019","content":"**How to reduce error and improve models - Research methods**\n\n*Better research designs\n*Better data collection procedures\n*More reliable instruments\n\npg.2 references for help on such issues \n"},{"_id":"479190c142ac70454300001c","treeId":"478cfe63ddcf4b18a900000e","seq":476695,"position":2,"parentId":"479150cf42ac704543000019","content":"**How to reduce error and improve models - Data analysis**\n\n*make the model's predictions conditional on additional information about each observation. \n*Add parameters to the model and estimate those parameters so that the model will provide a good fit to the data by making the error as small as possible."},{"_id":"479192df42ac70454300001d","treeId":"478cfe63ddcf4b18a900000e","seq":476058,"position":1,"parentId":"479190c142ac70454300001c","content":"For example\n\nIn creating a model for the internet availability in each state of America.\n\n**Simple Apriori assumption of internet availability allows for high level of error**\n\nData (internet for each state) = Model (aprori percentage prediction of 44% internet availability) + Error \n\nYi (Data set) = .44 + ERROR\n\nYi=.44+ERROR\n\n\n\n\n"},{"_id":"4791a01942ac70454300001e","treeId":"478cfe63ddcf4b18a900000e","seq":476057,"position":2,"parentId":"479190c142ac70454300001c","content":"**Improved model predicts the same internet usage for each state, but leaves the predicted value as an unspecified parameter**\n\nUsing the data set,it may use the AVERAGE internet availability for the 50 states\n\nYi (Data Set)=B0 (average using data set) + ERROR\n\nYi=B0+ERROR\n"},{"_id":"4791a7f042ac70454300001f","treeId":"478cfe63ddcf4b18a900000e","seq":476059,"position":3,"parentId":"479190c142ac70454300001c","content":""},{"_id":"4791b83242ac704543000021","treeId":"478cfe63ddcf4b18a900000e","seq":476083,"position":4,"parentId":"479190c142ac70454300001c","content":"**Reduce error further by including more parameters to make *conditional predictions***\n\nWest and east coast of america gain innovations quicker than the middle of the country. Therefore using the different time zones we can adjust up or down for predictive percentage.\n\nYi=B0 (average using data set) + B1 (Adjustment according to time zone) +ERROR if the state is in the Eastern or Pacific Time zones.\n\nYi=B0 (average using data set) + B1 (Adjustment according to time zone) +ERROR if the state is in the Central or mountain Time zones.\n\nThe model is conditional on the time zone in which the state is located."},{"_id":"479216d442ac704543000022","treeId":"478cfe63ddcf4b18a900000e","seq":476224,"position":5,"parentId":"479190c142ac70454300001c","content":"**Additionally you may add a *continuous* rather than *categorical* parameter**\n\nXi can represent the percentage of college graduates in the state, a percentage higher will adjust the internet usage prediction to be higher too.\n\nYi=B0+B1Xi+ERROR\n\n"},{"_id":"479235bf42ac704543000023","treeId":"478cfe63ddcf4b18a900000e","seq":476263,"position":6,"parentId":"479190c142ac70454300001c","content":"We could reduce error further by creating a prediction for each individual state - however this does not offer us a theory that can be tested or any useful information.\n\nIt would merely be a statement of fact."},{"_id":"479237c242ac704543000024","treeId":"478cfe63ddcf4b18a900000e","seq":476698,"position":1,"parentId":"479235bf42ac704543000023","content":"We have a `conflict of reducing error and providing the best description of DATA`.\n\nThe ultimate goal is to find the smallest, simplest model that provides an adequate description of the data sot that the error is not too large.\n\nModels without the additional parameters are the *compact model* (model c)\n\nThe alternative *augmented model* (model A), includes all parameters of Model c plus some additional parameters. These additional parameters will either reduce error or leave it unchanged therefore\n\nERROR(A) is less than or equal to ERROR(C)\n\n"},{"_id":"4793387542ac704543000025","treeId":"478cfe63ddcf4b18a900000e","seq":476705,"position":1,"parentId":"479237c242ac704543000024","content":"Is it worth adding parameters to potentialally reduce error?\n\nCalculating the *proportional reduction in error* (PRE) represents the proportion of Model C's error that is reduced or eliminated when we replace it with the more complex Model A.\n\nPRE= ERROR(C) -ERROR(A)/ ERROR C\n \n\nAlternatively\n\nPRE=1-ERROR(A)/ ERROR(C)\n \n\nIf additional parameters do no good then PRE=0\n\nIf Model A provides a perfect fit, then ERROR(A)=0 and (Assuming Model C does not also provide a perfect fit) PRE=1"},{"_id":"47934e1b42ac704543000026","treeId":"478cfe63ddcf4b18a900000e","seq":476676,"position":1,"parentId":"4793387542ac704543000025","content":"The higher the PRE the more error is reduced from the Augmented model. The smaller the value of the PRE the more we will want to stick with the simpler compact model."},{"_id":"47935bd742ac704543000027","treeId":"478cfe63ddcf4b18a900000e","seq":476707,"position":1,"parentId":"47934e1b42ac704543000026","content":"**Is the new model valuable? is 40% reduction enough?**\n\nConsider how MANY parameters is causing the 40% reduction in error. If it is just 1 then the factor may be quite an important one.\n\n*PRE per parameter added* will be a useful index.\n\nWe will also be more imporessed with a given PRE as the difference between the number of parameters that were added and the number of parateters that could have been added becomes greater. Hence, our inferential statistics will consider how many parameters could have been added to Model C to create Model A but were not.\n\nIs it more influential OVER AND ABOVE the other parameters?"},{"_id":"47936def42ac704543000028","treeId":"478cfe63ddcf4b18a900000e","seq":476708,"position":1,"parentId":"47935bd742ac704543000027","content":"Model C corresponds to the null hypothesis and Model A corresponds to the alternative hypothesis. \n\n`The null hypothesis is that all the parameters included in Model A but in in Model C are zero, or equivalently that there is no difference in error between Models A and C`. If we reject Model C in favour of Model A, then we reject the null hypothesis in favour of the alternative hypothesis that is implied by the difference between Models C and A. \n\nWe conclude that it is unreasonable to presume that all the extra parameter values in Model A are zero."},{"_id":"4793800942ac704543000029","treeId":"478cfe63ddcf4b18a900000e","seq":476696,"position":2,"parentId":"479237c242ac704543000024","content":""},{"_id":"47c831494c6f5f83b8000027","treeId":"478cfe63ddcf4b18a900000e","seq":504978,"position":5,"parentId":"4791422d42ac704543000010","content":"Chapter 2: Simple Models: Definitions of Error and Parameter Estimates"},{"_id":"47dccf48c82efbca2800004c","treeId":"478cfe63ddcf4b18a900000e","seq":513874,"position":2,"parentId":"478cff5fddcf4b18a9000010","content":"Notation"}],"tree":{"_id":"478cfe63ddcf4b18a900000e","name":"Data analysis","publicUrl":"data-analysis"}}