Statistics
Data Analysis
Notation
Chapter 2: Simple Models: Definitions of Error and Parameter Estimates
Data
Basic scores or observations that we want to analyse
Model
Compact description or representation of the data.
From simple to complex predictions of the data. E.g prediction of households that have internet access. The prediction for a particular attribute is an unknown parameter which is estimated from the data.
Error
The amount by which the model fails to represent the data accurately. Often referred to as ‘the residual’ - the part that is left over after we have made our best prediction.
Error = Data-Model
We want to reduce error to have data accurately represent the model.
How to reduce error and improve models - Research methods
Better research designs
Better data collection procedures
*More reliable instruments
pg.2 references for help on such issues
How to reduce error and improve models - Data analysis
make the model’s predictions conditional on additional information about each observation. Add parameters to the model and estimate those parameters so that the model will provide a good fit to the data by making the error as small as possible.
For example
In creating a model for the internet availability in each state of America.
Simple Apriori assumption of internet availability allows for high level of error
Data (internet for each state) = Model (aprori percentage prediction of 44% internet availability) + Error
Yi (Data set) = .44 + ERROR
Yi=.44+ERROR
Improved model predicts the same internet usage for each state, but leaves the predicted value as an unspecified parameter
Using the data set,it may use the AVERAGE internet availability for the 50 states
Yi (Data Set)=B0 (average using data set) + ERROR
Yi=B0+ERROR
Reduce error further by including more parameters to make conditional predictions
West and east coast of america gain innovations quicker than the middle of the country. Therefore using the different time zones we can adjust up or down for predictive percentage.
Yi=B0 (average using data set) + B1 (Adjustment according to time zone) +ERROR if the state is in the Eastern or Pacific Time zones.
Yi=B0 (average using data set) + B1 (Adjustment according to time zone) +ERROR if the state is in the Central or mountain Time zones.
The model is conditional on the time zone in which the state is located.
Additionally you may add a continuous rather than categorical parameter
Xi can represent the percentage of college graduates in the state, a percentage higher will adjust the internet usage prediction to be higher too.
Yi=B0+B1Xi+ERROR
We could reduce error further by creating a prediction for each individual state - however this does not offer us a theory that can be tested or any useful information.
It would merely be a statement of fact.
We have a conflict of reducing error and providing the best description of DATA
.
The ultimate goal is to find the smallest, simplest model that provides an adequate description of the data sot that the error is not too large.
Models without the additional parameters are the compact model (model c)
The alternative augmented model (model A), includes all parameters of Model c plus some additional parameters. These additional parameters will either reduce error or leave it unchanged therefore
ERROR(A) is less than or equal to ERROR(C)
Is it worth adding parameters to potentialally reduce error?
Calculating the proportional reduction in error (PRE) represents the proportion of Model C’s error that is reduced or eliminated when we replace it with the more complex Model A.
PRE= ERROR(C) -ERROR(A)/ ERROR C
Alternatively
PRE=1-ERROR(A)/ ERROR(C)
If additional parameters do no good then PRE=0
If Model A provides a perfect fit, then ERROR(A)=0 and (Assuming Model C does not also provide a perfect fit) PRE=1
The higher the PRE the more error is reduced from the Augmented model. The smaller the value of the PRE the more we will want to stick with the simpler compact model.
Is the new model valuable? is 40% reduction enough?
Consider how MANY parameters is causing the 40% reduction in error. If it is just 1 then the factor may be quite an important one.
PRE per parameter added will be a useful index.
We will also be more imporessed with a given PRE as the difference between the number of parameters that were added and the number of parateters that could have been added becomes greater. Hence, our inferential statistics will consider how many parameters could have been added to Model C to create Model A but were not.
Is it more influential OVER AND ABOVE the other parameters?
Model C corresponds to the null hypothesis and Model A corresponds to the alternative hypothesis.
The null hypothesis is that all the parameters included in Model A but in in Model C are zero, or equivalently that there is no difference in error between Models A and C
. If we reject Model C in favour of Model A, then we reject the null hypothesis in favour of the alternative hypothesis that is implied by the difference between Models C and A.
We conclude that it is unreasonable to presume that all the extra parameter values in Model A are zero.