Tuesday, November 12, 2019

Simple Linear Regression

Simple linear regression is the statistic method used to make summary of and provide the association between variables that are continues and quantitative ,basically it deals with two measures that describes how strong the linear relationship we can compute in data .Simple linear regression consist of one variable known as the predictor variable and the other variable denote y known as response variable . It is expected that when we talk of simple linear regression to touch on deterministic relationship and statistical relationship, the concept of least mean square .the interpretation of the b0 and b1 that they are used to interpret the estimate regression . There is also what is known as the population regression line and the estimate regression line . This linearity is measured using the correlation coefficient (r), that can be -1,0,1.The strength of the association is determined from the value of r .( https://onlinecourses.science.psu.edu/stat501/node/250). History of simple linear regression Karl Pearson established a demanding treatment of Applied statistical measure known as Pearson Product Moment Correlation . This come from the thought of Sir Francis Galton ,who had the idea of the modern notions of correlation and regression ,Sir Galton contributed in science of Biology ,psychology and Applied statistics . It was seen that Sir Galton is fascinated with genetics and heredity provided the initial inspiration that led to regression and Pearson Product Moment Correlation . The thought that encouraged the advance of the Pearson Product Moment Correlation began with vexing problem of heredity to understand how closely features of generation of living things exhibited in the next generation. Sir Galton took the approach of using the sweet pea to check the characteristic similarities. ( Bravais, A. (1846). The use of sweet pea was motivated by the fact that it is self- fertilize ,daughter plants shows differences in genetics from mother with-out the use of the second parent that will lead to statistical problem of assessing the genetic combination for both parents .The first insight came about regression came from two dimensional diagram plotting the size independent being the mother peas and the dependent being the daughter peas. He used this representation of data to show what statisticians call it regression today ,from his plot he realised that the median weight of daughter seeds from a particular size of mother seed approximately described a straight line with positive slope less than 1. â€Å"Thus he naturally reached a straight regression line ,and the constant variability for all arrays of character for a given character of second .It was ,perhaps best for the progress of the correlational calculus that this simple special case should promulgated first .It so simply grabbed by the beginner (Pearson 1930,p.5). Then it was later generalised to more complex way that is called the multiple regression. Galton, F. (1894),Importance of linear regressionStatistics usually uses the term linear regression in interpretation of data association of a particular survey, research and experiment .The linear relationship is used in modelling .The modelling of one explanatory variable x and response variable y will require the use of simple linear regression approach . The simple linear regression is said to be broadly useful in methodology and the practical application. This method on simple linear regression model is not used in statistics only but it is applied in many biological, social science and environmental research. The simple linear regression is worth importance because it gives indication of what is to be expected, mostly in monitoring and amendable purposes involved on some disciplines(April 20, 2011 , plaza ,). Description of linear regression The simple linear regression model is described by Y=(?0 + ?1 +E), this is the mathematical way of showing the simple linear regression with labelled x and y .This equation gives us a clear idea on how x is associated to y, there is also an error term shown by E. The term E is used to justification for inconsistency in y, that we can be able to detect it by the use of linear regression to give us the amount of association of the two variables x and y . Then we have the parameters that are use to represent the population (?0 + ?1x) .We then have the model given by E(y)= (?0 + ?1x), the ?0 being the intercept and ?1 being the slope of y ,the mean of y at the x values is E(y) . The hypothesis is assumed is we assume that there is a linear association between the two variables ,that being our H0 and H1 we assume that there is no linear relationship between H0 and H1. Background of simple linear regression Galton used descriptive statistics in order for him to be able to generalise his work of different heredity problems . The needed opportunity to conclude the process of analysing these data, he realised that if the degree of association between variables was held constant,then the slope of the regression line could be described if variability of the two measure were known . Galton assumed he estimated a single heredity constant that was generalised to multiple inherited characteristics . He was wondering why, if such a constant existed ,the observed slopes in the plot of parent child varied too much over these characteristics .He realise variation in variability amongst the generations, he attained at the idea that the variation in regression slope he obtained were solely due to variation in variability between the various set of measurements . In resent terms ,the principal this principal can be illustrated by assuming a constant correlation coefficient but varying the standard deviations of the two variables involved . On his plot he found out that the correlation in each data set. He then observe three data sets ,on data set one he realised that the standard deviation of Y is the same as that of X , on data set two standard deviation of Y is less than that of X ,third data set standard deviation of Y is great than that of X . The correlation remain constant for three sets of data even though the slope of the line changes as an outcome of the differences in variability between the two variables.The rudimentary regression equation y=r(Sy / Sx)x to describe the relationship between his paired variables .He the used an estimated value of r , because he had no knowledge of calculating it The (Sy /Sx) expression was a correction factor that helped to adjust the slope according to the variability of measures . He also realised that the ratio of variability of the two measures was the key factor in determining the slope of the regression line .The uses of simple linear regression Simple linear regression is a typical Statistical Data Analysis strategy. It is utilized to decide the degree to which there is a direct connection between a needy variable and at least one free factors. (e.g. 0-100 test score) and the free variable(s) can be estimated on either an all out (e.g. male versus female) or consistent estimation scale. There are a few different suppositions that the information must full fill keeping in mind the end goal to meet all requirements for simple linear regression. Basic linear regression is like connection in that the reason for existing is to scale to what degree there is a direct connection between two factors. The real contrast between the two is that relationship sees no difference amongst the two variables . Specifically, the reason for simple linear regression â€Å"anticipate† the estimation of the reliant variable in light of the estimations of at least one free factors. https://www.statisticallysignificantconsulting.com/RegressionAnalysis.htm ReferenceBravais, A. (1846), â€Å"Analyse Mathematique sur les Probabilites des Erreurs de Situation d'un Point,† Memoires par divers Savans, 9, 255-332.Duke, J. D. (1978),â€Å"Tables to Help Students Grasp Size Differences in Simple Correlations,† Teaching of Psychology, 5, 219-221.FitzPatrick, P. J. (1960),â€Å"Leading British Statisticians of the Nineteenth Century,† Journal of the American Statistical Association, 55, 38-70.Galton, F. (1894),Natural Inheritance (5th ed.), New York: Macmillan and Company.https://onlinecourses.science.psu.edu/stat501/node/250.https://www.statisticallysignificantconsulting.com/RegressionAnalysis.htmGhiselli, E. E. (1981),Measurement Theory for the Behavioral Sciences, San Francisco: W. H. Freeman.Goldstein, M. D., and Strube, M. J. (1995), â€Å"Understanding Correlations: Two Computer Exercises,† Teaching of Psychology, 22, 205-206.Karylowski, J. (1985),â€Å"Regression Toward the Mean Effect: No Statistical Backgrou nd Required,† Teaching of Psychology, 12, 229-230.Paul, D. B. (1995),Controlling Human Heredity, 1865 to the Present, Atlantic Highlands, N.J.: Humanities Press.Pearson, E. S. (1938),Mathematical Statistics and Data Analysis (2nd ed.), Belmont, CA: Duxbury.Pearson, K. (1896),â€Å"Mathematical Contributions to the Theory of Evolution. III. Regression, Heredity and Panmixia,† Philosophical Transactions of the Royal Society of London, 187, 253-318.Pearson, K. (1922),Francis Galton: A Centenary Appreciation, Cambridge University Press.Pearson, K. (1930),The Life, Letters and Labors of Francis Galton, Cambridge University Press.Williams, R. H. (1975), â€Å"A New Method for Teaching Multiple Regression to Behavioral Science Students,† Teaching of Psychology, 2, 76-78. Simple Linear Regression Stat 326 – Introduction to Business Statistics II Review – Stat 226 Spring 2013 Stat 326 (Spring 2013) Introduction to Business Statistics II 1 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 2 / 47 Review: Inference for Regression Example: Real Estate, Tampa Palms, Florida Goal: Predict sale price of residential property based on the appraised value of the property Data: sale price and total appraised value of 92 residential properties in Tampa Palms, Florida 1000 900 Sale Price (in Thousands of Dollars) 800 700 600 500 400 300 200 100 0 0 100 200 300 400 500 600 700 800 900 1000 Appraised Value (in Thousands of Dollars)Review: Inference for Regression We can describe the relationship between x and y using a simple linear regression model of the form  µy = ? 0 + ? 1 x 1000 900 Sale Price (in Thousands of Dollars) 800 700 600 500 400 300 200 100 0 0 100 200 300 400 500 600 700 800 900 1000 Appraised Value (in Thousands of Dollars) response variable y : sale price explanatory variable x: appraised value relationship between x and y : linear strong positive We can estimate the simple linear regression model using Least Squares (LS) yielding the following LS regression line: y = 20. 94 + 1. 069x Stat 326 (Spring 2013) Introduction to Business Statistics II / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 4 / 47 Review: Inference for Regression Interpretation of estimated intercept b0 : corresponds to the predicted value of y , i. e. y , when x = 0 Review: Inference for Regression Interpretation of estimated slope b1 : corresponds to the change in y for a unit increase in x: when x increases by 1 unit y will increase by the value of b1 interpretation of b0 is not always meaningful (when x cannot take values close to or equal to zero) here b0 = 20. 94: when a property is appraised at zero value the predicted sales price is $20,940 — meaningful?!Stat 326 (Spring 2013) Introduction to Business Statistics II 5 / 47 b1 < 0: y decreases as x increases (negative association) b1 > 0: y increases as x increases (positive association) here b1 = 1. 069: when the appraised value of a property increases by 1 unit, i. e. by $1,000, the predicted sale price will increase by $1,069. Stat 326 (Spring 2013) Introduction to Business Statistics II 6 / 47 Review: Inference for Regression Measuring strength and adequacy of a linear relationship correlation coe? cient r : measure of strength of linear relationship ? 1 ? r ? 1 here: r = 0. 9723 Review: Inference for RegressionPopulation regression line Recall from Stat 226 Population regression line The regression model that we assume to hold true for the entire population is the so-called population regression line where  µy = ? 0 + ? 1 x, coe? cient of determination r 2 : amount of variation in y explained by the ? tted linear model 0 ? r2 ? 1 here: r 2 = (0. 9723)2 = 0. 9453 ? 94. 53% of the variation in the sale price can be explained through the line ar relationship between the appraised value (x) and the sale price (y ) Stat 326 (Spring 2013) Introduction to Business Statistics II 7 / 47  µy — average (mean) value of y in population for ? xed value of x ? — population intercept ? 1 — population slope The population regression line could only be obtained if we had information on all individuals in the population. Stat 326 (Spring 2013) Introduction to Business Statistics II 8 / 47 Review: Inference for Regression Based on the population regression line we can fully describe relationship between x and y up to a random error term ? y = ? 0 + ? 1 x + ? , where ? ? N (0, ? ) Review: Inference for Regression In summary, these are important notations used for SLR: Description x y Parameters ? 0 ? 1  µy ? Stat 326 (Spring 2013) Introduction to Business Statistics II 9 / 47 Stat 326 (Spring 2013)Description Estimates b0 b1 y e Description Introduction to Business Statistics II 10 / 47 Review: Inference for Regre ssion Review: Inference for Regression Validity of predictions Assuming we have a â€Å"good† model, predictions are only valid within the range of x-values used to ? t the LS regression model! Predicting outside the range of x is called extrapolation and should be avoided at all costs as predictions can become unreliable. Why ? t a LS regression model? A â€Å"good† model allows us to make predictions about the behavior of the response variable y for di? rent values of x estimate average sale price ( µy ) for a property appraised at $223,000: x = 223 : y = 20. 94 + 1. 069 ? 223 = 259. 327 ? the average sale price for a property appraised at $223,000 is estimated to be about $259,327 What is a â€Å"good† model? — answer to this question is not straight forward. We can visually check the validity of the ? tted linear model (through residual plots) as well as make use of numerical values such as r 2 . more on assessing the validity of regression model wi ll follow. 11 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 12 / 47 Stat 326 (Spring 2013)Introduction to Business Statistics II Review: Inference for Regression What to look for: Review: Inference for Regression Regression Assumptions residual plot: Assumptions SRS (independence of y -values) linear relationship between x and  µy for each value of x, population of y -values is normally distributed (? ? ? N) r2 : for each value of x, standard deviation of y -values (and of ? ) is ? In order to do inference (con? dence intervals and hypotheses tests), we need the following 4 assumptions to hold: Stat 326 (Spring 2013) Introduction to Business Statistics II 13 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 14 / 47Review: Inference for Regression †SRS Assumption† is hardest to check The †Linearity Assumption† and †Constant SD Assumption† are typically checked visually through a residual plot. Recall: residua l = y ? y = y ? (b0 + b1 x) The †Normality Assumption† is checked by assessing whether residuals are approximately normally distributed (use normal quantile plot) plot x versus residuals any pattern indicates violation Review: Inference for Regression Stat 326 (Spring 2013) Introduction to Business Statistics II 15 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 16 / 47 Review: Inference for RegressionReturning to the Tampa Palms, Florida example: 100 50 Residual 0 -50 -100 -150 0 100 200 300 400 500 600 700 800 900 1000 Review: Inference for Regression Going one step further, excluding the outlier yields 0. 2 0. 1 0. 0 -0. 1 -0. 2 -0. 3 4 4. 5 5 5. 5 log Appraised 6 6. 5 7 Residual Appraised Value (in Thousands of Dollars) Note: non-constant variance can often be stabilized by transforming x, or 0. 5 y , or both: Residual 0. 0 -0. 5 -1. 0 -1. 5 4 4. 5 5 5. 5 log Appraised 6 6. 5 7 outliers/in? uential points in general should only be excluded from an analysis if they can be explained and their exclusion can be justi? ed, e. g. ypo or invalid measurements, etc. excluding outliers always means a loss of information handle outliers with caution may want to compare analyses with and without outliers Stat 326 (Spring 2013) Introduction to Business Statistics II 17 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 18 / 47 Review: Inference for Regression normal quantile plots Tampa Palms example Residuals Sale Price (in Thousands of Dollars) 100 .01 . 05 . 10 . 25 . 50 . 75 . 90 . 95 . 99 Review: Inference for Regression Residuals log Sale 50 Regression Inference Con? dence intervals and hypotheses tests -3 -2 -1 0 1 2 3 Normal Quantile Plot -50 -100 Need to assess whether linear relationship between x and y holds true for entire population. .01 . 05 . 10 . 25 . 50 . 75 . 90 . 95 . 99 Residuals log Sale without outlier 0. 2 0. 1 0 -0. 1 -0. 2 -0. 3 -3 -2 -1 0 1 2 3 This can be accomplished through testing H0 : ? 1 = 0 vs. H0 : ? 1 = 0 based on the estimates slope b1 . For simplicity we will work with the untransformed Tampa Palms data. Normal Quantile Plot Stat 326 (Spring 2013) Introduction to Business Statistics II 19 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 20 / 47 Review: Inference for RegressionReview: Inference for Regression Example: Find 95% CI for ? 1 for the Tampa Palms data set Con? dence intervals We can construct con? dence intervals (CIs) for ? 1 and ? 0 . General form of a con? dence interval estimate  ± t ? SEestimate , where t ? is the critical value corresponding to the chosen level of con? dence C t ? is based on the t-distribution with n ? 2 degrees of freedom (df) Interpretation: Stat 326 (Spring 2013) Introduction to Business Statistics II 21 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 22 / 47 Review: Inference for Regression Review: Inference for RegressionTesting for a linear relationship between x and y If we wish to tes t whether there exists a signi? cant linear relationship between x and y , we need to test H0 : ? 1 = 0 Why? If we fail to reject the null hypothesis (i. e. stick with H0 = ? 1 = 0), the LS regression model reduces to  µy = ? 1 =0 versus Ha : ? 1 = 0 ?0 + ? 1 x ? 0 + 0  · x ? 0 (constant) Introduction to Business Statistics II 24 / 47 = = implying that  µy (and hence y ) is not linearly dependent on x. Stat 326 (Spring 2013) Introduction to Business Statistics II 23 / 47 Stat 326 (Spring 2013) Review: Inference for Regression Review: Inference for RegressionExample (Tampa Palms data set): Test at the ? = 0. 05 level of signi? cance for a linear relationship between the appraised value of a property and the sale price Stat 326 (Spring 2013) Introduction to Business Statistics II 25 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 26 / 47 Inference about Prediction Why ? t a LS regression model? The purpose of a LS regression model is to 1 Inference about Predi ction 2 estimate  µy – average/mean value of y for a given value of x, say x ? e. g. estimate average sale price  µy for all residential property in Tampa Palms appraised at x ? $223,000 predict y – an individual/single future value of the response variable y for a given value of x, say x ? e. g. predict a future sale price of an individual residential property appraised at x ? =$223,000 Keep in mind that we consider predictions for only one value of x at a time. Note, these two tasks are VERY di? erent. Carefully think about the di? erence! Stat 326 (Spring 2013) Introduction to Business Statistics II 27 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 28 / 47 Inference about Prediction To estimate  µy and to predict a single future y value for a given level of x = x ? we can use the LS regression line y = b0 + b1 x Simply substitute the desired value of x, say x ? , for x: y = b0 + b1 x ? Inference about Prediction In addition we need to know how much variability is associated with the point estimator. Taking the variability into account provides information about how good and reliable the point estimator really is. That is, which range potentially captures the true (but unknown) parameter value? Recall from 226 ? construction of con? dence intervals Stat 326 (Spring 2013) Introduction to Business Statistics II 29 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 0 / 47 Inference about Prediction Much more variability is associated with estimating a single observation than estimating an average — individual observations always vary more than averages!! Inference about Prediction Therefore we distinguish a con? dence interval for the average/mean response  µy and a prediction interval for a single future observation y Both intervals use a t ? critical value from a t-distribution with df = n ? 2. the standard error will be di? erent for each interval: While the point estimator for the average  µ y and the future individual value y are the same (namely y = b0 + b1 x ? , the of the two con? dence intervals ! Stat 326 (Spring 2013) Introduction to Business Statistics II 31 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 32 / 47 Inference about Prediction Con? dence interval for the average/mean response  µy Width of the con? dence interval is determined using the standard error SE µ (from estimating the mean response) SE µ can be obtained in JMP Keep in mind that every con? dence interval is always constructed for one speci? c given value x ? A level C con? dence interval for the average/mean response  µy , when x takes the value x? is given by y  ± t ?SE µ , where SE µ is the standard error for estimating a mean response. Stat 326 (Spring 2013) Introduction to Business Statistics II 33 / 47 Inference about Prediction Prediction interval for a single (future) value y Again, Width of the con? dence interval is determined using the standard error SE µ (from estimating the mean response) SEy can be obtained in JMP Keep in mind that every prediction interval is always constructed for one speci? c given value x ? A level C prediction interval for a single observation y , when x takes the value x ? is given by y  ± t ? SEy , where SEy is the standard error for estimating a single response.Stat 326 (Spring 2013) Introduction to Business Statistics II 34 / 47 Inference about Prediction The larger picture: Inference about Prediction The larger picture cont’d. Stat 326 (Spring 2013) Introduction to Business Statistics II 35 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 36 / 47 Inference about Prediction Example: An appliance store runs a 5-month experiment to determine the e? ect of advertising on sales revenue. There are only 5 observations. The scatterplot of the advertising expenditures versus the sales revenues is shown below: Bivariate Fit of Sales Revenues (in Dollars) By Advertising expenditur eInference about Prediction Example cont’d: JMP can draw the con? dence intervals for the mean responses as well as for the predicted values for future observations (prediction intervals). These are called con? dence bands: Bivariate Fit of Sales Revenues (in Dollars) By Advertising expenditure 5000 5000 Sales Revenues (in Dollars) 4000 3000 2000 1000 Sales Revenues (in Dollars) 4000 3000 2000 1000 0 0 0 100 200 300 400 500 600 Advertising expenditure (in Dollars) 0 100 200 300 400 500 600 Advertising expenditure (in Dollars) Linear Fit Linear Fit Sales Revenues (in Dollars) = -100 + 7 Advertising expenditure (in Dollars)Stat 326 (Spring 2013) Introduction to Business Statistics II 37 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 38 / 47 Inference about Prediction Inference about Prediction Estimation and prediction (for the appliance store data) Estimation and prediction – Using JMP For each observation in a data set we can get from JMP: y , SEy , and also SE µ . In JMP do: 1 2 We wish to estimate the mean/average revenue of the subpopulation of stores that spent x ? = 200 on advertising. Suppose that we also wish to predict the revenue in a future month when our store spends x ? = 200 on advertising.The point estimate in both situations is the same: y = ? 100 + 7 ? 200 ? 1300 the corresponding standard errors of the mean and of the prediction however are di? erent: SE µ ? 331. 663 SEy ? 690. 411 40 / 47 Choose Fit Model From response icon, choose Save Columns and then choose Predicted Values, Std Error of Predicted, and Std Error of Individual. Stat 326 (Spring 2013) Introduction to Business Statistics II 39 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II Inference about Prediction Estimation and prediction (cont’d) Note that in the appliance store example, SEy > SE µ (690. 411 versus 331. 63). This is true always: we can estimate a mean value for y for a given x ? much more precisely than we can predict the value of a single y for x = x ?. In estimating a mean  µy for x = x ? , the only uncertainty arises because we do not know the true regression line. In predicting a single y for x = x ? , we have two uncertainties: the true regression line plus the expected variability of y -values around the true line. Inference about Prediction Estimation and prediction (cont’d) It always holds that SE µ < SEy Therefore a prediction interval for a single future observation y will always be wider than a con? ence interval for the mean response  µy as there is simply more uncertainty in predicting a single value. Stat 326 (Spring 2013) Introduction to Business Statistics II 41 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 42 / 47 Inference about Prediction Example cont’d: JMP also calculates con? dence intervals for the mean response  µy as well as prediction intervals for single future observations y. (For instructions follow the handout o n JMP commands related to regression CIs and PIs. ) Inference about Prediction Example cont’d: To construct both a con? ence and/or prediction interval, we need to obtain SE µ and SEy in JMP for the value x ? that we are interested in: Month Ad. Expend. Sales Rev. Pred. Sales Rev. StdErr Pred Sales Revenues StdErr Indiv Sales Revenues Let’s construct one 95% CI and PI by hand and see if we can come up with the same results as JMP: In the second month the appliance store spent x = $200 on advertising and observed $1000 in sales revenue, so x = 200 and y = 1000 Using the estimated LS regression line, we predict: y = ? 100 + 7 ? 200 = 1300 Stat 326 (Spring 2013) Introduction to Business Statistics II 43 / 47 Need to ? nd t ? ?rst:Stat 326 (Spring 2013) Introduction to Business Statistics II 44 / 47 Inference about Prediction A 95% CI for the mean response  µy , when x ? = 200: Inference about Prediction A 95% PI for a single future observation of y , when x ? = 200: S tat 326 (Spring 2013) Introduction to Business Statistics II 45 / 47 Stat 326 (Spring 2013) Introduction to Business Statistics II 46 / 47 Inference about Prediction Example cont’d: Advertising exp. Sales Rev. Lower 95% Mean Upper 95% Mean Sales Rev. Sales Rev. Lower 95% Indiv Sales Rev. Upper 95% Indiv Sales Rev. Month Stat 326 (Spring 2013) Introduction to Business Statistics II 47 / 47

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.