Rural poverty remains a critical economic problem in many developing countries. This paper conducts an econometric analysis of data from the 2006 Vietnam Household Expenditure Survey to assess the impact of selected socio-economic factors on the income of Vietnamese households. The data that is used is cross sectional in that it is widely discrete data (such as per capita income) relating to one period or without respect to variance due to time.
Jehovaness Aikaeli, in his research report, “Determinants of Rural Income in Tanzania:An Empirical Approach”, carries out a study from the 2005 Tanzania Rural Investment Climate Survey to assess the impact of selected socio-economic and geographic factors on the income of rural households and communities. What he found out was that improvement in four variables: the level of education of the household head, size of household labor force, acreage of land use and ownership of a non-farm rural enterprise had a significant positive impact on the incomes of rural households (Aikaeli vi).
I will use his paper as a guide to my research paper; however, I will not be using household labor force, acreage of land use and ownership of non-farm rural areas as I do not have the data for it. Steve Onyeiwu, in his paper, “Determinants of Income Poverty in Rural Africa: Empirical Evidence from Kenya and Nigeria”, uses panel and cross- sectional regressions, with socio-economic and demographic survey data collected from rural communities of Kenya and Nigeria to explore the determinants of income and poverty in rural Africa.
The determinants he looks at are household size, age, female proportion, education, land ownership and non-durable assets. He found out that income was lower in households run by females (Onyeiwu 2). I will use this paper as a guide to my research paper but again I would not be using land ownership and non-durable assets as I do not have data on it. Education plays an important role in determining income. What I would like to examine in this paper is the casual affect of father and mother’s education on household income. However, education is not the only factor that affects household income.
There are many other factors that affect household income such as father’s age, mother’s age, total number of kids at home, if the child works, value of durable assets, if the household is from an urban area or rural and if the household belongs to an ethnic minority or not. To see the casual affect of father and mother’s education on household income, I would need to control for all the other variables so they do not get into the error term and cause omitted variables bias. Explaining the Method and Results I am going to explain some things that are important as I go on and step by step procedure of implementing with the data.
Dependent Variable: the response that is measured, variable to be explained in a model. In this case, household income per capita (pcexp2rl). But throughout the paper, I will refer to it as household income. Independent Variables: variables that are used to explain variation in the dependent variable. In my case: father_edu, mother_edu, father_age, mother_age, totkidshome, kid_work, durbus_2, urban06 and ethn. Multiple Linear Regression (MLR) Model: To examine the relationship between household income and all the other different variables a multiple linear regression model can be used.
Multiple linear regression takes into account the relationship between two or more explanatory variables and a response variable by fitting a linear equation to observed data. y = ? 0+ ? 1×1 + ? 2×2+ …+ ? kxk+ u where u is the error term, which includes all other factors not mentioned in the MLR that affect y. General to Simple Model: I used the strategy general to simple model because if I start with a model that is too simple then there will be omitted variables bias so I chose to start with a general model including many variables.
The if I find any irrelevant variables I can remove them step by step. Multiple Linear Regression Model for my data: So I ran the regression with all these variables below, the general model: lpcexp = 8. 005 + 0. 022 father_edu + 0. 008 mother_edu – 0. 0002 father_age + 0. 011 mother_age – 0. 108 totkidshome – 0. 025 kid_work + 0. 230 durbus_2 + 0. 333 urban06 – 0. 255 ethn. I chose to do a log-linear model in order to get better results, which is why I had generate a new variable lpcexp: log of pcexp2rl). Basic assumptions are heteroskedastcity and normality. Results are shown below: Coefficients (Coef. : In regression with multiple independent variables, the coefficient tells you how much the dependent variable is expected to increase or decrease (depending on the coefficient sign) when that independent variable increases by one, holding all the other independent variables constant (ceteris paribus). Since this is a log-linear model a 1% increase in for example, father’s education will cause a 2. 2 % in household income. However, as we know the dependent variable is pcexp2rl = household income/ number of household members. Increasing the totkidshome means the number of children at home decreases the per-capita household income (pcexp2rl).
So a 1% increase in totkidshome decreases the (per-capita) household income by 10. 8%. Table for Percentage change: Explanatory Variables| Percentage Changes in y explained by x| father_edu| 2. 22%| mother_edu| 0. 83%| father_age| -0. 019%| mother_age| 1. 11%| totkidshome| -10. 82%| kid_work| -2. 47%| durbus_2| 23. 05%| urban06| 33. 30%| ethn| -25. 47%| This table can be used to see which effects are large and which are small. urban06 has the largest effect on per capita income, meaning we expect to see 33. 30% increase in per capita income if the household lives in an urban area.
Second largest effect is ethn but it means we expect to see a 25. 47% decrease in per capita household income if the household belongs to an ethnic minority. Dummy Variables Variable that takes on the value zero or one. For example: female = 1 and male = 0. In this case, kid_work, urban06 and ethn are examples of dummy variables. For urban06, if the household is in rural area, urban06 = 0 and if the household is in urban area, urban06=1. So the coefficient on urban06 implies that households in an urban area earn 0. 461times higher income than the households in rural area. t-statistic
Moving on, in Wooldridge’s “Introductory Econometrics”, Theorem 4. 2 (t distribution for the Standardized Estimators) states: Under the CLM assumptions MLR. 1 through MLR. 6 (? j-? j)/se(? j)~tn-k-1, where k+1 is the number of unknown parameters in the population model y = ? 0+ ? 1×1 + ? 2×2+ …+ ? kxk+ u (k slope parameters and the intercept beta). This theorem allows us to test hypotheses involving only one element of ? : ? j. In most applications, our primary interest lies in testing null hypothesis (hypotheses that we take as true and have the data in order to present substantial evidence against it), H0: ? = ? j0. To determine a rule of rejecting H0, we need to decide on an alternative hypotheses (hupotheses against which the null hypotheses is tested), Ha: ? j ? ?j0. This is a two-sided test and I will use this as it is common and is appropriate to the stata results given above. It is common to take H0: ? j = 0. So the alternative hypotheses is Ha: ? j ? 0. For example: for father_edu, H0: ? 1= 0 and Ha: ? 1 ? 0 the t stat is: [(0. 22181-0)/0. 0036637] = 6. 05 (as seen in table above). If |t stat| > c, we reject H0 in favor of Ha. C is the critical value: t(n-k-1, ? 2 where n-k-l is the degrees of freedom (df) and ? is the significance level, which is a conventional significance level (usually 0. 01 or 0. 05) used for rejecting a null hypothesis in favor of alternative hypotheses. If we know the degrees of freedom and we choose a significance level then we can find the criticial value, c (look at the table in Appendix). For example: for father_edu, we know the t stat is 6. 05. When we look at the t-stat table, we see the critical value with df of 4278 (4288-9-1) and significance level of 0. 01 (0. 01/2 = 0. 005, as two-tailed test) is 2. 576. So 6. 05 > 2. 76, we reject H0 in favor of Ha at 1% level, meaning father’s education does make a difference to the household income. We can also say father_edu is statistically significant or statistically different from zero at 1% level. This shows that we should keep this variable in the model. Likewise, mother_edu, mother_age, totkidshome, durbus_2, urban06 and ethn all have |t stat| > c, so they are all statistically significant.
However, father_age and kid_work have |t stat| < c, so we do not reject H0 in favor of Ha at 1% level, meaning father’s age does not make a difference in per capita household ncome and also if the child works or not does not affect per capita household income. The variables father_age and kid_work are not statistically significant. Also, check the graph in the appendix. P-value Furthermore, we can also look at the p-value, which is the smallest significance level at which the null hypotheses can be rejected. Equally the largest significance level at which the null hypotheses cannot be rejected. If p-value < ? , we reject H0. For example: totkidshome has p-value of 0. 000 which is less than the significance level of 0. 1, so we reject H0, meaning the number of kids in the house have a significant impact on household income. However, father_age and kid_work have p-values that are greater than 0. 01, so we do not reject H0, meaning father_age and kid_work does not make a difference to the household income. P-value and Type I and Type II errors Type I error, ? (alpha), is defined as the probability of rejecting a true null hypothesis. Type II error, ? (beta), is defined as the probability of failing to reject a false null hypothesis. At the model development stage, we use ? = 0. 30 or ? = 0. 50. We choose ? = 0. 0 when deciding whether to retain individual regressors because it reduces Type II error, meaning choosing large alpha minimizes beta risk. If p-value > 0. 30, we exclude relevant variable (Type II error). Using alpha = 0. 05, a result is said to be statistically significant. If p-value < 0. 05, then the variable is said to be statistically significant. In my case, I excluded and deleted father_age as its p-value = 0. 917 > 0. 30 and also its least significant as p-value = 0. 917 < 0. 05. I ran the regression again without father_age. The results show that the coefficients on the variables decreased little.
This shows how father_age is least significant, which is why dropping it out was a good decision. F-statistic Another way of testing hypotheses that involves several betas is by using F statistic, which is defined by F ? (SSRr- SSRur)/qSSRur/(n-k-1) where SSRur is the sum of squared residuals for an unrestricted model (big model, includes all betas) and SSRr is the sum of squared residuals for a restricted model (only includes betas that you are not testing). The q is the numerator degrees of freedom and (n-k-1) is denominator degrees of freedom. For example , we test H0: ? 2 = ? 3 = ? 4 = 0 against Ha: at least one of them ? . So the restricted model will be one without all these betas and we will test if SSRr > SSRur. Under H0 and CLM assumptions, F is distributed as an F random variable with (q, n-k-1) degrees of freedom: F~ Fq,n-k-1. If F > c, we reject H0 at a chosen significance level. For example, in the first table, F (9, 4278) = 567. 98, where (n-k-1) = 4278 and q = 9. Using this information we look at the F-stat table and for 1% significance level we get 2. 41 as the critical value. So 567. 98 > 2. 41, we reject H0 at 1% significance level. The F-distribution graph is shown in the appendix. R-squared
Adding on, R-Squared in a multiple regression model is the proportion of the total sample variation in the dependent variable (household income) that is explained by the independent variable. It is “goodness of fit” and takes the value between 0 and 1, where R-squared = 1 means regression line fits the data perfectly. In the first table, we see R-squared = 0. 5884, which is like in the middle of 0 and 1, not bad not good. We say 58. 84% of variation in per capita household income is explained by father-edu, mother_edu, father_age, mother_age, totkidshome, kid_work, durbus_2, urban06 and ethn.
In the second table, R-squared did not change at all it was still 0. 5884. This is an exception case as when you drop out a variable R-squared should decrease because now less percentage of variation in household income is explained by the variables (now only eight variables compared to nine before). But still if R-squared didn’t change at all by deleting father-age, it makes in that it shows how least significant father’s age is and therefore it makes sense to delete it from the model. Test for Normality We use skewness and kurtosis test (sktest) to test normality of the residuals.
We test the null hypotheses below (using sktest): : skewness = 0 : skewness ? 0 : kurtosis = 3 : kurtosis ? 3 : skewness = 0 & kurtosis = 3 : ~ I typed in stata: The results from the sktest show that adjusted chi2 is very big so we reject H null… P-value for skewness and kurtosis is very tiny, so we reject the hypotheses for skewness and kurtosis at 1% significance level (p-value < 0. 01). 0. 3 Therefore, the Histogram of Residuals We can also look at the histogram for residuals to see if it gives a normal distribution. I typed:
Graphically the normal distribution is best described by a ‘bell-shaped’ curve. We see that the residuals do not appear to be normally distributed. The histogram does not fit well under its comparable normal density as it is a bit skewed to the left. Smethin below -6 White’s Test for Heteroskedasticity: If the error terms do not have constant variance, they are said to be heteroskedastic as oppose to homoskedastic (MLR. 5 in appendix). The White’s test is a test that establishes whether the residual variance of a variable in a regression model is constant: that is for homoskedasticity.
This test, and an estimator for heteroscedaticity-consistent standard errors, were proposed by Halbert White in 1980 (citation). It involves regressing the squared error term from the OLS regression on the independent variables in the regression. The R-squared from that regression is multiplied by n = . The result is a test statistic distributed approximately as chi-squared: ~, where q is K + . The null hypotheses of interest are: H0: no heteroskedasticity. H0: no skewness. H0: normal kurtosis. The result shows that the p-values for heteroskedasticity, skewness and kurtosis is very small less than 0. 0, so we reject H null, meaning unrestricted heteroskedasticity is present. Another way of looking at the results is using = 4278(0. 5884) = 2517. At 1% level and df = 61, we reject H null as 2517 > 50. 89. At 1% significance level and df = 61, we reject H null as 2517 > critical value. There is unrestricted heteroskedasticity present. I also did a residuals plot (by typing rvfplot in stata) as seen above which shows as fitted values increase the residuals increase…. Now the t-stat is not t-stat anymore and the F-stat is not F-stat anymore. -stat is asymptotically standard normal: t-stat based on robust standard error F-stat is asymptotically chi2. I ran the White’s test again to check if heteroskedasticity is still presented in the final model (one without father_age). References Aikaeli, Jehovaness. “Determinants of Rural Income in Tanzania:An Empirical Approach. ” (2010): 1-21. Print. Onyeiwu, Steve. “Determinants of Income Poverty in Rural Africa: Empirical Evidence from Kenya and Nigeria. ” (2011): 1-27. Print. Wooldridge, Jeffrey. Intoductory Econometrics. 4th ed. 2009. Print.