A soft general debut to imputation methods is provided.The focal point of this study is to research the Monte Carlo Markov Chain ( MCMC ) imputation method. Initial treatment is on the underlying mathematics of the process. Execution of the process with the assistance of SAS plans is besides carried out to compare and contrast between multiple and individual imputation methods and. Further treatment on the two imputation methods will be

## 1.2 Introduction

Multiple imputation ( MI ) is a statistical iterative technique used to analyse losing informations. The process replaces each losing value a vector M2 of augmented informations to finish the set to let for usual statistics illation of the dataset.[ 1 ]After finishing M sets of illations under a specific theoretical account, the set are so combined to organize one complete -data illation. This thought was foremost postulated by Rubin and has proved to be really valuable for research workers who have to handle losing informations from studies. It is imperative to observe at this phase that there are uncertainnesss and premises that go with patterning losing informations. While there are many imputation techniques that can be adopted, the method of pick will hold to depend on assorted factors such as the nature of informations collected and causes of losing informations. Some of these methods will be discussed in this study but the focal point of the study will be on one specific imputation method.

## 1.3 Types of Missingness of Data and Imputation Methods

Missing wholly at random ( MCAR ) : This is when the losing values in the study are wholly indiscriminately distributed. Although in pattern this difficult to infer if the losing information is losing wholly at random transporting out statistical trial such as t-test to prove for significance of the losing informations against complete informations sets.

Missing at random ( MAR ) is when losing values are indiscriminately distributed within one or more sub-samples and non across all observations. This is a more common instance than the former.

There are assorted methods that can be used to ascribe values for losing values and the determination on the pick depends on the type of losing informations.

A information set with variables ( Y1, Y2, … , Ym ) has a drone losing form if event that has a variable Yj is observed implies all old variables Ym, m & lt ; J, are besides observed in that information set.

In this instance a parametric arrested development method presuming multivariate normalcy or a nonparametric method utilizing leaning tonss is appropriate.

For arbitrary losing informations, a Markov Chain Monte Carlo ( MCMC ) method that assumes multivariate normalcy is ideal.

When you have a drone losing informations form, you have greater flexibleness in your pick of schemes. For illustration, you can implement a arrested development theoretical account without affecting loops as in MCMC. When you have an arbitrary missing informations form, you can frequently utilize the MCMC method, which creates multiple imputations by utilizing simulations from a Bayesian anticipation distribution for normal informations. Another manner to manage a information set with an arbitrary missing informations form is to utilize the MCMC attack to ascribe plenty values to do the losing informations form drone. Then, you can utilize a more flexible imputation method. In this study I will concentrate on MCMC.

## 1.4 Use Of Bayesian Theorem in Multiple Imputation

At this point I will non brood into the particulars of the Bayesian theorem as a standalone construct but will alternatively explicate the trim version used in MI execution. The appendix can be referred to for a more elaborate debut and account of the Bayesian Posterior Distribution.

For simpleness I will concentrate on a job with two parametric quantities I?1, I?2 and informations Ys. These, Bayesian theorem and analysis have a joint posterior distribution

. ( 1 )

Concentrating on I?2 and partitioning the buttocks

( 2 )

it can be deduced now that the fringy buttocks for I?2 can be expressed as

( 3 )

And in peculiar, the posterior mean and discrepancy for I?2 is given by:

( 4 ) ( 5 )

With assistance of empirical minutes estimates of those expressions can be deduced:

Let, m=1, aˆ¦ , M be draws from the fringy posterior distribution of.

Let ( 6 )

The discrepancy can besides so be approximated to be:

. ( 7 )

Finally, seting everything into position, this theoretical account can now be applied into MI processs by holding I?2 stand foring the parametric quantities of the substantial theoretical account and I?1 stand foring the losing informations.

## 2.0 The Markov Chain Monte Carlo Imputation Method.

A Markov concatenation is random variable sequence distribution where the value depends merely on the old 1s. Examples include simple random walks. Markov concatenation Monte Carlo ( MCMC ) is a aggregation of methods for imitating random draws from nonstandard distributions via Markov ironss. This is peculiarly utile when covering with informations of a nontrivial form. Here MCMC is initiated to bring forth several independent replacings of the losing informations from a prognostic distribution. These consequences are so used for multiple imputation illation. Because of its efficiency the MCMC requires merely comparatively a few draws and loops for plausible imputation analysis to take topographic point.

Here the posterior distribution is implemented with drawings from the Bayes ‘ theorem:

( 8 )

One cardinal premise for utilizing the MCMC method is that the dataset follows a multivariate normal distribution.

Imputation can so be carried out by transporting out the undermentioned stairss repetitively:

## 2.1 Imputation Phase:

With the assistance of a covariance matrix, a?‘ and estimated average vector, Aµ losing values are imputed without dependance on other ascertained values. The values are drawn from

Let be the average vector of Yobs and Ymis partitioned severally.

Additionally it is of import that a?‘ be partitioned,

( 9 )

Where a?‘11 = covariance matrix for Yobs,

a?‘22 = covariance matrix for Ymis

and a?‘12 = covariance matrix between Yobs and Ymis.

In order to be able to cipher the conditional covariance matrix of variable Ymis with Yobs controlled I shall use the expanse operator[ 2 ]on the pivots of a?‘11 to obtain

. ( 10 )

Here = . ( 11 )

For lucidity, it follows so that conditional distribution of Ymis cognizing Yobs= y1 is multivariate usually distributed with average vector:

( 12 )

## 2.2 Applying Bayes Theorem To Estimate Covariance Matrix And The Mean Vector

Leting Y= ( yA?1, yA?2, … . , yA?n ) be an ( n*b ) matrix with N ( b*1 ) independent vectors yi, which have mean of nothing and their covariance matrices represented by. It follows so by the theorem that

A = YA?Y = a?‘yiyA?i is Wishart distributed with W ( n, ) . ( 13 )

Intuitively from ( 13 ) , it follows that

## ( 14 )

n being the grades of freedom

and the preciseness matrix denoted by

From equations ( 9 ) and ( 12 ) and keeping the multivariate normal distribution premise we can infer the anterior[ 3 ]upside-down Wishart distribution of these parametric quantities as follows:

, ( 15 )

, ( 16 )

where R & gt ; 0 is non-changing figure.[ 4 ]The posterior distribution is so

## ( 17 )

( n- 1 ) S is the CSSCP[ 5 ]matrix.

## 2.3 Combining Inferences Of Multiple Imputations.

After ascribing thousand times standard informations analysis can be applied to each set bring forthing thousand sets inferences incorporating information such as discrepancy. These can now be combined to bring forth point and discrepancy estimations for a parametric quantity Q. Leting I and be the point and discrepancy estimations from the ith augmented informations set, i=1, 2, … , m. Then Q is the norm of the thousand complete-data estimations given as shown below:

( 17 )

Let the norm of the thousand complete-data estimations, be the within-imputation discrepancy, given by:

( 18 )

B stand foring the between-imputation discrepancy,

( 19 )

Applying Rubin ‘s thought, entire discrepancy is therefore given by:

. ( 20 )

Combining ( 17 ) and parts ( 20 ) the “ mean ” ( Q-T- ( 1/2 ) has an approximative t-distribution[ 6 ]with

vm grades of freedom. Note vm is given as shown below:

v_ { m } = ( Garand rifle ) [ 1 + frac { { overline U } } { ( 1+m^ { -1 } ) B } ] ^2 ( 21 )

The ratio R is called the comparative addition in discrepancy due to nonresponse and is zero if there is no losing information about Q. When either m is big or R is little it consequences in the grades of freedom V being big and ( Q- { overline Q } ) T^ { – ( 1/2 ) } will be about usually distributed.

Equally of import is the penetration into the fraction[ 7 ]of losing informations sing Q: hat { lambda } =frac { r+2/ ( v+3 ) } { r+1 } . ( 22 )

These are imperative in the appraisal of the part the losing information has on the uncertainness of Q.

## 2.3 Multiple Imputation Efficiency.

Rubin postulated that the more imputations one takes the less discrepancy of the illations will be.

It follows that transporting out an infinite imputations will be the best manner to understate the calculator discrepancy. However this is neither practical nor economical. One can merely calculate several imputations and comparative efficiencies ( RE ) of these can be calculated as maps of thousand andlambda:

( 23 )

## 3.0 Execution of the MCMC in SAS.

In this subdivision I am traveling to transport out multiple imputations on a sample of informations ( restrnt.txt )[ 8 ]by programming with SASA© . This will let me to transport out standard analysis of the information set taking into history the significance of any systematic grounds as to why the information is losing in the first topographic point.

Additionally I will besides transport out individual imputation of the informations and once more transport out standard analysis on the “ new complete information ” . This will organize the footing on my treatment on the comparing between multiple and individual imputations as picks for covering with losing informations.

Please note full codifications and all end product are included in the appendix for a elaborate overview of the executing of the plans I made. Included in there besides is a brief manual brochure explicating the workings of the bids employed.

## 3.1 Single And Multiple Imputation Models.

## Data Set MWALE.REST1

## Method MCMC

## Multiple Imputation Chain Single Chain

## Initial Estimates for MCMC EM Posterior Mode

## Start Starting Value

## Prior Jeffreys

## Number of Imputations 10

## Number of Burn-in Iterations 200

## Number of Iterations 100

Seed for random figure generator 1305417Figure Model Information for Multiple Imputations..

Figure 1 is the SAS end product of the codification demoing that the multiple imputation has been carried out 10 times, m=10 and the method applied was the MCMC. In order to acquire around the fact that the Markvov concatenation reaches its stationary distribution after several loops I have “ burnt in ” the first 200 loops, to assist fling the dependance in the early par of the Markov concatenation.

As detailed in the appendix the studies besides shows the losing information forms. Here each variable is analyzed and the frequence of losing informations in each is noted. This may be important information from a information aggregator ‘s base point as it helps highlight the nature and beginning of losing informations. Again analysing those forms it is a speedy procedure for one to recognize that the information is non losing monotonically as discussed above.

Figure 1.1 shows the theoretical account information for the individual Imputation theoretical account. Note merely one imputation took topographic point and the premise that no anterior information is known about the covariance matrix and the mean is preserved by utilizing the bid “ Jeffreys ” on the anterior call. For consistence the random figure generator

## The MI Procedure

## Model Information

## Data Set MWALE.REST1

## Method MCMC

## Multiple Imputation Chain Single Chain

## Initial Estimates for MCMC EM Posterior Mode

## Start Starting Value

## Prior Jeffreys

## Number of Imputations 1

## Number of Burn-in Iterations 200

## Number of Iterations 100

## Seed for random figure generator 1305417

Figure 1.1

## 3.2 Variance Information of the Models.

## — — — — — — — — -Variance — — — — — — — — –

## Variable Between Within Total DF

## Gross saless 47.688718 1499.058115 1551.515705 257.05

## NEWCAP 0.244187 2.678453 2.947059 203.1

## VALUE 67.740387 2657.343936 2731.858362 261.73

## COSTGOOD 0.132985 0.658865 0.805148 123.29

## WAGES 0.044648 0.428854 0.477967 191.37

## ADS 0.006896 0.055431 0.063017 174.1

## TYPEFOOD 0.000058083 0.002505 0.002569 263.33

## SEATS 0.113954 13.736092 13.861442 271.86

## OWNER 0.000096182 0.003242 0.003348 258.68

## FT_EMPL 0.014620 1.182980 1.199062 269.87

## PT_EMPL 0.015868 0.712927 0.730382 263.95

SIZE 0.000075479 0.002382 0.002465 257.14Figure 2.1 Information of The Variance of the Variables in MI.

Figure 2.2 Missing Values ‘ Influence On Variance.

## Relative Fraction

## Increase Missing Relative

## Variable in Variance Information Efficiency

## Gross saless 0.034994 0.034056 0.996606

## NEWCAP 0.100284 0.092817 0.990804

## VALUE 0.028041 0.027437 0.997264

## COSTGOOD 0.222024 0.187623 0.981583

## WAGES 0.114520 0.104850 0.989624

## ADS 0.136857 0.123201 0.987830

## TYPEFOOD 0.025506 0.025005 0.997506

## SEATS 0.009126 0.009061 0.999095

## OWNER 0.032635 0.031819 0.996828

## PT_EMPL 0.024483 0.024022 0.997604

## SIZE 0.034858 0.033927 0.996619

Figures 2.1 and 2.2 here highlight the within-imputation, between-imputation and entire discrepancies with the latter the consequence of the complete-data analysis. Fig 2.2 is explicit on the consequence of the losing values has on increasing discrepancy. Despite his the comparative efficiencies ‘ of all the variables are all high at more than 98 % efficient, with the least one being for the cost of good at 98.1583 % efficient. RE is discussed in greater item in old subdivisions. Besides in this instance it is inferable that deducing for that same variable ( costgood ) brings about the largest comparative addition in the discrepancy. It can be conclude that the uncertainness associated with losing of informations of this variable has uncertainness of comparatively higher significance than the other variable.

Of class this does non use to SI as there is merely one set of imputations.

Posterior vitamin E

Figure.1 Parameter Estimates for Multiply Imputed Data.

## Multiple Imputation Parameter Estimates

## Variable Mean Std Error 95 % Confidence Limits DF

## Gross saless 324.900360 39.389284 247.3336 402.4671 257.05

## NEWCAP 12.697842 1.716700 9.3130 16.0827 203.1

## Multiple Imputation Parameter Estimates

## T for H0:

## Variable Minimum Maximum Mu0 Mean=Mu0 Pr & gt ; |t|

## Gross saless 314.607914 332.600719 0 8.25 & lt ; .0001

## NEWCAP 12.046763 13.683453 0 7.40 & lt ; .0001

Figure 3.2. Parameter Estimates for Multiply Imputed Data.

## Multiple Imputation Parameter Estimates

## Variable Mean Std Error 95 % Confidence Limits DF

## VALUE 344.361871 52.267182 241.4442 447.2796 261.73

## COSTGOOD 45.296763 0.897301 43.5207 47.0729 123.29

## WAGES 24.924820 0.691351 23.5612 26.2885 191.37

## ADS 3.914029 0.251032 3.4186 4.4095 174.1

## TYPEFOOD 1.864748 0.050684 1.7650 1.9645 263.33

## SEATS 71.805036 3.723096 64.4753 79.1348 271.86

## OWNER 2.126259 0.057859 2.0123 2.2402 258.68

## FT_EMPL 8.014748 1.095017 5.8589 10.1706 269.87

## PT_EMPL 12.798921 0.854624 11.1162 14.4817 263.95

## SIZE 1.674460 0.049648 1.5767 1.7722 257.14

## T for H0:

## Variable Minimum Maximum Mu0 Mean=Mu0 Pr & gt ; |t|

## VALUE 326.848921 353.402878 0 6.59 & lt ; .0001

## COSTGOOD 44.899281 46.093525 0 50.48 & lt ; .0001

## WAGES 24.507194 25.241007 0 36.05 & lt ; .0001

## ADS 3.787770 4.050360 0 15.59 & lt ; .0001

## TYPEFOOD 1.848921 1.874101 0 36.79 & lt ; .0001

## SEATS 71.388489 72.312950 0 19.29 & lt ; .0001

## OWNER 2.115108 2.143885 0 36.75 & lt ; .0001

## FT_EMPL 7.877698 8.233813 0 7.32 & lt ; .0001

## PT_EMPL 12.593525 12.949640 0 14.98 & lt ; .0001

## SIZE 1.661871 1.690647 0 33.73 & lt ; .0001

Figures 3.1 and 3.2 show the consequence of the analysis of the parametric quantity estimations

There is a great fluctuation in the assurance bound at 95 % of the variables computed under MCMC. This may non needfully be built-in with the theoretical account itself but due to the fact that each of the variable may hold an independent relationship with the response variable “ mentality ” which in bend will ensue in a noon fiddling form of the imputed information. The standard mistakes of the variables are comparatively low saloon the imputed values of the “ value ” variable which shows the largest trying uncertainness under this theoretical account.

With the assistance of the MINIANALYSE process of SAS I was able to transport out several extra complete informations analysis on the imputed informations. Comparison of the covariance and agencies of the imputed information was carried out[ 9 ]. It is clear from observation of these consequences that MI in this instance offers a method with less discrepancy addition with each imputation. However it is of import to observe these consequences use adjusted grades of freedom which could hold an influencing factor in the comparative higher discrepancy in uncertainness with the SI method.

After transporting out several analyses I at this point recommend the usage of MI over SI for ascribing this peculiar dataset. MI ‘s better public presentation could be attributed to the high multivariate degree of the informations and the fact that most of the variables had losing informations. This nevertheless does non intend that MI is purely better than SI but instead that determination on the use of either should be made on a instance by instance footing.

## 4.0 Advantages And Disadvantages Of MCMC Method.

By definition MI completes losing informations leting for complete -data methods of analysis. Useful illations can so be deduced from the otherwise unserviceable information. This is peculiarly of import for study consequences.

## 4.1 Advantages Of MCMC Multiple Imputation

The most of import advantage the MCMC has it is its ability to deduce informations set which do non conform to fiddling forms. This is due to the fact that it utilises Markvov concatenation which are random forms in the nature. This

Besides due to its comparative efficiency the MCMC multiple imputation merely requires a few draws to transport out plausible imputations. This saves clip and resources and makes it an attractive method of usage.

Imputed values calculated based on the informations aggregator ‘s ain values on other variables.

Unique set of forecasters can be used for each variable with losing informations guaranting forms are preserved as practically possible in such state of affairss.

## 4.2 Disadvantages.

Method merely works good for appraisals when covering with monotonically losing informations. But in world

Reinforces bing relationships and reduces generalizability.[ 10 ]

Must have sufficient relationships among variables to bring forth valid predicted values

Understates discrepancy unless error term added to imputed value.

Replacement values may be out of scope[ 11 ].

## 5.0 Single Imputation Comparison.

In Single Imputation ( SI ) one value is generated for each losing value and this method is the most normally implemented one for managing point in nonresponse for modern study pattern.

## 5.1 Advantages Of Single Imputation:

Merely every bit is the instance with MI one time the values have been imputed standard statistical illations can be drawn for the information with comparative truth.

In many instances MI prove non to be necessary as the illation obtained is similar to that if SI had been used. In those instances utilizing SI would be an advantage for efficiency and clip economy intents.

This is of importance as other methods of covering with losing informations may necessitate excess input before the informations can undergo standard analysis. Hence SI is an efficient method of covering with nonresponse.

As SI can be carried out by the informations aggregator, the process will profit from holding person toilet to beginning information transporting out the analysis.

## 5.2 Disadvantages Of Single Imputation.

The first obvious job with SI is that because merely one set of imputed values is produced it necessarily treats the losing informations as known with adjusting for trying variableness or the uncertainness brought approximately by utilizing that specific theoretical account to make the imputation in the first topographic point.

## 6.0 Decision.

Multiple Imputation is a echt manner of covering with losing informations comparison to the other options. In extra to leting complete informations analysis on such informations, more significantly it does non ignore there possibility of a systematic grounds as to why the information might be losing in the first topographic point. Simply disregarding losing informations may take to incorrect illations being made by excluding tendencies oppressed due to losing informations. In decision, there is no individual imputation method that is appropriate for every scenario so I will state the load is on the research worker to exert utmost cautiousness and do a valid judgement as to the pick of the imputation method on a instance by instance footing. Some of the imperative factors to see hold discussed in this study.

## Bibliography

J. F. Hair, W.C. Black, B.J. Babin, R.E. Anderson, R.L. tatham. Multivariate Data Analysis. New Jersey: Pearson Education, Inc, 2006.

Rubin, Roderick J.A. Little & A ; Donald B. Staitistical Analysis With Missing Data. New Jersey: John Wiley & A ; Sons Inc, 2002.

Schafer, J.L. Analysis of Incomplete Multivariate Data ( MONOGRAPHS ON STATISTICS AND APPLIED PROBABILITY ) . Florida: Chapman & A ; Hall/CRC, 1997.

www.support.sas.com. hypertext transfer protocol: //support.sas.com/rnd/app/papers/miv802.pdf ( accessed January 11, 2010 ) .