Contents: Case Overview Characteristics of a Diamond * The Four C’s (Color, Carat, Cut and Clarity) * Symmetry and Polish * Certification Pricing Data Set Regression Analysis Full Level – Level type Model * Partial Level – Level Model (Carat) * Partial Level – Level Model (Carat*Color) * Ln – Ln Model * Ln – Level Model * Level – Ln Model Comparison Appendix Case Overview This paper tries to assist a professor in making an educated decision regarding the diamond he should buy.

He feels that his girlfriend has already hinted about marriage three times, and the time has come to finally do something about it. So, he decides to propose to her and sets aside $2000 to $4000 for a diamond ring. At the store, the professor finds out that picking the right diamond is not a straightforward task. Characteristics of a Diamond The value of a diamond is determined by a lot of characteristics, out of which some of the most important ones are listed below: Color Diamonds vary on a scale from colorless to yellow, with the yellower shades being the cheapest and the colorless ones being the most valuable.

Other colors are extremely rare and very exorbitant. Carat This is the unit of measurement for diamonds, where 1 Carat equals 0. 2 grams. Higher Carat diamonds are more rare and as a result more valuable. For instance, a two carat diamond will be more expensive than two one carat diamonds combined. Cut Cut is the main determinant of reflective properties and it refers to the shape and proportion of the diamond. One with the right cut results in the ideal proportions in its facets as well as its depth and width, reflecting light brilliantly.

Clarity Clarity of a diamond signifies the flaws that can be detected within it. (Appendix). Symmetry and Polish Symmetry measures how well the top of the diamond (facets) and the bottom of the diamond (pavilion) match. A bad facet alignment may lead to light leaking out before being reflected. Polish means how smooth and reflective the surface of the stone is. Polish and Symmetry are measured on a scale of poor, fair, good, very good, excellent and ideal. Certification Diamonds are evaluated on the characteristics listed above.

There are numerous labs that carry out this task, but there are only a few well-known sources of diamond certification. Gemological Institute of America (GIA) and American Gemological Society (AGS) are two of the most respected labs. Some other labs include European Gemological Laboratories (EGL) and International Gemological Institute (IGI) and many other small labs. Pricing This paper attempts to predict a range of values of diamond prices by regressing price (dependent variable) against the four C’s, Polish, Symmetry and Certification (independent variables).

The analysis comprises of running various regression models and comparing them to see which one fits the data the best. This will be done by making inferences on the outputs of these models and with the help of graphical analysis. The Data Set The data set being used to carry out the analysis consists of 440 diamonds, which are grouped under three different wholesalers. Characteristics of each diamond are provided in the Appendix. Prices in the data set vary from $160 to $3145. Various models will be fit to this data, with each model being refined as inferences are made. Regression 1.

Level – Level Full Model: The first model we are going to consider is the one which regresses price on all the seven variables and shows how price depends on the predictor variables. Price = ? 0 + ? 1x + ? 2y + ? 3z + …… …… + ? kk + ? Where – ? 0 is the intercept and ? 1, ? 2…. ….? k are coefficients of the ‘k’ predictor variables (k = 7 in this model). The coefficients indicate how the price changes by a unit change in any variable, keeping rest of the variables constant. The output shows that cut, polish and certification have t-stats and corresponding p-values that are considerably greater than 0. 5 (significance level). A stepwise regression drops these variables, indicating that they are not a good fit for the model (Appendix). One reason might be that the scatterplot of price vs. carat shows observations being clumped; representing the three different wholesalers (Appendix). Making inferences on this scatterplot shows that wholesaler 3 carries diamonds in the range of $160 to $665, significantly lower than the total range. These observations might lead to inaccurate prediction of the effect of predictor variables on price.

Also, it can be seen from the scatterplot that wholesaler 1 deals in diamonds falling under a very small interval of $3000 to $3091, while the ones from wholesaler 2 fall in the range of $1856 to $3145. As a result, this paper makes an assumption that only data from wholesalers 1 and 2 will fit the model best, in terms of finding the correct diamond for the professor. 2. Partial Level – Level Model (Carat): Considering the assumption made above, the next model regresses price on all the variables, but only for diamonds from wholesalers 1 and 2.

It can be seen from the scatterplot of price vs. carat that there exists a nonlinear relation (quadratic). Price seems to be decreasing first with increase in carat size to a point, and then starts increasing as carat size increases. Testing for heteroscedasticity (non-constant variance of error terms) is very important as Ordinary Least Square Regression (OLS) assumes that errors have a constant variance. If not, then the predicted standard errors will tend to be biased. Breuch – Pagan and White’s tests can help with showing if heteroscedasticity is present or not.

An analysis of these tests has been carried out and the output and graphs suggest that errors have non-constant variance, indicating the presence of heteroscedasticity. At this point, this paper tries to make another assumption; Observations from wholesaler 1 should be dropped for the purpose of regression as prices of diamonds in this category vary in a small interval. ($3000 – $3091), (Professor’s Range: $2000 – $4000). 3. Partial Level – Level Model (Carat*Color): As indicated above, we attempt to run another regression by adding an interaction term, carat*color.

This will help us understand if the price of diamonds depends not only on carat, but also on the relationship between different carat size of diamonds and different colors (if 1carat yellow shade is less expensive than 1. 2 carat yellow shade or cross comparison). Our model becomes: Price = ? 0 + ? 1x + ? 2y + ? 3(x*y)+ ? 4z + …… …… + ? kk + ? A stepwise regression of this model drops the interaction term suggesting that there is no relationship between carat and color. Please note all the models including this one and the ones that follow consist of observations only from wholesaler 2 (hence the term Partial) due to the reasons given above. . Partial Ln – Ln Model: There are ways to achieve linearity in the model with the provided data set. One of those ways is to transform the dependent and/or independent variables into ln(variable) form. This model predicts the percentage change in price with a unit percentage change in any of the independent (predictor) variables. Model: ln(Price) = ? 0 + ? 1ln(x) + ? 2ln(y) + ? 3ln(z) + …… …… + ? kln(k) + ? Modeling this way, it can be used to gain insight into elasticity of demand for diamonds at different price levels (as change in price and the change in any variable is in percentage term). 5.

Partial Ln – Level Model: This model is a variation of the one stated above. The difference is that we only take ln(price) here and rest of the variables are as defined. The model becomes: ln(Price) = ? 0 + ? 1x + ? 2y + ? 3z + …… …… + ? kk + ? The outcome of this regression will be the percentage change in price of diamonds due to a unit change in any one variable, keeping rest of the variables constant. Moreover, we can run a check for heteroscedasticity and notice that it is lower as compared to the Level – Level model for observations from wholesalers 1 and 2. (Appendix) 6. Partial Level – Ln Model:

In this model, we regress price against ln(variable). Ln shows the effect of how price changes with a unit percentage change in any variable. The model fit is: Price = ? 0 + ? 1ln(x) + ? 2ln(y) + ? 3ln(z) + …… …… + ? kln(k) + ? The form of this model is useful in predicting the elasticity of demand for different diamonds at different price levels with respect to any variable. This can be tested for the belief of diamond demand being less elastic towards higher prices. Comparison The models given above were used in the analysis of predicting a price interval for diamonds.

As the analysis progressed, several adjustments were made to see which model fits the data best. However, countless models can be tested, but the analysis and the results depend on the assumptions and subjectivity of whoever carries out the analysis. It can be seen through the correlation matrix (Appendix) that all the variables except color have a considerable relation with price. Correlation is measured in a range from [-1,1]. A value close to -1 means a strong negative correlation while a value close to +1 means a strong positive correlation. Any value close to zero means the two variables are weakly correlated.

From the given matrix, it is evident that color has the least correlation with price, which is equal to 0. 0036. However, on carrying out several regression models and checking which one fits the data best, it seems as if color cannot be ignored in terms of it affecting price for the particular set of observations taken. This paper uses Root MSE (Mean Squared Error) as the comparative statistic. It is calculated by squaring the error terms, taking the average and converting it into square root. One thing to note is that different forms of models (Level – Ln, Ln – Ln etc. ) will have different measure of ROOT MSE. As a result, this statistic needs to be converted into the same form and units in order to compare the various models.. Some other statistical values that were used to make judgments include the F-test (to check if the model fits the data), p-values (significance of coefficients) and R2 (to see how much of the variation in the data is explained by the model), etc. A model-wise comparison is given in the Appendix. ————————————————- ————————————————- ————————————————- ———————————————— ————————————————- ————————————————- Appendices Appendix 1: Summary Statistics | Carat| Price| Max. | 1. 58| $3,145 | Min. | 0. 09| $160 | Average| 0. 67| $1,717 | | Wholesaler 1| Wholesaler 2| Wholesaler 3| No. of Diamonds| 60| 80| 200| Price Range| $3000 – $3091| $1856 – $3145| $160 – $665| Average Price| $3,043 | $2,662 | $468 | Carat Range| 0. 8 – 0. 92| 1 – 1. 58| 0. 09 – 0. 3| Average Carat Size| 0. 84| 1. 06| 0. 27| ————————————————-

Appendix 2: Diamond Characteristics Professor’s Diamond Ring| Price| $3,100 | Carat Weight| 0. 9| Cut| Very Good| Color| J| Clarity| S12| Polish| Good| Symmetry| Very Good| Certification| GIA| Appendix 2 Continued: Characteristic | Scale | Comments | | | | Carat | | 1 carat = 0. 2 grams | | | | Color | D-F (6), G-I (5), J-K (4), L-N (3), O-S (2), T-Z (1). | Colorless,| | | Near colorless, Faint yellow, Very light yellow, Light yellow, Yellow | | | | Cut | Poor Fair Good Very good Excellent Ideal | | | | | Clarity | FL | Flawless: No flaws| | | Internally Flawless: No internal flaws| IF, VVS1, VVS2, VS1, VS2, SI1, SI2, SI3, I1, I2, I3, | Very, Very Slightly Included: very, very few inclusions at 30? Very, Very Slightly Included: very few inclusions at 30? | | | Very Slightly Included: few inclusions at 30? | | | Very, Very Slightly Included: several inclusions at 30? Slightly Included: very, very few inclusions at 10? | | | Slightly Included: very few inclusions at 10? | | | Slightly Included: several inclusions at 10? | | | Included: very few inclusions, but visible to the naked eye Included: few inclusions visible to the naked eye| | | Included: several inclusions visible to the naked eye | | | | | No. of Observations| 440| | | | | | F (4,435)| 1321. 7| Source | SS| df| MS| | Prob ;gt; F| 0. 000| Model| 560673122| 4| 140168280| | R squared| 0. 924| Residual| 46132399. 4| 435| 106051. 493| | Adj. R squared| 0. 923| Total| 606805521. 4| 439| 140274331. 5| | Root MSE| 325. 66| | | | | | | | Price| Coefficient| Std. Error| t| P ;gt; I t I| [95% Confidence Interval]| Carat| 3769. 419| 63. 018| 59. 810| 0. 000| 3645. 561| 3893. 270| Colour| 486. 3624| 39. 498| 12. 310| 0. 000| 408. 732| 563. 993| Clarity| 233. 8763| 14. 193| 16. 480| 0. 000| 205. 982| 261. 771| Symmetry| 84. 8572| 21. 235| 3. 970| 0. 000| 42. 650| 126. 122| Constant| -3155. 398| 157. 197| -20. 070| 0. 000| -3464. 358| -2846. 438| Appendix 3. 1: Level – Level Full Model (Stepwise – Dropping Cut, Polish and Symmetry) Appendix 3. 2: Level – Level Partial Model (Wholesaler 1 and 2) (Stepwise – dropping polish and symmetry) | | | | | No. of Observations| 240| | | | | | F (5,234)| 43. 44| Source | SS| df| MS| | Prob ;gt; F| 0. 000| Model| 15529166. 3| 5| 3105833. 27| | R squared| 0. 4814| Residual| 16728397. 5| 234| 71488. 8782| | Adj. R squared| 0. 4703| Total| 32257563. 8| 239| 134968. 886| | Root MSE| 267. 7| | | | | | | | Price| Coefficient| Std. Error| t| P ;gt; I t I| [95% Confidence Interval]| Carat| 1113. 524| 220. 745| 5. 040| 0. 000| 678. 622| 1548. 426| Colour| 279. 0194| 44. 599| 6. 260| 0. 000| 191. 153| 366. 886| Clarity| 200. 6413| 16. 891| 11. 880| 0. 000| 167. 363| 233. 920| Cut| 41. 56663| 12. 055| 3. 450| 0. 001| 17. 816| 65. 317| Certification| 94. 39192| 36. 720| 2. 570| 0. 011| 22. 047| 166. 736| Constant| 28. 86792| 334. 272| 0. 09| 0. 931| -629. 699| 687. 435| Appendix 3. 3: Level – Level Partial Model (only wholesaler 2) (Stepwise – dropping symmetry) | | | | | No. f Observations| 180| | | | | | F (6,173)| 27. 74| Source | SS| df| MS| | Prob ;gt; F| 0. 000| Model| 12589905| 6| 2098317. 5| | R squared| 0. 4903| Residual| 13088261. 8| 173| 75654. 6927| | Adj. R squared| 0. 4726| Total| 25678166. 8| 179| 134968. 886| | Root MSE| 275. 05| | | | | | | | Price| Coefficient| Std. Error| t| P ;gt; I t I| [95% Confidence Interval]| Carat| 1372. 323| 263. 5418| 5. 21| 0. 000| 852. 1521| 1892. 4950| Colour| 383. 9623| 52. 4943| 7. 31| 0. 000| 280. 3505| 487. 5741| Clarity| 267. 8255| 24. 1559| 11. 09| 0. 000| 220. 1472| 315. 5038| Cut| 38. 63288| 15. 144| 2. 56| 0. 011| 8. 8006| 68. 4652| Polish| 82. 78422| 36. 8503| 2. 25| 0. 026| 10. 0502| 155. 5182| Certification| 142. 3838| 51. 0648| 2. 79| 0. 006| 41. 5935| 243. 1741| Constant| -990. 5638| 386. 4021| -2. 56| 0. 011| -1753. 2330| -227. 8944| Heteroscedasticity Test: Appendix 3. 4: Partial Ln – Ln Model (Stepwise – dropping cut and polish) | | | | | No. of Observations| 180| | | | | | F (5,174)| 42. 05| Source | SS| df| MS| | Prob ;gt; F| 0. 000| Model| 2. 08437278| 5| 0. 416874557| | R squared| 0. 5472| Residual| 1. 72508945| 174| 0. 009914307| | Adj. R squared| 0. 5341| Total| 3. 0946223| 179| 0. 021281912| | Root MSE| 0. 09957| | | | | | | | lnPrice| Coefficient| Std. Error| t| P ;gt; I t I| [95% Confidence Interval]| lnCarat| 0. 7361| 0. 1119| 6. 58| 0. 000| 0. 5151| 0. 9569| lnColour| 0. 2293| 0. 0273| 8. 39| 0. 000| 0. 1754| 0. 2833| lnClarity| 0. 4409| 0. 0334| 13. 19| 0. 000| 0. 3749| 0. 5069| lnSymmetry| 0. 1320| 0. 0312| 4. 23| 0. 000| 0. 0704| 0. 1937| lnCertification| 0. 1479| 0. 0271| 5. 47| 0. 000| 0. 0945| 0. 2013| Constant| 7. 0144| 0. 0630| 111. 31| 0. 000| 6. 8897| 7. 1384| Appendix 3. 5: Partial Ln – Level Model (Stepwise – dropping and symmetry) | | | | No. of Observations| 180| | | | | | F (5,174)| 27. 62| Source | SS| df| MS| | Prob ;gt; F| 0. 000| Model| 1. 86386422| 6| 0. 310644037| | R squared| 0. 4893| Residual| 1. 94559801| 173| 0. 11246231| | Adj. R squared| 0. 4716| Total| 3. 80946223| 179| 0. 021281912| | Root MSE| 0. 10605| | | | | | | | lnPrice| Coefficient| Std. Error| t| P ;gt; I t I| [95% Confidence Interval]| Carat| 0. 5414| 0. 1016| 5. 33| 0. 000| 0. 3409| 0. 7420| Colour| 0. 1475| 0. 0202| 7. 29| 0. 000| 0. 1075| 0. 1874| Clarity| 0. 1028| 0. 0093| 11. 03| 0. 000| 0. 0844| 0. 1211| Cut| 0. 0159| 0. 058| 2. 67| 0. 008| 0. 0041| 0. 0271| Polish| 0. 0309| 0. 0142| 2. 17| 0. 031| 0. 0028| 0. 0589| Certification| 0. 0531| 0. 0197| 2. 70| 0. 008| 0. 0142| 0. 0919| Constant| 6. 4612| 0. 1490| 43. 37| 0. 000| 6. 1671| 6. 7553| Checking for heteroscedasticity, we notice that the effect is lower in a Ln – Level model (right side) as compared to a Level – Level model (left). Appendix 3. 6: Partial Level – Ln Model (Stepwise – dropping cut and polish) | | | | | No. of Observations| 180| | | | | | F (5,174)| 41. 99| Source | SS| df| MS| | Prob ;gt; F| 0. 000| Model| 14041797. 2| 5| 2808359. 3| | R squared| 0. 5468| Residual| 11636369. 7| 174| 66875. 6879| | Adj. R squared| 0. 5338| Total| 25678166. 9| 179| 143453. 446| | Root MSE| 258. 6| | | | | | | | Price| Coefficient| Std. Error| t| P ;gt; I t I| [95% Confidence Interval]| lnCarat| 1867. 5760| 290. 6430| 6. 43| 0. 000| 1293. 9360| 2441. 2150| lnColour| 596. 1278| 71. 0110| 8. 39| 0. 000| 455. 9741| 736. 2815| lnClarity| 1147. 4550| 86. 8453| 13. 21| 0. 000| 976. 0490| 1318. 8610| lnSymmetry| 337. 9256| 81. 1220| 4. 17| 0. 000| 177. 8158| 498. 0354| lnCertification| 389. 2828| 70. 2653| 5. 54| 0. 000| 250. 6007| 527. 649| Constant| 426. 1247| 163. 6587| 2. 60| 0. 010| 103. 1128| 749. 1366| Appendix 4: Comparison | Price | Root MSE| Transformed Root MSE| Level – Level Full Model| $2230. 37| 325. 66| 325. 66| Level – Level Partial 1| $2668. 32| 267. 37| 267. 37| Level – Level Partial 2| $2655. 30| 275. 05| 275. 05| Ln- Ln Model| $2908. 15| 0. 09957| 264. 96| Ln – Level Model| $2619. 30| 0. 10605| 282. 19| Level – Ln Model| $2722. 90| 258. 6| 258. 6| Root MSE = Root Mean Square Error = [v1n? i=1.. n (yi – yi)2] Root MSE is a measure of the deviation of predicted values from the true values of observations in the data.

The model with the lowest measure of MSE is preferred as it suggests that predicted values are close to the actual data. On comparing the models given above, it is evident that Level – Level full model is not a good fit as it has the highest Root MSE (the assumption of dropping wholesaler 1 could be supported by this fact). Level – Ln model, which regresses price on ln(variable), seems to fit the data best as it has the lowest MSE. However, the goal of this paper was to predict an interval of prices, which in turn would help the professor with his choice of diamond ring. The interval is: ($2230. 37 – $2908. 15).