Which variables go into a multiple regression equation?

Which variables go into a multiple regression equation?
Gerard E. Dallal, Ph.D.

Estrogen and the Risk of Heart Disease

[This section talks about the potentially beneficial effect of estrogen on heart disease risk. In July, 2002, the estrogen plus progestin component of the Women's Health Initiative, the largest Hormone Replacement Therapy trial to date, was halted when it was discovered that women receiving HRT experienced heart attack, stroke, blood clots, and breast cancer at a higher rate than those who did not take HRT (Journal of the American Medical Association 2002;288:321-333). This study and others like it have discredited estrogen therapy as a means of lowering heart disease risk. However, these results are in stark contrast to epidemiological studies that show a protective benefit from estrogen therapy.

No one doubts the findings of the randomized trials, but it has been said that if the epidemiology is wrong, this will be the first time the epidemiology has failed so miserably. To date, no one has come up with a satisfactory explanation of the discrepancy. Michels and Manson review many of the proposed explanations in their 2003 editorial in Circulation.

I've decided to let the example remain until there is general consensus over the reason why the trials and epidemiology disagree. It should be noted, however, that Michels and Manson end their editorial with the recommendation that "HT should not be initiated or continued for primary or secondary prevention of cardiovascular disease."

Update: Judy Foreman writing in the February 20, 2006, Boston Globe:

Two studies published over the last few weeks...aimed at better understanding the role hormones play in heart disease... Both found that starting estrogen therapy at menopause did not increase the risk of heart problems, while starting later in life does increase risk. In fact, there's a chance estrogen may even protect the hearts of those who take it early...
Why would the timing of hormones make such a difference? Because estrogen plays an important role in preventing some of the age-related buildup of plaque in artery walls...
When a woman's arteries are bathed in hormones -- either naturally or with estrogen supplements -- they harden more slowly. Estrogen appears to decrease ''bad" (LDL) cholesterol and raise ''good" (HDL). It also makes blood vessels more elastic, allowing them to dilate better, which increases blood flow.
But in older women who already have plaque on their artery walls, adding estrogen can increase the likelihood of blood clots or plaque ruptures that can trigger heart attacks and strokes.

Comment, February 2011: Every year, I worry about what to do with this section. It's a striking example of how including certain variables in a multiple linear regression can have a dramatic effect on the outcome, but if the science is wrong, then the example should go.

However, there are suggestions in the medical literature that HRT may be beneficial for some subgroups of women. For the purposes of this exercise, let's assume that the two studies reported on in the next section involved women for whom HRT would have a beneficial effect.]

The October 25, 1985 issue of the New England Journal of Medicine is notable for the reason given by John C. Bailar III in his lead editorial: "One rare occasions a journal can publish two research papers back-to-back, each appearing quite sound in itself, that come to conclusions that are incompatible in whole or in part... In this issue we have another such pair."

The two papers were

Wilson PWF, Garrison RJ, Castelli WP (1985), "Postmenopausal Estrogen Use, Cigarette Smoking, and Cardiovascular Morbidity In Women Over 50: The Framingham Study", New England Journal of Medicine, 313, 1038-1043.
Stampfer MJ, Willett WC, Colditz GA, Rosner B, Speizer FE, Hennekens CH (1985), "A Prospective Study of Postmenopausal Estrogen Therapy and Coronary Heart Disease", New England Journal of Medicine, 313, 1044-1049.

Both papers were based on epidemiologic studies rather than intervention trials. Wilson et al. studied women participating in the Framingham Heart Study. Stampfer et al. studied women enrolled in Nurses' Health Study. The disagreement is contained in the last sentence of each abstract.

Wilson: No benefits from estrogen use were observed in the study group; in particular, mortality from all causes and from cardiovascular disease did not differ for estrogen users and nonusers.
Stampfer: These data support the hypothesis that the postmenopausal use of estrogen reduces the risk of severe coronary heart disease.

The reports generated an extensive correspondence suggesting reasons for the discrepancy (New England Journal of Medicine, 315 (July 10, 1986), 131-136). A likely explanation for the apparent inconsistency was proposed by Stamper:

Among the reasons for the apparent discrepancy...may be their [Wilson's]...adjustment for the effects of high-density lipoprotein, which seems unwarranted, since high-density lipoprotein is a likely mediator of the estrogen effect. By adjusting for high-density lipoprotein, one only estimates the effect of estrogen beyond its beneficial impact on lipids.

Stampfer was saying that the way estrogen worked was by raising the levels of HDL-cholesterol, the so-called good cholesterol. When Wilson's group fitted their regression model to predict the risk of heart disease, they included both estrogen and HDL-cholesterol among their predictors. A multiple regression equation gives the effect of each predictor after adjusting for the effects of the other predictors (or, equivalently, with all other predictors held fixed). The Wilson equation estimated the effect of estrogen after adjusting for the effect of HDL cholesterol, that is the effect of estrogen when HDL cholesterol was not allowed to change. To put it another way, it estimated the effect of estrogen after adjusting for the effect of estrogen! This is an example of over adjustment--adjusting for the very effect you are trying to estimate.

Added Sugars

The November 14-18, 1999, annual meeting of the North American Association for the Study of Obesity in Charleston, SC, USA, included some presentations discussing the role of added sugar in the diet.

In "Do Added Sugars Affect Overall Diet Quality?", R. Forshee and M Storey developed a multiple regression model to predict the number of food group servings from the amount of added sugar in the diet. If added sugar was displacing important foods and nutrients from the diet, those eating more added sugar would be consuming less of these other important items. The models adjust for age, sex, fat, carbohydrates (less added sugar), protein, and alcohol. The investigators noted the regression coefficient for added sugars, while statistically significant, was always quite small. They interpret this as saying those who eat more added sugar do not have appreciably different predicted numbers of servings of grains, vegetables, fruits, dairy, or lean meat.

The interpretation was correct in its way, but it's hard to imagine how the result could have been otherwise. The result was predetermined! By adding fat, carbohydrates (less added sugar), protein, and alcohol to their statistical model, the researchers were asking the question, "When you take a bunch of people eating the same amount of fat, carbohydrates (less added sugar), protein, and alcohol, does their consumption of specific food groups vary according to the amount of added sugar they eat?" It would be truly astounding if other foods could vary much when fat, carbohydrates (less added sugar), protein, and alcohol were held fixed. By adding all of the components of food to the model, the investigators were asking whether food groups varied with added sugar intake when food was held constant!

They use their regression model as though it were developed on longitudinal data to predict the amount of added sugar it would take to reduce the number of predicted dairy servings by 1. They conclude, "Children would have to consume an additional 15 twelve-ounce cans of carbonated soft drinks to displace one serving of dairy foods." With a regression model in the background, it sounds very impressive but the words make no sense. One doesn't have to be a nutritionist to know an additional 15 twelve-ounce cans of carbonated soft drinks will displace a lot more than one serving of dairy foods! As nonsensical as this claim appears, the report garnered a lot of publicity as can be seen by using the terms "sugar" and "Forshee" in any Internet search engine.

A second presentation, "Energy Intake From Sugars and Fat In Relation to Obesity in U.S. Adults, NHANES III, 1988-94" by DR Keast, AJ Padgitt, and WO Song, shows how one's impression of the data can change with the particular model that is fitted.

Their figure 8 showed those in the highest quarter of sugar intake are least likely to have deficient intakes of selected nutrients, but this is undoubtedly true because those who eat more added sugar are eating more of everything. The researchers also report, "When the data are presented as quartiles of percent kilocalories from total sugars, individuals in the highest quartile of total sugars are more likely to fall below 2/3 of the RDA for all nutrients listed except for vitamin C (Figure 9)." The only way sweeteners can be greater percentage of one's diet is if other things are a lesser percentage. This leaves us with the question of the great American philosopher Johnny Cash who asks, "What is truth?" Is either piece of information relevant to assessing the effect of added sugar on nutritional status. I could argue more strenuously that the calorie-adjusted values are more pertinent.

Dietary Patterns and 20-year mortality

In "Dietary Pattern and 20 Year Mortality In Elderly Men In Finland, Italy, and the Netherlands: Longitudinal Cohort Study" (BMJ,315(1997), 13-17) Huijbregts, Feskens, Rasanen, Fidanza, Nissinen, Menotti, and Kromhout investigated whether healthy dietary patterns were inversely associated with mortality. The data were fitted by a survival model that included an indicator of a healthy diet among the predictors.

The researchers were faced with the thorny question of whether country should be included in the model. If country were included, the coefficient for diet would answer the question of whether diet was predictive of survival after accounting for the participants' country of residence. The authors argue against including country.

Since dietary patterns are highly determined by cultural influences (for example, the Mediterranean dietary pattern), we did not adjust for country in the pooled population analyses. Country has a strong cultural component which is responsible for (part of) the variation in dietary patterns. Adjustment for this variable would result in an overcorrection and hence an underestimation of the true association between the quality of the diet and mortality.

It is true that the effect of diet will be underestimated to the extent to which diet and culture are correlated and there are other things about culture that predict survival. However, it is equally true that if country is left out of the model the effect of diet will be overestimated to the extent to which diet and culture are correlated and things in the culture other than diet affect longevity! For this reason, the conservative approach is to fit all known or suspected predictors of longevity, including country, so that claims for the predictive capability of a healthful diet will be free of counterclaims that a healthful diet is a surrogate for something else. The conservative approach means that we often lack the power to separate out individual effects, which is what happened here. The authors continue

When the countries were analyzed separately, the associations between the healthy diet indicator and all cause mortality were essentially the same, although they no longer reached significance. This was due to a low statistical power resulting from the smaller numbers of subjects within a country.

When this happens, the investigators have no choice, in my opinion, but to design a better study. There are so many things in a culture other than diet then might influence survival that it seems unwise not to adjust for country. The authors are no doubt correct that "dietary patterns are highly determined by cultural influences", but this strikes me as an insufficient reason for allowing everything else associated with diet and survival to be attributed to diet. Good science is often expensive, inconvenient, and difficult.

The article focuses on the beneficial effects of diet, with the possible effects of not adjusting for country relegated to the Discussion section. A Web search on "Huijbregts" and "dietary patterns" reveals the cautions were lost when the message was transmitted to the general public.