The Nonequivalent Groups Design

The Basic Design

The Non-Equivalent Groups Design (hereafter NEGD) is probably the most frequently used design in social research. It is structured like a pretest-posttest randomized experiment, but it lacks the key feature of the randomized designs – random assignment. In the NEGD, we most often use intact groups that we think are similar as the treatment and control groups. In education, we might pick two comparable classrooms or schools. In community-based research, we might use two similar communities. We try to select groups that are as similar as possible so we can fairly compare the treated one with the comparison one. But we can never be sure the groups are comparable. Or, put another way, it’s unlikely that the two groups would be as similar as they would if we assigned them through a random lottery. Because it’s often likely that the groups are not equivalent, this designed was named the nonequivalent groups design to remind us.

So, what does the term “nonequivalent” mean? In one sense, it just means that assignment to group was not random. In other words, the researcher did not control the assignment to groups through the mechanism of random assignment. As a result, the groups may be different prior to the study. That is, the NEGD is especially susceptible to the internal validity threat of selection. Any prior differences between the groups may affect the outcome of the study. Under the worst circumstances, this can lead us to conclude that our program didn’t make a difference when in fact it did, or that it did make a difference when in fact it didn’t.

The Bivariate Distribution

Let’s begin our exploration of the NEGD by looking at some hypothetical results. The first figure shows a bivariate distribution in the simple pre-post, two group study. The treated cases are indicated with Xs while the comparison cases are indicated with Os. A couple of things should be obvious from the graph. To begin, we don’t even need statistics to see that there is a whopping treatment effect (although statistics would help us estimate the size of that effect more precisely). The program cases (Xs) consistently score better on the posttest than the comparison cases (Os) do. If positive scores on the posttest are “better” then we can conclude that the program improved things. Second, in the NEGD the biggest threat to internal validity is selection – that the groups differed before the program. Does that appear to be the case here? Although it may be harder to see, the program does appear to be a little further to the right on average. This suggests that they did have an initial advantage and that the positive results may be due in whole or in part to this initial difference.

We can see the initial difference, the selection bias, when we look at the next graph. It shows that the program group scored about five points higher than the comparison group on the pretest. The comparison group had a pretest average of about 50 while the program group averaged about 55. It also shows that the program group scored about fifteen points higher than the comparison group on the posttest. That is, the comparison group posttest score was again about 50, while this time the program group scored around 65. These observations suggest that there is a potential selection threat, although the initial five point difference doesn’t explain why we observe a fifteen point difference on the posttest. It may be that there is still a legitimate treatment effect here, even given the initial advantage of the program group.

Possible Outcome #1

Let’s take a look at several different possible outcomes from a NEGD to see how they might be interpreted. The important point here is that each of these outcomes has a different storyline. Some are more susceptible to treats to internal validity than others. Before you read through each of the descriptions, take a good look at the graph and try to figure out how you would explain the results. If you were a critic, what kinds of problems would you be looking for? Then, read the synopsis and see if it agrees with my perception.

Sometimes it’s useful to look at the means for the two groups. The figure shows these means with the pre-post means of the program group joined with a blue line and the pre-post means of the comparison group joined with a green one. This first outcome shows the situation in the two bivariate plots above. Here, we can see much more clearly both the original pretest difference of five points, and the larger fifteen point posttest difference.

How might we interpret these results? To begin, you need to recall that with the NEGD we are usually most concerned about selection threats. Which selection threats might be operating here? The key to understanding this outcome is that the comparison group did not change between the pretest and the posttest. Therefore, it would be hard to argue that that the outcome is due to a selection-maturation threat. Why? Remember that a selection-maturation threat means that the groups are maturing at different rates and that this creates the illusion of a program effect when there is not one. But because the comparison group didn’t mature (i.e., change) at all, it’s hard to argue that it was differential maturation that produced the outcome. What could have produced the outcome? A selection-history threat certainly seems plausible. Perhaps some event occurred (other than the program) that the program group reacted to and the comparison group didn’t. Or, maybe a local event occurred for the program group but not for the comparison group. Notice how much more likely it is that outcome pattern #1 is caused by such a history threat than by a maturation difference. What about the possibility of selection-regression? This one actually works a lot like the selection-maturation threat If the jump in the program group is due to regression to the mean, it would have to be because the program group was below the overall population pretest average and, consequently, regressed upwards on the posttest. But if that’s true, it should be even more the case for the comparison group who started with an even lower pretest average. The fact that they don’t appear to regress at all helps rule out the possibility the outcome #1 is the result of regression to the mean.

Possible Outcome #2

Our second hypothetical outcome presents a very different picture. Here, both the program and comparison groups gain from pre to post, with the program group gaining at a slightly faster rate. This is almost the definition of a selection-maturation threat. The fact that the two groups differed to begin with suggests that they may already be maturing at different rates. And the posttest scores don’t do anything to help rule that possibility out. This outcome might also arise from a selection-history threat. If the two groups, because of their initial differences, react differently to some historical event, we might obtain the outcome pattern shown. Both selection-testing and selection-instrumentation are also possibilities, depending on the nature of the measures used. This pattern could indicate a selection-mortality problem if there are more low-scoring program cases that drop out between testings. What about selection-regression? It doesn’t seem likely, for much the same reasoning as for outcome #1. If there was an upwards regression to the mean from pre to post, we would expect that regression to be greater for the comparison group because they have the lower pretest score.

Possible Outcome #3

This third possible outcome cries out “selection-regression!” Or, at least it would if it could cry out. The regression scenario is that the program group was selected so that they were extremely high (relative to the population) on the pretest. The fact that they scored lower, approaching the comparison group on the posttest, may simply be due to their regressing toward the population mean. We might observe an outcome like this when we study the effects of giving a scholarship or an award for academic performance. We give the award because students did well (in this case, on the pretest). When we observe their posttest performance, relative to an “average” group of students, they appear to perform a more poorly. Pure regression! Notice how this outcome doesn’t suggest a selection-maturation threat. What kind of maturation process would have to occur for the highly advantaged program group to decline while a comparison group evidences no change?

Possible Outcome #4

Our fourth possible outcome also suggests a selection-regression threat. Here, the program group is disadvantaged to begin with. The fact that they appear to pull closer to the program group on the posttest may be due to regression. This outcome pattern may be suspected in studies of compensatory programs – programs designed to help address some problem or deficiency. For instance, compensatory education programs are designed to help children who are doing poorly in some subject. They are likely to have lower pretest performance than more average comparison children. Consequently, they are likely to regress to the mean in much the pattern shown in outcome #4.

Possible Outcome #5

This last hypothetical outcome is sometimes referred to as a ‘cross-over" pattern. Here, the comparison group doesn’t appear to change from pre to post. But the program group does, starting out lower than the comparison group and ending up above them. This is the clearest pattern of evidence for the effectiveness of the program of all five of the hypothetical outcomes. It’s hard to come up with a threat to internal validity that would be plausible here. Certainly, there is no evidence for selection maturation here unless you postulate that the two groups are involved in maturational processes that just tend to start and stop and just coincidentally you caught the program group maturing while the comparison group had gone dormant. But, if that was the case, why did the program group actually cross over the comparison group? Why didn’t they approach the comparison group and stop maturing? How likely is this outcome as a description of normal maturation? Not very. Similarly, this isn’t a selection-regression result. Regression might explain why a low scoring program group approaches the comparison group posttest score (as in outcome #4), but it doesn’t explain why they cross over.

Although this fifth outcome is the strongest evidence for a program effect, you can’t very well construct your study expecting to find this kind of pattern. It would be a little bit like saying “let’s give our program to the toughest cases and see if we can improve them so much that they not only become like ‘average’ cases, but actually outperform them.” That’s an awfully big expectation to saddle any program with. Typically, you wouldn’t want to subject your program to that kind of expectation. But if you happen to find that kind of result, you really have a program effect that has beat the odds.