Chapter 37., Section 4. Selecting an Appropriate Design for the Evaluation - Main Section

When you hear the word “experiment,” it may call up pictures of people in long white lab coats peering through microscopes. In reality, an experiment is just trying something out to see how or why or whether it works. It can be as simple as putting a different spice in your favorite dish, or as complex as developing and testing a comprehensive effort to improve child health outcomes in a city or state.

Academics and other researchers in public health and the social sciences conduct experiments to understand how environments affect behavior and outcomes, so their experiments usually involve people and aspects of the environment. A new community program or intervention is an experiment, too, one that a governmental or community organization engages in to find out a better way to address a community issue. It usually starts with an assumption about what will work – sometimes called a theory of change - but that assumption is no guarantee. Like any experiment, a program or intervention has to be evaluated to see whether it works and under what conditions.

In this section, we’ll look at some of the ways you might structure an evaluation to examine whether your program is working, and explore how to choose the one that best meets your needs. These arrangements for discovery are known as experimental (or evaluation) designs.

What do we mean by a design for the evaluation?

Every evaluation is essentially a research or discovery project. Your research may be about determining how effective your program or effort is overall, which parts of it are working well and which need adjusting, or whether some participants respond to certain methods or conditions differently from others. If your results are to be reliable, you have to give the evaluation a structure that will tell you what you want to know. That structure – the arrangement of discovery- is the evaluation’s design.

The design depends on what kinds of questions your evaluation is meant to answer.

Some of the most common evaluation (research) questions:

Does a particular program or intervention – whether an instructional or motivational program, improving access and opportunities, or a policy change – cause a particular change in participants’ or others’ behavior, in physical or social conditions, health or development outcomes, or other indicators of success?
What component(s) and element(s) of the program or intervention were responsible for the change?
What are the unintended effects of an intervention, and how did they influence the outcomes?
If you try a new method or activity, what happens?
Will the program that worked in another context, or the one that you read about in a professional journal, work in your community, or with your population, or with your issue?

If you want reliable answers to evaluation questions like these, you have to ask them in a way that will show you whether you actually got results, and whether those results were in fact due to your actions or the circumstances you created, or to other factors. In other words, you have to create a design for your research – or evaluation – to give you clear answers to your questions. We’ll discuss how to do that later in the section.

Why should you choose a design for your evaluation?

An evaluation may seem simple: if you can see progress toward your goal by the end of the evaluation period, you’re doing OK; if you can’t, you need to change. Unfortunately, it’s not that simple at all. First, how do you measure progress? Second, if there seems to be none, how do you know what you should change in order to increase your effectiveness? Third, if there is progress, how do you know it was caused by ( or contributed to) your program, and not by something else? And finally, even if you’re doing well, how will you decide what you could do better, and what elements of your program can be changed or eliminated without affecting success? A good design for your evaluation will help you answer important questions like these.

Some specific reasons for spending the time to design your evaluation carefully include:

So your evaluation will be reliable. A good design will give you accurate results. If you design your evaluation well, you can trust it to tell you whether you’re actually having an effect, and why. Understanding your program to this extent makes it easier to achieve and maintain success.
So you can pinpoint areas you need to work on, as well as those that are successful. A good design can help you understand exactly where the strong and weak points of your program or intervention are, and give you clues as to how they can be further strengthened or changed for the greatest impact.
So your results are credible. If your evaluation is designed properly, others will take your results seriously. If a well-designed evaluation shows that your program is effective, you’re much more likely to be able to convince others to use similar methods, and to convince funders that your organization is a good investment.
So you can identify factors unrelated to what you’re doing that have an effect – positive or negative – on your results and on the lives of participants. Participants’ histories, crucial local or national events, the passage of time, personal crises, and many other factors can influence the outcome of a program or intervention for better or worse. A good evaluation design can help you to identify these, and either correct for them if you can, or devise methods to deal with or incorporate them.
So you can identify unintended consequences (both positive and negative) and correct for them. A good design can show you all of what resulted from your program or intervention, not just what you expected. If you understand that your work has consequences that are negative as well as positive, or that it has more and/or different positive consequences than you anticipated, you can adjust accordingly.
So you’ll have a coherent plan and organizing structure for your evaluation. It will be much easier to conduct your evaluation if it has an appropriate design. You’ll know better what you need to do in order to get the information you need. Spending the time to choose and organize an evaluation design will pay off in the time you save later and in the quality of the information you get.

When should you choose a design for your evaluation?

Once you’ve determined your evaluation questions and gathered and organized all the information you can about the issue and ways to approach it, the next step is choosing a design for the evaluation. Ideally, this all takes place at the beginning of the process of putting together a program or intervention. Your evaluation should be an integral part of your program, and its planning should therefore be an integral part of the program planning.

That’s the ideal; now let’s talk about reality. If you’re reading this, the chances are probably at least 50-50 that you’re connected to an underfunded government agency or to a community-based or non-governmental organization, and that you’re planning an evaluation of a program or intervention that’s been running for some time – months or even years.

Even if that’s true, the same guidelines apply. Choose your questions, gather information, choose a design, and then go on through the steps presented in this chapter. Evaluation is important enough that you won’t really be accomplishing anything by taking shortcuts in planning it. If your program has a cycle, then it probably makes sense to start your evaluation at the beginning of it – the beginning of a year or a program phase, where all participants are starting from the same place, or from the beginning of their involvement.

If that’s not possible – if your program has a rolling admissions policy, or provides a service whenever people need it – and participants are all at different points, that can sometimes present research problems. You may want to evaluate the program’s effects only with new participants, or with another specific group. On the other hand, if your program operates without a particular beginning and end, you may get the best picture of its effectiveness by evaluating it as it is, starting whenever you’re ready. Whatever the case, your design should follow your information gathering and synthesis.

Who should be involved in choosing a design?

If you’re a regular Tool Box user, and particularly if you’ve been reading this chapter, you know that the Tool Box team generally recommends a participatory process – involving both research and community partners, including all those with an interest in or who are affected with the program in planning and implementation. Choosing a design for evaluation presents somewhat of an exception to this policy, since scientific or evaluation partners may have a much clearer understanding of what is required to conduct research, and of the factors that may interfere with it.

As we’ll see in the “how-to” part of this section, there are a number of considerations that have to be taken into account to gain accurate information that actually tells you what you want to know. Graduate students generally take courses to gain the knowledge they need to conduct research well, and even some veteran researchers have difficulty setting up an appropriate research design. That doesn’t mean a community group can’t learn to do it, but rather that the time they would have to spend on acquiring background knowledge might be too great. Thus, it makes the most sense to assign this task (or at the very least its coordination) to an individual or small group with experience in research and evaluation design. Such a person can not only help you choose among possible designs, but explain what each design entails, in time, resources, and necessary skills, so that you can judge its appropriateness and feasibility for your context.

How do you choose a design for your evaluation?

How do you go about deciding what kind of research design will best serve the purposes of your evaluation?

The answer to that question involves an examination of four areas:

The nature of the research questions you are trying to answer
The challenges to the research, and the ways they can be resolved or reduced
The kinds of research designs that are generally used, and what each design entails
The possibility of adapting a particular research design to your program or situation – what the structure of your program will support, what participants will consent to, and what your resources and time constraints are

We’ll begin this part of the section with an examination of the concerns research designs should address, go on to considering some common designs and how well they address those concerns, and end with some guidelines for choosing a design that will both be possible to implement and give you the information you need about your program.

Note: in this part of the section, we’re looking at evaluation as a research project. As a result, we’ll use the term “research” in many places where we could just as easily have said, for the purposes of this section, “evaluation.” Research is more general, and some users of this section may be more concerned with research in general than evaluation in particular.

Concerns research designs should address

The most important consideration in designing a research project – except perhaps for the value of the research itself – is whether your arrangement will provide you with valid information. If you don’t design and set up your research project properly, your findings won’t give you information that is accurate and likely to hold true with other situations. In the case of an evaluation, that means that you won’t have a basis for adjusting what you do to strengthen and improve it.

Here’s a far-fetched example that illustrates this point. If you took children’s heights at age six, then fed them large amounts of a specific food for three years – say carrots – and measured them again at the end of the period, you’d probably find that most of them were considerably taller at nine years than at six. You might conclude that it was eating carrots that made the children taller because your research design gave you no basis for comparing these children’s growth to that of other children.

There are two kinds of threats to the validity of a piece of research. They are usually referred to as threats to internal validity (whether the intervention produced the change) and threats to external validity (whether the results are likely to apply to other people and situations).

Threats to internal validity

These are threats (or alternative explanations) to your claim that what you did caused changes in the direction you were aiming for. They are generally posed by factors operating at the same time as your program or intervention that might have an effect on the issue you’re trying to address. If you don’t have a way of separating their effects from those of your program, you can’t tell whether the observed changes were caused by your work, or by one or more of these other factors.They’re called threats to internal validity because they’re internal to the study – they have to do with whether your intervention – and not something else – accounted for the difference.

There are several kinds of threats to internal validity:

History. Both participants’ personal histories – their backgrounds, cultures, experiences, education, etc. – and external events that occur during the research period – a disaster, an election, conflict in the community, a new law – may influence whether or not there’s any change in the outcomes you’re concerned with.
Maturation. This refers to the natural physical, psychological, and social processes that take place as time goes by. The growth of the carrot-eating children in the example above is a result of maturation, for instance, as might be a decline in risky behavior as someone passed from adolescence to adulthood, the development of arthritis in older people, or participants becoming tired during learning activities towards the end of the day.
The effects of testing or observation on participants. The mere fact of a program’s existence, or of their taking part in it, may affect participants’ behavior or attitudes, as may the experience of being tested, videotaped, or otherwise observed or measured.
Changes in measurement. An instrument – a blood pressure cuff or a scale, for instance – can change over time, or different ones may not give the same results. By the same token, observers – those gathering information – may change their standards over time, or two or more observers may disagree on the observations.
Regression toward the mean. This is a statistical term that refers to the fact that, over time, the very high and very low scores on a measure (a test, for instance) often tend to drift back toward the average for the group. If you start a program with participants who, by definition, have very low or high levels of whatever you’re measuring – reading skill, exposure to domestic violence, particular behavior toward people of other races or backgrounds, etc. – their scores may end up closer to the average over the course of the evaluation period even without any program.
The selection of participants. Those who choose participants may slant their selection toward a particular group that is more or less likely to change than a cross-section of the population from which the group was selected. (A good example is that of employment training programs that get paid according to the number of people they place in jobs. They’re more likely to select participants who already have all or most of the skills they need to become employed, and neglect those who have fewer skills... and who therefore most need the service.) Selection can play a part when participants themselves choose to enroll in a program (self-selection), since those who decide to participate are probably already motivated to make changes. It may also be a matter of chance: members of a particular group may, simply by coincidence, share a characteristic that will set their results on your measures apart from the norm of the population you’re drawing from.

Selection can also be a problem when two groups being compared are chosen by different standards. We’ll discuss this further below when we deal with control or comparison groups.

The loss of data or participants. If too little information is collected about participants, or if too many drop out well before the research period is over, your results may be based on too little data to be reliable. This also arises when two groups are being compared. If their losses of data or participants are significantly different, comparing them may no longer give you valid information.
The nature of change. Often, change isn’t steady and even. It can involve leaps forward and leaps backward before it gets to a stable place – if it ever does. (Think of looking at the performance of a sports team halfway through the season. No matter what its record is at that moment, you won’t know how well it will finish until the season is over.) Your measurements may take place over too short a period or come at the wrong times to track the true course of the change or lack of change that’s occurring.
A combination of the effects of two or more of these. Two or more of these factors may combine to produce or prevent the changes your program aims to produce. A language-study curriculum that is tested only on students who already speak two or more languages runs into problems with both participants’ history – all the students have experience learning languages other than their own – and selection – you’ve chosen students who are very likely to be successful at language learning.

Threats to external validity

These are factors that affect your ability to apply your research results in other circumstances – to increase the chances that your program and its results can be reproduced elsewhere or with other populations. If, for instance, you offer parenting classes only to single mothers, you can’t assume, no matter how successful they appear to be, that the same classes will work as well with men.

Threats to external validity (or generalizability) may be the result of the interactions of other factors with the program or intervention itself, or may be due to particular conditions of the program.

Some examples:

Interaction of testing or data collection and the program or intervention. An initial test or observation might change the way participants react to the program, making a difference in final outcomes. Since you can’t assume that another group will have the same reaction or achieve similar final outcomes as a result, external validity or generalizability of the findings becomes questionable.
Interaction of selection procedures and the program or intervention. If the participants selected or self-selected are particularly sensitive to the methods or purpose of the program, it can’t be assumed to be effective with participants who are less sensitive or ready for the program.

Parents who’ve been threatened by the government with the loss of their children due to child abuse may be more receptive to learning techniques for improving their parenting, for example, than parents who are under no such pressure.

The effects of the research arrangements. Participants may change behavior as a result of being observed, or may react to particular individuals in ways they would be unlikely to react to others.

A classic example here is that of a famous baboon researcher, Irven DeVore, who after years of observing troupes of baboons, realized that they behaved differently when he was there than when he wasn’t. Although his intent was to observe their natural behavior, his presence itself constituted an intervention, making the behavior of the baboons he was observing different from that of a troupe that was not observed.

The interference of multiple treatments or interventions. The effects of a particular program can be changed when participants are exposed to it beforehand in a different context, or are exposed to another before or at the same time as the one being evaluated. This may occur when participants are receiving services from different sources, or being treated simultaneously for two or more health issues or other conditions.

Given the range of community programs that exist, there are many possibilities here. Adults might be members of a high school completion class while participating in a substance use recovery program. A diabetic might be treated with a new drug while at the same time participating in a nutrition and physical activity program to deal with obesity. Sometimes, the sequence of treatments or services in a single program can have the same effect, with one influencing how participants respond to those that follow, even though each treatment is being evaluated separately.

Common research designs

Many books have been written on the subject of research design. While they contain too much material to summarize here, there are some basic designs that we can introduce. The important differences among them come down to how many measurements you’ll take, when you will take them, and how many groups of what kind will be involved.

Program evaluations generally look for the answers to three basic questions:

Was there any change – in participants’ or others’ behavior, in physical or social conditions, or in outcomes or indicators of success– during the evaluation period?
Was whatever change took place – or the lack of change – caused by your program, intervention, or effort?
What, in your program or outside it, actually caused or prevented the change?

As we’ve discussed, changes and improvement in outcomes may have been caused by some or all of your intervention, or by external factors. Participants’ or the community’s history might have been crucial. Participants may have changed as a result of simply getting older and more mature or more experienced in the world – often an issue when working with children or adolescents. Environmental factors – events, policy change, or conditions in participants’ lives – can often facilitate or prevent change as well. Understanding exactly where the change came from or where the barriers to change reside, gives you the opportunity to adjust your program to take advantage of or combat those factors.

If all you had to do was to measure whatever behavior or condition you wanted to influence at the beginning and end of the evaluation, choosing a design would be an easy task. Unfortunately, it’s not quite that simple – there are those nasty threats to validity to worry about. We have to keep them in mind as we look at some common research designs.

Research designs, in general, differ in one or both of two ways: the number and timing of the measurements they use; and whether they look at single or multiple groups. We’ll look at single-group designs first, then go on to multiple groups.

Before we go any further, it is helpful to have an understanding of some basic research terms that we will be using in our discussion.

Researchers usually refer to your first measurement(s) or observation(s) – the ones you take before you start your program or intervention – as a baseline measure or baseline observation, because it establishes a baseline – a known level – to which you compare future measurements or observations.

Some other important research terms:

Independent variables are the program itself and/or the methods or conditions that the researcher – in this case, you – wants to evaluate. They’re called variables because they can change – you might have chosen (and might still choose) other methods. They’re independent because their existence doesn’t depend on whether something else occurs: you’ve chosen them, and they’ll stay consistent throughout the evaluation period.

Dependent variables are whatever may or may not change as a result of the presence of the independent variable(s). In an evaluation, your program or intervention is the independent variable. (If you’re evaluating a number of different methods or conditions, each of them is an independent variable.) Whatever you’re trying to change is the dependent variable. (If you’re aiming at change in more than one behavior or outcome, each type of change is a different dependent variable.) They’re called dependent variables because changes in them depend on the action of the independent variable...or something else.

Measures are just that – measurements of the dependent variables. They usually refer to procedures that have results that can be translated into numbers, and may take the form of community assessments, observations, surveys, interviews, or tests. They may also count incidents or measure the amount of the dependent variable (number or percentage of children who are overweight or obese, violent crimes per 100,000 population, etc.)

Observations might involve measurement, or they might simply record what happens in specific circumstances: the ways in which people use a space, the kinds of interactions children have in a classroom, the character of the interactions during an assessment. For convenience, researchers often use “observation” to refer to any kind of measurement and we’ll use the same convention here.

Pre- and post- single-group design

The simplest design is also probably the least accurate and desirable: the pre (before) and post (after) measurement or observation. This consists of simply measuring whatever you’re concerned with in one group – the infant mortality rate, unemployment, water pollution – applying your intervention to that group or community, and then observing again. This type of design assumes that a difference in the two observations will tell you whether there was a change over the period between them, and also assumes that any positive change was caused by the intervention.

In most cases, a pre-post design won’t tell you much, because it doesn’t really address any of the research concerns we’ve discussed. It doesn’t account for the influence of other factors on the dependent variable, and it doesn’t tell you anything about trends of change or the progress of change during the evaluation period – only where participants were at the beginning and where they were at the end. It can help you determine whether certain kinds of things have happened – whether there’s been a reduction in the level of educational attainment or the amount of environmental pollution in a river, for instance – but it won’t tell you why. Despite its limitations, taking measures before and after the intervention is far better than no measures.

Even looking at something as seemingly simple to measure pre and post as blood pressure (in a heart disease prevention program) is questionable. Blood pressure may be lower at the final observation than at the initial one, but that tells you nothing about how much it may have gone up and down in between. If the readings were taken by different people, the change may be due in part to differences in their skill, or to how relaxed each was able to make participants feel. Familiarity with the program could also have reduced most participants’ blood pressure from the pre- to the post-measurement, as could some other factor that wasn’t specifically part of the independent variable being evaluated.

Interrupted time series design with a single group (simple time series)

An interrupted time series used repeated measures before and after delayed implementation of the independent variable (e.g., the program, etc.) to help rule out other explanations. This relatively strong design – with comparisons within the group – addresses most threats to internal validity.

The simplest form of this design is to take repeated observations, implement the program or intervention, and observe a number of times during the evaluation period, including at the end. This method is a great improvement over the pre- and post- design in that it tracks the trend of change, and can therefore, help see whether it was actually the independent variable that caused any change. It can also help to identify the influence of external factors such as when the dependent variable shows significant change before the intervention is implemented.

Another possibility for this design is to implement more than one independent variable, either by trying two or more, one after another (often with a break in between), or by adding each to what came before.This gives a picture not only of the progress of change, but can show very clearly what causes change. That gives an evaluator the opportunity not only to adjust the program, but to drop elements that have no effect.

There are a number of variations on the interrupted time series theme, including varying the observation times; implementing the independent variable repeatedly; and implementing one independent variable, then another, then both together to evaluate their interaction.

In any variety of interrupted time series design, it’s important to know what you’re looking for. In an evaluation of a traffic fatality control program in the United Kingdom that focused on reducing drunk driving, monthly measurements seemed to show only a small decline in fatal accidents. When the statistics for weekends, when there were most likely to be drunk drivers on the road, were separated out, however, they showed that the weekend fatality rate dropped sharply with the implementation of the program, and stayed low thereafter. Had the researchers not realized that that might be the case, the program might have been stopped, and the weekend accident rate would not have been reduced.

Interrupted time series design with multiple groups (multiple baseline/time series)

This has the same possibilities as the single time series design, with the added wrinkle of using repeated measures with one or more other groups (so-called multiple baselines). By using multiple baselines (groups), the external validity or generality of the findings is enhanced – we can see if the effects occur with different groups or under different conditions.

This multiple time series design – typically staggered introduction of the intervention with different groups or communities – gives the researcher more opportunities:

You can try a method or program with two or more groups from the same
You can try a particular method or program with different populations, to see if it’s effective with others
You can vary the timing or intensity of an intervention with different groups
You can test different interventions at the same time
You can try the same two or more interventions with each of two groups, but reverse their order to see if sequencing it makes any difference

Again, there are more variations possible here.

Control group design

A common way to evaluate the effects of an independent variable is to use a control group. This group is usually similar to the participant group, but either receives no intervention at all, or receives a different intervention with the same goal as that offered to the participant group. A control group design is usually the most difficult to set up – you have to find appropriate groups, observe both on a regular basis, etc. – but is generally considered to be the most reliable.

The term control group comes from the attempt to control outside and other influences on the dependent variable. If everything about the two groups except their exposure to the program being evaluated averages out to be the same, then any differences in results must be due to that exposure. The term comparison group is more modest; it typically offers a community watched for similar levels of the problem/goal and relevant characteristics of the community or population (e.g., education, poverty).

The gold standard here is the randomized control group, one that is selected totally at random, either from among the population the program or intervention is concerned with – those at risk for heart disease, unemployed males, young parents – or, if appropriate, the population at large. A random group eliminates the problems of selection we discussed above, as well as issues that might arise from differences in culture, race, or other factors.

A control group that’s carefully chosen will have the same characteristics as the intervention group (the focus of the evaluation). If, for instance, the two groups come from the same pool of people with a particular health condition, and are chosen at random either to be treated in the conventional way or to try a new approach, it can be assumed that – since they were chosen at random from the same population – both groups will be subject, on average, to the same outside influences, and will have the same diversity of backgrounds. Thus, if there is a significant difference in their results, it is fairly safe to assume that the difference comes from the independent variable – the type of intervention, and not something else.

The difficulty for governmental and community-based organizations is to find or create a randomized control group. If the program has a long waiting list, it may be able to create a control by selecting those to first receive the intervention at random. That in itself creates problems, in that people often drop off waiting lists out of frustration or other reasons. Being included in the evaluation may help to keep them, on the other hand, by giving them a closer connection to the program and making them feel valued.

An ESOL (English as a Second or Other Language) program in Boston with a three-year waiting list addressed the problem by offering those on the waiting list a different option. They received videotapes to use at home, along with biweekly tutoring by advanced students and graduates of the program. Thus, they became a comparison group with a somewhat different intervention that, as expected, was less effective than the program itself, but was more effective than none, and kept them on the waiting list. It also gave them a head start once they got into the classes, with many starting at a middle rather than at a beginning level.

When there’s no waiting list or similar group to draw from, community organizations often end up using a comparison group - one composed of participants in another place or program and whose members’ characteristics, backgrounds, and experience may or may not be similar to those of the participant group. That circumstance can raise some of the same problems related to selection seen when there is no control group. If the only potential comparisons involve very different groups, it may be better to use a design, such as an interrupted time series design that doesn’t involve a control group at all, where the comparison is within (not between) groups.

Groups may look similar, but may differ in an important way. Two groups of participants in a substance use intervention program, for instance, may have similar histories, but if one program is voluntary and the other is not, the results aren’t likely to be comparable. One group will probably be more motivated and less resentful than the other, and composed of people who already know they have a potential problem. The motivation and determination of their participants, rather than the effectiveness of the two programs, may influence the amount of change observed.

This issue may come up in a single-group design as well. A program that may, on average, seem to be relatively ineffective may prove, on close inspection, to be quite effective with certain participants – those of a specific educational background, for instance, or with particular life experiences. Looking at results with this in mind can be an important part of an evaluation, and give you valuable and usable information.

Choosing a design

This section’s discussion of research designs is in no way complete. It’s meant to provide an introduction to what’s available. There are literally thousands of books and articles written on this topic, and you’ll probably want more information. There are a number of statistical methods that can compensate for less-than-perfect designs, for instance: few community groups have the resources to assemble a randomized control group, or to implement two or more similar programs to see which works better.

Given this, the material that follows is meant only as broad guidelines. We don’t attempt to be specific about what kind of design you need in what circumstances, but only try to suggest some things to think about in different situations. Help is available from a number of directions: Much can be found on the Internet (see the “Resources” part of this section for a few sites); there are numerous books and articles (the classic text on research design is also cited in “Resources”); and universities are a great resource, both through their libraries and through faculty and graduate students who might be interested in what you’re doing, and be willing to help with your evaluation. Use any and all of these to find what will work best for you. Funders may also be willing either to provide technical assistance for evaluations, or to include money in your grant or contract specifically to pay for a professional evaluation.

Your goal in evaluating your effort is to get the most reliable and accurate information possible, given your evaluation questions, the nature of your program, what your participants will consent to, your time constraints, and your resources. The important thing here is not to set up a perfect research study, but to design your evaluation to get real information, and to be able to separate the effects of external factors from the effects of your program. So how do you go about choosing the best design that will be workable for you? The steps are in the first sentence of this paragraph.

Consider your evaluation questions

What do you need to know? If the intent of your evaluation is simply to see whether something specific happened, it’s possible that a simple pre-post design will do. If, as is more likely, you want to know both whether change has occurred, and if it has, whether it has in fact been caused by your program, you’ll need a design that helps to screen out the effects of external influences and participants’ backgrounds.

For many community programs, a control or comparison group is helpful, but not absolutely necessary. Think carefully about the frequency and timing of your observations and the amount of different kinds of information you can collect. With repeated measures, you can get you quite an accurate picture of the effectiveness of your program from a simple time series design. Single group interrupted time series designs, which are often the most workable for small organizations, can give you a very reliable evaluation if they’re structured well. That generally means obtaining multiple baseline observations (enough to set a trend) before the program begins; observing often and documenting your observations carefully (often with both quantitative – expressed in numbers – and qualitative – expressed in records of incidents and of what participants did and said – data); and including during intervention and follow-up observations to see whether effects are maintained.

In many of these situations, a multiple-group interrupted time series design is quite possible, but of a “naturally-occurring” experiment. If your program includes two or more groups or classes, each working toward the same goals, you have the opportunity to stagger the introduction of the intervention across the groups. This comparison with (and across) groups allows you to screen out such factors as the facilitator’s ability and community influences (assuming all participants come from the same general population.) You could also try different methods or time sequences, to see which works best.

In some cases, the real question is not whether your method or program works, but whether it works better than other methods or programs you could be using. Teaching a skill – for instance, employment training, parenting, diabetes management, conflict resolution – often falls into this category. Here, you need a comparison of some sort. While evaluations of some of these – medical treatment, for example – may require a control group, others can be compared to data from the field, to published results of other programs, or, by using community-level indicators, from measurements in other communities.

There are community programs where the bottom line is very simple. If you’re working to control water pollution, your main concern may be the amount of pollution coming out of effluent pipes, or the amount found in the river. Your only measure of success may be keeping pollution below a certain level, which means that regular monitoring of water quality is the only evaluation you need. There are probably relatively few community programs where evaluation is this easy – you might, for instance, want to know which of your pollution-control activities is most effective – but if yours is one, a simple design may be all you need.

Consider the nature of your program

What does your program look like, and what is it meant to do? Does it work with participants in groups, or individually, for instance? Does it run in cycles – classes or workshops that begin and end on certain dates, or a time-limited program that participants go through only once? Or can participants enter whenever they are ready and stay until they reach their goals? How much of the work of the program is dependent on staff, and how much do participants do on their own? How important is the program context – the way staff, participants, and others treat one another, the general philosophy of the program, the physical setting, the organizational culture? (The culture of an organization consists of accepted and traditional ways of doing things, patterns of relationships, how people dress, how they act toward and communicate with one another, etc.)

If you work with participants in groups, a multiple-group design – either interrupted time series or control group – might be easier to use. If you work with participants individually, perhaps a simple time series or a single group design would be appropriate.
If your program is time-limited – either one-time-only, or with sessions that follow one another – you’ll want a design that fits into the schedule, and that can give you reliable results in the time you have. One possibility is to use a multiple group design, with groups following one another session by session. The program for each group might be adjusted, based on the results for the group before, so that you could test new ideas each session.
If your program has no clear beginning and end, you’re more likely to need a single group design that considers participants individually, or by the level of their baseline performance. You may also have to compensate for the fact that participants may be entering the program at different levels, or with different goals.

A proverb says that you never step in the same river twice, because the water that flows past a fixed point is always changing. The same is true of most community programs. Someone coming into a program at a particular time may have a totally different experience than a similar person entering at a different time, even though the operation of the program is the same for both. A particular participant may encourage everyone around her, and create an overwhelmingly positive atmosphere different from that experienced by participants who enter the program after she has left, for example. It’s very difficult to control for this kind of difference over time, but it’s important to be aware that it can, and often does, exist, and may affect the results of a program evaluation.

If the organizational or program context and culture are important, then you’ll probably want to compare your results with participants to those in a control group in a similar situation where those factors are different, or are ignored.

There is, of course, a huge range of possibilities here: nearly any design can be adapted to nearly any situation in the right circumstances. This material is meant only to give you a sense of how to start thinking about the issue of design for an evaluation.

Consider what your participants (and staff) will consent to

In addition to the effect that it might have on the results of your evaluation, you might find that a lot of observation can raise protests from participants who feel their privacy is threatened, or from already-overworked staff members who see adding evaluation to their job as just another burden. You may be able to overcome these obstacles, or you may have to compromise – fewer or different kinds of observations, a less intrusive design – in order to be able to conduct the evaluation at all.

There are other reasons that participants might object to observation, or at least intense observation. Potential for embarrassment, a desire for secrecy (to keep their participation in the program from family members or others), even self-protection (in the case of domestic violence, for instance) can contribute to unwillingness to be a participant in the evaluation. Staff members may have some of the same concerns.

There are ways to deal with these issues, but there’s no guarantee that they’ll work. One is to inform participants at the beginning about exactly what you’re hoping to do, listen to their objections, and meet with them (more than once, if necessary) to come up with a satisfactory approach. Staff members are less likely to complain if they’re involved in planning the evaluation, and thus have some say over the frequency and nature of observations. The same is true for participants.Treating everyone’s concerns seriously and including them in the planning process can go a long way toward assuring cooperation.

Consider your time constraints

As we mentioned above, the important thing here is to choose a design that will give you reasonably reliable information. In general, your design doesn’t have to be perfect, but it does have to be good enough to give you a reasonably good indication that changes are actually taking place, and that they are the result of your program. Just how precise you can be is at least partially controlled by the limits on your time placed by funding, program considerations, and other factors.

Time constraints may also be imposed. Some of the most common:

Program structure. An evaluation may make the most sense if it’s conducted to correspond with a regular program cycle.
Funding. If you are funded only for a pilot project, for example, you’ll have to conduct your evaluation within the time span of the funding, and soon enough to show that your program is successful enough to be refunded. A time schedule for evaluation may be part of your grant or contract, especially if the funder is paying for it.
Participants’ schedules. A rural education program may need to stop for several months a year to allow participants to plant and tend crops, for instance.
The seriousness of the issue A delay in understanding whether a violence prevention program is effective may cost lives.
The availability of professional evaluators. Perhaps the evaluation team can only work during a particular time frame.

Consider your resources

Strategic planners often advise that groups and organizations consider resources last: otherwise they’ll reject many good ideas because they’re too expensive or difficult, rather than trying to find ways to make them work with the resources at hand. Resources include not only money, but also space, materials and equipment, personnel, and skills and expertise. Often, one of these can substitute for another: a staff person with experience in research can take the place of money that would be used to pay a consultant, for example. A partnership with a nearby university could get you not only expertise, but perhaps needed equipment as well.

The lesson here is to begin by determining the best design possible for your purposes, without regard to resources. You may have to settle for somewhat less, but if you start by aiming for what you want, you’re likely to get a lot closer to it than if you assume you can’t possibly get it.

In Summary

The way you design your evaluation research will have a lot to do with how accurate and reliable your results are, and how well you can use them to improve your program or intervention. The design should be one that best addresses key threats to internal validity (whether the intervention caused the change) and external validity (the ability to generalize your results to other situations, communities, and populations).

Common research designs – such as interrupted time series or control group designs– can be adapted to various situations, and combined in various ways to create a design that is both appropriate and feasible for your program. It may be necessary to seek help from a consultant, a university partner, or simply someone with research experience to help identify a design that fits your needs.

A good design will address your evaluation questions, and take into consideration the nature of your program, what program participants and staff will agree to, your time constraints, and the resources you have available for evaluation. It often makes sense to consider resources last, so that you won’t reject good ideas because they seem too expensive or difficult. Once you’ve chosen a design, you can often find a way around a lack of resources to make it a reality.

Contributor

Stephen B. Fawcett

Phil Rabinowitz

Resources

Online Resources

Bridging the Gap: The role of monitoring and evaluation in Evidence-based policy-making is a document provided by UNICEF that aims to improve relevance, efficiency and effectiveness of policy reforms by enhancing the use of monitoring and evaluation.

Effective Nonprofit Evaluation is a briefing paper written for TCC Group. Pages 7 and 8 give specific information related to designing an effective evaluation.

From the Introduction to Program Evaluation for Public Health Programs, this resource from CDC on Focus the Evaluation Design offers suggestions for tailoring questions to evaluate the efficiency, cost-effectiveness, and attribution of a program. This guide offers a variety of program evaluation-related information.

Chapter 3 of the GAO Designing Evaluations handbook focuses on the process of selecting an evaluation design. This handbook provided by the U.S. Government Accountability Office provides information on various topics related to program evaluation.

Interrupted Time Series Quasi-Experiments is an essay by Gene Glass, from Arizona State University, on time series experiments, distinction between experimental and quasi-experimental approaches, etc.

The Magenta Book - Guidance for Evaluation provides an in-depth look at evaluation. Part A is designed for policy makers. It sets out what evaluation is, and what the benefits of good evaluation are. It explains in simple terms the requirements for good evaluation, and some straightforward steps that policy makers can take to make a good evaluation of their intervention more feasible. Part B is more technical, and is aimed at analysts and interested policy makers. It discusses in more detail the key steps to follow when planning and undertaking an evaluation and how to answer evaluation research questions using different evaluation research designs. It also discusses approaches to the interpretation and assimilation of evaluation evidence.

Practical Challenges of Rigorous Impact Evaluation in International Governance NGOs: Experiences and Lessons from The Asia Foundation explores program evaluation at the international level.

Research Design Issues for Evaluating Complex Multicomponent Interventions in Neighborhoods and Communities is from the Promise Neighborhoods Research Consortium. The article discusses challenges and offers approaches to evaluation that are likely to result in adoption and maintenance of effective and replicable multicomponent interventions in high-poverty neighborhoods.

Research Methods is a text by Dr. Christopher L. Heffner that focuses on the basics of research design and the critical analysis of professional research in the social sciences from developing a theory, selecting subjects, and testing subjects to performing statistical analysis and writing the research report.

Research Methods Knowledge Base is a comprehensive web-based textbook that provides useful, comprehensive, relatively simple explanations of how statistics work and how and when specific statistical operations are used and help to interpret data.

A Second Look at Research in Natural Settings is a web-version of a PowerPoint presentation by Graziano and Raulin.

The W.K. Kellogg Foundation Evaluation Handbook provides a framework for thinking about evaluation as a relevant and useful program tool. Chapters 5, 6, and 7 under the “Implementation” heading provide detailed information on determining data collection methods, collecting data, and analyzing and interpreting data.

Print Resources

Campbell, D., & Stanley. J. (1963, 1966). Experimental and Quasi-Experimental Designs for Research. Chicago: Rand McNally.

Fawcett, S., et. al. (2008). Community Toolbox Curriculum Module 12: Evaluating the initiative. Work Group for Community Health and Development. University of Kansas. Community Tool Box Curriculum.

Roscoe, J. (1969). Fundamental Research Statistics for the Behavioral Sciences. New York, NY: Holt, R., & Winston.

Shadish, W,. Cook, T., & Campbell, D. (2002). Experimental and Quasi-experimental Designs for Generalized Causal Inference. Houghton Mifflin College Div.