hr-survey.com

Assessment Handbook: A guide for developing assessment programs in Illinois schools

1995 Edition

ILLINOIS STATE BOARD OF EDUCATION

School and Student Assessment Section
100 North First Street
Springfield, Illinois 62777-0001

Louis Mervis, Chairperson
Joseph Spagnolo, State Superintendent


Table of Contents


CHAPTER 1

  1. LOCAL ASSESSMENT SYSTEMS
  2. Overview of Assessment

  3. Developing Local Assessment Systems

CHAPTER 2

  1. THE QUALITY OF ASSESSMENT
  2. Ensuring the Quality of Assessment

    Other Important Topics

CHAPTER 3

  1. SELECTION AND DEVELOPMENT OF ASSESSMENT PROCEDURES
  2. Selecting Assessment Procedures

    Developing Performance-Based Assessment Procedures

CHAPTER 4

  1. INTERPRETATION, USE, AND REPORTING OF ASSESSMENT RESULTS
  2. Reporting Results

REFERENCES


Chapter 1

Local Assessment Systems

This handbook is intended to help schools and school districts develop and implement assessment programs that produce useful information that is of high quality. Assessment is a critical element of the Illinois school improvement process.

Overview of Assessment

What is assessment?

Assessment is the major focus here, although testing will also be discussed. The terms test and assessment are frequently used interchangeably.

Technically, however, there are important differences between them. A third term, evaluation, is sometimes used interchangeably also but is actually distinct. To avoid confusion, the three terms are defined below.

Test, the narrowest of the terms, usually refers to a specific set of questions or tasks that is administered to an individual or to all members of a group and measures a sample of behavior. It is highly structured and can be administered and scored consistently within and across groups of students, thus making it highly reliable. It requires a relatively limited period of time to administer.

Assessment is more encompassing and includes the collection of information from multiple sources. A test is one kind of assessment. Assessment may also include rating scales, observation of student performance, portfolios, individual interviews, and other procedures. Assessment may refer to groups or individuals. Group assessment may involve administering different performance tasks or subsets of items to different samples of students and reporting the results for groups but not for individuals. In addition, assessment often refers to a planned program or system.

Evaluation refers to making a value judgment about the implications of assessment information. This process is necessary for school improvement planning. While assessment involves obtaining achievement data through a variety of means, evaluation goes a step further - interpreting the data from an informed perspective. That perspective should also be informed by knowledge such as that about instructional content, community context, school climate, and dropout rate. Although this Handbook includes some material about interpreting assessment results, for the most part it does not address evaluation.

In summary, testing provides one isolated glimpse -- analogous to taking a picture with a camera -- of student achievement (individual or group) in specific skills or knowledge at a specific time. Assessment provides more comprehensive data from multiple measures administered over a period of time or, preferably, a variety of data-gathering approaches. Evaluation produces value judgments about the results of assessment.

Testing, assessment, and evaluation are strongly interdependent; the quality of one affects the quality of the others. Good tests strengthen assessment; well-planned assessment increases the probability of valid and accurate evaluation.

What is a comprehensive assessment system?

A comprehensive assessment system is a coordinated plan for periodically monitoring the progress of students at multiple grade levels in a variety of subjects. It specifies the procedures that will be used for assessment; indicates when and how those procedures will be administered; and describes plans for processing, interpreting, and using the resulting information. It takes information collected at various levels--classroom, school, district, and state--into consideration. A comprehensive assessment system includes:

Assessment types

There are numerous ways to categorize the many different types of assessment. The previous version of this Handbook used the following classification scheme:This version classifies assessment into two types and discusses three sources of assessment.

The two types of assessment are:

The three sources of assessment are:The major types and sources of assessment will be discussed below. The major advantages and limitations of each will be discussed.

Forced-choice assessment

Forced-choice assessment (referred to in some publications as "selected-response") requires students to select the correct response from two or more alternatives (e.g., multiple-choice, true-false, or matching test items) or supply a word, a phrase or several sentences to answer a question or complete a statement. This approach is sometimes described as "traditional," "standardized," or "objective." (However, these descriptions apply to all assessment procedures- -performance-based and forced-choice- -which will be aggregated or used for comparison. Both types of assessment are traditional in the sense that they have long histories. Both should be standardized in the sense that the same procedures are used with all students and are administered and scored uniformly. Both should be objective in the sense that they are analyzed systematically and results can be verified.) Most forced-choice assessments are paper-and-pencil tests. The "correct" responses are seldom debatable. Students' scores are likely to be the same regardless of who scores the test.

The following types of assessment are considered forced-choice:

An advantage of forced-choice assessments is their efficiency in collecting information about student achievement. Questions are usually fairly brief. To respond, students fill in a bubble, check or circle the correct answer, or write a brief response. Thus, students can answer a large number of questions that cover a rather broad content area in a relatively short period of time. Responses can often be machine scored.

A limitation of forced-choice assessment is that it cannot be used to validly assess many skills which require students to actually perform (conduct a scientific experiment, run a 100-yard dash) or to make a product (write an essay, create a painting). Another limitation is that students can sometimes guess the correct response without actually having the knowledge assessed. Still another limitation is the difficulty of writing good questions or items which assess higher order thinking skills.

Perhaps the most effective use of forced-choice assessment is to assess student knowledge, particularly if there is a need to cover a broad content area. Students can answer many forced-choice items in a relatively brief period of time.

Performance-based assessment

Performance-based assessment, also known as complex generated response, extended response, alternative, authentic, or constructed-response assessment, requires students to construct their own responses to questions or prompts--to actually perform or to develop a product. For example, a student might deliver a speech or play a musical instrument; the performances might be live or recorded. Or a student might assemble a portfolio containing descriptions of several cultures and representations of their literature and artwork, write an essay, show a mathematics proof, or make a drawing; such assessments would use paper and pencil. Rather than being treated as correct or incorrect, some responses may be considered qualitatively different from others.

Types of activities that qualify as performance-based assessments (if they are administered and scored uniformly) include:

An advantage of this type of assessment is its effectiveness in validly assessing learning which refers to demonstration of skills or other performances. Also, performance-based assessment often provides more in-depth information--which may be useful for diagnosing individual students' learning needs. And, it is more likely to be directly related to real-life skills.

A limitation of performance-based assessment is that while it can validly assess the student behaviors that are directly included in a specific task, those behaviors may not adequately represent a broader domain of interest. It may seem obvious that a task which requires students to explain how they answered an algebra problem may not assess how well the students could explain their answers to a geometry problem. Other situations are not so obvious. For example, students' ability to demonstrate one kind of scientific experiment may not be evidence that they can conduct an experiment that requires different scientific principles or methods.

Another limitation is that scores will be reliable only if the same assessment procedures are used with all students and the assessments are scored by trained raters using uniform rating scales or rubrics. ("Scale" and "rubric" are used interchangeably here.) Such scoring is often time consuming and can be expensive. A further limitation is that performance-based assessments may require much more time to comprehensively assess a body of knowledge than forced-choice assessments.

Some thoughts about this typology

Like any typology that might be used, this one has shortcomings. Assessment approaches cannot always be neatly categorized into forced-choice or performance-based. For example, completion items that require students to supply as much as several sentences certainly require them to generate their own responses. Completion items were categorized with forced-choice approaches because, generally, they are simpler and do not require students to generate their own responses in the same way as most performance-based responses. On the other hand, some performance-based tasks can be rather simplistic and lack some of the benefits commonly associated with performance-based assessment.

Some assessment approaches combine the two types presented here. One, known as "enhanced multiple-choice" may require students to explain their responses to multiple-choice items. Another approach uses thematic exercises that may, for example, require students to develop a theatrical character in response to a given stimulus, perform the character's role, and answer multiple-choice and essay questions about character development. An assessment task may require students to write an essay about a historical event and answer multiple-choice questions about it.

Videotaped assessments, which are currently being developed in Illinois to assess students' perceptual skills in the arts, may include forced-choice or performance-based questions. The tapes include excerpts from arts performances (such as in music or dance) and assessment items about them. Students watch and listen to the performances and then answer the questions. The questions may be of any type, such as multiple-choice or essay. They do not require students to perform. However, they do assess students' skills in interpreting and analyzing performances by others.

Portfolios may or may not qualify as performance-based approaches. For example, a collection of essays, stories, or poems written by a student and rated using a systematic scoring rubric can be considered a performance-based approach for assessing writing. Similarly, a collection of reports describing experiments conducted by a student might be appropriate performance-based assessment for science. A videotaped collection of a student's musical performances might be appropriate for the fine arts. To qualify as performance-based, portfolios must contain products or performances generated by students. Portfolios that are simply collections of forced-choice tests completed by students are not performance-based. Another condition that sometimes disqualifies portfolios is that they are simply used to assemble student work and the quality of the work is not judged. Unless the quality of portfolio contents is rated, the portfolios are not being used for assessment.

Written essays are not always clearly appropriate as performance-based assessment. If they are used to assess students' skills in writing or in higher-order thinking skills such as problem solving, analyzing, or critiquing, they are clearly appropriate. However, if they are used to estimate whether or not students have mastered a particular body of knowledge and simply require students to recite that knowledge (verbally or in writing), they do not require students to perform.
To help organize the materials in this document, assessment approaches will be discussed in these two categories. Readers should remember that distinctions between them sometimes blur and that teachers' most important assessment priority should be to ensure that individual assessment procedures are appropriate for their intended uses.

Assessment sources

As mentioned above, educators will probably want to consider three major sources of assessment procedures: (1) commercial publishers, (2) the local school or district, and (3) other educational organizations. The first and third sources would, of course, provide assessment procedures that have already been developed. However, they would have to be reviewed carefully; evidence of validity, reliability, and fairness may need to be collected. Also, revision or additional development may be necessary. The second source would require local educators to develop new assessment procedures specifically for use in the assessment system, although committees should also solicit and review existing assessment procedures that local staff have developed for their own use.

Most schools will probably use a combination of sources. Initially, it may be preferable to obtain assessment procedures from elsewhere in order to save the time and other resource costs of local development. These procedures might be used as they are, used as models, or otherwise adapted. On the other hand, assessment procedures that have been developed locally are likely to focus more directly on local outcomes and instruction and thus may provide more valid data.

Commercial publishers

The most commonly available commercially published assessments are standardized achievement tests (which may be norm referenced or criterion referenced), textbook tests, and tests in other published resources such as curriculum guides or textbook manuals for teachers. Other sources of commercially published assessments include customized tests which publishers tailor to local needs. Many of these tests are forced-choice, but more and more performance-based tests are being published--either by themselves or as components of test batteries which are primarily composed of forced-choice items.

Assessments published by testing corporations have several advantages. Many, especially standardized achievement tests, are likely to have been professionally constructed. It is highly probable that they have been reviewed carefully, pilot-tested, and analyzed for bias. Estimates of validity for various uses and of reliability are usually available in technical manuals and other sources. Item-difficulty data, which can be very useful in deciding how to use an item and in setting student expectations, may also be readily available. Scoring and reporting services can be obtained. Commercially published assessments may be appropriate for several uses, such as for accountability or program evaluation.

Textbook tests also have several advantages. They are likely to be readily available, closely aligned with instruction, and inexpensive. Schools and districts do not have to spend additional money or time to purchase or develop the assessments (although they may have to spend resources examining validity, reliability, and fairness).

Assessment procedures are sometimes also available from other published sources such as teachers' manuals, various resource materials, and books about assessment. Their quality varies considerably, as does their appropriateness in divergent local situations. They must be reviewed very carefully before being adopted. In addition, it may be necessary to obtain permission to copy and/or use them.

Before deciding to use commercially published assessment procedures, however, local planners should examine them carefully. The tests and tasks should be closely aligned with local learning outcomes. Commercial publishers sometimes provide information about alignment, but school personnel should independently examine assessment procedures to make sure that they agree with the publisher's interpretation of local curricula or outcomes and of what the items measure. The tests are unlikely to be valid locally unless such agreement exists. Educators should also review the alignment of textbook tests with the textbook, instruction, and local outcomes. Textbook tests are sometimes poorly aligned with the textbook. Even if textbook tests are aligned with the textbook and with instruction, the tests are not likely to provide useful information about student attainment of outcomes if instruction is poorly aligned with outcomes.

Local planners should also use information about validity, reliability, and fairness carefully. It may or may not be useful locally. Educators may need to ask textbook publishers for additional information about items or tasks. For example, for what uses will the assessment results be valid? What is known about the reliability and difficulty level of the items? Have the items been pilot-tested or reviewed for bias? (For more information about validity, reliability, and fairness, see Chapter 2.)

Before adopting commercially published assessments, local planning groups also need to make sure that assessment results can be aggregated or disaggregated as needed. For example, can a publisher or scoring service provide data for the specific clusters of items or tasks that measure particular local outcomes or objectives? If so, at what cost?

Local school or district

Locally developed assessment procedures are also appropriate to use as one of several types of measures. Since locally developed assessments will be administered on a smaller scale than published tests, they may permit more opportunities for such assessment alternatives as having students listen to audio recordings as they respond to questions about music, use videotapes with questions about dance or drama, or simulate scientific experiments on computers. Also, some audiences may have greater ownership for the assessment and use the results more since local staff are involved in development.

Developing local assessment procedures is a challenging - but rewarding - undertaking. Designing good assessment tasks and items can be time consuming. School or district personnel may have to develop several for each local learning outcome. As with assessments from other sources, staff must address the issues of validity, reliability, and fairness. They may also have to develop scoring scales and train staff to use them. Staff will need to develop processes for printing, scoring, and reporting. Computer software programs can be a great help.

Other educational organizations

Assessment procedures are often available from other schools and districts as well as other kinds of educational organizations (for example, regional education service agencies, professional organizations, colleges and universities, and ISBE). Local school assessment committees may want to adopt or adapt these. Staff should review them carefully and should make certain that they have permission to use them and should not amend them unless they have permission to do so. Staff may prefer not to use procedures they cannot adapt as needed. If procedures are revised substantially, they should be reviewed by additional staff and tried out with students.

DEVELOPING LOCAL ASSESSMENT SYSTEMS

Developing a comprehensive assessment system is an important task which must be done with care. Decisions made during this stage will have a major impact on the utility of information collected for various purposes. The system might simply produce information that meets requirements imposed by a larger, external system such as a school district or state department of education. Preferably, however, the local assessment system will help improve student learning by producing information that is genuinely useful in the learning process. Results might be used to help monitor whether students are meeting local learning outcomes or objectives, to evaluate local programs and needs, or to make day-to-day instructional decisions.

Issues which should be addressed during the development of an assessment system include:

Practical Tip: Developing an assessment system that provides all of the information needed, is of high quality, and functions smoothly may require several years. Furthermore, modifications will be needed occasionally--for example, as curricula change or student learning improves. Local educators can probably make the most progress if they begin to assess all learning areas and goals. Then, committees of teachers (and others) can expand and refine the assessment procedures for each content area. At the same time, others can work on issues of validity, reliability, and fairness.
The process that a school or district uses to plan its assessment program is critical to the program's success. That process helps determine the quality of the system and its appropriateness for local purposes. Also, the process may help elicit crucial support. The program must be credible - that is, valid and useful - to everyone who is involved, including teachers, administrators, parents, students, and others. One of the most effective ways to generate such support is to involve a wide variety of people in the program's development. The involvement and roles of various committees will be discussed below.

Identifying assessment purposes/functions

Early in the process of developing an assessment system, local planning teams (in consultation with others) need to decide what functions the system should serve. That decision will strongly influence the remainder of the process. A major criterion used to select assessment procedures should be that they will produce the kinds of information needed for various purposes.

Some assessment functions/purposes that may be considered include:

System developers may want to limit the purposes of a school or district assessment system to some or all of the functions listed above and leave others--such as diagnosing individual students' needs, placing students in instructional groups, or evaluating special programs--to the discretion of teachers or others. However, they may also want to consider other functions in order to minimize the assessment procedures that must be administered by selecting those which serve multiple functions.

Deciding what to assess

At some point, perhaps during this stage, planning needs to become specific so that (a) planning teams or committees have the information needed to make decisions regarding when students will be assessed and (b) people who are or will be involved in the assessment process (primarily committee members, school administrators, and teachers) develop clearer understandings of the scope of the assessment system and the commitments that will be needed to implement it. Those commitments will include planning time (for example, to select and develop specific assessment procedures), resources for purchasing assessment procedures as well as scoring and reporting services, and classroom time for administering the assessments.

During this stage, planning teams should decide what content areas (and perhaps what knowledge/skills/learning outcomes) need to be assessed in order to meet the purposes or functions that have been identified. They may want to simply name the content areas that will be assessed. However, more specific information will allow them to plan more effectively. They may even want to identify the learning outcomes or other subcategories for which information will be needed.

Selecting assessment types

When developing a broad outline for an assessment program, this stage is optional. The outline would specify the purposes/functions of resulting information, the outcomes or content which will be assessed, and a schedule that indicates when students will be assessed. However, if assessment types are identified at this point, school personnel can make better decisions about allocating planning time and other resources for developing and purchasing assessment procedures. Also, information about what types of assessment are most appropriate will facilitate later efforts to select or develop specific assessment procedures.

The most important consideration when selecting assessment types is the kind of evidence which will most validly represent the status of students' knowledge/skills in a particular content area or learning outcome. Frequently, more than one type of evidence will be needed. Most content areas and many learning outcomes, for example, describe what students should know and how they should be able to apply that knowledge. A forced-choice test might very efficiently provide comprehensive evidence about student knowledge, but a performance-based approach may be necessary to find out whether students can apply that knowledge. If applying the knowledge means that students should write analytically, written essays should be used for assessment. If it means that students should perform physical tasks such as running in relay races, the assessment procedure should probably be to observe and rate students' performances.

Another reason that an assortment of assessment methods is often needed is that many content areas and outcomes include multiple and varied skills which can be assessed most validly using different methods. In the biological and physical sciences, for example, students may need to be able to design and conduct scientific studies and prepare reports of the results. To determine whether students have attained these skills, the assessment should include procedures that require them to design, conduct, and report the results of scientific studies. Teachers would assess the quality of the designs (perhaps using "model designs" and rubrics), observe and rate students' actions in conducting the experiment (perhaps using accepted scientific standards and criteria and a rubric), and assess the quality of the reports (perhaps using exemplary reports and a rubric).

Establishing assessment schedules

As mentioned previously, assessment designs should specify the grades at which assessment procedures will be used and when they will be administered. Occasionally, such as for standardized assessments, schedules may specify a very limited time frame in which a test should be administered. More often, however, the schedule will indicate that the assessment should be given at the end of a particular instructional cycle or during a "window" of time that may be as long as several weeks.

Practical Tip: Don't allow assessment--or any element of school improvement planning--to be overly emphasized or burdensome. Incorporate assessments into instructional activities. Combine assessment tasks to address more than one learning outcome. Distribute assessment across the academic year. Do not try to formalize every classroom test, quiz, or performance-based task.
Assessment schedules should ensure that resulting information would be available when it is most useful. Ask the following questions:
  1. At what grades is assessment information particularly needed for each content area? (To help identify those levels, examine local improvement plans, curricula, and organizational structures.)
  2. When will the information be most useful to the major audience(s) for whom it is being collected (school improvement planning committees, teachers, etc.)?
  3. Has sufficient time been allowed to score and compile the results so that they will be available when needed?
  4. How and when can the assessment be integrated with instruction to help reduce intrusions on classroom time?

Involving staff and other constituent groups

To the extent possible, those who are affected by assessment results (district and school administrators, teachers, students, parents, and other community representatives) should participate in the assessment's design. Each group offers a unique and vital perspective on which skills are most important to assess, how to assess them, and how to use the results.

Depending on a school or district's particular needs, at least two different types of groups might participate in assessment development. These could include an overall committee (such as a school improvement team) and task forces. The composition and function of each group is discussed in the following paragraphs.

Practical Tip: Involving teachers (and other keyplayers) from the beginning of assessment planning should produce programs of better quality and result in less criticism. One reason that assessment generates so much controversy is that people feel they were not part of the process and that decisions were made without their knowledge -- leaving them with no other options than to "take it or leave it." Teachers often fear that inappropriate assessment procedures will produce misleading results which imply that their teaching is ineffective.

Overall committee (for policy recommendations)

The overall team or committee is likely to have general responsibility for developing the assessment system. It should probably be broadly representative of the school or district. The committee might include staff representing administrators who will be involved in system implementation or the use of resulting data (e.g., assessment or curriculum directors), school and district administrators, teachers from different grade levels and subject areas, perhaps the teachers' association or union, and community (e.g., school board) members.

One of the committee's first responsibilities might be to develop a framework for the assessment system. That framework should include the functions or purposes of the system and the types of assessment procedures to be used. The framework can then be used to guide the development of more specific assessment plans.

This planning team can form an important link between teachers and administrators. Depending on local needs, the link may be broadened to include others. Throughout the planning process, the committee should inform others of its work and invite their suggestions. The committee will probably want to involve other staff more directly by establishing special-purpose task forces.

Task forces (for specific areas and functions)

The number, composition and major functions of the task forces will vary according to local factors such as development needs, school or district size, and type of district (elementary, secondary, or unit). The functions of the task forces may include selecting or developing assessment procedures or designing specific assessment system elements such as procedures for processing or reporting the assessment data.

Special considerations

Members of all committees and task forces should be selected carefully. Whether they are asked to serve or selected from a pool of volunteers, they should be people who are:Developing a comprehensive assessment system is time-consuming. School and district administrators may need to identify strategies that make it easier for school personnel to participate. For example, they might schedule meetings during inservice or institute days, hire substitutes, offer salary-schedule or college credits to participants, ask other teachers to cover participants' classes (and perhaps reward them for doing so), or pay staff for working during the summer.

The school board's role

A well-informed school board can be one of the best allies of any assessment program. Proactive administrators keep their boards closely involved throughout the planning and implementation of assessment programs. Regularly scheduled board review can help ensure that assessment practices remain responsive to changing local needs.

At each stage of the planning and implementation of an assessment program, a skillful administrator provides the board with relevant decision-making information. The information is more likely to be used if it is timely, complete, easily understood, and targeted to the decisions at hand.

Practical Tip: Keep teachers involved. Don't allow a scenario such as the following to occur. Administrators in one school district entered into a contract with a private corporation to develop their local assessment system. Corporation representatives held three half-day meetings with teachers near the beginning and end of the development year. The assessment system the corporation developed impressed the university consultants who reviewed it. However, four years later, annual results from two independent measures of student achievement (a commercially published test and the state assessment) showed no growth in student learning. A subsequent investigation found that teachers ignored the assessment results provided by the district's contractor. The teachers had had little input into the local assessment system and little knowledge about it until after it was completed. Consequently, they did not view the contractor's assessment data as credible indicators of student learning.
During the initial planning stages of an assessment program, the board should be educated about:
  1. the potential purposes of assessment,
  2. characteristics of effective assessment programs,
  3. proposed procedures for test selection or development,
  4. the limitations of assessment,
  5. state assessment requirements, and
  6. estimated costs for various assessment options.
After an assessment program has been implemented, the board should be kept informed. When assessment results are available, the board should receive a report of school- and district-level performance. This report, delivered prior to public reporting of the scores, should be designed to help members understand the results and prepare them for questions/comments from the public.

Reviewing the assessment system

The assessment system should be reviewed regularly. For example, it should be reviewed as indicated in school improvement plans, whenever students do not meet locally defined expectations or when trend information indicates a decline in student performance. The review should include input (oral or written) from staff members who have used the assessment procedures and results. It should include their judgments of quality, appropriateness, and utility. Also, various audiences can be surveyed about their perceptions of the program's effectiveness. Staff recommendations for changes in the assessment program can be presented to the board along with survey results. If board members have been provided with appropriate information since the beginning of the assessment program, decisions made at this point should be especially sound.

Examine questions such as:


Chapter 2

The Quality of Assessment

This chapter discusses five major factors that are integral to the quality of assessment. The first three factors--validity, reliability, and fairness--have been particularly emphasized in school improvement planning in Illinois. The other two--assessment administration and alignment--are also critical to the success of the school improvement process.

Ensuring the Quality of Assessment

Everyone who participates in the development or implementation of assessment systems is responsible for helping ensure that assessment is of high quality. Quality must be a concern at every stage--when designing assessment systems; selecting or developing assessment procedures; administering the procedures; and scoring, reporting, and using the results. Assessment which is of poor quality is of limited utility. The information it produces does not represent student learning well enough to inform decision makers about the types of changes that are needed to help improve the educational system. The time and other resources devoted to planning and administering the system will have been poorly spent. Three major indicators of quality--validity, reliability, and fairness--are discussed in this section.

Validity is the extent to which an assessment measures what is needed for a particular purpose and to which the results, as they are interpreted and used, meaningfully and thoroughly represent the specified knowledge or skill.

Reliability is the consistency or stability of assessment results. It is often defined as the degree to which an assessment is free from errors of measurement.

Fairness means that assessment procedures do not discriminate against a particular group of students (for example, students from various racial, ethnic, or gender groups, or students with disabilities).

All three indicators represent critical elements of good assessment systems. Assessment procedures or results with purported high validity but low reliability are actually of low validity and cannot be used confidently because differences in scores across students or across time are more likely due to random error than actual differences in achievement. Assessments with high reliability but low validity accurately measure something other than what was intended and the results are meaningless with regard to the intended purpose. Since they only poorly represent the knowledge or skills they supposedly measure, they are not useful. Assessments which claim to be highly valid and reliable but discriminate against certain groups of students should not be used because they are not fair to those students and thus are not valid.

Although the three indicators mentioned above are essential, validity is the most important. However, educators have often spent time and other resources gathering evidence about reliability to the neglect of validity. In recent years, many articles in education publications have proposed abandoning forced-choice assessment in favor of performance-based assessment. A major criticism of forced-choice assessment is that it is strong in reliability but weak in validity. Unfortunately, performance-based assessment may suffer from that same dilemma. In response to criticisms that scoring is inconsistent, major resources are sometimes devoted to documenting the reliability of scoring while neglecting validity--as well as other sources of unreliability.

Validity

As mentioned above, validity is the extent to which an assessment measures what is needed for a particular purpose and the results, as they are interpreted and used, meaningfully and thoroughly represent the specified knowledge or skill. This definition highlights the importance of considering purposes and intended uses when developing and selecting assessment procedures. Those procedures must assess the knowledge or skill (or learning goal, outcome, or objective) that they claim to measure. The type of information produced must be useful for the intended purposes.

Traditionally, measurement specialists have discussed several types of validity, including content, construct, concurrent, and predictive validity. Content validity is usually established through expert review of assessment contents. Examining assessments to estimate their content validity is likely to be both feasible and useful at the local level. Examining the other types of validity requires using statistical procedures and may be difficult and of limited usefulness locally. Linn, Baker, and Dunbar (1991) proposed eight criteria for evaluating validity in performance-based assessment. However, the criteria are also appropriate for forced-choice assessment and may help improve it more than traditional statistical validation procedures.

Figure 2: Summary of Criteria for Estimating Validity
Criterion
Consequences
Content coverage
Content quality
Transfer and generalizability
Cognitive Complexity
Meaningfulness
Fairness
Cost and efficiency
Focus
Effects of assessment
Comprehensiveness
Consistency with current content onceptualization
Representativeness of broader domain
Level of knowledge assessed
Relevance to students
Freedom from bias against members of a group
Practicality and feasibility of assessment

Validity criteria

The eight criteria should help inform educators about whether they need to revise assessment procedures and the types of revisions they should make. Answers to the questions presented below should also influence how educators interpret and use assessment results.

When using the criteria, educators who plan to use two or more assessment procedures in combination with one another--for example, to estimate whether students meet a learning outcome--may want to ask whether the assessment procedures, considered together, meet the criteria. Alone, neither assessment procedure may, for example, cover content or represent cognitive complexity adequately. Together, they may be quite effective. This means, of course, that neither assessment procedure should be reported or used without the other.

1) Consequences

Focus: The effects of the assessment

Questions

2) Content Coverage

Comment: This criterion may be of limited utility for individual assessment procedures. Educators will probably find it more useful to examine all procedures that, together, will assess a particular content unit (e.g., learning outcome). While individual assessment procedures must be appropriate for the content assessed, they may cover only a portion of it.
Focus: Comprehensiveness of assessment content

Questions

3) Content Quality

Comment: This criterion is especially important in content areas (e.g., science) in which knowledge sometimes grows rapidly. An assessment that represents an outdated conceptualization of the content assessed is not likely to produce useful information and will waste students' and teachers' time.
Focus: Consistency with current content conceptualization

Questions

4) Transfer and Generalizability

Comment: For different reasons, this criterion is of concern in both forced-choice and performance-based assessment. In the former, the nature of questions may make them poor indicators of student ability to deal with concepts in the domain assessed. In performance-based assessment, the small number of tasks makes it essential that each task or group of tasks is representative of the domain assessed. Also, continued re-use of any type often may compromise generalizability, because teachers and students may focus on specific items or tasks from the test rather than on the larger domain.
Focus: The assessment's representatives of a larger domain

Questions

5) Cognitive Complexity

Focus: Whether level of knowledge assessed is appropriate

Questions

6) Meaningfulness

Comment: Assessment procedures must be meaningful to students in order to produce valid, useful information. Assessment that is relevant to students' personal experiences is likely to motivate students to perform as well as possible. However, some assessments cannot be made relevant to problems students encounter in real life, and educators should realize that contrived assessments may poorly represent the knowledge or skills assessed.
Focus: The relevance of the assessment in the minds of students

Questions

7) Fairness

Comment: For some assessment purposes (e.g., to measure the achievement of individual students in a content area), it may be important to consider whether all students have had similar opportunities to acquire the knowledge or skills assessed. For example, are some students at an advantage because ancillary skills (such as prior knowledge or reading ability) that are not relevant to the focus of the assessment enable them to score higher? On the other hand, if the purpose is to find out whether a group of students has achieved something (such as a learning outcome) and some students have not had an opportunity to learn it, the assessment is not unfair or invalid. Rather, instruction may not have been adequate.
Focus: Fairness to members of all groups

Questions

8) Cost and Efficiency

Focus: The practicality or feasibility of an assessment

Questions

Examining and improving validity

Educators who use assessment should be continually concerned about validity. They should explicitly design assessment systems so that the information collected will serve the purposes for which it was intended. They should select or develop assessment procedures that are likely to produce the information needed. When processing and reporting results, educators should ensure that scores accurately represent results and analyses present the type of information needed. When using results, educators and others should avoid making interpretations or drawing conclusions that are not warranted by the data.

A common, efficient way to examine validity is through review by qualified people (teachers, curriculum coordinators, etc.) A panel consisting of people with recognized expertise in the subject area examines assessment procedures using criteria such as those above: consequences, content coverage, content quality, transfer and generalizability, cognitive complexity, meaningfulness, fairness, and cost and efficiency.

Educators may want to assemble panels of members who are qualified in the subject under consideration to make informed judgments. The panels might meet during the development or selection of assessment procedures and again after results are available. Ideally, they might also assemble each time a new assessment is ready for piloting. Validity review committees which systematically review local assessment procedures can help ensure that the procedures remain valid. (Figure 3 below illustrates a worksheet that validity review panels might want to use as is or modify as needed.)

Figure 3: Worksheet: Assessment Validity Review
Directions: Indicate your agreement or disagreement with the following statements by answering "yes," "no," or "not sure."For any items answered "no" or "not sure", explain below why the assessment procedures did not meet the criteria. For individual procedures, record which items or tasks receive a rating of "no" or "not sure".

Reliability

Reliability refers to the consistency or stability of assessment results. Traditionally, questions about reliability in assessment have asked whether students were likely to respond in the same way to a particular stimulus (such as a test) if it had been presented to them, for example, a week later or in a different but equivalent version. In addition, questions about reliability in performance-based assessment ask whether students' responses would have received the same ratings if different people had scored them--or if the same person had scored them at another time.

Reliability is a necessary condition of validity. An assessment procedure that is not reliable cannot be valid. A test that a student would respond to quite differently one day than the next (for example, because the test was very susceptible to the student's mood, which changed due to an argument at school or home) would not produce trustworthy results. Conversely, an assessment procedure can be reliable but not valid (i.e., not actually measuring what is intended). A multiple-choice spelling test may produce highly reliable scores and be validly interpretable as an indicator of student achievement in recognizing correct spelling. However, if one needs to know whether students can spell correctly when writing rather than simply recognize misspelled words, the reliable scores may not be valid indicators. A performance-based task which requires students to play a musical scale may be very reliable in the sense that individual students perform it in the same way repeatedly and different raters consistently assign the performances the same ratings. However, the task may be of low validity as an indicator of students' ability to play a piece of music.

Examining reliability

The process of examining reliability will be difficult for many schools. It requires analyzing assessment results statistically. The formulas for estimating the reliability of forced-choice assessments are sophisticated and can be applied much more easily using a computer. Although estimating the reliability of scoring in performance-based assessment is simpler, it involves so much information that computers are almost essential. Extensive, systematic procedures must be used if computers are not available. Several computer programs can help estimate reliability. The software program that is available for data management for Illinois school improvement planning provides a general measure of reliability (Kuder-Richardson Formula 21). Staff can also refer to several resources listed at the end of this chapter for further help in estimating reliability.

Most schools may need several years to accumulate strong evidence of reliability. However, local committees should attend to reliability almost continuously, beginning when they select or develop assessment procedures. This is necessary to help ensure that (a) sufficient data for computing reliability will be available later, (b) the computations will not force a school to make major adjustments in its assessment program, and (c) assessment information used for school improvement planning in the meantime is of high quality. The next sections will describe methods that are commonly used to estimate reliability and offer suggestions to schools regarding their gradual accumulation of reliability evidence.

Forced-choice assessment

Frequently used methods for examining the reliability of forced-choice assessments are: 1) internal consistency reliability, 2) equivalent forms reliability and 3) repeated measures reliability. Each will be described briefly below.

Figure 4: Methods of Estimating Reliability
  1. Kuder-Richardson (Measure of internal consistency) Give test once. Score total test and apply Kuder-Richardson formula

  2. Split-Half (Measure of internal consistency) Give test once. Score two equivalent halves of test (e.g. odd items and even items); correct reliability coefficient to fit whole test by Spearman-Brown formula.

  3. Equivalent Forms (Measure of equivalence) Give two different forms of the test to the same group in close succession.

  4. Test-retest with same test (Measure of stability) Give the same test twice to the same group with any time interval between tests from several minutes to several years.

  5. Test re-test with equivalent forms (Measure of stability and equivalence) Give two different equivalent forms of the test to the same group with increased time interval between forms.

Internal consistency (or split-half) reliability provides an estimate of test stability without having to administer the test twice. It is calculated by splitting a test in half randomly (for example, separating odd- and even-numbered items), scoring the two halves separately, and correlating the results. The Kuder-Richardson formula is one way to determine split-half reliability. It correlates every item on a test with every other item, computes the mean, and computes the reliability of the test. High split-half reliability means that individual students would receive approximately the same score on both halves. The test questions are, for the most part, measuring the same content and contributing effectively to the overall content being measured.

Equivalent forms reliability is calculated by giving students two forms of a test (one set of questions on test form A and a parallel set on test form B) and correlating the results. With high reliability, students would receive roughly the same score on both forms. That is, form A and form B measure the same thing and result in the same measurement of the student's knowledge/skills.

Repeated measures (or test-retest) reliability is an estimate of an assessment's stability. It is calculated by administering a test to a group of students twice, with a given interval between the two testing times. Then, students' results are correlated. If reliability is high, each student's score would be roughly the same on the second test as on the first test. Given that other conditions were equal (for example, that the test was administered in the same way and that the student did not remember answers from the first administration or had not acquired new knowledge), a score on the test would be stable from one administration to another.

When schools plan to adopt tests that were developed elsewhere, such as by commercial test publishers or other school districts, they should consult technical manuals or contact the organization that developed the tests to request evidence of reliability. If reliability information is not available, staff must decide whether to establish a reliability estimate themselves or abandon the test. (It is critical to note here that estimates calculated by others will probably be for entire tests--or, perhaps, subtests. If a school plans, for example, to use a cluster of items to measure a particular knowledge, skill, or outcome, the developer's reliability estimate will not provide satisfactory evidence.)

Schools must take care in how they report reliability estimates. The Standards for Educational and Psychological Testing note that: "Because there are many ways of estimating reliability, each influenced by different sources of measurement error, it is unacceptable to say simply, 'The reliability of test X is .90' " (p. 21).

In any statement about the results of reliability studies, it is important to present:

  1. the method that was used to estimate reliability,
  2. the student population to whom it was administered, and
  3. the decision rule that was followed to determine if the reliability estimate was sufficient to ensure accuracy.

Performance-based assessment

In performance-based or complex generated response assessments, some of the above approaches might be adapted. However, they would not be sufficient. The approaches assume that scoring is consistent. As mentioned previously, judgment is required to score performance-based assessments. Therefore, evidence of reliability must include evidence that a scoring rubric or scale was used uniformly by all persons who rated the responses, thus ensuring that a student's performance received roughly the same score regardless of which rater scored it.

Interrater agreement is often used as an indicator of the consistency of scoring performance-based assessments. It is calculated as the percentage of agreement between judges.

Interrater agreement is often estimated by "double scoring" a sample of student responses and comparing the ratings assigned by different people. A sample of students' responses (such as essays, science experiments, or dance performances) is selected. Each response is rated by at least two people. The resulting scores are compared. The percentage of agreement is then calculated. The sample should include responses representing the top, middle, and bottom of the scale to help ensure that agreement exists throughout the scale.

The size of the sample used to examine interrater agreement will vary according to the size of the local population and the consequences of the assessment. In some situations, it may be necessary to double score the responses of all students.

Before beginning an analysis of interrater agreement, local educators should decide what is required for agreement. Criteria for acceptable levels of agreement may vary according to tasks, scales, and the purposes of the assessment. On a limited scale, such as one with four points, raters should not he considered in agreement unless they assign exactly the same score to a performance.

On a larger scale, raters might be considered in agreement if they assign adjacent scores. For example, the state writing assessments in Illinois have 6-point rating scales. Interrater agreement is considered satisfactory if readers are in exact agreement on 65% - 70% or more of their ratings and are within one point of agreement on 90% or more of their ratings.

Interrater agreement should be established each time a new scoring scale (or assessment procedure) is used or a new group of teachers plans to use the scale. This will help provide evidence that the rubric or scale has been constructed well enough that different individuals can use it consistently.

All raters should be trained to use scoring rubrics. The training should include carefully reviewing the scales--which should have been developed so they could be understood clearly and used consistently. Ratings that others assigned to previous performances using the same rubric (sometimes referred to as "exemplars") may help new raters understand the scales. During the training, raters should use the scale--for example, to score students' written or videotaped responses to previous similar assessments or exercises. Interrater agreement should be examined to determine whether individual raters are using the scale correctly.

After interrater reliability has been established, other kinds of reliability can be examined. It may be important to learn, for example, whether students respond differently to performance-based assessments across brief time periods or with slight variations in the assessments.

Gradually accumulating evidence of reliability

As mentioned above, school improvement or assessment committees may consider it necessary to work over a period of several years in order to accumulate strong evidence of the reliability of assessment procedures. This section will suggest strategies that committees might use to accumulate such evidence and to help improve reliability in the meantime so that the eventual evidence is less likely to require that they revise assessment procedures substantially.

While designing assessment programs, local committees should develop plans specifying the methods they will use to estimate the reliability of assessment procedures, the information that will be needed to calculate the reliability estimates, and the procedures they will use to ensure that the information will be collected (for example, by requesting that scoring services compute specified types of reliability estimates or by keeping records showing what scores each rater assigned to each student's responses).

During the selection and development process, committees can:

During assessment administration, schools can help improve reliability by ensuring that the assessment administration is standardized. They can do this by making sure that everyone who administers the assessment procedure has a copy of administration instructions, understands them, and uses them to administer the assessment uniformly to all students within a designated time frame.

After assessments have been administered, schools can monitor scoring procedures. For example, they can send students' responses to commercial scoring services. They can also double score and analyze a sample of performance-based assessments.

By using the procedures suggested above, schools should help ensure that (a) reliability is high, and (b) appropriate information will be available to calculate reliability estimates. Thus, after several years, they are likely to assemble strong evidence of reliability.

Improving reliability

What is to be done if reliability estimates indicate an unreliable test? In forced-choice assessment, school personnel may want to look at data for individual items. Compare students' performance on an item with their performance on the overall test. Did students with the highest scores tend to respond correctly to the item? Did most students who did poorly on the test as a whole get the item incorrect? This can be examined by charting the performance of the item or by computing a point biserial correlation. A strong relationship, as indicated on the chart or by a high point biserial correlation, indicates that the item was consistent with the test as a whole. Low correlations help identify items which should be revised or deleted to improve reliability.

When reliability estimates are low in performance-based assessment, examining scoring rubrics as well as raters' perceptions of them may help identify inconsistencies which should be corrected. Ask questions such as:

Practical Tip: When information from different sources conflicts, it does not necessarily mean that some of the information is wrong. The various sources may (a) focus on different things (ability to write narratively or persuasively), (b) look at them from different perspectives (ability to analyze and answer a multiple-choice question about scientific procedures or to demonstrate their use), or (c) contain different kinds of error (performance ratings may represent the knowledge or skills assessed too narrowly; multiple-choice test scores may be influenced by test wiseness).
In addition, procedures similar to those described above for forced-choice assessment may be useful. Compare students' scores on multiple tasks, tests, and observations. Did students who usually demonstrated the most competence score highest on the performance-based assessment? Did most students who did poorly in the subject as a whole get lower scores on this particular assessment? This task can be facilitated by charting scores from various assessment procedures or by computing correlations between the performance-based assessment scores and one or more of the other measures. Relationships that are very low help identify assessment instruments or procedures which may need to be revised. However, information from various sources will not correspond perfectly. The various sources may focus on different aspects of student behavior or knowledge. They use different kinds of evidence such as teachers impressions or students check marks, written essays, or actual performance.

Fairness

School personnel must ensure that assessment procedures are fair to all students, that they do not discriminate against members of any ethnic/racial or gender group or against students with disabilities. Assessment procedures are not fair if they offend members of some groups, if the way they refer to some groups distracts students and lowers their scores, or if other qualities of the procedures reduce the ability of group members to answer questions correctly. The assessment results of students who are from different groups but have similar knowledge and skills should be similar. If students' scores are lowered due to their group membership, assessment procedures are discriminatory.

Educators can sometimes obtain evidence that commercial tests have been reviewed and/or statistically analyzed to eliminate bias. However, they should not simply accept such evidence. They should review the evidence to make sure that the assessments will be fair locally. They should review the methods used to examine fairness, the groups focused upon, the representativeness and credibility of the reviewers, and the issues and sensitivities the reviewers were most attentive to as they critiqued the assessments for potential bias. If the information provided does not ensure local fairness, educators will need to conduct additional analyses. They will also need to examine the fairness of assessment procedures obtained from organizations other than commercial test publishers (such as textbook publishers or other districts) or developed locally.

Bias review procedures may include several different types of questions:

Examining and improving fairness

Most procedures for reviewing fairness can be categorized as either judgmental or statistical. Judgmental review is usually conducted by a committee or panel which reads assessment procedures (including items, prompts, instructions, scoring scales), asks questions such as those listed above, and identifies items or tasks which appear to be biased. Those items or tasks are then either revised or deleted. Judgmental review is usually conducted by committees (which may include community members as well as educators) that are sensitive to the groups under consideration. They should receive training in assessment fairness.

Statistical review involves examining assessment results for various groups of students. A simple but useful type of statistical information is the proportion of students from each group who answered items correctly (commonly known as the item-difficulty level or p-value) or received particular scores on performance-based tasks. Items and tasks which appear to have been more difficult for some groups than others should be reviewed judgmentally to determine whether the differences were caused by bias. Since many group differences may represent actual differences rather than bias in assessment procedures, it may be helpful to examine point biserial correlations for each group on each test item. Those correlations estimate whether students with higher ability are more likely to answer an item correctly than students with lower ability. Educators who examine point biserials in order to identify discriminatory items should be aware of two limitations. First, the point biserial statistic is not appropriate for many types of assessments--particularly performance-based assessments. Secondly, the statistic is appropriate for evaluating individual items; it cannot identify an entire test that is biased.

Some bias review experts prefer either judgmental or statistical review procedures. However, a more thorough bias review can be conducted if the two types of procedures are used interactively. Each has different strengths and weaknesses. Neither is sufficient alone. To capitalize on the strengths of each method and counterbalance the weaknesses, both should be used.

Bias reviews should be conducted at several stages of the assessment cycle. Assessment developers, including developers of both forced-choice items and performance-based tasks, should be sensitized to the types and sources of bias. Later, the assessment procedures should be reviewed by others who are representative of various groups and knowledgeable about learning area content or statistical procedures. During the process of selecting specific assessment procedures, all procedures should be examined for bias. After assessment procedures have been administered, scoring should be monitored to ensure that scores were not influenced by bias. Results should be reviewed statistically.

Districts might establish bias review procedures that specify: (a) the types of procedures that will be used, (b) the types of committees or panels that will be involved in the process, and (c) the stages of assessment at which bias review will be conducted. Guidelines for bias review are included in Bias Issues in Test Development (National Evaluation Systems, Inc., 1987), which ISBE distributed to Illinois school districts previously.

OTHER IMPORTANT TOPICS

As mentioned previously, two topics will be discussed here--assessment administration and alignment. Both are very important to the quality of assessment programs.

Assessment administration

Assessment procedures must be administered uniformly to all students. Otherwise, scores will vary due to factors other than differences in student knowledge and skills. Consider the situations below. Each statement describes two classrooms in which the same assessment procedures were administered:

In the first class, the teacher collected students' test booklets immediately after the designated 40 minutes had elapsed. In the second, the teacher let the students continue to work 10 minutes longer until a buzzer sounded, signaling the end of the class period.

One teacher read the instructions to students and reviewed two sample problems with them, as directed in the administration manual. The other teacher handed the tests to students and told them to read the directions and work through the sample problems themselves.

Students in one class worked quietly without interruption while completing a task. The second class was interrupted four times--by phone calls from the office and by students who entered to deliver messages. The teacher of the first class had asked the office not to disturb the class and hung a "Do not disturb" sign on the door. No prior arrangements had been made for the second class.

In each of the illustrations above, one class received higher average scores than the other. However, those scores were not reliable indicators of student knowledge/skills. Some scores were misleadingly inflated (for example, because students were given extra time). Others were lower than they should have been (for example, because the teacher did not read the instructions). Differences in testing conditions may have caused differences in results, making them unreliable. These examples show why assessment data for students from different classes should not be aggregated or compared unless the assessments were administered under the same conditions.


To help ensure that assessments are administered uniformly, schools should make sure that administration instructions exist for all assessment procedures and that those instructions are studied carefully and used by those who administer them. Commercial test publishers routinely prepare administration manuals for all assessment procedures they develop. Schools should attempt to obtain administration manuals from the originators of all assessment procedures and should distribute copies to all teachers who administer those assessments. However, in some cases such manuals are unlikely to be available. Local staff may have to develop administration procedures for assessments that are developed locally or are adopted/adapted from textbooks, other schools, and elsewhere.

Assessment procedures are likely to be administered more uniformly if educators know about and implement appropriate activities before, during, and after tests are administered. Many of these activities are described below. They will also probably help the assessment administration go more smoothly, avoid delays, and ensure fairness. Some activities can be handled by a single person in the school or district. Others require cooperation among and integrity by administrators, teachers, students and parents.

Before assessment begins

Practical Tip: Students get cues about how to react to an assessment from teachers. They need to know how the information will be used and that the teacher considers it important. If a teacher reads the directions carefully and proctors the assessment rigorously, students are more inclined to give their best effort. In contrast, if a teacher conducts the assessment casually and haphazardly, paying little attention to directions or class behavior, students may react accordingly and the results may be lower than students' actual levels of achievement.
When the assessment occursWhen assessment is completed

Alignment

Alignment refers to coordination among various elements of a system. It is critical to the effectiveness of cyclical processes--for example, processes that involve identifying educational intents (as expressed in goals, outcomes, or objectives), developing curricula, arranging instruction to help ensure that students reach the intents, assessing students, and using the data to inform subsequent planning as the cycle is repeated.

Stories about the effects of poor alignment abound in education. One hears about curriculum documents that took many hours to develop but ended up sitting on shelves rather than being used, or tests that are poor indicators of student learning because they do not assess what is taught. In the first example, alignment between written curricula and actual instruction is poor. In the second, curricula and instruction may have been closely aligned but assessment was not aligned with either of them; or, assessment may have been closely aligned with curricula but instruction was not aligned with either. In such situations, a cyclical process like that used in Illinois school improvement planning would be minimally effective.

Practical Tip: Alignment is most beneficial when all elementary and secondary grade levels are involved. That way, programs and expectations of students are more coordinated and consistent from level to level. Alignment should involve teachers from elementary and secondary schools that students attend - which sometimes means that teachers from several districts work together. No level should be allowed to dictate to another. Rather, educators should work collaboratively to make sure that outcomes and curricula reflect the emphases that educators from all levels consider critical.
Alignment has numerous applications and may refer to coordination or matches between or among any of the following: goals for learning, outcomes, objectives, curricula, instruction, and assessment.

Alignment might be examined at the district, school, or classroom level. Often, the examination should include staff from different grades and school levels (such as elementary and high schools) and from different content areas (such as mathematics and science).

Alignment that is functioning well provides smooth, cyclical transitions from planning to instruction, to assessment, to remediation and enrichment, and then back to planning. Everything works together. The elements are coordinated between elementary and secondary schools as well as among grade levels in each.
Although the primary interest in this Handbook is coordination between assessment and other elements of the educational process, coordination among those other elements is critical. Therefore, various types of alignment will be discussed in this section. As mentioned above, all elements of cyclical planning strategies must be coordinated in order for the strategies to work effectively.

Educational intents (such as goals, outcomes, or objectives) are perhaps the most important element of such strategies; they identify what should be taught. Assessment provides trustworthy indicators of how well schools are achieving their intents. Without assessment, we have to rely primarily on impressions that may or may not be accurate but that, regardless, are difficult to summarize across students to help evaluate schools. Assessment provides information to help schools achieve their stated intents by indicating how well they are functioning and what adjustments are needed.

Steps toward improving alignment

Practical Tip: Make sure that administrators understand and support alignment efforts. A lot of work can be wasted if alignment efforts don't have consistent support.
Steps that a school or district might follow to help improve alignment are listed below. Local education agencies might want to assign some tasks to the district level and others to the school level. Still others might become the responsibility of individual classroom teachers. They may want to delete some steps or add others.

Staff should recognize that alignment is a dynamic process involving many continuously changing factors. Alignment might be viewed as a continuum which can always be improved. Also, it works better in some circumstances than others.

The steps suggested below incorporate many elements of school improvement planning. Schools/districts may want to simply continue using their current process but explicitly add alignment activities to it as appropriate.

  1. Review the state goals and local learning outcomes. Make sure that the outcomes are consistent with and comprehensively cover the goals. Also, look for consistency across grades, an ordered progression from one level to the next, and appropriateness within each grade. Revise the outcomes as needed.
  2. Analyze the learning outcomes to determine precisely what should be introduced, reinforced, reviewed, or mastered at each grade level.
  3. Ensure that all tasks identified through the first two steps can be covered by instruction, and that performance on these tasks can be assessed. Consider reviewing assessment procedures to identify learning outcomes which are too broad or too specific, too shallow or too deep.
  4. Devise a plan for selecting textbooks and other materials to support the development of locally-selected knowledge and skills.
  5. Establish procedures for assessment selection and/or development. The procedures should emphasize criteria that foster alignment (such as matches among educational intents, teaching-learning activities, and assessment).
  6. Review the current school improvement plan to identify which state and local educational intents are being addressed well and which are not.
  7. Recommend new instructional strategies that support alignment (e.g., integrating science and mathematics instruction to develop students' mathematics skills through science problems).
  8. Survey teachers to determine professional development needs. Do teachers need more knowledge about assessment? About teaching across learning areas? About designing instruction for selected content areas or learning outcomes?
  9. Design professional development based on the findings of Step 8.
  10. Provide forums for representatives of various groups (teachers, administrators, and community members) to share their perceptions about learning outcomes, instruction, and assessment at each grade level.
  11. Review local standards and expectations for student achievement. Are they still appropriate after assessment data are reviewed and curriculum or instruction is revised? Should standards or expectations be raised or lowered?
  12. Ensure that the assessment program adequately measures the achievement of local educational intents. Ensure that the instructional program gives students the necessary skills and knowledge.
  13. Review local assistance and enrichment programs. Are procedures and instruction directly related to local outcomes and assessment?
  14. Make alignment a high priority and an inherent part of future planning. Consider all perspectives (those of teachers, curriculum coordinators, and other content or measurement specialists).
Although critical, alignment should be used flexibly. It should not become so constraining that, for example, teachers cannot take advantage of current events to meaningfully teach concepts that the formal curriculum schedules for several months later or in the next grade.

Schools and districts may find it impractical to implement all of these steps in all learning areas in one year. Nonetheless, they should develop plans for systematically improving the alignment of curriculum, instruction, and assessment even if it takes several years. The advantages for both students and educators are well worth the effort.

Some schools might want to develop a checklist to help guide the alignment process. Others may prefer to survey teachers and others to help examine the current status of alignment. Still other local staff may, of course, want to create other approaches.

Estimating Reliability - Forced-Choice Assessment

The split-half model and the Kuder-Richardson formula for estimating reliability will be described here. Given the demands on time and the need for all assessment to be relevant, school practitioners are unlikely to utilize a test-retest or equivalent forms procedure to establish reliability.

Reliability Estimation Using a Split-half Methodology

The split-half design in effect creates two comparable test administrations. The items in a test are split into two tests that are equivalent in content and difficulty. Often this is done by splitting among odd and even numbered items. This assumes that the assessment is homogenous in content. Once the test is split, reliability is estimated as the correlation of two separate tests with an adjustment for the test length.

Other things being equal, the longer the test, the more reliable it will be when reliability concerns internal consistency. This is because the sample of behavior is larger. In split-half, it is possible to utilize the Spearman-Brown formula to correct a correlation between the two halves--as if the correlation used two tests the length of the full test (before it was split), as shown on the next page.

For demonstration purposes a small sample set is employed here--a test of 40 items for 10 students. The items are then divided even (X) and odd (Y) into two simultaneous assessments.

Student
Score (40)
X Even (20)
Y Odd (20)
x
y
x2
y2 xy
A4020204.84.223.04
17.64
20.16
B281513-0.2-2.80.04
7.84
0.56
C3519163.80.214.44
0.04
0.76
D3818202.84.27.84
17.64
11.76
E22l012-5.2-3.827.04
14.44
19.76
F20128-3.2-7.810.24
60.84
24.96
G3516190.83.20.64
10.24
2.56
H33
16
170.81.20.64
1.44
0.96
I311219-3.23.210.24
10.24
-10.24
J281414-1.2-1.81.44
3.24
2.16
MEAN31.015.215.8  95.60
143.60
73.40
SD 3.26 3.99   
 
 

From this information it is possible to calculate a correlation using the Pearson Product-Moment Correlation Coefficient, a statistical measure of the degree of relationship between the two halves.

Pearson Product Moment Correlation Coefficient:

where

x is each student's score minus the mean on even number items for each student.
y is each student's score minus the mean on odd number items for each student.
N is the number of students.
SD is the standard deviation. This is computed by

  1. squaring the deviation (e.g., x2 ) for each student,
  2. summing the squared deviations (e.g., S x2 );
  3. dividing this total by the number of students minus 1 (N-l) and
  4. taking the square root.

The Spearman-Brown formulais usually applied in determining reliability using split halves. When applied, it involves doubling the two halves to the full number of items, thus giving a reliability estimate for the number of items in the original test.

Estimating Reliability using the Kuder-Richardson Formula 20

Kuder and Richardson devised a procedure for estimating the reliability of a test in 1937. It has become the standard for estimating reliability for single administration of a single form. Kuder-Richardson measures inter-item consistency. It is tantamount to doing a split-half reliability on all combinations of items resulting from different splitting of the test. When schools have the capacity to maintain item level data, the KR20, which is a challenging set of calculations to do by hand, is easily computed by a spreadsheet or basic statistical package.

The rationale for Kuder and Richardson's most commonly used procedure is roughly equivalent to:

1) Securing the mean inter-correlation of the number of items (k) in the test,
2) Considering this to be the reliability coefficient for the typical item in the test,
3) Stepping up this average with the Spearman-Brown formula to estimate the
     reliability coefficient of an assessment of k items.

ITEM (k)

1

2

3

4

5

6

7

8

9

10

11

12

X

x=X-

x2

Student
(N)

1=correct

0=incorrect

(Score)mean
(score-
mean)
A111111101111114.520.25
B111111110110103.512.25
C11111111100092.56.25
D11101101100070.50.25
E11111001100070.50.25
F1110011001006-0.50.25
G1111001000005-1.52.25
H1101000100004-2.56.25
I1110100000004-2.56.25
J0001100000002-4.520.25
S =99877555432165074.50

mean
6.5

Sx2
74.50

P-values0.90.90.80.70.70.50.50.50.40.30.20.1

Q-value

0.10.10.20.30.30.50.50.50.60.70.80.9
pq0.090.090.160.210.210.250.250.250.240.210.160.09

Spq

2.21

Here, Variance Kuder-Richardson Formula 20

p is the proportion of students passing a given item
q is the proportion of students that did not pass a given item
s2 is the variance of the total score on this assessment
x is the student score minus the mean score;
x is squared and the squares are summed (S x2);
the summed squares are divided by the number of students minus 1 (N-l)
k is the number of items on the test.

For the example,

Estimating Reliability Using the Kuder-Richardson Formula 21

When item level data or technological assistance is not available to assist in the computation of a large number of cases and items, the simpler, and sometimes less precise, reliability estimate known as Kuder-Richardson Formula 21 is an acceptable general measure of internal consistency. The formula requires only the test mean (M), the variance (s 2) and the number of items on the test (k). It assumes that all items are of approximately equal difficulty. (N=number of students)

For this example, the data set used for computation of the KR 20 is repeated.

Student (N=l0)

X
(Score)
x= X-mean
(score-mean)
x2 

A

114.520.25
B103.512.25
C92.56.25
D70.50.25
E70.50.25
F6-0.50.25
G5-1.52.25
H4-2.56.25
I4-2.56.25
J2-4.520.25
 mean = 6.5 S x2= 74.50
Variance

Kuder-=Richardson formula 21

M - the assessment mean (6.5)
k - the number of items in the assessment (12)
s 2 - variance (8.28).

Therefore; in the example:

The ratio [ mean (k-mean)] / ks2 in KR21 is a mathematical approximation of the ratio Spq/s2 in KR20. The formula simplifies the computation but will usually yield, as evidenced, a lower estimate of reliability. The differences are not great on a test with all items of about the same difficulty.

In addition to the split-half reliability estimates and the Kuder-Richardson formulas (KR20, KR21) as mentioned above, there are many other ways to compute a reliability index. Another one of the most commonly used reliability coefficients is Cronbach's alpha (a ). It is based on the internal consistency of items in the tests. It is flexible and can be used with test formats that have more than one correct answer. The split-half estimates and KR20 are exchangeable with Cronbach's alpha. When examinees are divided into two parts and the scores and variances of the two parts are calculated, the split-half formula is algebraically equivalent to Cronbach's alpha. When the test format has only one correct answer, KR20 is algebraically equivalent to Cronbach's alpha. Therefore, the split-half and KR20 reliability estimates may be considered special cases of Cronbach's alpha.

Given the universe of concerns which daily confront school administrators and classroom teachers, the importance is not in knowing how to derive a reliability estimate, whether using split halves, KR20 or KR21. The importance is in knowing what the information means in evaluating the validity of the assessment. A high reliability coefficient is no guarantee that the assessment is well-suited to the outcome. It does tell you if the items in the assessment are strongly or weakly related with regard to student performance. If all the items are variations of the same skill or knowledge base, the reliability estimate for internal consistency should be high. If multiple outcomes are measured in one assessment, the reliability estimate may be lower. That does not mean the test is suspect. It probably means that the domains of knowledge or skills assessed are somewhat diverse and a student who knows the content of one outcome may not be as proficient relative to another outcome. 

Establishing Interrater Agreement for Performance-Based or Product
Assessments (Complex Generated Response Assessments)

In performance-based assessment, where scoring requires some judgment, an important type of reliability is agreement among those who evaluate the quality of the product or performance relative to a set of stated criteria. Preconditions of interrater agreement are:

The end result is that all evaluators are of a common mind with regard to the student performance and that one mind is reflected in the scoring scale or rubric and that all evaluators should give the demonstration the same or nearly the same ratings. The consistency of rating is called interrater reliability. Unless the scale was constructed by those who are employing the scale and there has been extensive discussion during this construction, training is a necessity to establish this common perspective.

Training evaluators for consistency should include:

Gronlund (1985) indicated that "rater error" can be related to:

Proper training to an unambiguous scoring rubric is a necessary condition for establishing reliability for student performance or product. When evaluation of the product or performance begins in earnest, it is necessary that a percentage of student work be double scored by two different raters to give an indication of agreement among evaluators. The sample of performances or products that are scored by two independent evaluators must be large enough to establish confidence that scoring is consistent. The smaller the number of cases, the larger the percentage of cases that will be double scored. When the data on the double-scored assessments is available, it is possible to compute a correlation of the raters' scores using the Pearson Product Moment Correlation Coefficient. This correlation indicates the relationship between the two scores given for each student. A correlation of .6 or higher would indicate that the scores given to the students are highly related.

Another method of indicating the relationship between the two scores is through the establishing of a rater agreement percentage--that is, to take the assessments that have been double scored and calculate the number of cases where there has been exact agreement between the two raters. If the scale is analytic and rather extensive, percent of agreement can be determined for the number of cases where the scores are in exact agreement or adjacent to each other (within one point on the scale). Agreement levels should be at 80% or higher to establish a claim for interrater agreement.

Establishing Rater Agreement Percentages

Two important decisions which precede the establishment of a rater agreement percentage are:

After agreement and the acceptable percentage of agreement have been established, list the ratings given to each student by each rater for comparison:

StudentScore: Rater 1 Score: Rater 2 Agreement
A6 6X
B55X
C34
D44X
E23
F77X
G66X
H55X
I34
J77X
Dividing the number of cases where student scores between the raters are in agreement (7) with the total number of cases (10) determines the rater agreement percentage (70%).

When there are more than two teachers, the consistency of ratings for two teachers at a time can be calculated with the same method. For example, if three teachers are employed as raters, rater agreement percentages should be calculated for

All calculations should exceed the acceptable reliability score. If there is occasion to use more than two raters for the same assessment performance or product, an analysis of variance using the scorers as the independent variable can be computed using the sum of squares.

In discussion of the various forms of performance assessment, it has been suggested how two raters can examine the same performance to establish a reliability score. Unless at least two independent raters have evaluated the performance or product for a significant sampling of students, it is not possible to obtain evidence that the score obtained is accurate to the stated criteria.

Chapter 3

Selection and Development of Assessment Procedures

After deciding what to assess, local educators should decide what they need to learn from the assessment (such as whether students meet an outcome or help identify the school's strengths and weaknesses). The first task will be to decide what types of assessment will most validly provide that information. Then, educators will need to select or develop specific tests and other assessment procedures.

Most local planning groups will probably want to use a combination of assessment procedures that are adopted or adapted from elsewhere as well as those that are developed locally. Because local development is difficult and time-consuming, many schools may want to conserve their resources by selecting as many assessment procedures from elsewhere as possible. However, local educators must carefully examine procedures developed elsewhere and will probably need to conduct special reviews or studies to estimate validity, reliability, and fairness.

Practical Tip: When feasible, use the same assessment procedures for school- and classroom-level assessment. This should improve the efficiency of assessment as well as alignment between instruction and assessment. At the same time, it should increase the time and other resources available for collecting whatever information will be most useful locally. Make certain, though, that the assessment procedures will provide useful information for both levels.
After local planning groups have selected assessment procedures from commercial publishers, other districts, and other sources, they are likely to need additional measures to complete their assessment systems. They will probably need to develop those procedures themselves. Although constructing assessment procedures is difficult and demands rigorous attention to quality, many resources are available to help. The rewards for producing a good assessment system that is particularly valid locally and provides high-quality information for school improvement planning can be worth the efforts.

Selection and development will be discussed in this chapter. After the following discussion of selection, subsequent sections will address the development of forced-choice and performance-based assessments, respectively.

Selecting Assessment Procedures

As mentioned previously, assessment procedures are available from a variety of sources. Local educators will probably want to investigate those sources before beginning to develop local assessments. Assessment procedures should be selected carefully. The time and other resources that assessment requires should be used wisely. The information produced should be useful. It may be wise to use a committee process to select assessment procedures. Regardless, all procedures should be considered carefully before they are adopted.

This section discusses criteria which should be considered when assessment procedures are selected from commercial and other sources. Also, at least some of the criteria may be appropriate when considering whether to use local materials which were developed previously for another purpose.

General selection criteria

1. How well does the assessment match the targeted content or educational intent?

The most important criterion to use when deciding whether to adopt any type of assessment procedure is whether the content (knowledge/skills) it assesses matches local content (such as that included in a learning outcome). If the match is poor, the assessment procedure should be eliminated from consideration. It will not provide information which local educators can use validly. If an assessment procedure matches only a proportion of local content, educators will need to use it in combination with one or more other assessment procedures which measure the remainder of the content.

2. Is the type of assessment appropriate?

Another important criterion is whether the assessment requires that students demonstrate their knowledge/skills appropriately. For example, if educators need to learn whether students can communicate effectively, the assessment should require students to speak or write, not answer multiple-choice questions. If the assessment is going to be used to estimate whether students have acquired a wide variety of specified knowledge in a content area, forced-choice tests may cover the content more thoroughly than essays or other performance-based assessments.

3. Will the assessment produce the kind of information needed?

Before adopting a specific assessment tool, educators should make sure that they will be able to obtain the types of information they will need about the results. For example, test publishers may produce information only about results for an entire test or subtest. If educators plan to use other groupings of items, they should find out how and whether they can get results for the specific groupings.

4. What is known about the validity, reliability, and fairness of the assessment?

Information about validity, reliability, and fairness is more likely to be available from commercial test publishers than other sources of assessment procedures. However, it may also be available from those other sources, and seeking information from others may be worth the effort it takes. All information should be reviewed carefully. It is unlikely to show conclusively that an assessment will be valid, reliable, and fair locally. However, it may provide evidence that will be useful in the selection of assessment procedures and will serve as a starting point for local examinations of validity, reliability, and fairness. Furthermore, the existence of such information indicates that the educators who constructed the procedure were concerned about quality.

Educators should be cautious about using evidence of validity that was collected elsewhere for several reasons. It may not address the local situation, or the purposes for which the assessment will be used locally, well enough. It may only consider the content of the assessment, not whether that content is assessed in an appropriate manner or whether the kinds of information produced represent what is needed locally. In addition, the information may not be appropriate because (a) it addresses a test as a whole or predetermined subtests and local educators plan to use groups of items for separate purposes, or (b) local educators plan to modify tasks, thus creating new procedures to which the original information does not apply.

Reliability estimates should also be examined carefully. To determine whether the estimates can be used locally, educators should ask: What method(s) was used to estimate reliability? Is that method appropriate locally? Is the local student population similar to the population included in the reliability studies?

Fairness estimates should be questioned primarily on two factors. First, what methods were used to examine fairness? If only statistical methods were used, local educators may want to convene a panel to conduct judgmental reviews. Second, what groups were included in the fairness estimates? Did they include all racial/ethnic groups that are represented locally? Did they include students with disabilities? It might also be important to ask questions such as: Does it appear that the people who conducted the bias review would be sufficiently representative and credible locally? Did the bias reviewers attend to the questions, issues, and sensitivities that are most important locally?

5. Is the scoring scale appropriate and of high quality?

When selecting performance-based assessments, educators should carefully review scoring scales. Assessment procedures which will not produce useful, reliable information should not be used. The scoring criteria should be appropriate for determining the extent to which students have the knowledge/skills that will be assessed. Definitions of individual points on the scale should be clear and appropriate for measuring the knowledge/skills. Educators should be cautious about adopting procedures with scales or rubrics that appear to be difficult to learn to use or time consuming to apply.

6. What costs (including time) will be incurred while using the assessment?

Educators should consider several kinds of costs. They should consider whether each is a one-time cost or will recur annually.

The costs to consider include:

  1. obtaining sufficient copies of the instrument or procedure (including costs to purchase, print, or copy tests),
  2. scoring and analyzing results (including reporting services from test publishers, time and other costs for scoring performance- based assessments, and time required to assemble information for various reporting needs),
  3. administering the assessment, especially the time required of teachers and students, and
  4. collecting and analyzing additional information about validity, reliability, and fairness.

Developing Forced Choice Assessment Procedures

Local staff who decide to construct forced-choice assessment procedures can use several different sources of items. They can purchase commercial item banks, network with other districts and schools in the creation of a shared item bank, and/or write their own items.

Writing your own assessment items

School and district personnel who write their own assessment items must devote considerable resources, including training time, to that task. (Regional education service agency staff may be able to assist.) Schools and districts may need to allocate several weeks or more to writing, piloting, and revising items in each content area.

Many guidelines are available to help local staff write assessment items.

staff should try out the items by administering them to a representative sample of students. While doing this, they should ask several question about each item:

Schools, districts, and regional collaborative groups may want to assemble assessment items in item banks. After items have been tried out with students, staff can review the items and select those which they want to use. Item banks can be computerized or stored on paper. To increase the utility of the items, several types of information might be stored with each one. For example, the items should be coded to indicate the content area assessed (and perhaps more specific information such as the state goal and local learning outcome assessed). The items might also include data from students in the pilot or tryout--for example, the grade level of the students and the proportion of students who answered the item correctly.

Constructing tests from items

Local staff who write or acquire a large number of test items have taken a major step toward constructing their own local tests. Test construction, however, involves more than simply assembling the items into booklets. To construct tests, staff will need to perform the following tasks:

Identify the test purpose. The intended use of test results will influence at least two types of decisions about test construction. First, if a test will be used to estimate whether students have mastered a particular body of knowledge or skills (such as a criterion-referenced test of learning outcomes), the difficulty of items should be different from items in a test that will be used to compare students (such as a norm- referenced test). A test for comparing students should include items at a wide range of difficulty levels, but a test of student mastery needs to concentrate on items that distinguish between students who have mastered the knowledge/skills and students who have not. Second, the purpose of a test will be critical when estimating validity. As discussed in the previous chapter, any test will be more valid for some purposes or uses than others. A test is likely to be more valid for purposes that are identified in advance and used to direct test construction than for other purposes. To maximize validity for a test's major purposes, it is important to identify those purposes at the beginning of test construction and to construct the test accordingly.

Develop test specifications. This stage, sometimes referred to as developing test "blueprints," involves making decisions about the composition of the test. Test blueprints specify the distribution of test items across one or more factors. For example, a blueprint might be a matrix with content categories across one axis and item difficulty level across the other axis. Each cell of the matrix would specify the number (or percent) of test items which should fall in the cell. Staff might want to distribute items equally across the cells. Or, they may decide that some cells are more important than others and should receive comparable emphasis on the test.

Blueprints may or may not be in matrix format. Staff may want to distribute test items across more than the two factors or variables that can be included in a matrix. They may want a test to include items that are at several different cognitive levels. For example, they may decide that some items should assess factual knowledge, but that others should assess whether students can apply that knowledge. Staff may want to develop test specifications for several types of content categories, and at some point may want to become rather specific (for example, specifications for history, literature, or fine arts tests might indicate what proportion of the items should refer to each of several major historical periods or cultures).

Assemble the test. This stage includes three major activities: 1) selecting items, 2) arranging them into test booklets, and 3) developing instructions for standardized administration of the test. When selecting items, staff should refer to the test specifications. However, staff should also review each item carefully. They should consider whether it accurately reflects local learning outcomes/instruction and whether it assesses knowledge/skills they consider important. Local staff should also examine information that is available from previous administrations of the item. What proportion of students selected the correct response? Does the distribution of responses to incorrect alternatives suggest that the item is poor because one or two alternatives were so obviously incorrect that only a few students (e.g., less than 10%) chose them? Did a high percentage of students who performed well on the test as a whole select an incorrect response?

Standard administration instructions are important to ensure that the tests are given uniformly to all students. The instructions will indicate directions to be given to students, the resources students may use during the test (e.g., books or calculators), whether guessing is allowed or encouraged, and the amount of time allotted for the test.

Field-test and then revise the test. Before tests are used widely, they should be given to a small, representative sample of students using processes similar to those described previously for trying out items. Each time a test is revised, the new version should be field-tested. Following these processes with items, and then with tests, should limit the amount of time and resources that are lost because a test did not perform as expected.

Practical Tip: As staff develop assessment tools (items, tests, or other procedures), they may want to try them out with small groups of their own students before giving them to a larger sample. This may help them identify problems such as language that is difficult for students to interpret, make modifications, and thus reduce the need for modifications and tryouts after assessing the larger sample of students.
Review the test for validity and fairness. This stage needs to occur both before and after field-testing. Major changes in an assessment procedure will require additional reviews, and perhaps field-testing. The reviews should be conducted by panels which include teachers and others who are knowledgeable about the content area assessed or about bias review procedures. Ideally, these panels should be independent of the test construction committees. People who review tests for validity or nondiscrimination should not have been closely involved in the test's development.

Developing Performance Based Assessment Proceudres

The process of developing performance-based assessment procedures is somewhat like that of developing forced- choice assessments. Assessment purposes and specifications should be considered carefully. The assessments should be drafted, tried out with students, and--if necessary-- revised and tried out with students again. Validity, reliability, and fairness should be reviewed at several stages.

In addition, scales or rubrics for scoring performance-based assessments must be available--whether they are constructed locally or are adapted or adopted from other sources. People who score forced-choice items generally use a scoring key and, thus, have common understandings of which responses are correct. However, each person rating a performance- based assessment must make decisions about the specific scores to assign. To help ensure that raters use the same criteria and have similar understandings of what kinds of performances to assign each value on a scoring scale, schools and districts must design explicit scoring scales and train raters to use them.

Students' scores must not vary according to who rates their performance. Such scores will not be reliable and should not be used.

One of the first tasks in the development of performance-based assessments will be to decide what type of performance to use to assess a particular outcome or knowledge/skill. For example, should students be required to make an oral presentation, show a mathematical proof, or demonstrate how to use scientific equipment? What should be rated--performances, products, or processes used to make the products?

Developing Tasks

Once a school or district has decided what type of performance-based assessment to use, they should develop specific tasks for eliciting the desired student behavior. Tasks should be clear and appropriate. They should be directly related to the knowledge/skill they are being used to assess. They should include the instructions which will be given to students to help ensure that the assessment is administered uniformly. Tasks should meet the criteria discussed in Chapter 2: consequences, content coverage, content quality, transfer and generalizability, cognitive complexity, meaningfulness, fairness, and cost and efficiency.

Developers should also ask questions such as:

The information which is given to students is critical. Students should know what knowledge and skills the task is assessing, and the criteria that will be used to rate their performances. Unless students clearly understand what is expected of them, their responses to the task may not be relevant enough to permit the scoring scale to be applied appropriately and rigorously. Assessing students on criteria they are not aware of is unfair.

Developing Scoring Scales

As indicated above, the scoring scale or rubric is a major tool for helping ensure that performance assessments are uniform. Constructing a scoring scale requires three major activities. First, developers must decide what type of scale to use. Second, they should identify the criteria that will be used to judge the quality of the performance. Third, they should identify the specific values, or points, which will be used with each criterion and define each one clearly.

Types of Scoring Scales

There are two basic types of scoring scales: analytic and holistic. Analytic scales assign separate ratings to separate criteria. Holistic scales combine the criteria into one score.

These two basic types may be combined, as on the scoring scale used for the writing assessment in the Illinois Goal Assessment Program (IGAP). This scale uses both analytic criteria (conventions, focus, organization and support) and an overall score (integration). The integration score is informed by the analytic criteria. It is double weighted to emphasize its value. The scale is shown in Figure 5.

Figure 5

SUMMARY OF KEY FEATURES
FROM THE ILLINOIS WRITING ASSESSMENT

Absent Developing Developed Fully-Developed
FEATURES

1

2

3

4

5

6

FOCUS

Degree to which idea/theme or point of view is clear and maintained.
Absent; unclear; insufficient writing to ascertain
maintenance
Attempted; subject unclear or confusing; main print is unclear or shifts; resembles brainstorming; insufficient writing to sustain issue Subject clear/ position is not; "underpromise, overdeliver"; "overpromise, underdeliver"; infer; two or more positions without unifying statement; abrupt ending Bare bones; position clear; main point(s) clear and maintained; prompt dependent; launch into support w/o preview Position announced; points generally previewed; has a closing All main points are specified and maintained; effective closing; narrative event clear, importance/ significance stated or inferred
SUPPORT

Degree to which main point/ elements are elaborated and/or explained by specific evidence and detailed reasons.
No support; insufficient writing Support attempted; ambiguous/ confusing; unrelated list; insufficient writing Some points elaborated; most general/some questionable; may be a list of related specifics; sufficiency? Some second-order elaboration; some are general; sufficiency ok - not much depth Most points elaborated by second-order or more All major points elaborated with specific second-order support; balanced/ evenness
ORGANIZATION
Degree to which logical flow of ideas and text plan are clear and connected.
No plan; insufficient writing to ascertain maintenance Attempted; plan can be inferred; no evidence of paragraphing; confusion prevails; insufficient writing Plan noticeable; inappropriate paragraphing; major digressions; sufficiency? Plan is evident; minor digressions; some cohesion and coherence from relating to topic Plan is clear; most points logically connected; coherence and cohesion demonstrated; most points appropriately paragraphed All points logically connected and signaled with transitions and/or other cohesive devices; all appropriately paragraphed; no digressions
CONVENTIONS

Use of conventions of standard English.*
Many errors, cannot read, problems with sentence construction; insufficient writing to ascertain maintenance Many major errors; confusion; insufficient writing Some major errors, many minor; sentence construction below mastery Minimally developed; few major errors, some minor, but meaning unimpaired; mastery of sentence construction A few minor errors, but no more than one major error No major errors, few or no minor errors
INTEGRATION

Evaluation of the paper based on a global judgement of how effectively the paper as a whole uses basic features to address the assignment.
Barely deals with topic; does not present most or all features; insufficient writing Attempts to address assignment; some confusion or disjointedness; insufficient writing Partially developed; some or one feature not developed, but all present; reader inference required Only the essentials present; paper is simple, informative, and clear Developed paper; each feature evident, but not all equally developed Fully developed paper; all features evident and equally well developed

    * Usage, sentence construction, spelling, punctuation/capitalization, paragraph format.

The type of scale selected should be governed by the type of task to be assessed. If a task involves an integrated activity and performance on one criterion will influence all other criteria, judgment should be integrated and a holistic scale should be used. If examining specific criteria is important, an analytic scale may be most useful. A rubric that may be used either analytically or holistically is shown in Figure 6. It was developed for scoring open-ended items in mathematics.

Figure 6

FIGURE 6.
MATHEMATICS SCORING RUBRIC:
A GUIDE TO SCORING OPEN-ENDED ITEMS

SCORE
LEVEL

MATHEMATICAL KNOWLEDGE

STRATEGIC KNOWLEDGE

COMMUNICATlON

Knowledge of mathematical principles and concepts which result in a correct solution to a problem. Identification of important elements of the problem and the use of models, diagrams and symbols to systematically represent and integrate concepts. Written explanation and rationale for the solution process.
4
  • shows complete understanding of the problem's mathematical concepts & principles
  • uses appropriate mathematical terminology & notation (e.g. labels answer as appropriate)1
  • executes algorithms completely and correctly
  • identifies all the important elements of the problem and shows complete understanding of the relationship between elements.
  • reflects an appropriate and systematic strategy for solving the problem
  • gives clear evidence of a complete and systematic solution process
  • gives a complete written explanation of the solution process employed; explanation addresses what was done, and why it was done
  • if a diagram is appropriate, there is a complete explanation of all the elements in the diagram.
  • 3
  • shows nearly complete understanding of the problem's mathematical concepts and principles
  • uses nearly correct mathematical terminology and notations
  • executes algorithms completely; computations are generally correct but may contain minor errors
  • identifies most of the important elements of the problem and shows general understanding of the relationships among them
  • reflects an appropriate strategy for solving the problem
  • solution process is nearly complete
  • gives a nearly complete written explanation of the solution process employed; may contain some minor gaps
  • may include a diagram with most of the elements explained
  • 2
  • shows some understanding of the problem's mathematical concepts and principles
  • may contain major computational errors
  • identifies some important elements of the problem but shows only limited understanding of the relationships between them
  • appears to reflect an appropriate strategy but application of strategy is unclear
  • gives some evidence of a solution process
  • gives some explanation of the solution process employed, but communication is vague or difficult to interpret
  • may include a diagram with some of the elements explained
  • 1
  • shows limited to no understanding of the problem's mathematical concepts and principles
  • may misuse or fail to use mathematic terms
  • may contain major computational errors
  • fails to identify important elements or places too much emphasis on unimportant elements
  • may reflect an inappropriate strategy for solving the problem
  • gives minimal evidence of a solution process; process may be difficult to identify
  • may attempt to use irrelevant outside information
  • provides minimal explanation of solution process; may fail to explain or may omit significant parts of the problem
  • explanation does not match presented solution process
  • may include minimal discussion of elements in diagram; explanation of significant elements is unclear
  • 0
  • no answer attempted
  • no apparent strategy
  • no written explanation of the solution process is provided.
  • 1 ”As Appropriate” or “if appropriated” relate to whether or not the specific element if called for in the stem of the item. Adapted from Lame (1993).
    Rating scales or checklists might be used. Rating scales are, essentially, continua which place student performance at a particular level. Many have four or six points (because many people prefer even-numbered scales to avoid the tendency to select the middle point on odd-numbered scales). Individual points on scales may be numerical or qualitative (with verbal descriptions rather than numbers). Checklists can be used to indicate whether a criterion is present or absent. Again, the type of scale which is most appropriate depends on what is being assessed and how the results will be used.

    Scales can be developmental. Generally, this means that one scale will be used across grades. Student growth will be shown by improved performance as students advance through the grades.

    Selecting Criteria

    The criteria which are used to rate student performance are absolutely critical. They are the dimensions or variables (also known by terms such as traits or components) upon which student performance is judged. The type of information that is available from performance-based assessment is dependent on the criteria used. The criteria tell raters what they should look at when judging the quality of a performance or product. Criteria must clearly represent the knowledge/skills for which the assessment is being used. Criteria should also represent what knowledgeable educators agree need to be assessed in order to judge the quality of the performance/product. Therefore, assessment criteria should be established and reviewed thoughtfully.

    Practical Tip: As staff develop assessment systems, they should document their work as much as possible--so that the work doesn't have to start over when different teachers or administrators take responsibility for the system due to turnover in committee membership or school staff. The documentation should include descriptions of decisions made about the system (such as when assessment will occur and what types of assessment procedures will be used); examinations of validity, reliability, and fairness; and selection and development of assessment procedures.
    Select criteria which are:A final consideration is the number of criteria or traits to include in the scale. The number should be
    reasonable, to avoid overwhelming raters and to limit the amount of time required for scoring. When too many criteria are used, scoring can become so burdensome that the gain in specificity is lost.

    Identifying Scale Values

    One effective approach to establishing different values on a scale is to use actual examples of the performance or product to make direct comparisons of different levels of student performance. These "anchors" can then be used to create mental images which may be as meaningful to raters as written descriptions. However, written descriptions--which may be based on summaries of the anchor performances/products--are necessary to establish definitions of the different value levels.

    Another approach is to assign descriptors to value levels. These descriptors can be numerical (1, 2, 3, 4) or verbal (unacceptable, minimally acceptable, clearly acceptable, excellent).

    Suggestions for scale developers

    Chapter 4

    Interpretation, Use, and Reporting of Assessment Results

    Three important functions of assessment are deciding what assessment results mean, what their implications are, and what changes or other decisions should be made.

    The purpose of most assessment is to collect information about learning in order to improve decisions about how to help schools improve and students learn even more. Such information includes identifying:

    Because those decisions are critical, educators should make them thoughtfully and carefully. Assessment results should be interpreted and used cautiously. Factors which should be considered during interpretation and use will be discussed here.

    Theoretically, interpreting and using assessment results are separate actions. Interpretation occurs when someone attaches meaning to the results. Use occurs when someone makes decisions or takes other actions on the basis of the results. However, because interpretation and use are sometimes difficult to distinguish and are closely related to one another, they will be discussed together.

    Validity

    Current scholars (including Messick, 1989; Shepard, 1993, and Linn, 1993) say that validity is a quality of assessment interpretation and use rather than of assessment procedures themselves. Validity is critical to anyone who attempts to attach meaning to assessment results or make decisions based on them. All assessment results are more valid for some interpretations and uses than for others.

    The intended purposes and uses of assessment results are considered several times during the planning phase--for example, when an assessment system is designed, when assessment procedures are selected and developed, and when validity is examined. Those purposes and uses help determine the types of interpretation and use that are appropriate. New interpretations or uses, whether they are voluntary or in response to others such as legislators or superiors, require renewed planning. Otherwise, assessment may be misused.

    After assessment results are available, people often want to interpret or use them for multiple purposes which go beyond those considered in the planning phase. For example, they may want to use an assessment that was designed to learn whether schools are meeting learning outcomes to make decisions about whether individual students should be promoted to the next grade. They may want to use the results of a large-scale (e.g., state-level) assessment to determine whether schools have met local outcomes. They may want to interpret the results of a test intended to help make college admissions decisions about individual students as indicators of the quality of instruction in individual schools. Interpretations and uses which exceed an assessment's limitations are inappropriate.

    Generalization

    People who use assessment results often draw conclusions about content domains that are much broader than those actually included in the assessment. For example, they may make statements about student achievement in world history or environmental sciences based on responses to just a few forced-choice test questions. Or, they may make general statements about student writing skills based on responses to a single prompt that required only expository writing, or about artistic abilities based on a crayon drawing of a house. These examples clearly represent unwarranted generalizations. Other inappropriate generalizations may be much more subtle. Educators should be careful when interpreting assessment results to avoid implying that they represent broader content or skill areas than the assessment actually measured.

    Multiple Sources of Information

    Many sources of error influence assessment results: a student misreads a question or went to bed late the previous night and has trouble staying awake; a teacher forgets to review sample problems with students or allows students to work beyond the specified time; a testing coordinator fails to open boxes in advance and discovers too late that there are not enough assessments for all students. The sources of error are likely to vary from one testing situation to another. If errors resulting from different sources of information--such as different kinds of assessment procedures--are random, they are likely to balance one another. For this reason, assessment interpretation and use will probably be improved if they are based on multiple sources of information.

    Multiple sources of information also help compensate for the shortcoming that information does not cover the assessed content comprehensively. It occurs because, as discussed in the previous section about generalization, the assessment is brief and covers only a small portion of the content. Or, perhaps the content includes both knowledge and skills, and different types of assessment procedures (such as a paper-and-pencil test and a performance-based examination) should be used with different portions of it. Again, multiple sources of information should enable educators to make more valid interpretations and uses.

    To illustrate, one could draw firmer conclusions about students' knowledge and skills related to a civil war from the results of several assessments that each focused on different aspects of the war. The assessments were of several types, such as multiple-choice or matching tests and essays about the causes of the war, social conditions during it, or the impact of it on the country. Together, the assessments covered the war comprehensively. Multiple performance-based assessments might be used--for example, requiring students to build models of battlefields, write and perform a simulated Lincoln cabinet meeting, and develop a script of Lee's conversation with Grant at Appomattox.

    Results for Individual Students

    Many assessments are explicitly intended to produce information about the achievement of individual students. Others, however, are for collecting evidence about the effectiveness of schools--often for accountability purposes. It may or may not be appropriate to use such information to learn about the status of individual students. Schools should use individual students' results from school-level assessments only after determining that such uses are appropriate.

    Reporting Results

    Reports of assessment results can be critical documents. They are the only source of information that some audiences receive about actual student learning. Therefore, educators should thoughtfully design and prepare reports about assessment results. They should consider the factors discussed above. Otherwise, they might mislead audience members and cause them to misinterpret and/or misuse the results. In addition, educators need to consider questions such as:
    1. What audiences will each report address?
    2. What do audience members want to know, and what else do we want them to learn?
    3. In what form should the results be reported?

    Audiences

    As schools and districts develop plans for reporting assessment results, they will base many of their decisions on their perceptions of what various audiences want and need. Educators should consider several different types of local audience including:
    1. Staff and administrators (school and district)
    2. Parents and students
    3. School board members
    4. Teachers' associations
    5. Community members
    6. Media
    Many educators will probably conclude that they can serve multiple audiences most effectively by issuing several reports, each tailored to one or more specific audiences.

    Information

    When deciding what types of information to include in various reports, educators should consider not only what various audiences want, but what additional information will give them a more meaningful understanding of what students are learning and of a school's effectiveness. In addition to data showing the proportion of students meeting local outcomes or objectives, the following types of information might be included:Educators should carefully prepare reports that will communicate effectively with each intended audience. They should avoid developing reports that contain so much information that many audience members will ignore them. Indexing reports so that readers can quickly find information can help considerably. However, educators should not overload reports with extraneous information that will discourage potential readers.

    Format

    The format of assessment reports will influence readers' motivation to read the reports as well as the perceptions that readers acquire from the reports. Questions that educators might want to ask as they design reports include:The answers to many of these questions will vary by audience. For example, some audiences may prefer brief, easy-to-read summaries while others want considerable detail. As indicated above, it may be necessary to prepare different reports for different audiences.

    References

    Alaska Department of Education. (1986). Assessment Handbook: A Practical Guide for Assessing Alaska's Students: Juneau: State of Alaska, Department of Education.

    A resource that was used extensively when both versions of the Illinois Assessment Handbook were developed.

    American Educational Research Association, American Psychological Association and National Council on Measurement in Education. (1985). Standards for Educational and Psychological Testing. Washington, DC: American Psychological Association.

    The fourth version of a publication developed by a joint committee representing three major professional organizations. The document's purpose is to guide the development and use of tests.

    Gronlund, N.E. (1985). Measurement and Evaluation in Teaching. New York: MacMillan Publishing Company, Inc.

    The fifth edition of a classic textbook on assessing students. The book is intended for elementary and secondary teachers. It is practical and contains clear descriptions of major concepts in assessment.

    Gronlund, N.E. (1988). How to Make Achievement Tests and Assessments. Boston: Allyn and Bacon.

    The fifth edition of a practical guide for constructing forced-choice and--to a more limited extent--performance-based assessment procedures.

    Herman, J.L., P.R. Aschbacher, and L. Winters. (1992). A Practical Guide to Alternative Assessment. Alexandria, VA:  Association for Supervision and Curriculum Development.

    A practical guide for developing and using performance-based assessments.



    ERROR. Unbalanced Paragraph Marks. Do not forget to terminate <P> with </P>

    Illinois State Board of Education. (1993). An Overview of IGAP Performance Standards for Reading, Mathematics, Writing, Science, and Social Sciences.

    This document describes the development of the standards, tells how they will be used in the evaluation of schools, and presents the standards.

    Illinois State Board of Education. (1994a). The Illinois Public School Accreditation Process: Resource Document.


    An overview of the Illinois Public School Accreditation Process that explains its three parts and explains how schools can use it. 


    Illinois State Board of Education. (1994b). The Illinois Public School Accreditation Process: School Improvement Plan Workbook.


    A training manual on the Illinois School Improvement Plan.


     Illinois State Board of Education (1994c). Illinois School Improvement Plan: Assessment Systems. Brochure.


    An information brochure on the assessment component of the Illinois School Improvement Plan.


    Illinois State Board of Education (1994d). Illinois School Improvement Plan: Introduction. Brochure.


    An information brochure for teachers and the general public on the School Improvement Plan and the Illinois State Goals for Learning.



    ERROR. Unbalanced Paragraph Marks. Do not forget to terminate <P> with </P>

    Illinois State Board of Education (1994e). Illinois School Improvement Plan: Learning Outcomes, Standards, and Expectations. Brochure.

    An information brochure on the learning outcomes, standards, and expectations component of the Illinois School Improvement System.

    Illinois State Board of Education. (1994f). Learning Outcomes, Standards and Expectations: Linking Educational Goals, Curriculum and Assessment.

    A theoretical paper on options for addressing learning outcomes and standards from various curriculum orientations.

    Illinois State Board of Education. (l994g). Write on Illinois, III!

    The third edition of a guide for understanding performance assessment in writing. It can help schools develop performance-based assessment in writing.

    Illinois State Board of Education. (1995). Performance Assessment in Mathematics: Approaches to Open-Ended Problems.

    This document, which includes the scoring rubric shown in Figure 6, provides guidelines and suggestions for creating and using open-ended performance items for problem solving.

    Lane, S. (1993). The conceptual framework for the development of a mathematics performance instrument. Educational Measurement: Issues and Practice, 12, l6-23.

    Linn, R.L. (1993). Educational assessment: Expanded expectations and challenges. Educational Evaluation and Policy Analysis, 15(1), 1-16.

    Linn, R.L., E.L. Baker, and S.B. Dunbar. (1991). Complex, performance-based assessment: Expectations and validation criteria. Educational Researcher, 20(8), 15-23.

    Marzano, R.J., D. Pickering, and J. McTighe. (1993). Assessing Student Outcomes: Performance Assessment Using the Dimensions of Learning Model. Alexandria, VA: Association for Supervision and Curriculum Development.

    Describes a practical approach to student assessment that recognizes the connections among teaching, learning, and assessment. Includes many generic rubrics for teachers to adapt and use.

    Messick, S. (1989) Validity. In R.L. Linn (Ed.), Educational Measurement (3rd ed, pp 13-103). New York: Macmillian.

    National Evaluation Systems, Inc. (1987). Bias Issues in Test Development. Amherst, MA: National Evaluation Systems, Inc.

    Prepared under contract with the Illinois State Board of Education. Discusses assessment bias through language usage, stereotyping, representational unfairness, and content exclusion. Includes guidelines for avoiding bias.

    Ory, J. C. and K. E. Ryan. (1993). Tips for Improving Testing and Grading. Newbury Park, CA: Sage.

    A practical guide for developing forced-choice and performance-based assessments (including classroom-level assessments) and for assigning grades.

    Shepard, L.A. (1993) Evaluating test validity. In Review of Research in Education, 19, pp. 405-484. Washington, D.C.: American Educational Research Association.

    Stiggins, R. (1987, Spring). Design and development of performance assessments. Educational Measurement: Issues and Practices, 35.

    Wilson, L. R., B. C. Sherbarth, H. M. Brickell, S. T. Mayo, and R. H. Paul. (1988). Determining Validity and Reliability of Locally Developed Assessments. Springfield, IL: Illinois State Board of Education.

    Note: Most Illinois State Board of Education publications are available from ISBE and from regional education service agencies (although some may be out of print).