Which test refers to the consistency of scores?

Chapter Summary:

This chapter focused on the reliability/precision of educational assessment procedures. Reliability refers to the consistency with which a test measures whatever it’s measuring—that is, the absence of measurement errors that would distort a student’s score.

There are three distinct types of reliability evidence. Test-retest reliability refers to the consistency of students’ scores over time. Such reliability is usually represented by a coefficient of correlation between students’ scores on two occasions, but can be indicated by the degree of classification consistency displayed for students on two measurement occasions. Alternate-form reliability evidence refers to the consistency of results between two or more forms of the same test. Alternate-form reliability evidence is usually represented by the correlation of students’ scores on two different test forms, but can also be reflected by classification consistency percentages. Internal consistency evidence represents the degree of homogeneity in an assessment procedure’s items. Common indices of internal consistency are the Kuder-Richardson formulae as well as Cronbach’s coefficient alpha. The three forms of reliability evidence should not be used interchangeably, but should be sought if relevant to the educational purpose to which an assessment procedure is being put—that is, the kind of educational decision linked to the assessment’s results.

The standard error of measurement supplies an indication of the consistency of an individual’s score by estimating person-score consistency from evidence of group-score consistency. The standard error of measurement is interpreted in a manner similar to the plus or minus sampling-error estimates often provided with national opinion polls. Conditional SEMs can be computed for particular segments of a test’s score scale—such as near any key cut-scores. Classroom teachers are advised to become generally familiar with the key notions of reliability, but not to subject their own classroom tests to reliability analyses unless the tests are extraordinarily important.

Chapter Outcome:

An understanding of commonly employed indicators of a test’s reliability/precision that is sufficient to identify the types of reliability evidence already collected for a test and, if necessary, to select the kinds of reliability evidence needed for particular uses of educational assessments

Reliability is such a cherished commodity. We all want our automobiles, washing machines, and spouses to be reliable. The term reliability simply reeks of solid goodness. It conjures up visions of meat loaf, mashed potatoes, and a mother’s love. Clearly, reliability is an attribute to be sought.

Chapter Activity:

Double-Entry Journal Activity:

(A double-entry journal is a journal where the student is to write main points in the left-hand column and a reflection to each main point in the right-hand column.)

Which test refers to the consistency of scores?

Please create a double-entry journal for chapter 3 to learn about reliability more deeply.  

Reliability is a test’s first requirement and refers to its consistency: a reliable test is one that yields consistent scores when a person takes two alternate forms of the test or when he or she takes the same test on two or more different occasions.

Definition 2.

Reliability is a measure of the probability that a product will not malfunction or fail within a specified time period.


Reliability is the characteristic which refers to the consistency of scores obtained by the same person when retested with the identical or equivalent tests.


Reliability is a statistical term for the internal consistency of a test; the extent to which it can be expected to produce the same result on different occasions.

Webster Dictionary Meaning

- The state or quality of being reliable; reliableness.

Which test refers to the consistency of scores?

Which test refers to the consistency of scores?
Which test refers to the consistency of scores?

EXPLORING RELIABILITY IN ACADEMIC ASSESSMENT

Written by Colin Phelan and Julie Wren, Graduate Assistants, UNI Office of Academic Assessment (2005-06)

Reliability is the degree to which an assessment tool produces stable and consistent results.

Types of Reliability

  1. Test-retest reliability is a measure of reliability obtained by administering the same test twice over a period of time to a group of individuals.  The scores from Time 1 and Time 2 can then be correlated in order to evaluate the test for stability over time. 

Example:  A test designed to assess student learning in psychology could be given to a group of students twice, with the second administration perhaps coming a week after the first.  The obtained correlation coefficient would indicate the stability of the scores.

  1. Parallel forms reliability is a measure of reliability obtained by administering different versions of an assessment tool (both versions must contain items that probe the same construct, skill, knowledge base, etc.) to the same group of individuals.  The scores from the two versions can then be correlated in order to evaluate the consistency of results across alternate versions. 

Example:  If you wanted to evaluate the reliability of a critical thinking assessment, you might create a large set of items that all pertain to critical thinking and then randomly split the questions up into two sets, which would represent the parallel forms.

  1. Inter-rater reliability is a measure of reliability used to assess the degree to which different judges or raters agree in their assessment decisions.  Inter-rater reliability is useful because human observers will not necessarily interpret answers the same way; raters may disagree as to how well certain responses or material demonstrate knowledge of the construct or skill being assessed. 

Example:  Inter-rater reliability might be employed when different judges are evaluating the degree to which art portfolios meet certain standards.  Inter-rater reliability is especially useful when judgments can be considered relatively subjective.  Thus, the use of this type of reliability would probably be more likely when evaluating artwork as opposed to math problems.

  1. Internal consistency reliability is a measure of reliability used to evaluate the degree to which different test items that probe the same construct produce similar results. 
    1. Average inter-item correlation is a subtype of internal consistency reliability.  It is obtained by taking all of the items on a test that probe the same construct (e.g., reading comprehension), determining the correlation coefficient for each pair of items, and finally taking the average of all of these correlation coefficients.  This final step yields the average inter-item correlation. 
    1. Split-half reliability is another subtype of internal consistency reliability.  The process of obtaining split-half reliability is begun by �splitting in half� all items of a test that are intended to probe the same area of knowledge (e.g., World War II) in order to form two �sets� of items.  The entire test is administered to a group of individuals, the total score for each �set� is computed, and finally the split-half reliability is obtained by determining the correlation between the two total �set� scores.

Validity refers to how well a test measures what it is purported to measure. 

Why is it necessary?

While reliability is necessary, it alone is not sufficient.  For a test to be reliable, it also needs to be valid.  For example, if your scale is off by 5 lbs, it reads your weight every day with an excess of 5lbs.  The scale is reliable because it consistently reports the same weight every day, but it is not valid because it adds 5lbs to your true weight.  It is not a valid measure of your weight.

Types of Validity

 

1. Face Validity ascertains that the measure appears to be assessing the intended construct under study. The stakeholders can easily assess face validity. Although this is not a very �scientific� type of validity, it may be an essential component in enlisting motivation of stakeholders. If the stakeholders do not believe the measure is an accurate assessment of the ability, they may become disengaged with the task.

Example: If a measure of art appreciation is created all of the items should be related to the different components and types of art.  If the questions are regarding historical time periods, with no reference to any artistic movement, stakeholders may not be motivated to give their best effort or invest in this measure because they do not believe it is a true assessment of art appreciation.

2. Construct Validity is used to ensure that the measure is actually measure what it is intended to measure (i.e. the construct), and not other variables. Using a panel of �experts� familiar with the construct is a way in which this type of validity can be assessed. The experts can examine the items and decide what that specific item is intended to measure.  Students can be involved in this process to obtain their feedback.

Example: A women�s studies program may design a cumulative assessment of learning throughout the major.  The questions are written with complicated wording and phrasing.  This can cause the test inadvertently becoming a test of reading comprehension, rather than a test of women�s studies.  It is important that the measure is actually assessing the intended construct, rather than an extraneous factor.

3. Criterion-Related Validity is used to predict future or current performance - it correlates test results with another criterion of interest.

Example: If a physics program designed a measure to assess cumulative student learning throughout the major.  The new measure could be correlated with a standardized measure of ability in this discipline, such as an ETS field test or the GRE subject test. The higher the correlation between the established measure and new measure, the more faith stakeholders can have in the new assessment tool.

4. Formative Validity when applied to outcomes assessment it is used to assess how well a measure is able to provide information to help improve the program under study.

Example:  When designing a rubric for history one could assess student�s knowledge across the discipline.  If the measure can provide information that students are lacking knowledge in a certain area, for instance the Civil Rights Movement, then that assessment tool is providing meaningful information that can be used to improve the course or program requirements.

5. Sampling Validity (similar to content validity) ensures that the measure covers the broad range of areas within the concept under study.  Not everything can be covered, so items need to be sampled from all of the domains.  This may need to be completed using a panel of �experts� to ensure that the content area is adequately sampled.  Additionally, a panel can help limit �expert� bias (i.e. a test reflecting what an individual personally feels are the most important or relevant areas).

Example: When designing an assessment of learning in the theatre department, it would not be sufficient to only cover issues related to acting.  Other areas of theatre such as lighting, sound, functions of stage managers should all be included.  The assessment should reflect the content area in its entirety.

 

What are some ways to improve validity?

  1. Make sure your goals and objectives are clearly defined and operationalized.  Expectations of students should be written down.
  2. Match your assessment measure to your goals and objectives. Additionally, have the test reviewed by faculty at other schools to obtain feedback from an outside party who is less invested in the instrument.
  3. Get students involved; have the students look over the assessment for troublesome wording, or other difficulties.
  4. If possible, compare your measure with other measures, or data that may be available.

 References

American Educational Research Association, American Psychological Association, &

National Council on Measurement in Education. (1985). Standards for educational and psychological testing. Washington, DC: Authors.

Cozby, P.C. (2001). Measurement Concepts. Methods in Behavioral Research (7th ed.).

California: Mayfield Publishing Company.

Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.). Educational

Measurement (2nd ed.). Washington, D. C.: American Council on Education.

Moskal, B.M., & Leydens, J.A. (2000). Scoring rubric development: Validity and

reliability. Practical Assessment, Research & Evaluation, 7(10). [Available online: http://pareonline.net/getvn.asp?v=7&n=10].

The Center for the Enhancement of Teaching. How to improve test reliability and

validity: Implications for grading. [Available online: http://oct.sfsu.edu/assessment/evaluating/htmls/improve_rel_val.html].

Which test refers to the consistency of scores obtained by an individual on the same test on two occasions?

Reliability of a test refers to the consistency of scores obtained by individual on the same test on two different occasions.

What is the consistency of a test?

A consistent test is one for which the power of the test for a fixed untrue hypothesis increases to one as the number of data items increases.

What is consistency of a measure?

Consistency measurement is used to determine the degree to which a material resists deformation by an applied force. This article presents the basic principles of consistency determination with a penetrometer.

What is reliability of test?

The reliability of test scores is the extent to which they are consistent across different occasions of testing, different editions of the test, or different raters scoring the test taker's responses.