Validity and Reliability of a Test
Validity and Reliability of a Test
Introduction
An important part of social science research is the quantification of human behavior that is, using measurement instruments to observe human behavior. The measurement of human behavior belongs to the widely accepted positivist view, or empirical-analytic approach, to discern reality. Because most behavioral research takes place within this paradigm, measurement instruments must be valid and reliable.
Reliability
Reliability is the extent to which measurements are repeatable –when different persons perform the measurements, on different occasions, under different conditions, with supposedly alternative instruments which measure the same thing. In sum, reliability is consistency of measurement (Bollen, 1989), or stability of measurement over a variety of conditions in which basically the same results should be obtained.
Methods of testing reliability
Because reliability is consistency of measurement over time or stability of measurement over a variety of conditions, the most commonly used technique to estimate reliability is with a measure of association, the correlation coefficient, often termed reliability. The reliability coefficient is the correlation between two or more variables (here, tests, items, or raters) which measure the same thing.
Reliability co-efficient of a test is computed using the Pearson Product Moment Correlation Coefficient(r). Is expressed as the relationship between two repeated measures of the same test to the same subjects under similar conditions.
Typical methods to estimate test reliability in behavioral research are: test-retest reliability, alternative forms, split-halves, inter-rater reliability, and internal consistency. There are three main concerns in reliability testing: equivalence, stability over time, and internal consistency.
- Test-retest reliability
Test-retest reliability refers to the temporal stability of a test from one measurement session to another. The procedure is to administer the test to a group of respondents and then administer the same test to the same respondents at a later date. The correlation between scores on the identical tests given at different times operationally defines its test-retest reliability.
2. Alternative forms reliability
2. Alternative forms reliability
The alternative forms technique to estimate reliability is similar to the test retest method, except that different measures of a behavior (rather than the same measure) are collected at different times. If the correlation between the alternative forms is low, it could indicate that considerable measurement error is present, because two different scales were used. For example, when testing for general spelling, one of the two independently composed tests might not test general spelling but a more subject-specific type of spelling such as business vocabulary. This type of measurement error is then attributed to the sampling of items on the test. Several of the limits of the test-retest method are also true of the alternative forms technique.
3. Split-half method
3. Split-half method
The split-half approach is another method to test reliability which assumes that a number of items are available to measure a behavior. Half of the items are combined to form one new measure and the other half is combined to form the second new measure. The result is two tests and two new measures testing the same behavior. In contrast to the test-retest and alternative form methods, the split-half approach is usually measured in the same time period. The correlation between the two halves tests must be corrected to obtain the reliability coefficient for the whole test.
A practical advantage is that the split-halves are usually cheaper and more easily obtained than over time data.
A disadvantage of the split-half method is that the tests must be parallel measures, that is, the correlation between the two halves will vary slightly depending on how the items are divided.
Nunnally (1978) suggests using the split-half method when measuring variability of behaviors over short periods of time when alternative forms are not available. For example, the even items can first be given as a test and, subsequently, on the second occasion, the odd items as the alternative form. The corrected correlation coefficient between the even and odd item test scores will indicate the relative stability of the behavior over that period of time.
4. Inter-rater reliability
When raters or judges are used to measure behavior, the reliability of their judgments or combined internal consistency of judgments is assessed.
The correlation between the ratings made by the two judges will tell us the reliability of either judge in the specific situation. The composite reliability of both judges, referred to as effective reliability, is calculated using the Spearman-Brown formula.
5. Internal consistency
Internal consistency concerns the reliability of the test components. Internal consistency measures consistency within the instrument and questions how well a set of items measures a particular behavior or characteristic within the test. For a test to be internally consistent, estimatetest reliability are based on the average intercorrelations among all the single items within a test.
The most popular method of testing for internal consistency in the behavioral sciences is coefficient alpha. Coefficient alpha was popularized by Cronbach (1951), who recognized its general usefulness. As a result, it is often referred to as Cronbach’s alpha. Coefficients of internal consistency increase as the number of items goes up, to a certain point. For instance, a 5-item test might correlate 0 .4 with true scores, and a 12-item test might correlate 0.8 with true scores.
Consequently, the individual item would be expected to have only a small correlation with true scores. Thus, if coefficient alpha proves to be very low, either the test is too short or the items have very little in common. Coefficient alpha is useful for estimating reliability for item-specific variance in a unidimentional test. That is, it is useful once the existence of a single factor or construct has been determined.
How to make a test more reliable?
- Writing items clearly
- Making test instructions easily understood
- Training the raters effectively by making the rules for scoring as explicit as possible.
- Identifying Candidates by number Instead of names so that scores cannot inevitably award scores to those whom he/she knows
- By not allowing the candidate too much freedom
Factors affecting reliability of a test
1. Test length. Generally, the longer a test is, the more reliable it is.
2. Speed. When a test is a speed test, reliability can be problematic. It is inappropriate to estimate reliability using internal consistency, test-retest, or alternate form methods. This is because not every student is able to complete all of the items in a speed test. In contrast, a power test is a test in which every student is able to complete all the items.
3. Group homogeneity. In general, the more heterogeneous the group of students who take the test, the more reliable the measure will be.
4. Item difficulty. When there is little variability among test scores, the reliability will be low. Thus, reliability will be low if a test is so easy that every student gets most or all of the items correct or so difficult that every student gets most or all of the items wrong.
5. Objectivity. Objectively scored tests, rather than subjectively scored tests, show a higher reliability.
6. Test-retest interval. The shorter the time interval between two administrations of a test, the less likely that changes will occur and the higher the reliability will be.
Variation with the testing situation. Errors in the testing situation (e.g., students misunderstanding or misreading test directions, noise level, distractions, and sickness) can cause test scores to vary.
Validity
The term validity refers to whether or not the test measures what it claims to measure. On a test with high validity the items will be closely linked to the test's intended focus. For many certification and licensure tests this means that the items will be highly related to a specific job or occupation. If a test has poor validity then it does not measure the job-related content and competencies it ought to. When this is the case, there is no justification for using the test results for their intended purpose.
There are nine types of validity that researchers should consider, it includes the following;
Statistical conclusion validity
Statistical conclusion validity pertains to the relationship being tested. Statistical conclusion validity refers to inferences about whether it is reasonable to presume co-variation given a specified alpha level and the obtained variances. There are some major threats to statistical conclusion validity such as low statistical power, violation of assumptions, reliability of measures, reliability of treatment, random irrelevancies in the experimental setting, and random heterogeneity of respondents.
Construct validity
Construct validity refers to how well you translated or transformed a concept, idea, or behavior that is a construct – into a functioning and operating reality, that is, the operationalization.
Translation Validity
Translation validity centers on whether the operationalization reflects the true meaning of the construct. Translation validity attempts to assess the degree to which constructs are accurately translated into the operationalization, using subjective judgment.
Face Validity
Face validity is a subjective judgment on the operationalization of a construct. For instance, one might look at a measure of reading ability, read through the paragraphs, and decide that it seems like a good measure of reading ability. Even though subjective judgment is needed throughout the research process, the aforementioned method of validation is not very convincing to others as a valid judgment. As a result, face validity is often seen as a weak form of validity.
Content validity
Content validity is a qualitative type of validity where the domain of the concept is made clear and the analyst judges whether the measures fully represent the domain. According to Bollen, for most concepts in the social sciences, no consensus exists on theoretical definitions, because the domain of content is ambiguous. Consequently, the burden falls on the researcher not only to provide a theoretical definition (of the concept) accepted by his/her peers but also to select indicators that thoroughly cover its domain and dimensions. Thus, content validity is a qualitative means of ensuring that indicators tap the meaning of a concept as defined by the researcher. For example, if a researcher wants to test a person‘s knowledge on elementary geography with a paper-and-pencil test, the researcher needs to be assured that the test is representative of the domain of elementary geography. Does the survey really test a person‘s knowledge in elementary geography (i.e. the location of major continents in the world) or does the test require a more advanced knowledge in geography (i.e. continents’ topography and their effect on climates, etc.)?
There are basically two ways of assessing content validity:
- ask a number of questions about the instrument
- ask the opinion of expert judges in the field.
Criterion-related validity
Criterion-related validity is the degree of correspondence between a test measure and one or more external referents (criteria), usually measured by their correlation. For example, suppose we survey students in a class and ask them to report their chemistry results. If we had access to their actual results records, we could assess the validity of the survey (results reported by the students) by correlating the two measures. In this case, the students’ records represent an (almost) ideal standard for comparison.
Concurrent Validity and Predictive Validity
When the criterion exists at the same time as the measure, we talk about concurrent validity. Concurrent ability refers to the ability of a test to predict an event in the present. When the criterion occurs in the future, we talk about predictive validity. For example, predictive validity refers to the ability of a test to measure some event or outcome in the future.
Internal validity
Internal validity speaks to the validity of the research itself. For example, a teacher tests on learning satisfaction. Only 50% of the students responded to the survey and all of them liked their teacher. Does the teacher have a representative sample of students or a bias sample?
There are many threats to internal validity of a research design. Some of these threats are: history, maturation, testing, instrumentation, selection, mortality, diffusion of treatment and compensatory equalization, rivalry and demoralization.
External validity
External validity of a study or relationship implies generalizing to other persons, settings, and times. Generalizing to well-explained target populations should be clearly differentiated from generalizing across populations. Each is truly relevant to external validity: the former is critical in determining whether any research objectives which specified populations have been met, and the latter is crucial in determining which different populations have been affected by a treatment to assess how far one can generalize.
For instance, if there is an interaction between an educational treatment and the social class of children, then we cannot infer that the same result holds across social classes.
Factors affecting validity of a test
1.The test itself, that is, the length of the test or the number of items in a test.
Long tests do three things to help maintain validity.
- They increase the amount of content that the student must address, ensuring a more accurate picture of student knowledge.
- Long tests counteract the effects of faulty items by providing a greater number of better items.
- Long tests reduce the impact of student guessing. Therefore the length of the test affects its validity.
2. Range of ability. A very limited range of ability of the examinees gives rise to low validity coefficient of the test.
Ambiguous directions. If the instructions/directions of the test are not clear, the test will be interpreted differently by various examinees. Such items encourage guessing thus lowering the validity of the test.
3. Socio-cultural differences. A test developed with a particular culture in mind may not be valid when tested to examinees from other cultures. Differences in socio-economic and cultural practices affect test validity. However, a test that is cross-cultural, the validity will not be affected by cultural differences.
The testing environment is another variable that affects the validity of tests. If the testing environment is distracting or noisy or the test-taker is unhealthy, he or she will have a difficult time remaining consistent throughout the testing process.
Relationship between validity and reliability of a test
- If a test is unreliable, it cannot be valid
- For a test to be valid, it must be reliable
- However, just because a test is reliable does not mean it would be valid
- Reliability is a necessary but not a sufficient condition for validity
Conclusion
Reliability in testing indicates more than just consistency. It also indicates the amount of random error associated with the test score. In other words, reliability refers to the confidence placed in a test score being the correct or true estimate of a student’s trait or construct being tested, such as his or her level of proficiency in English or, in the case of an IQ score, his or her general mental ability. The term validity refers to whether or not the test measures what it claims to measure. On a test with high validity the items will be closely linked to the test's intended focus. While test reliability is important, a reliable test is of little use if it is invalid. Reliability is necessary but it is not sufficient for great tests. Therefore, a reliable test must also have high validity in order for it to be considered psychometrically sound.
References
Bollen, K. A. (1989). Structural Equations with Latent Variables (pp. 179-225). John Wiley & Sons.
Brinberg, D. and McGrath, J. E. (1982). A Network of Validity Concepts within the Research Process. In Brinberg, D. and Kidder, L. H., (Eds), Forms of Validity in Research, pp. 5-23.
Chapman, L.J. and Chapman, J.P. (1969). Illusory correlations as an obstacle to use of valid psychodiagnostic signs. Journal of Abnormal Psychology, 74, 271-280.
Cook, T. D. and Campbell, D. T. (1979). Quasi-Experimentation: Design & Analysis Issues for Field Settings. Boston: Houghton Muffin Company, pp. 37- 94.
Rosenthal, R. and Rosnow, R. L. (1991). Essentials of Behavioral Research: Methods and Data Analysis. Second Edition. McGraw-Hill Publishing Company, pp. 46-65.
Shadish, W. R., Cook, T.D., and Campbell, D. T. (2001). Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Boston: Houghton Mifflin.
Trochim, W. M. K. (2006). Introduction to Validity. Social Research Methods, retrieved from www.socialresearchmethods.net/kb/introval.php, September 9, 2010.
Williams, L.J., Cote, J.A. and Buckley, M.R. (1989). Lack of method variance in self-reported affect and perceptions at work: Reality or artifact? Journal of Applied Psychology, 74: 462-468.
Comments
Post a Comment