
June 2 and 3, 2009
Note: Abstracts and contact information for
unavailable papers
follow the conference schedule
Tuesday, 2 June
8:30 – 9:00 Welcomes: Larry Rudner,
GMAC, and Dave Weiss,
9:00 – 10:15 Realities of CAT: Dave Weiss,
Effect of Early Misfit in Computerized
Adaptive Testing on the Recovery of q
. Rick Guyer and David J. Weiss,
Quantifying the Impact of Compromised Items in CAT. Fanmin Guo, Graduate Management Admission Council
Guess What? Score Differences With Rapid Replies Versus Omissions on a Computerized Adaptive Test. Eileen Talento-Miller and Fanmin Guo, Graduate Management Admission Council
Termination Criteria in
Computerized Adaptive Tests: Variable-Length CATs Are Not Biased. Ben Babcock and David J. Weiss,
10:30 – 12:00 CAT for Classification: Dave Weiss,
Computerized
Classification Testing in More Than Two Categories by Using Stochastic
Curtailment. Theo
J.H.M. Eggen, CITO and University of
Twente, The Netherlands; and Jasper T. Wouda, CITO,
Utilizing
the Generalized Likelihood Ratio as a Termination Criterion. Nathan A.
Thompson, Assessment Systems Corporation
Adaptive
Testing Using Decision Theory.
“Black Box" Adaptive Testing by Mutual Information and Multiple Imputations. Anne Thissen-Roe, Kronos
A
Comparison of Computerized Adaptive Testing Approaches: Real-data Simulations
of
12:30 – 2:00 Posters: CAT Research and Applications Around
the World (Concurrent)
A Comparison of Three Methods of Item Selection for Computerized Adaptive Testing. Denise Reis Costa and Camila Akemi Karino, CESPE/University of Brasilia; Fernando A. S. Moura, Federal University of Rio de Janeiro; and Dalton F. Andrade, Federal University of Santa Catarina, Brazil
Adequacy
of an Item Pool for Proficiency in English Language From the
Development of an Item
Model Taxonomy for Automatic Item Generation in Computerized Adaptive Testing. Hollis Lai, Mark J. Gierl, and Cecilia Alves,
An Approach to Implementing Adaptive
Testing Using Item Response Theory in a Paper-Pencil Mode. V. Natarajan, MeritTrac Services Pvt. Ltd,
Assessing the Equivalence of
Internet-Based vs. Paper-and-Pencil Psychometric Tests. Naomi Gafni,
Keren Roded, and Michal Baumer, National Institute for Testing and
Evaluation,
Features of a CAT System and Its Application to
J-CAT. Shingo Imai, Y. Akagi,
Adaptive
Measurement of Cognitive Ability Based on a Person’s Zone of Nearest
Development. Marina Chelyshkova and
Victor Zvonnikov, State
Implementing
Figural Matrix Items in a Computerized Adaptive Testing System:
Constrained
Item Selection Using a Stochastically Curtailed SPRT. Jasper T. Wouda, and Theo J. H. M. Eggen, CITO, The
Using
Enhanced Effective Response Time to Detect the Extent and Track the Trend of
Item Pre-knowledge on a Large-Scale Computer Adaptive Assessment. Jie Li and Xiang Bo Wang, ACT, Inc.,
Computerized Adaptive Testing for
the
Criterion-Related
Validity of an Innovative CAT-Based Personality Measure. Robert J. Schneider, PDRI, Richard A. McLellan, PreVisor,
Inc., Tracy M. Kantrowitz, PreVisor,
Inc., Janis S. Houston, PDRI, Walter
C. Borman, PDRI, U.S.A.
1:00 – 1:40 CAT in
Computerized
Adaptive Testing in
Twenty-Two
Years of Applying CAT for Admission to Higher Education in
2:00 – 3:15 Concurrent Sessions
Item Selection: Larry Rudner, GMAC, Chair
Item Selection and Hypothesis Testing for the Adaptive Measurement of Change. Matthew Finkelman, Tufts University School of Dental Medicine, David J. Weiss, University of Minnesota, and Gyenam Kim-Kang, Korea Nazarene University
A Gradual Maximum Information Ratio Approach to
Item Selection in Computerized Adaptive Testing. Kyung (Chris) T. Han, Graduate Management Admission Council
Item
Selection With Biased-Coin Up-and-Down Designs.
A
Burdened CAT: Incorporating Response
Burden with Maximum Fisher’s Information for Item Selection. Richard J. Swartz, The University of
Real-Time Analysis: Fanmin Guo, GMAC, Chair
Adaptive Item Calibration: A Simple Process for Estimating Item Parameters Within a Computerized Adaptive Test. G. Gage Kingsbury, Northwest Evaluation Association
On
the Fly Item Calibration in Low Stakes CAT Procedures. Sharon Klinkenberg, Department of Psychology, University of Amsterdam, Marthe
Straatemeier, Department of Psychology,
University of Amsterdam, Gunter Maris, CITO,
and Han van der Maas, Department of
Psychology, University of Amsterdam
An Automatic Online Calibration Design in Adaptive Testing.Guido Makransky, University of Twente/ Master Management International A/S and Cees A. W. Glas, University of Twente
Investigating
Cheating Effects on the Conditional Sympson and Hetter Online Procedure with
Freeze Control for Testlet-based Items. Ya-Hui
Su,
3:25 – 5:30
Department
of Defense
The
Nine Lives of CAT-ASVAB: Innovations and Revelations. Mary Pommerich, Daniel O. Segall, and
Kathleen E. Moreno,
National
Institutes of Health
The
CAT-DI Project: Development of a
Comprehensive CAT-Based Instrument for Measuring Depression. Robert D. Gibbons,
Development
of a CAT to Measure Dimensions of Personality Disorder: The CAT-PD Project. Leonard J. Simms,
The
MEDPRO Project: An SBIR Project for a
Comprehensive IRT and CAT Software System
IRT
Software: David Thissen, The
CAT
Software: Nathan Thompson, Assessment Systems Corporation
Wednesday, 3 June
8:15 – 9:25 Concurrent Sessions
Item Exposure: Larry Rudner, GMAC, Chair
Reviewing
Test Overlap Rate and Item Exposure Rate as Indicators of Test Security in
CATs. Juan Ramón Barrada, Universidad
Autónoma de Barcelona; and Julio Olea, Vicente Ponsoda, and Francisco J.
Abad, Universidad Autónoma de Madrid.
Optimizing
Item Exposure Control and Test Termination Algorithm Pairings for Polytomous
Computerized Adaptive Tests With Restricted Item Banks. Michael Chajewski
and Charles Lewis,
Limiting
Item Exposure for
Multidimensional CAT: Nate Thompson, Assessment Systems
Corporation, Chair
Comparison of Adaptive Bayesian Estimation
and Weighted Bayesian Estimation in Multidimensional Computerized Adaptive
Testing. Po-Hsi Chen,
Comparison
of Ability Estimation and Item Selection Methods in Multidimensional
Computerized Adaptive Testing.
Multidimensional
Adaptive Testing: The Application of Kullback-Leibler Information. Chun Wang and Hua-Hua Chang,
Multidimensional
Adaptive Personality Assessment: A Real-Data Confirmation. Alan D. Mead, Avi Fleischer, and Jessica D. Sergent, Illinois Institute of Technology
9:35 – 10:45 Item and Pool Development: Larry
Rudner, GMAC, Chair
Adaptive
Computer-Based Tasks Under an Assessment Engineering Paradigm. Richard M.
Luecht, The
Developing
Item Variants: An Empirical Study.Anne Wendt, National Council of State Boards of Nursing, Shu-chuan Kao, Pearson VUE, Jerry Gorham , Pearson VUE, and Ada Woo, National Council of State Boards of Nursing
Evaluation
of a Hybrid Simulation Procedure for the Development of Computerized Adaptive
Tests. Steven W. Nydick and David J. Weiss,
11:00 - 11:55 Diagnostic Testing: Larry Rudner,
GMAC, Chair
Computerized
Adaptive Testing for Cognitive Diagnosis.
Ying Cheng,
Obtaining
Reliable Diagnostic Information through Constrained CAT. Hua-Hua Chang, Jeff Douglas and Chun
Wang,
Applying the DINA model to GMAT Focus Data. Alan Huebner, Xiang Bo Wang, and Sung Lee, ACT, Inc.
11:55 - 12:30 Wrap-Up and Future
Directions: Larry Rudner and Dave Weiss
Rick Guyer and David J. Weiss, University
of Minnesota
This study focused on how early person misfit affected the recovery of θ for a computerized adaptive test (CAT) basd on the 3-parameter logistic model. Number of misfitting items, generating θ, item selection method, and θ estimation method were independent variables in this study. The number of misfitting initial item responses was varied from k = 0 to 4 items. Ten different generating θ values were used at intervals from -3 to +3. For the five conditions in which θ was less than or equal to 0, the first k responses were fixed to be correct; for the five conditions where θ was greater than or equal to 0, misfit was introduced by fixing the first k item responses to be incorrect. Maximum likelihood, weighted likelihood (WLE), and expected a posterior (EAP) estimation were used to estimate θ. Both Fisher information and Kullback-Leibler information item selection methods were used. All independent variables were crossed in the simulation design, with 1,000 simulees per cell. Recovery of q was indexed by bias, standard error, and root-mean-square error at CAT lengths of 15, 25, 35, and 50 items. ANOVA was used to analyze the results and major effects were identified by eta-squared.
It was found that CAT could recover from misfit-as-correct-responses (MCR) for low ability simulees given a sufficient number of items. CAT could not recover from misfit-as-incorrect-responses (MIR) for high ability simulees, even after 50 items. At 50 items, a small amount of bias was observed for 1 misfitting item; as the number of misfitting items increased to 4, the bias increased and was substantial for all positive values of q. The differences between the Fisher and Kullback-Leibler information-based item selection dissipated after 15 items were administered – with one exception: for the MIR conditions, it was found that WLE functioned differently under the two item selection methods even after 50 items were administered. A follow-up study was performed, and it was found that WLE was highly sensitive to item difficulty early in the CAT. Implications of the results and suggestions for future research will be provided.
For further information: guyerr@assess.com or
guyer005@umn.edu
Fanmin Guo, Graduate Management Admission Council
If a few test items should become compromised, their impact on test scores would not be constant across different computerized adaptive test (CAT) programs. For the same number of items compromised, the impact might be more serious in some CAT programs than others because the impact interacts with the complexity of the test specification, size of CAT pools, item exposure control, item selection algorithm, and scoring method employed in CAT programs. As a result, evaluating the impact of compromised items on test scores in a CAT program is not easy. Most of the previous research focused on the impact on a group of examinees through simulations.
In this study, a new method of simulation is introduced that focuses on the impact on individual examinees using the GMAT® CAT as an example. For each simulee, two paths of simulations were run. The first path is the conventional simulation under no compromised item condition. The second path follows the selected items and response patterns in the first path until a “compromised” item is “administered.” Then the answer to this item is reset to a correct answer to simulate a “security breach.” After that, the path branches to selecting new items. All the answers to subsequent “compromised” items are set as correct answers until the end of the test. The purpose of this method is to quantify the impact of compromised items as well as its interaction with the item selection and other CAT operational configurations. Since each simulee will have two scores from the two separate paths, this method allows estimating the range of score gains and the number of compromised items seen by each individual. It allows reports that, if n items from a CAT pool were exposed to m examinees, x examinees would gain y score points due to the impact of compromised items. The method employed in this study applies to any CAT program.
For further information: fguo@gmac.com
Eileen Talento-Miller
and Fanmin Guo, Graduate Management Admission Council
Estimation of ability in computerized adaptive testing relies on the assumption that examinees are responding based on their content knowledge and skills. Guessing might have differential consequences on scoring depending on the situation. In the case of time constraints, examinees are faced with a choice of leaving questions blank or randomly responding. The current study provides guidance for examinees based on real data from an operational CAT. Previous research provides an incomplete picture of the effects of choosing a guessing strategy versus omitting items in the scoring of an operational CAT. The study expands on previous research by using operational as opposed to simulated data, comparing results in verbal and quantitative sections of a test, and framing the results to provide guidance for examinees.
In this study, scores from tests with responses classified as random guesses are compared to scores that would be observed if the items had not been reached. Items are classified as guesses by examining the distribution of latency for correct responses to determine a rapid guessing threshold. The threshold is then checked against the proportion of correct responses at that level to find close to chance levels of correctly answering the item. The guessing threshold of 10 seconds for verbal items and 7 seconds for quantitative items was applied to all item positions. Only consecutive rapid guesses at the end of the section were examined. Scores of examinees who guessed are recalculated to reflect ending the test and omitting the remaining items rather than guessing. Although the results tend to favor guessing as a strategy, the degree of difference varied based on section content, number of items involved, and estimated ability of the examinee. In the verbal section of the test, few differences existed between guessing scores and omit scores. In the quantitative section, the benefit of guessing became more pronounced as the number of items increased. The results are particularly intriguing when ability groups are compared. Both the verbal and quantitative sections show a slight preference for the omit strategy in the low ability group. For the high ability group, apparently severe penalties for omissions in the shorter quantitative measure appear to make guessing the unequivocal strategy of choice. Future research could include more definitive methods for determining random guessing and examine guessing at different positions within the test rather than merely at the end. Ultimately, the advice for candidates remains the same for a CAT as it would for other tests: Time management is important to allow ample opportunity to give thought to every question.
For
further information: talento-miller@gmac.com
Ben Babcock and David J. Weiss,
This simulation study examined the performance of several CAT termination rules: four basic termination rules (standard error, minimum information, change in θ, and fixed length) and two combinations of standard error and minimum information termination. Four item banks were used: a flat information bank with 500 items, a peaked information bank with 500 items, a flat information bank with 100 items, and a peaked information bank with 100 items. Maximum likelihood scoring was used to estimate q . For non-mixed response vectors, q was incremented by 0.5. In addition to examining the performance of these termination criteria, the study was concerned with further examining the conclusion from previous research that variable-length CATs are more biased than fixed-length CATs (Chang & Ansley, 2003; Yi, Wang, & Ban, 2001).First, a number of variable-length CAT conditions were simulated. Then, the mean number of items administered for selected variable-length conditions was determined and fixed-length CATs were simulated with the appropriate number of items in order to properly compare variable- and fixed-length CATs. CAT performance was compared in terms of test length, as well as bias, RMSE, and correlation in the recovery of true q.
As expected, longer CATs yielded more accurate q estimation no matter which termination criterion was used, but there were diminishing returns with a large numbers of items. It is recommended that CATs should administer a minimum number of 15 to 20 items to ensure stable measurement. The standard error termination rule, also known as the equiprecise measurement rule, performed the best among all the methods if the standard error cutoff was sufficiently low and the item bank contained the amount of Fisher information needed to reach the cutoff. Standard error termination was also quite efficient by administering relatively few items. Change in q , a newer termination criterion, performed slightly worse than its fixed-length termination counterpart. Hybrid termination rules, such as combining minimum information and standard error termination, functioned the best when the item bank was small but had a peaked information function. The fixed-length CATs did not perform better than their standard error termination counterpart when equated for average test length. Previous findings stating that variable-length CATs are more biased than fixed length CATs were the result of two procedural artifacts in prior research: (1) variable-length CATs were generally much shorter than the fixed-length CATs; and (2) most previous studies used Bayesian scoring, which biased the shorter variable-length CATs in the previous studies because the prior has more of an effect on q estimation when there is less psychometric information. Standard error termination actually performed slightly better than fixed-length CATs of comparable mean length in estimating low true q values.
For
further information: babco062@umn.edu
Theo J. H. M. Eggen, CITO and
Jasper T. Wouda CITO, The
When
classification into a limited number of categories is the main purpose of
testing, algorithms based on the application of sequential statistical testing
have shown to be better performing alternatives above traditional estimation
based computerized adaptive tests(e.g. Reckase, 1983 and Eggen &
Straetmans, 2000) In these studies, the sequential probability ratio test
(SPRT; Wald, 1947) is applied in order to decide whether more observations on
items are needed and which classification decision is to be made. When a
decision cannot be made with the predetermined decision error rates, in
practice the procedure is always truncated at a maximum test length. Recently Finkelman (2003, 2008) proposed an
adaptation of stochastic curtailment with which he created an additional
stopping rule for the SPRT. This “stochastically curtailed sequential
probability ratio test:, or SCSPRT, generally follows the same rules as the
conventionally truncated SPRT, including its stopping rule. However, the SCSPRT
adds some rules in order to be able to stop testing in the cases where a change
in decision between categories is possible, but unlikely. Finkelman (2003)
introduced the method for the case of classifications in two categories and
items selected to be most informative at the classification point. In this
paper the generalization of the application of the SCSPRT to problems with more
than two categories is discussed with a focus on the problems encountered in
generalizing to the three-category problem. In general the (optimal)
composition of the test cannot be fixed in advanced when there is more than one
cutting point, which is a requirement of
Finkelman’s SCSPRT. The way the application of stochastic curtailment in
combinations of SPRTs can be combined with the adaptive item selection in the
test is described. The performance of the proposed procedures is illustrated by
results of simulation studies.
For further information: Theo.Eggen@cito.nl
Nathan A. Thompson, Assessment Systems Corporation
A common application for adaptive testing is to classify examinees into mutually exclusive groups. Currently, the predominant psychometric termination criterion for designing computerized classification tests is the sequential probability ratio test (SPRT; Reckase, 1983) based on item response theory. This operates by formulating a hypothesis test that a given examinee’s ability value q is equal to a fixed value below (q1) or above (q2) the classification cutscore. Recently, it was demonstrated that the SPRT, which only uses fixed values, is less efficient than a generalized form which tests whether a given examinee’s q is below q1or above q2 (Thompson, 2007). Moreover, this better represents the conceptual purpose of the exam, which is to test whether q is above or below the cutscore.
The purpose of this study was to explore the specifications of the new generalized likelihood ratio (GLR). As with the SPRT, the efficiency of the procedure depends on the nominal error rates and the distance between q1 and q2 (Eggen, 1999). Preliminary results suggest that observed error rates are closest to nominally specified error rates when the values of q1 and q2 are approximately 0.1 from the cutscore. The study utilized a monte carlo approach, with 10,000 examinees simulated under each condition. Three levels of nominal accuracy were investigated (90%, 95%, and 99%), as well as 25 values of the difference between the cutscore and q1 or q2 (0.00 to 0.50 in increments of 0.2). Additionally, another formulation was investigated that forms the likelihood ratio based on an integration of the likelihood function. This was also suggested by Thompson (2007), but was not accurate due to the asymmetry of the likelihood function when the three-parameter model is used; the left-hand end of the likelihood function is substantially higher than the right-hand end because of the c parameter. This artificially biases the ratio in the negative direction. Methods of correcting for this are suggested.
For further information: nthompson@assess.com
Anne Thissen-Roe, Kronos
Over the years, most CAT systems have used score estimation procedures from item response theory. IRT models have salutary properties for score estimation, error reporting, and next-item selection. However, some testing purposes favor scoring approaches outside IRT. Where a criterion metric is readily available and more relevant than the assessed construct, for example in the selection of job applicants, a predictive model might be appropriate (Scarborough & Somers, 2006). Neither IRT scoring nor unidimensional assessment structure can be assumed. Yet, the primary benefit of CAT remains desirable: shorter assessments with minimal loss of accuracy due to unasked items. Without IRT, it remains possible to create a CAT system that produces an estimated score from a subset of available items, recognizes differential item information given the emerging item response pattern, and optimizes the accuracy of the score estimated at every successive item. No information is needed about the internal mechanisms of the scoring algorithm, provided it has certain properties: (1) The score must be discrete or able to be made discrete, such as by application of cut scores or reporting of integer scale scores. The score can be a nominal category; and (2) The degree to which the score changes when a particular item response is given must vary based on the responses to other items. If these conditions are met, the scoring algorithm can be treated as a "black box," with adaptation conducted on the outside. The method of multiple imputations (Rubin, 1987) might be used to simulate plausible scores given plausible response patterns to unasked items (Thissen-Roe, 2005). This method is also capable of rendering an estimate of the error introduced by unasked questions. Mutual information might then be calculated in order to select an optimally informative next item (or set of items). This is related but not identical to the methods of Weissman (2007) for item selection, and Chambless and Scarborough (2001) for feature selection.
Two neural network-centered scoring algorithms serve as structural examples. In early testing, previously observed response patterns to the complete assessments were resampled according to CAT item selection. The reproduced CAT scores were compared to full-length assessment scores. Approximately 95% accurate assignment of examinees to one of three score categories was achieved with a 70%-80% reduction in median test length. This method of CAT is more computationally demanding than traditional IRT-based approaches, due to the necessity of completely scoring some hundreds or thousands of response patterns per item selected. Factors influencing performance were also examined during early testing. Reducing the number of multiple imputations used is a way of reducing computation time; it appears to impact assignment accuracy less than limiting items presented under a confidence-based stopping rule. Computation time can also be reduced by sacrificing algorithmic simplicity to move repeated computations outside of the "black box;" however, such shortcuts impose a maintenance burden. Mixing "black box" CAT with Internet testing also requires minimizing the data size and frequency of transactions between client and server, for which the simplest algorithm is well suited.
For
further information: anne.thissenroe@kronos.com
Monica M Rudick, Wern How Yam, and Leonard Simms
University at Buffalo, State University of New York
A variety of approaches have been implemented to create CAT personality assessments. Recent research has focused on IRT for CAT personality measures, although its use is both computationally complex and requires certain assumptions to be met that do not always hold for personality measures. As a result, non-IRT-based CAT approaches, such as the countdown method, have also successfully been applied to CAT versions of personality measures. In the countdown method, there is some debate regarding whether classification or full-scores-on-elevated-scales (FSES) methods are more preferable. In addition, it is unclear how order of item administration might impact item savings and the validity of scores. Both IRT and non-IRT based methods appear to yield numerous advantages for CAT assessments, most notably time and item savings, and ease of administration. However, these two methods have yet to be directly compared. The purpose of the present study was to compare non-IRT and IRT-based approaches utilizing real-data CAT simulations on a large diverse sample (N = 8,690) who completed the Schedule for Nonadaptive and Adaptive Personality (SNAP). The report focuses on the three longest SNAP Scales: Disinhibition (DIS), Negative Temperament (NT) and Positive Temperament (PT). Simulation analyses compared item savings, item and test information, test validity, and fidelity across the IRT- and non-IRT CAT methods. In addition, within the countdown method simulations, the simulations examined whether item presentation order impacted the results. Results will have implications for test developers wishing to apply CAT technology to personality measures.
For
further information: mmrudick@buffalo.edu
Denise Reis Costa, Camila Akemi Karino, CESPE/University of Brasilia, Brazil
Fernando A. S. Moura, Federal University of Rio de Janeiro, Brazil
Dalton F. Andrade, Federal University of Santa Catarina, Brazil
One of the most important components of CAT is the set of procedures for item selection. Unlike traditional paper-and-pencil tests, adaptive procedures administer items that fit the examinee's level of proficiency. This selection is based both on the characteristics of the items (e.g., item difficulty or discrimination parameters) and on the estimated proficiency of the examinee. This study is a work-in-progress that aims to evaluate the performance of three different CAT item selection methods: the first one is derived from the maximum information criterion, one of the most popular item selection methods inCAT; the second method is based on the global information method as defined by Chang and Ying (1996), which use the Kullback-Leibler measure, while the third selection method based on the predictive analysis defined by the expected maximum information criterion proposed by van der Linden (1998). To evaluate the three different methods, the answers of ten examinees with different skill levels were simulated for an item pool containing 246 items of the Instrumental English test of the University of Brasilia. The resulting database was fit by a three-parameter logistic model on a scale with mean 0.0 and standard deviation of 1.0, later transformed into a mean of 100 and standard deviation of 25. The examinees' iterative proficiencies were estimated using expected a posteriori (EAP). An initial analysis of bias and mean square error suggested that all methods performed similarly to estimate examinees’ proficiency. However, databank-related characteristics might have influenced those measures, since it is not yet an ideal item pool for CAT implementation. With these results, it can be concluded that there is no apparent statistical difference in relation to the proficiency estimation for the three presented methods for the analyzed item bank.
For
further information: denise@cespe.unb.br
Camila Akemi Karino, Denise Reis Costa, and
Jacob Arie Laros
CESPE/University of
Brasilia, Brazil
The possibility of applying different item sets according to the level of ability of each respondent has stimulated, among other factors, an increasing use of CAT. In spite of the increasing use, this study is one of the first initiatives in this field in Brazil. The item pool used in this study is a database of the proficiency exam in English language has been in use since 2004 by the University of Brasilia. This exam aims to assess the student’s comprehension of texts in the English language. The exam is a paper-and-pencil test that is composed of 50 multiple-choice items. The psychometric item quality was verified using classical test theory and IRT. The complete item pool consists of 450 items divided into nine test forms. Each test form was responded by, on average, by 330 students. The total number of respondents was 2,969. First, each test was analyzed individually and in a second stage the nine tests were calibrated jointly. Of the 450 items, 37 items were common items between test forms. In the individual analyses, 46 items with biserial correlation less than than .20 and 80 items with discrimination parameter in the normal IRT metric less than .50 were eliminated. In the joint analysis, another 58 items with an a parameter less than .50 were eliminated. After the elimination of these items, the joint IRT analysis revealed a mean discrimination parameter of .77 (SD = .20), varying between .49 and 1.67. In relation to the b parameter, the existence of a substantial variation in difficulty level of the items was observed (varying between -3.56 and 3.23): however, the majority (75%) of the items showed a b parameter below .10. The median value of parameter c was .11 (SD =.04) with a range from .03 to .24. After the joint calibration, successive points of the scale were fixed for anchor items and each of these levels was interpreted pedagogically by specialists. The suitability of the item pool for implementation of a CAT procedure was questioned taking into consideration that 44% of the items needed to be eliminated in order to agree with pre-established psychometric criteria. Nonetheless, both the analysis of the item pool and the scale interpretation permit initial studies for the implementation of a CAT procedure. The item pool as well as the scale could be improved by repeated applications of the English exam using a CAT procedure.
For
further information: camilaakarino@gmail.com
Hollis Lai, Mark J.
Gierl, and Cecilia Alves,
CAT makes tremendous demands on item banks because CATs require large numbers of test items. CATs require these item volumes for three general reasons. First, as test length increases in fixed-length CATs, requirements for test items increase to ensure that test scores are reliable (Wainer & Eignor, 2000). Second, with the emergence of cognitive adaptive tests (e.g., Zhou, Gierl & Cui, 2008), many more skills are measured at a finer grain size. Thus, more test items are required to measure these large numbers of specific skills. Third, item exposure and security concerns demand that item re-use rates be relatively small. That is, CAT requires a large number of unique test items in operational testing situations. One solution that could be developed to address these three issues is to generate many more items. Automatic item generation is an approach to item development where large numbers of offspring items (also called item instances) are generated from a parent item model. Although automatic item generation can potentially create hundreds and even thousands of items, its effectiveness is reliant on the availability of an efficient framework for creating the parent item models. The components in a parent item model for a multiple-choice item consist of the stem (the component of an item that forms the context of the question the examinee is required to answer), the options (a set of alternatives with one correct option and multiple distracters to answer the question), and any auxiliary information (e.g., pictures, graphs).
To identify possible item model types, Gierl, Zhou, and Alves (2008) developed a taxonomy to categorize and delineate the levels of variation in components of the parent item model. One limitation of the study by Gierl et al., however, was that it focused only on mathematics items. To be applied in diverse testing situations, item models need to be created in many different content areas to allow for automatic item generation. The present study will apply the taxonomy to item models from diverse content areas, including Language Arts, Social Studies, and Science, to generate items for a computer-based testing program. While there might have been other implementations of item generation, few have been documented (Irvine, 2002). Hence, the implication of the present study is to demonstrate a systematic way to generate test items that creates large numbers of items in diverse content areas, thereby lowering the cost of item development while maintaining a high level of quality in the development process.
For further information:
hollis.lai@ualberta.ca
An Approach to Implementing Adaptive Testing Using
Item Response Theory in a Paper-Pencil Mode
V. Natarajan, MeritTrac Services Pvt. Ltd, INDIA
In India, as most of the
large scale testing is conducted in the paper-pencil (offline) mode, it is
important to arrive at models of implementing IRT in an offline/paper-pencil
mode. MeritTrac has experimented in conducting an IRT-based test in a
paper-pencil mode for the analytical abilities test for engineering graduates.
With the help of item characteristics calculated prior to the test, a 6-item
test with increasing item difficulty was created as a test form on paper.
Normally, research shows that a 6/10 item test can be compared to 25 or more
items in the test. The test was then administered to the candidates in an
offline mode. The responses of the examinee were then entered in student tracking software that had
been specially coded for this purpose. The output of this gives an estimation
of the examinee’s true score as if he/she has taken the parent 25-item test.
Since it is not very feasible to conduct an online test everywhere, especially
in a country like India, the importance of adaptive testing in offline mode
increases many fold. In this model, we only need a single computer with student
tracking software and pre-published test forms consisting of items whose
characteristics have been calculated on the basis of past responses. Thus the
offline mode is much more practical and is as accurate as the online mode.
In the analytical abilities test, we have looked at 100 items and the responses of 1,000+ examinees on each of these items, which we entered into BILOG and item difficulty values were generated. 93 items were found to be relevant and the parent test of 93 items eventually emerged. The items were grouped into 6 groups and 10 items were selected (one item each very easy and easy two items from below average, average difficult and very difficult). Several sets of 10-item adaptive tests each were selected and administered to the examinees. Their responses to 10 items were categorized in terms of 9,8,7,6,5,4,3,2,1 correct and a table generated from which ability and true scores can be read. In this methodology, the test administrator needs to be very cautious when dealing with student tracking software so that mistakes are not made in entering the values of item numbers in the reshuffled version and the examinee’s responses.
For further information: madan@merittrac.com
Naomi Gafni, Keren Roded, and Michal Baumer
National Institute for Testing and Evaluation, Israel
Few studies have yielded information regarding the equivalence of high-stakes admissions tests administered via the Internet and paper-and-pencil administrations of those tests (Potosky & Bobko, 2004). Despite the lack of evidence regarding the equivalence of scores obtained in these two modalities, there is increasing demand for Internet-based testing, with the number of recruitment and admissions tests administered via the Internet constantly rising. This is largely due to the convenience and efficiency that the medium offers. The Psychometric Test, which is used for admission to institutions of higher education in Israel, is a high-stakes examination. The test consists of three sections: Verbal Reasoning (60 items), Quantitative Reasoning (60 items), and English as a Foreign Language (54 items). All items are in multiple-choice format. At the present time, most of the examinees take the paper-and-pencil version of the test. It is anticipated that Internet-based administration will be expanded. Given that this process will be gradual, and for a period of time the test will be administered in two parallel modalities, establishing the equivalence of scores is of paramount importance.
The goal of the present study was to compare the achievement of examinees who took the paper-and-pencil version of the Psychometric Test with the achievement of those who took it via the Internet. The question of equivalence arises because there are certain differences between a linear computerized test and a traditional paper-and-pencil test, and also between computerized tests administered via the Internet and those that are not. In the former case, the differences lie in the presentation of the items, the method of answering, how reading comprehension passages and questions with graphic components are presented, and in how time is allotted. Internet-based administration brings other factors into play, for example interruptions to the power supply, non-standard computers in different laboratories, Internet server problems, the impact of heavy traffic on the server, a greater risk of items being compromised and the challenge of handling problems during the administration itself. The relationship between performance on the experimental test and several background variables (based on a feedback questionnaire) was also examined. The participants were 381 examinees who registered for the October 2008 administration of the Psychometric Test. The paper-and-pencil version was given to 192 of these participants, and 189 were tested via the Internet. Assignment to the two groups was random. 370 of the participants in the experiment (185 from each one of the groups) took the actual Psychometric Test a month after the experimental administration.
The following conclusions are based on analysis of the results: (1) No significant difference was found between the scores of the two groups; (2) No significant differences were found between the scores on the Verbal Reasoning and Quantitative Reasoning sections, however, the English scores were significantly higher in the computerized version, across all item types; (3). The correlation between the overall experimental scores and scores on the actual test were 0.93 and 0.94 for the computer-based and paper-and-pencil groups respectively; (4) The difference between the two groups in improvement in scores (between the experiment and actual test), both overall and for each section, was not significant; (5) The difference in scores between men and women was the same for both groups; and (6) The correlation between frequency of computer use and performance on the test was similar for both groups. Thus, it was found that the modality of administration, Internet-based or paper-and-pencil, did not affect examinee performance on the Psychometric Test. This holds with respect to item types that we suspected would become more difficult when administered by computer. The results support simultaneous administration in two modalities.
For
further information: naomi@nite.org.il
Shingo Imai, Y. Akagi, Yamaguchi University, Japan
K. Kikuchi, Toho University, S. Ito, TUFS, Japan
Y. Nakamura, Tokiwa University, Japan
H. Nakasono, Shimane University, Japan
A. Honda, APU, and T. Hiramura, TIT,
A CAT system called J-CAT or
Japanese computerized adaptive test, which is operational on the internet or by LAN, has
been developed and used as a
proficiency test of
Japanese at the college level for international students in Japan. We discuss some
features of this CAT system,
focused on the viewpoint of test administrators. The features discussed in this presentation include registration
method,
item-pool management, and utilization of test results. We illustrate how this
system registers examinees and authenticates them. We also discuss how to
manage an item pool; such
as uploading
items, setting IRT
parameters, and setting answering
time limits for
each item. The system provides useful information for analyzing the results of a test. We highlight some
features of a downloadable CSV file of properties of examinees and test results. We show
what information is available for an administrator and how an administrator might utilize the information. Examinees are
also provided feedback of their test results as a report form which is
automatically produced at the end of a test.
The system
of J-CAT, which
contains items for Japanese
proficiency at present,
can be also used
for tests other than
Japanese language if the items are replaced with items of other tests. The
system supports Rasch, two-parameter, and three-parameter IRT models.
For
further information: imai2002@yamaguchi-u.ac.jp
Marina Chelyshkova and Victor Zvonnikov, State University of Management, Russia
At the present moment the majority
schools and universities of
For
further information: mchelyshkova@mail.ru
Poh Hua Tay and
Raymond Fong, Ministry of Education,
Figural matrix items such as Raven’s Standard Progressive Matrices (SPM) are widely used for assessing general intelligence of pupils. Substantial manpower resources are incurred when administering tests on a large scale basis via paper-and-pencil (P&P). A computer-based test (CBT) would offer the advantages of logistical ease during the data collection stage, and administrative ease during the data entry stage; this is especially so for CAT, as it reduces administration time, as well. Unlike P&P and CBT, the most appropriate set of items in a CAT can be adaptively selected for each pupil based on his/her responses to previous items. This permits each pupil to be evaluated on a smaller subset of the total item pool, having better test experience as items are chosen based on his/her ability; and allows the test developer to control the error of measurement to a desired degree of precision.
In this study, an item bank of 195 figural matrix items that are similar to SPM’s was created. The psychometric properties of these items were then established after trialing them on a sample of 6,821 Primary 2 pupils (equivalent to Grade 2 pupils who are about 8 years in age) of varying academic abilities from 20 coeducational schools in Singapore. IRT was used to calibrate all the figural matrix items. From this item bank, a P&P prototype, two CAT prototypes (one starts with an easy item, while the other starts with an average item), and a CBT prototype were generated and administered, via the FastTEST Pro v2.3 platform, to four groups of Primary 2 pupils in Singapore. These groups consisted of a total of 948 Primary 2 pupils of varying academic abilities and were selected from 12 coeducational schools. SPM was also administered to all of them via P&P. This project was designed to study the comparability of the abilities of pupils estimated from the differentpPrototypes (P&P, CATs, CBT) and SPM.
For
further information: tay_poh_hua@moe.gov.sg
Jasper T. Wouda
and Theo J. H. M. Eggen, CITO, The
Computerized classification testing (CCT) can be used to increase efficiency in educational measurement. The truncated sequential probability ratio test (TSPRT) has been widely studied as a decision algorithm in CCT for two or more categories (Spray, 1993; Eggen, 1999). Finkelman (2003) added an algorithm to the TSPRT in the form of stochastic curtailment, to classify an examinee in an even earlier stage of testing. This stochastically curtailed SPRT (SCSPRT) halts testing when a change of classification is possible but unlikely. As can be seen in Finkelman (2003, 2008), the SCSPRT is an extension of the SPRT. It adds stochastic curtailment in the form of two extra stopping rules per level. Stochastic curtailment ceases testing and rejects hypothesis H01 if given k observations, the probability that a decision D will accept H01, Pr(D= H01), is not higher than a set value 1-γ. It stops testing and accepts H01 if this probability is at least γ. This method makes use of the sub-optimality of the SPRT as used in truncated tests.
In the comparison of performance between the SPRT and SCSPRT (Finkelman, 2003, 2008), results showed a substantial decrease in number of items used per simulee for the SCSPRT, while the percentage of correctly classified simulees remained the same. When using real item parameters and realistic data (Wouda, 2008), this decrease became somewhat smaller, but was still substantial. However, in order to be applied in real-world tests, non-statistical constraints must also be considered. Different constraints include, for example, content balancing, answer key balancing, conflicting items and item exposure control. In this study, different constraint handling methods will be compared, together with different item selection methods. The applied constraints are content balancing and exposure control. The compared item selection methods will be selection of items at the q estimate and selection of items at the cut-score. The methods for exposure control that will be compared for the SPRT and SCSPRT are the Sympson-Hetter method, the progressive method, and alpha-stratified testing. The methods for content balancing that will be compared are the Kingsbury and Zara (1989, 1991) approach and the weighted deviation method (WDM) by Stocking and Swanson (1993).
For
further information: Jasper.Wouda@cito.nl
Jie Li and Xiang Bo Wang, ACT, Inc.
In addition to being highly efficient and accurate in
terms of scoring, diagnosis, and reporting, CAT is also known for its global
ease and reach of test delivery (Wainer et al, 2000; Meijer & Nering, 1999;
Parshall, Spray, Kalohn, & Davey, 2002).
However, the latter advantage of CAT also introduces a tenacious problem
of potentially exposing items to a high number of examinees due to its high
frequency of test administration, which is likely to increase advance or
pre-knowledge of items and to jeopardize score validity. Of great concern and interest to the entire
educational testing industry is the possibility of validly detecting and
tracking the extent that CAT items are exposed.
The purpose of this research was (1) to establish population item
response times for all items and associated trends for all items with a
large-scale international CAT assessment and (2) to investigate the feasibility
of applying “effective response time” (ERT; Meijer & Sotaridona, 2006) to detect the extent and track the trend
of item pre-knowledge on suspected compromised items on this assessment. The
study was based on both operational and simulated data of a large item pool of
a large-scale international CAT assessment.
This item pool was selected because (1) it had a substantial number of
new items that were pretested in several years ago when little or no item
pre-knowledge could be assumed and (2) these pretest items had a long history
of operational use in subsequent years when item pre-knowledge could have been
accumulated. ERT indices for both items
and examinee, as described by Meijer & Sotaridona (2006), were computed
against a large collection of new items at their pretest time after they passed
stringent pretest item quality reviews. The ERT indices from this round were
used as null hypothesis benchmarks since no serious item pre-knowledge could be
assumed. In addition, simulations were conducted to project the values of these
ERT indices, if examinees’ response times were reduced by one-half and
one-fourth, respectively. Examinees ability estimates on the operational items
of this item pool were used for ERT modeling. ERT indices were also computed when all the new items
were first used operationally and the results were compared with their pretest
counterparts.
For further information: Jie.Li@Act.org
Patricia Rickard, CASAS,
James B. Olsen, Alpine Testing Solutions,
Debalina Ganguli, CASAS,
and Richard
Ackermann, Team Code, Inc.
This paper presents and demonstrates innovations in computerized adaptive testing of adult workplace literacy and numeracy skills developed by CASAS and customized for the Singapore Employability Skills System (ESS). The Singapore Workforce Development Agency (WDA) plays a pivotal role in the implementation of the ESS “to enhance the employability and competitiveness of employees and job seekers, thereby building a workforce that meets the changing needs of Singapore’s economy.” CASAS has designed and developed CATs for mathematics, reading, and listening, and computer-delivered tests for writing and speaking, suitable for adults. The CATs are administered in secure proctored locations using local area networks and an electronic access key (dongle). This paper presents an overview of the project, demonstrations of sample test items from the test battery, presentation of the test delivery and administration system, review of test score results and psychometric analyses, and plans for future enhancements and extensions. The Singapore CATs use the following psychometric procedures: selection of initial item from a random proficiency value near the center of proficiency distribution of the selected item bank, Rasch model calibration and proficiency estimation, and a stopping rule based on a minimum standard error or administration of a specified maximum number of items. Results for the mathematics and reading CATs are presented showing scale score population distributions, stopping rule exit criteria, item exposure distributions, and ability estimate and standard error curves across the item administration sequence. The paper presents summary recommendations for enhancements and extensions with the CAT tests and additional CAT research and validity investigations.
The CAT results are based on examinee samples of approximately 12,000 for the reading tests and 9,000 for the numeracy tests.
For
further information: rickard@casas.org
Robert J. Schneider, PDRI
Richard A. McLellan
and Tracy M. Kantrowitz, PreVisor, Inc.
Janis S. Houston and
Walter C. Borman, PDRI
This paper blends rigorous and innovative psychometric theory with a practical selection application. We used CAT principles to estimate examinees' personality trait levels through an iterative, IRT-driven, paired-comparison assessment process. The concept has its roots in Thurstone’s (1927) Law of Comparative Judgment. Thurstone conceived of using a paired-comparison procedure to scale stimuli on an interval scale. The idea was that if interval scale personality assessment could be generated with a paired-comparison procedure, then measurement might be made more precise than that yielded by typical Likert-type personality scales, which arguably provide only ordinal level data. Stark and Drasgow (1998) developed an algorithm to implement this process based on Zinnes and Griggs’ (1974) probabilistic unfolding model which, in turn, is based on (and extends) the work of Coombs (1950) and Thurstone (1927). Examinees select which of the two statements representing different levels of a personality trait are more descriptive of them, and are then presented with two additional statements, based on their previous selection. Sequences of statement-pairs are selected in a manner that maximizes information in an IRT sense. Statement-pairs are presented for a given personality traits until either (1) a sufficiently low conditional standard error of measurement is reached, or (2) ten statement-pairs have been presented. This methodology has been used successfully in the Navy (Borman, et al., 2001; Houston, Borman, Farmer, & Bearden, 2005). To our knowledge, however, our measure represents the first commercial application of CAT to the personality domain. Our test measures thirteen traits selected to represent the broad personality sphere and to be predictive across a wide range of occupations and industries. Our intent was to build in flexibility to create composites of scales relevant to a variety of different work populations to accommodate the differing needs of our clients.
This presentation reports initial validity results. Our CAT personality measure was administered to 1,607 first-line supervisors in eight organizations, each of whom was rated by his/her immediate supervisor. Sample sizes for predictor-criterion pairings ranged from n = 745 to 1,109. To identify a composite of scales relevant to the supervisory position, we conducted a relative weight analysis (Johnson, 2000) to identify the relative importance of each predictor based on its proportionate contribution to R2. This procedure controls for multicollinearity among predictors by considering the unique effect of each predictor as well as its effect when combined with the other predictors. Six scales were identified and a weighted sum was computed. The estimated operational validity of the adaptive personality scale composite was .25 against an overall job performance criterion. Graphs showing validity coefficients associated with presentation of different numbers of statement-pairs will also be shown for each scale included in the personality composite, as well as for the composite itself. This information will be very useful in that it will indicate how many statement-pairs must be presented to reach stable (asymptotic) criterion-related validity estimates.
For
further information: Robert.Schneider@pdri.com
Francisco J. Abad and David Aguado, Universidad Autónoma de Madrid
Juan Ramón Barrada, Universidad Autónoma de Barcelona
Julio Olea, Vicente Ponsoda, and Francisco J.
Abad, Universidad Autónoma de Madrid
eCAT is a CAT developed and applied
in
For
further information: fjose.abad@uam.es
Matthew Finkelman, Tufts University School of Dental Medicine
David J. Weiss, University of Minnesota
Gyenam Kim-Kang, Korea Nazarene University
In a paper presented at the 2007 GMAC CAT Conference, Kim-Kang and Weiss (2007, 2008) described a procedure for the adaptive measurement of change (AMC) for an individual examinee. In this procedure, a CAT is administered at Time 1 to an examinee and the final q estimate from that CAT is used to begin a second CAT at Time 2 (a later point in time). The Time 2 CAT continues until the Time 2 95% confidence interval around its q estimate does not overlap the Time 1 95% confidence interval; when this occurs “significant change” is said to have occurred for that examinee. Kim-Kang and Weiss compared the performance of the AMC procedure in measuring change with that of change scores from conventional tests based on raw difference scores, residual change scores, and IRT-based difference scores. Their results showed that AMC captured change better than all methods based on conventional tests under a variety of test configurations and levels of true change. They also demonstrated that the AMC procedure was efficient in detecting significant change, requiring an average of from 6 to 22 items for different levels of true change.
The present study focused on the detection of change. Two new methods for testing the hypothesis of significant change for a single person were developed and compared to the confidence interval overlap approach. These methods were a likelihood ratio test approach and a Z-test approach. The power and alpha level of these two hypothesis testing methods were evaluated in the context of two CAT item selection methods—Fisher information and a variation of Kullback-Leibler information designed to select items in the context of AMC. The new methods were evaluated under subsets of conditions examined by Kim-Kang and Weiss. Results demonstrated that both the likelihood ratio and the Z-test method had better control of alpha error and had better power to detect smaller amounts of change than the confidence interval overlap method. Item selection method had minimal effect on either alpha or power, with a slight difference in favor of AMC-modified Kullback-Leibler information. The combination of Kullback-Leibler information and the Z-test provided slightly better results than other combinations. When used with variable-length CATs, the latter combination resulted in substantial reductions in test length at Time 2 while maintaining alpha levels and power comparable to Time 2 fixed-length CATs. Recommendations are made for the further development of the AMC procedure.
For
further information: mattstat2000@yahoo.com
Yanyan Sheng, Southern Illinois University at Carbondale
A basic ingredient in computerized adaptive testing (CAT) is the item selection procedure that sequentially selects and administers items based on a person's responses to the previously administered items. For decades, maximum information (MI; Lord, 1977; Thissen & Mislevy, 2000) has been widely used as the conventional algorithm for item selection in CAT. However, this criterion based on Fisher’s information only targets the middle difficulty level where a person has about 0.5 probability of getting the items correctly, and hence is not applicable in situations where a different percentile is desired. In addition, MI heavily relies on an accurate estimation procedure that works well in all testing situations. Nonetheless, studies have shown that such a procedure is not readily available.
The biased-coin up-and-down design (BCD; Durham & Flournoy, 1994) has been widely used in bioassay for sequential dosage level selection because it can target any arbitrary percentile in addition to being efficient (Bortet & Giovagnoli, 2005). As the problem in bioassay shares many similarities with CAT, it is reasonable to believe that the item selection algorithm based on the BCD, which does not rely on an accurate trait estimate in every step of CAT administrations, provides an efficient alternative to, while being more flexible than, the conventional method. The development of this selection algorithm is essential as schools, professional organizations, and private companies seek to make CAT flexible enough to be implemented in wider testing applications.
The purpose of this study was to illustrate the use of the BCD in CAT and further evaluate its utility by comparing it with the conventional MI algorithm. For ease of comparisons, this study focused on the 1-parameter item response function. To investigate the utility of the BCD in CAT, two Monte Carlo simulation studies were conducted where either a fixed- or a random- stopping rule was employed. With fixed-stopping rule, the number of items administered was manipulated (k = 5, 10, 30, 100) and the item pool was fixed to have 100 different difficulty levels, whereas with random-stopping rule, the number of different difficulty levels in the item pool was manipulated (n = 10, 30, 50, 100). In either case, CAT responses were simulated for persons whose actual trait levels were 0 (average), -1 (1 standard deviation below the average), and -2 (2 standard deviations below the average), and the target difficulty level was at the 20th, 50th or 80th percentile. Each adaptive testing simulation began the trait estimation with an initial value of 0 and proceeded with the maximum likelihood method. The results suggested that item selection with the BCD is more flexible in targeting any arbitrary percentile of the difficulty levels. With respect to the accuracy of the trait estimation, MI performs slightly better with fixed-stopping rule, whereas the BCD is considerably better for tests with small number of different difficulty levels or persons whose trait levels are not at the extremes with random-stopping rule.
For further
information: ysheng@siu.edu
Richard J. Swartz, The University of
Seung W. Choi, Northwestern University Feinberg School of
Medicine
Widely used in various educational and vocational assessment applications, CAT has recently begun to infiltrate the patient-reported outcomes (PRO) arena. Several differences exist between PRO-CAT and “achievement CAT.” Polytomous, rather than binary, items are more appropriate for PROs; constructs are often quasi-traits with skewed distributions; informative items cannot always be generated along the important range of the trait; and in many patient populations conditions exist so that patients cannot tolerate longer tests. Reducing this response burden has been one of the main reasons for consideration of CAT in the PRO arena. Although successful in reducing burden, many of the current CAT algorithms do not formally consider patient or examinee burden as part of the item selection process. In the PRO setting, many CAT applications simply limit the maximum number of items to be administered. This study uses a loss function approach motivated by decision theory to develop an item selection method that incorporates burden into the Maximum Fisher’s Information (MFI) item selection method.
We compared several different loss functions representing varying degrees of burden, including a no-burden condition as a baseline. An item bank of 62 polytomous items measuring depressive symptoms was used to compare the different methods. The items were calibrated with the graded response model using 730 patients and caregivers from the M. D. Anderson Cancer Center. For each condition, we used two different response datasets to simulate CAT instruments. One dataset consisted of the real responses from the 730 patients and caregivers who answered all the items. The second dataset consisted of simulated responses to all the items based on a grid of q values with replicates at each grid point. The MFI-burden algorithm for item selection results in tests that are on average shorter (depending on the degree of burden) than those obtained using MFI alone, but without severely affecting the standard error of measurement. In particular the loss function incorporating burden protects respondents from receiving longer tests when their estimated trait score falls in a location where there are few informative items. This is very useful in PRO assessment where burden to the patient is a concern.
For
further information: rswartz@mdanderson.org
G. Gage Kingsbury, Northwest Evaluation Association
The characteristics of CAT change the characteristics of the field testing that is necessary to add items to an existing measurement scale. The process used to add field test items to a CAT might lead to scale drift (van der Linden & Glass, 2000; Ban, et al, 2001). In addition to this measurement concern, adding randomly chosen field test items to a test might disrupt the performance of an examinee by administering items of inappropriate difficulty. The current study makes use of the transitivity of examinee and item in IRT to describe a process for adaptive item calibration. In this process an item is successively administered to examinees whose ability levels match the performance of a given field test item. By treating the item as if it were taking an adaptive test, examinees can be selected who provide the most information about the item at its momentary difficulty level. Throughout the calibration process, the momentary difficulty estimate is updated and used in the process of item selection for all examinees. The item calibration can be completed when a fixed number of examinees have seen the item of interest, or when the momentary difficulty level for the item stabilizes to a predetermined variability. This approach should provide a more efficient procedure for estimating item parameters. While the procedure is not specifically designed to create an optimal calibration sample in the manner described by Holman and Berger (2001), it should result in the item being administered to a set of individuals that more closely approximates optimality.
The process is described in detail within the context of the one-parameter logistic IRT model. The process is then simulated using 10 replications of the calibration of 100 items to identify whether it produces more accurate and efficient item parameter estimates than random presentation of field test items to examinees. Results indicate that adaptive item calibration is more accurate for small sample sizes. With additional research, adaptive item calibration might provide a viable approach to expanding item pools in settings with small sample sizes or settings with a need for large numbers of items.
For
further information:
gage.kingsbury@nwea.org
Sharon Klinkenberg, Marthe Straatemeier, and Han van der Maas,
We present a new model for computerized adaptive progress-monitoring. This model is used in the Math Garden, a web-based monitoring system, which includes a challenging web environment for children to practice arithmetic skills. The Math Garden is a CAT web application, which tracks both accuracy and response time. Using a new model (Maris, in preperation) based on the Elo (1978) rating system and an explicit scoring rule, estimates of ability level and item difficulty are updated every trial. Items are sampled with a mean success probability of .75, making the tasks challenging yet not too difficult. By integrating the response time in the scoring rule, we try to compensate for the loss of information associated with the high success rates (van der Maas and Wagenmakers, 2005). In a period of eight months, our sample of 1,053 children completed over 850,000 arithmetic problems. The children completed about 25% of these problems outside their school hours. Results show good validity and reliability, high pupil satisfaction measured in playing frequency, and good diagnostic properties. The ability scores correlatde highly with the Dutch norm-referenced general math ability scale of the pupil monitoring systems of CITO. Also, test retest reliability analysis showed high correlations. In view of the satisfactory validity and reliability of the person ability estimators, our method opens the door to on-the-fly item calibration in low-stakes testing.
For
further information:
S.Klinkenberg@uva.nl
Ya-Hui Su, University of California, Berkeley
In CAT,
if a group of
examinees purposefully memorize items and distribute them to other prospective examinees, it certainly ruins the equality and accuracy of CAT. Steffen
and Mills (1999) investigated this effect and found that the
more the compromised
items and the more
effective the cheating, the more severe the overestimation for the recipients,
especially for those with low ability levels. Su, Chen, and Wang (2004), pointed out that the overestimation for the
recipients was more severe when the sources had diverse ability levels, because more items were compromised. Su and Wang (2007) proposed an item exposure control procedure, called the conditional Sympson and Hetter (Sympson
& Hetter, 1985)
online procedure with freeze control (denoted
as SHCOF) procedure. Results showed it superior to many other
conventional procedures in terms of measurement and operational efficiency. To assess the cheating effect, Su and Wang (2008) used the
SHCOF procedure
in a CAT, and found it could obtain precise estimation for persons in
real time without requiring
simulations to generate item exposure under a unidimensional context. In the past, little research has been done to investigate
cheating effects within a testlet context.
Hence, it is of great value to ascertain whether the SHCOF is also less affected by the cheating
between examinees under a testlet context,
when compared to a popular procedure such as the conditional multinomial method (SLC; Stocking
& Lewis, 1998). The
goal of this study was to use simulations to investigate how these two item exposure control procedures
would perform under various cheating conditions. It
was hypothesized that SHCOF would be less affected by cheating than SLC.
Four independent variables
were manipulated: (1) ability
level of
sources, (2) ability distribution
of recipients, (3) cheating conditions (no cheating, inefficient cheating, efficient cheating, and perfect cheating), and (4) item exposure control procedure (SHCOF and SLC). The root mean squared
error (RMSE) was
computed to describe the cheating effects; the
more serious the cheating effect, the larger the RMSE. Under the no-cheating condition, there is
no significant difference in
RMSE between SHCOF and SLC. It was also found that SLC had more serious
inflation on RMSE than SHCOF under the perfect cheating condition. As the cheating condition got more severe, the overestimation for the
recipients got more severe when the SLC was used. In addition, the more diverse the ability of the sources,
the larger the RMSE and the mean positive bias would be. More importantly,
SHCOF had
smaller RMSE than SLC. This was because only SHCOF could simultaneously monitor item exposure and
test overlap rates online. SHCOF could obtain precise estimation for persons
without requiring
simulations to generate item exposure before using in an
operational CAT. If test items are memorized by sources and shared to recipients,
CAT becomes
unfair because the ability levels of the
recipients will be overestimated. In this study, it was found that SHCOF was less affected by cheating than SLC. Hence, the SHCOF
procedure can be safely implemented in operational CAT.
For
further information: yahuisu@berkeley.edu
The Nine Lives of CAT-ASVAB: Innovations and
Revelations
Mary Pommerich,
Daniel O. Segall, and Kathleen E. Moreno, Defense
Manpower Data Center
The Armed Services Vocational Aptitude Battery (ASVAB) is administered annually to more than one million military applicants and high school students. ASVAB scores are used to determine enlistment eligibility, assign applicants to military occupational specialties, and aid students in career exploration. The ASVAB is administered as both a paper-and-pencil (P&P) test and a CAT. CAT-ASVAB holds the distinction of being the first large-scale adaptive test battery to be administered in a high-stakes setting. Approximately two-thirds of military applicants currently take CAT-ASVAB; long-term plans are to replace P&P-ASVAB with CAT-ASVAB at all test sites. Given CAT-ASVAB’s pedigree—approximately 20 years in development and 20 years in operational administration—,much can be learned from revisiting some of the major highlights of CAT-ASVAB history. This paper traces the progression of CAT-ASVAB through nine major phases of development including research and development of the CAT-ASVAB prototype, the initial development of psychometric procedures and item pools, initial and full-scale operational implementation, the introduction of new item pools, the introduction of Windows administration, the introduction of Internet administration, and research and development of the next generation CAT-ASVAB. A background and history is provided for each phase, including discussions of major research and operational issues, innovative approaches and practices, and lessons learned.
For further
information: mary.pommerich@osd.pentagon.mil
The CAT-DI Project: Development of a Comprehensive
CAT-Based Instrument for Measuring Depression
Robert
D. Gibbons, University of Illinois at
Chicago
The combination of IRT and CAT has proven invaluable in educational measurement. More recently, enormous reduction in patient and physician burden have been demonstrated using IRT based CAT in the area of mental health measurement problems (Gibbons et.al., 2008). CAT administration of a 626-item mood and anxiety spectrum disorder inventory revealed that an average of 24 items per examinee were required to provide impairment estimates with a correlation of 0.93 with the original complete scale. Furthermore, the CAT-based scores revealed twice the effect size than the total scale score in terms of differentiating patients with bipolar disorder based on the mood disorder subscale, despite an 83% reduction in the average number of items administered. These preliminary findings led to further interest and funding by the National Institute of Mental Health to develop a CAT-based instrument for the screening of major depressive disorder (CAT Depression Inventory—CAT-DI) that can be used for routine screening of depression in general medical practice settings as well as specialty mental health clinics. A recent supplement to the parent CAT-DI grant, extends our work on CAT for mental health measurement to CAT for diagnostic assessment of depression and other psychiatric disorders. The CAT Major Depressive Disorder (CAT-MDD) project will explore four different statistical/psychometric models for estimating the probability of an underlying discrete major depressive disorder based on self-administered symptom ratings that are adaptively administered. The ultimate objective of this program of research is to reduce patient and physician burden in terms of screening and diagnosing depression in general practice settings. Potential benefits include reduction in health care costs produced by high rates of service utilization among patients with an undiagnosed depressive illness, increased detection of depressive disorders, and increased access to quality mental health care for patients in need of such services.
For
further information: rdgib@uic.edu
Development of a CAT to Measure Dimensions
of Personality Disorder:
The CAT-PD Project
Leonard J. Simms,
In this presentation, describes the CAT-PD project, a funded, multi-year study designed to develop an integrative and comprehensive model and measure of personality disorder trait dimensions. Our general study aims are to (1) identify a comprehensive and integrative set of dimensions relevant to personality pathology, and (2) develop an efficient CAT method—the CAT-PD—to measure these dimensions. To accomplish our general goals, we plan a five-phase project to develop and validate the model and measure. The presentation describes the project generally, the results of Phase I (which is focused on content domains and initial item bank development), and our plans for IRT/CAT with these item banks. In particular, I will focus on how the item banks will be used, the possible IRT models we are considering for item bank calibration, the CAT algorithms we are planning to test, and our methods for deciding on a final set of procedures for the completed CAT-PD measure. Finally, I will discuss the CAT and IRT challenges that we anticipate facing in the future.
For
further information: ljsimms@buffalo.edu
The MEDPRO Project:
An SBIR Project for a Comprehensive IRT and CAT Software
System
The IRT Software
David Thissen, The
and Scientific Software International
The IRTPRO (Item Response Theory for Patient-Reported Outcomes) component of the MEDPRO Project is an entirely new application for item calibration and test scoring using IRT. Fall, 2009 release of this software is anticipated; this presentation briefly describes its features, user interface, and output. IRTPRO provides maximum likelihood calibration of items fitted with the 1PL, 2PL, 3PL, Graded, Generalized Partial Credit, and Nominal IRT models in any combination, using one of three estimation algorithms: (1) Bock-Aitkin EM, (2) adaptive quadrature, or 3) Metropolis-Hastings Robbins-Monro (MHRM). Unidimensional or multidimensional IRT models might be used; among multidimensional models, the implementation performs full-information estimation for exploratory and confirmatory models, including the special-case treatment appropriate for bifactor models. Analysis of differential item functioning (DIF) is also provided, using the Wald test, with accurate item parameter error variance-covariance matrices computed using the Supplemented EM (SEM) algorithm. Several goodness-of-fit and diagnostic statistics are reported. Standard maximum a posteriori (MAP) and expected a posteriori (EAP) estimates of the latent variable(s) for item response patterns might be computed, as well as (weighted) summed-score to scale score translation tables.
For further information: dthissen@email.unc.edu
The CAT Software
Nathan A. Thompson, Assessment
Systems Corporation
The CAT software for MEDPRO is designed to provide a comprehensive environment for the design and delivery of CATs. It consists of two main components: CATSIM and FASTCAT, in a package called CATPRO (Computerized Adaptive Testing for Patient-Reported Outcomes), which will be designed to interface with IRTPRO. CATSIM will be a major expansion of Assessment Systems’ (ASC) POSTSIM software. CATSIM will implement post-hoc simulations, Monte Carlo simulations, and hybrid simulations of CATs. New features in CATSIM will include the addition of CAT for polytomous IRT models, item selection constraints (content balancing, item exposure controls and “enemy” items), and an expanded set of termination options. FastCAT will be an expansion of ASC’s FastTEST Professional Testing System that includes all the options in CATSIM applied to the delivery of live CATs in a Windows environment. Output from both CATSIM and FastCAT will optionally be available in formats directly importable into IRTPRO for analysis and the parameter output from IRTPRO will be directly importable into both CATSIM and FastCAT.
For further information: nthompson@assess.com
Juan Ramón Barrada and Julio Olea, Universidad Autónoma de Barcelona, Spain
Vicente Ponsoda, Universidad Autónoma de Madrid, Spain
Francisco J. Abad, Universidad Autónoma de Madrid, Spain
Test security is a major concern in CAT, because of the possibility of item sharing between examinees. A CAT will be considered more secure the lower the overestimation of the examinee’s trait level due to item preknowledge. The common measures of test security have been the overlap rate between examinees and the distribution of item exposure rates. Usually, these indicators of test security have been evaluated when no item disclosure is present. We justify that lower overlap rates or less skewed distributions of usage of the items might not lead to safer CATs. The main ways of increasing security are to reduce: (1) the probability of item preknowledge of the first items administered, and (2) the overlap rate for high trait levels. In these conditions, there would be many different routes to obtain a high trait level estimation and it would be difficult for an examinee with item preknowledge to incorporate one of these routes. Progressive and proportional methods offer these characteristics. We show that these two methods are safer than the alpha-stratified method, a method with a much lower overlap rate. In fact, when the alpha-stratified method is applied, there is a “golden source of information:” an examinee with high trait level sharing items content is the best option for increasing trait estimation. When the progressive or proportional methods are applied, there is no source of information that fits to all the possible recipients. With these two methods, recipients and sources should have a similar trait level to lead to an important increment of trait estimation.
For
further information: juanramon.barrada@uab.es
Michael Chajewski and
Charles Lewis,
Much of the IRT and item exposure control literature regarding CAT has focused on the assessment of the impact of exposure control algorithms on frequency of item use, estimation precision, test bias, and overlap as well as item pool utilization and observed root mean square error rates. However, most inquiries into these pertinent issues have limited their inquiries to fairly large educational assessment-based item bank situations, which are less common in other areas into which CAT has been expanding. This paper discusses the results of a simulation study that focused on the pairing of item exposure control algorithms and test termination criteria within the specific framework of polytomous CATs using restricted item banks. Based on prior comparative and exploratory research by Chang and Twu (1998), Revuelta and Ponsoda (1998), Pastor, Dodd and Chang (2002), French and Thompson (2003), Davis (2002; 2004), Davis and Dodd (2005), Barada, Mazuelq and Olea (2006), Georgiadou, Triantafillou, and Economides (2007), and Barada, Olea and Abad (2008), six item exposure control algorithms and four test termination criteria were selected. Item exposure controls included the progressive-restricted maximum information method, Stocking and Lewis conditioning on estimated ability, target exposure control (TEC), Sympson-Hetter conditional strategy (SHC), 0-1 α-stratified strategy (0-1STR), and the combined α-stratified Sympson-Hetter method (STR-SH). The impact of these six algorithms was evaluated in their optimization of small item bank adaptive instruments using fixed length or fixed standard error (or Fisher target information) test termination criteria. Just like educational large test item bank assessments, restricted-item bank CATs also face issues regarding test security. Item exposure control algorithms are used to ensure limitations on any given item being delivered too many times. Non-cognitive assessments, which might also be high stakes, face an even greater need for test security since there are fewer items available. Alternatively, non-high-stakes instruments might need to utilize item exposure control algorithms for content validity purposes. Results are discussed in the framework of restricted item bank CATs such as non-cognitive psychological assessments and consumer survey evaluations.
For
further information: chajewski@fordham.edu
Xin Li, Kirk A. Becker, and Jerry L. Gorham, Pearson VUE
Ada Woo, NCSBN
Item exposure control has become a critical and practical issue since CAT was widely implemented in test administration. Strategies for controlling item exposure have been developed to prevent overexposure of items while maintaining measurement precision. Randomization and conditional selection are two major types of exposure control techniques (Way, 1998). Randomization procedures allow a random component for controlling item exposure. Kingsbury and Zara (1989) proposed the “randomesque” method that randomly selects one item out of a prespecified number of the most informative items throughout the testing. Another method designed by Lunz and Stahl (1998) randomly selects from all items within a logit range of the optimal item difficulty. Alternatively, conditional selection strategies impose an exposure control parameter for each item given it is selected. The Sympson-Hetter method developed by Sympson & Hetter (1985) and modifications of this procedure are reviewed in Georgiadou, Triantafillou and Economides (2007). The most recently being presented by Barrada, Veldkamp and Olea (2009) is the multiple maximum exposure rate (rmax) method which defines as many values of rmax as the number of items. Chang and Ying (1999) also proposed an a-stratified CAT to limit the exposure of items with high discrimination by restricting their selection until q estimates have stabilized. While adaptive tests using the Rasch model do not have exposure issues due to the item discrimination parameter, there can be problems with exposure for certain ranges of item difficulty. A Rasch-analog of b-stratified adaptive testing to control exposure in a key-difficulty range was investigated in this paper.
Numerous studies have been conducted to evaluate the effectiveness of a variety of algorithms that modify the CAT selection process to control item exposurel. Their strengths and weaknesses have been discussed for different models using dichotomous scoring, polytomous scoring, and testlet-based CATs. However, no studies have focused on exposure of items within a particular range, especially those items with difficulty level near the cut-score on variable-length adaptive tests. The CAT algorithm tends to overly administer these items under maximum item information selection. Overexposure of items might affect item parameter estimates and potentially the integrity of the test. This research investigated multiple methods for limiting exposure of items near the cut score and evaluatde the results for measurement precision. Response data from a large-scale live CAT licensure exam were used to obtain the known item parameters for simulation. q s for simulees were distributed according to the population distribution of final q estimates on the live test. Four procedures were employed for controlling exposure of items near the cut score in a CAT, including the Kingsbury-Zara, the “within-.10-logits,” the rmax method, and a stratified-b method. They were compared to a baseline condition with no exposure control. The performance of these procedures was evaluated first for measurement precision by the standard error of measurement. Other variables associated with test security include exposure rates, utilization of the item pool, and items overlap across test administrations.
For further information: Xin.Li@Pearson.com
Po-Hsi
Chen, Taiwan Normal University
The goal of the research was to compare two new Bayesian estimation methods, the adaptive Bayesian estimation and weighted Bayesian estimation, in multidimensional computerized adaptive testing (MCAT). Monte Carlo simulation and a multidimensional item response model, the multidimensional random coefficients multi-nominal logit model (Wang, Wilson, & Adams, 1997), were used in this research. Ten to sixty items of two-dimensional CAT were used with adaptive Bayesian, weighted Bayesian, and traditional Bayesian estimation. The dependent variables were conditional bias and the root mean square error (RMSE). Results indicated that these two new Bayesian approaches resulted in less regression bias than the traditional Bayesian estimation; however, weighted Bayesian estimation was more stable than the adaptive Bayesian estimation. The applications and suggestions for use of weighted Bayesian estimation are addressed
For further information: chenph@ntnu.edu.tw
Qi
Diao and Mark Reckase, Michigan State
University
The impetus of this research is the
lack of guidelines for designing multidimensional
computerized adaptive
tests (MCATs). There
has been some
research on
unidimensional CAT on the
properties of ability estimation and item selection methods (e.g. Weiss &
McBride, 1984; van der Linden & Pashley, 2000). However, in the literature on MCAT, most studies use a single ability estimation
and item selection method because they focus on other aspects of adaptive testing (e.g. Li Ip & Fuh, 2008). The only study on a comparison of different
ability estimation and item selection methods for MCAT is Tam (1992). But that was
before most currently used methods (e.g. Segall, 1996; Veldkamp & van der
Linden, 2002) were developed. Also, most of the research has used two-dimensional cases, but we believe at least three
dimensions are needed. In
the proposed study, three ability estimation methods were compared. The first is the general maximum likelihood method (Segall 1996). A
problem when maximum likelihood is used is that estimates of location are not
finite when the number of test items is small. One solution offered in Reckase
(2009) is fixed-step-size maximum likelihood. This method updates the estimates of ability location
with a fixed increment when infinite estimates are encountered. The third
method is Bayesian estimation (Segall 1996).
In the proposed study, four item selection methods
were compared. The first is maximizing the determinant of the Fisher information matrix
(Segall 1996). The
second is minimizing
the trace of the inverse of Fisher information matrix (Mulder & van der
Linden 2008). The third is maximizing the
decrement in the volume of the Bayesian credibility ellipsoid (Segall 1996). The last is maximizing the Kullback-Leibler information (Veldkamp & van
der Linden 2002). The ability
estimation and item selection methods conditioning were compared using
different priors and test length. The item pool was simulated based on data from the Michigan Educational Assessment
Program mathematics test for 7th graders. Mean
bias and mean squared error (MSE) were used as a measure of estimation
precision. Test
length of 20 and 50 were generated and results were compared. For testing the
impact of priors on the Bayesian method, a multivariate normal distribution
with mean 0 and an identity variance-covariance matrix as in
the real MEAP 2005 data were used and final ability estimates were compared. The maximum likelihood estimation
method did not perform well for the test length of 20. When test length was 50, the estimates were much
better. The fixed-step-size
maximum likelihood method fixed
the problem of estimates not converging and the results were comparable to the
Bayesian method. Bayesian estimates were regressed toward 0 because
Bayesian estimates tend to be statistically biased toward the mean of the prior. The standard errors of the estimation were smaller than
the
maximum likelihood method. Maximizing
the determinant of the Fisher information matrix and minimizing the trace of
the inverse of Fisher information matrix were comparable. When Bayesian ability
estimation was used, the
performance of Kullback-Leibler information was slightly better than the Bayesian item selection method
with the
test length 20. Those two methods were comparable with test length of 50.
For further information: diaoqi@msu.edu
Chun Wang and Hua-Hua Chang, University of Illinois at Urbana-Champaign
In adaptive testing, items are selected sequentially to match the updated ability of the examinee. Numerous item selection algorithms for item pools calibrated under unidimensional IRT models have been well developed. However, the assumption of unidimensionality can be easily violated, especially when the test covers broad content areas. In the presence of multidimensionality, instead of obtaining m separate unidimensional ability estimates, multidimensional IRT (MIRT) that provides a m-dimensional vector estimate might be a better choice. Previous researchers have shown that this kind of simultaneous estimation of abilities from different dimensions yields more accurate estimates, since it takes into account the correlational structure of those abilities. Built on MIRT, multidimensional adaptive testing (MAT) can, in principle, provide a promising choice in ensuring efficient estimation of each ability dimension. Currently, two item selection procedures have been developed for MAT, one based on Fisher Information embedded within a Bayesian framework, and the other using Kullback-Leibler Information. Since Fisher information extends to a matrix, instead of a single value in multidimensional ability space, item and test information are no longer independent of each other. Therefore, the nice additive property of FisheriInformation does not apply to MAT. Alternatively, Kullback-Leibler information remains a single value and thus keeps its additive property.
It is well-known that in unidimensional IRT, the second
derivative of K-L
information (also termed “global
information”) is
Fisher information evaluated at
.
This paper first generalizes the relationship between these two types of information in two ways—the
analytical result is
given as well as
the graphical representation to enhance interpretation and understanding. It is shown that the complete Fisher information matrix
can be easily recovered from K-L information, and the diagonals of the matrix equate to the curvature of the K-L information curve, evaluated with respect to
each dimension separately. Secondly, a K-L information index is constructed in MAT, which represents
the integration of K-L information over all of
the ability dimensions.
In geometric interpretation,
this index is analogous
to the volume under the information surface when only two dimensions are considered. This paper further discusses how this index correlates with the item discrimination
parameters. In the two-dimensional case, an analytical
derivation shows
that the size of the K-L
information index depends largely upon the sum of the squared item discrimination parameters,
which is also termed “multidimensional discrimination”. The results would lay a foundation for future development of
item selection methods in MAT which can help equalize the item exposure rate.
Finally, a simulation study will be conducted to verify the above results. The connection between the item parameters, item K-L information, and item exposure rate is
demonstrated for an empirical
MAT delivered by an
item pool calibrated under
two-dimensional IRT.
For
further information: cwang49@illinois.edu
Alan D. Mead, Avi
Fleischer, and Jessica D. Sergent, Illinois
Institute of Technology
Although CAT was developed in the context of ability tests (Weiss, 1982), studies have since demonstrated the effectiveness of CAT for measuring attitudes and personality. For example, Koch, Dodd, and Fitzpatrick (1990) applied the rating scale model to a Likert-scale attitudinal questionnaire. The rating scale model (an extension of the one-parameter logistic model for polytomous data) was found to fit the data very well and, although they noted item pool issues, succeeded in measuring effectively. Other studies have found similar results for personality assessments, suggesting that perhaps half the items of an assessment are needed to achieve comparable reliabilities (Waller & Reise, 1989; Reise & Henson, 2000). However, one issue that has not been extensively treated in prior literature is the multidimensional nature of most personality assessments. Prior research has generally applied unidimensional CAT to individual scales. Segall (1996) presented a multidimensional CAT (MCAT) methodology where correlations between the factors could be leveraged to administer and score items even more efficiently. Mead, Segall, Williams and Levine (1997) described a Monte Carlo simulation of the adaptive administration of the 16PF Questionnaire (Cattell, Cattell, & Cattell, 1993; Conn & Rieke, 1994) using Segall’s MCAT method. As in Segall’s simulation, the MCAT method was effective in allowing additional reductions in assessment length, beyond those typically encountered with unidimensional CAT. For example, overall assessment length could easily be cut in half with small decrements in scale reliabilities.
The purpose of the current study was to extend the results of the Monte Carlo simulation (Mead, et al, 1997) to real data. This study is important for two reasons. First, it is always important to show that simulated results generalize to actual use. Even more importantly, recent research on personality (research that specifically included the 16PF; Chernyshenko, Stark, Chan, Drasgow, & Williams, 2001) has suggested that traditional IRT models do not fit personality data well and might not be the most appropriate models (Stark, Chernyshenko, Drasgow, & Williams, 2006). If the IRT model is a poor fit to 16PF data, the Monte Carlo results will not hold for real data. On the other hand, if the real-data results replicate the simulation results, then we might assume that traditional IRT models fit 16PF data sufficiently well. We obtained archival data from the administration of the 16PF Questionnaire to approximately 5,000 individuals and the two-parameter logistic model was fit to the items using BILOG-MG 3.0. Segall’s (1996) software was adapted to read the actual responses of the individuals for a real-data simulation. Results generally supported the use of MCAT with 16PF items. Correlations between actual 16PF scores and MCAT trait estimates were high (averaging .91 to .82) for MCAT tests shortened by up to 40–50% while shorter MCAT tests had moderate correlations (averaging .72 to .58). The presentation will also discuss results for the pool usage (about a third of the pool had exposures greater than 90%), efficiency for individuals with extreme scores, and practical considerations for adaptive personality assessment.
For
further information: jsergent@iit.edu
Adaptive Computer-Based Tasks Under an Assessment
Engineering Paradigm
Richard M. Luecht, The
Assessment engineering (AE; Luecht, 2007, 2008a, 2008b; Luecht, Gierl, Tan, and Huff, 2006) is a highly structured way of designing constructs and building instruments and associated scales that measure those constructs. By using construct maps, evidence models, task models and templates, AE makes it possible to generate extremely large numbers of test forms with prescribed psychometric characteristics (e.g., targeted measurement precision). This paper presents an extension of AE to include computerized-adaptive performance tasks (CAPTs). In a traditional CAT, each item is selecting to maximize the measurement precision relative to a provisional estimate of some latent trait. CAT requires every item to be calibrated using an appropriate IRT model so that estimates of item difficulty (location) and other characteristics can be used in the item selection process. Under AE, task models and templates can generate large classes of items. In turn, individual items inherit the estimated psychometric characteristics of the task models and/or templates. A hierarchical Bayesian framework is used for calibration and to quantify uncertainty associated with the class of items sharing estimated item parameters (cf. Glas and van der Linden, 2003) . With CAPTs, features or components of the task models and/or templates are altered in real-time to actually vary the task difficulty in a systematic way. By applying a maximum information criteria to an item generation algorithm scripted as part of an AE template, the task features can be selected to create highly variable computer-based performance tasks (i.e., items) that effectively adapt themselves to the proficiency of the examinee. In this sense, the ensuing performance task or items become semi-intelligent measurement agents. The theoretical foundations for CAPTs will be presented in the context of several measurement scenarios. This paper will also present the hierarchical Bayes calibration framework and algorithms for item generation.
For further information: Email: rmluecht@uncg.edu
Steven W. Nydick and David J. Weiss, University of Minnesota
The ideal CAT has a large item bank with a wide range of item difficulties; furthermore, in order for the test to provide equiprecise measurements, there must be items that provide sufficient information across the full range of θ (Weiss, 1982). Post-hoc simulations have been proposed as a means of fine-tuning a CAT for live administration; indeed, Gibbons, Weiss, et al. (2008) demonstrated that the results of post-hoc simulations well predict the outcomes of a live CAT. However, before examining CAT test characteristics (e.g., SEM) with a post-hoc CAT simulation, each examinee must have provided a response to each item in a bank. But if the item bank is very large (e.g., 1,000), it might not be reasonable to expect any examinee to respond to all the items without factors external to the trait (e.g., fatigue) affecting his/her score. Frequently, because they tend to be large, CAT item banks are calibrated using concurrent calibration methods, which estimate IRT parameters from an incomplete data matrix including a set of linking items (e.g., Kim & Cohen, 1998). This paper proposes and evaluates the performance of a hybrid simulation procedure for use in developing CATs that employs these sparse, concurrent-linking matrices. The hybrid procedure estimates q for each examinee with the item parameters estimated from the sparse linking matrix in conjunction with the set of item responses for each examinee. Then, the q estimate for each examinee is used with Monte Carlo simulation methods to impute the examinee’s missing data, resulting in a complete response vector for each examinee—part real item responses and part imputed simulated data. A post-hoc simulation is then implemented with the hybrid response matrix.
Two IRT models were used—two- and three-parameter logistic. From a simulated data matrix of 620 items and 1,000 examinees, either two, four, five, or ten item/examinee blocks were selected, with 20 anchor items, and the remainder of the items and simulees divided randomly into groups. Then, responses were deleted to items not belonging to a simulee’s group, resulting in data matrices with from 49% to 87% missing data. Parameters were estimated for both the matrix of full responses and the matrix of partial responses and θ was estimated for each simulee. The new estimates of θ and the estimated IRT parameters were then used to simulate new responses. POSTSIM (Assessment Systems Corporation, 2007) performed a fixed termination (40 items) and a variable termination (SEM ≤ .20) post-hoc CAT on each matrix. For both the fixed and variable termination criteria, the hybrid CAT with parameters estimated from the full matrix of responses (HFP) had accuracy close to that of the hybrid CAT with parameters estimated from the partial matrix of responses (HPP), yet it also had efficiency close to that of a CAT performed on the full matrix of responses (FFP). The HPP had correlations with the FPP full-test θ well into the .90s; HPP and FPP performed poorly only near the limits of estimating the 3PL (80 items per group). These results suggest that meaningful hybrid simulations can be performed with sparse data matrices involving up to almost 80% missing/imputed data. The simulation results were replicated with a real data set.
For further information: nydic001@umn.edu
Ying
Cheng, University of Notre Dame
CAT is a new mode of testing that
enables more efficient and accurate recovery of latent traits. Traditionally,
CAT is built upon IRT models that assume unidimensionality. With the advances
of latent class models (LCM) and an increasing number of applications of them
in testing and measurement, an interesting question that arises is how to build
a CAT based on a LCM. Tatsuoka (2002) and Tatsuoka and Ferguson (2003)
established a general theorem on the asymptotically optimal sequential
selection of experiments to classify finite, partially ordered sets. Xu, Chang
and Douglas (2003) proposed two heuristics on the basis of Tatusoka's theoretical work in the context of
CAT, one using Kullback-Leibler information (the KL algorithm) and the other
using Shannon entropy (the SHE algorithm).
This paper presents an application of the optimal sequential selection
method, i.e., selecting items sequentially for examinees during CAT, which is
built upon a class of partially-ordered LCMs (i.e., the cognitive diagnostic
models). Two new algorithms are proposed: (1) posterior-weighted KL information or PWKL method, and (2) a hybrid
algorithm (HKL) which considers not only the posterior but also the distance
between latent classes. Two
simulation studies, one using simulated item parameters, the other with
parameter estimates from real data, show that the PWKL and HKL algorithms
outperformed the KL and SHE algorithms uniformly. Finally,
we built the link among the algorithms by establishing equivalence between the
Kullback-Leibler-information-based approaches and the Shannon-entropy-based
approach, and connecting the algorithms for LCM with algorithms built upon IRT
models.
For
further information: ycheng4@nd.edu
Jeff Douglas, Hua-Hua
Chang, and Chun Wang,
We consider how constraint weighted a-stratification can be used in CAT to guarantee that sufficient diagnostic information is obtained on a set of binary latent attributes, when estimation of a unidimensional IRT ability parameter is also desired. Such applications are useful when a single score is needed, but a more fine-grained assessment of the particular skills of an examinee is also desired. Accomplishing these dual aims requires carefully constructing how a single underlying model might simultaneously contain information about a continuous latent trait and a set of binary latent attributes of a cognitive diagnosis model. Such a model is discussed and results are given illustrating how these competing models can both be thought of as valid for an exam. Implementation of constraint weighted a-stratification involves identifying a priority function that combines IRT with cognitive diagnosis. Several priority functions are proposed, some based on formal measures of information, and others only utilizing knowledge of which items measure which attributes. A simulation study and results are reported, showing how utilization of information-based methods yields higher classification rates for cognitive diagnosis while achieving accurate ability estimation. Item exposure rates are also considered for all competing methods. Several new directions for future research are proposed, both for item selection and for considering when multiple latent variable models for a single dataset can be simultaneously used to extract useful information.
For further information: jeffdoug@illinois.edu
Applying the DINA Model to GMAT Focus Data
Alan Huebner, Xiang Bo Wang, and Sung Lee, ACT, Inc.
Recent years have seen growing interest in the area cognitive diagnostic modeling. These relatively new psychometric models seek to classify examinees as having mastered or not mastered a set of discretely defined skills, as opposed to traditional IRT models that assign examinees a continuous score measuring a broadly defined latent trait. The literature in this field contains few examples of applications of cognitive diagnostic models to real assessment data, and many of these applications use simple datasets as a means of introducing a new estimation algorithm. We attempt to fit the Deterministic Input, Noisy-And (DINA) model to assessment data for an existing test, the GMAT Focus. We discuss whether useful diagnostic information can be gleaned by applying the model to the data.
For further information: Alan.Huebner@act.org