# Random or fixed testlet effects : a comparison of two multilevel testlet models

## Abstract

This simulation study compared the performance of two multilevel measurement testlet (MMMT) models: Beretvas and Walker’s (2008) two-level MMMT model and Jiao, Wang, and Kamata’s (2005) three-level model. Several conditions were manipulated (including testlet length, sample size, and the pattern of the testlet effects) to assess the impact on the estimation of fixed and random effect parameters. While testlets, in which items share the same stimulus, are common in educational tests, testlet item scores violate the assumption of local item independence (LID) underlying item response theory (IRT). Modeling LID has been widely discussed in previous studies (for example, Bradlow, Wainer, and Wang, 1999; Wang, Bradlow, and Wainer, 2002; Wang, Cheng, and Wilson, 2005). More recently, Jiao et al. (2005) proposed a three-level MMMT (MMMT-3r) in which items are modeled as nested within testlets (level two) and then testlets are nested with persons (level three). Testlet effects are typically modeled as random in previous studies involving LID. However, item effects (difficulties) are commonly modeled as fixed under IRT models: that is, persons with the same ability level are assumed to have the same probability of answering an item correctly. Therefore, it is also important that a testlet effects model permit modeling of item effects as fixed. Moreover, modeling testlet effect as random implies testlets are being sampled from a larger population of testlets. However, as with item effects, researchers are typically more interested in a particular set of items or testlets that are being used in an assessment. Given the interest of the researcher or psychometrician using a testlet response model, it seems more useful to use a testlet response model that permits modeling testlets effects as fixed. An alternative MMMT that permits modeling testlet effect as fixed and/or randomly varying has been proposed (Beretvas and Walker, 2008). The MMMT-2f and MMMT-2r models treat testlet effects as item-set-specific but not person-specific. However, no simulation has been conducted to assess how this proposed model performs. The current study compared the performance of the MMMT-2f, MMMT-2r with that of the MMMT-3r. Results of the present simulation study showed that the MMMT-2r yielded the best parameter bias in estimation on fixed item effects, fixed testlet effects, and random testlet effects for conditions with nonzero equal pattern of random testlet effects’ variance even when the MMMMT-2r was not the generating model. However, random effects estimation did not perform well when unequal random testlet effects’ variances were generated. Fit indices did not perform well either as other studies have found. And it should be emphasized that model differences were of very little practical significance. From a modeling perspective, MMMT-2r does allow the greatest flexibility in terms of modeling testlet effects as fixed, random, or both.