In this study we show that the lack of universal DIF effect size measure makes interpretation of power of DIF detection difficult, and develop a set of criteria for a desirable DIF effect size measure.
Using DIF effect size measures along with a DIF assessment method has been shown to reduce inflated type I error rates (Hidalgo, Gómez-Benito & Zumbo, 2014). Some commonly used effect size measures are:
- log-odds based: the Delta scale in the Mantel-Haenszel method (Holland & Thayer, 1988),
- variance based: the Delta R squared in the logistic regression approach (Zumbo & Thomas, 1997),
- or probablity based: the difference in item difficulty parameters between groups in an IRT framework (e.g. Finch, 2005; Woods, 2009; Wang & Shih, 2010).
Currently, there is no equivalence between different types of effect size measures for DIF detection (DeMars, 2011). Furthermore, there is no universal interpretation of power in DIF studies, since it is associated with the DIF effect size, among other variables.
Effect size and power may also depend on the item discrimination and difficulty. In this study we show that the lack of universal DIF effect size measure makes interpretation of power of DIF detection difficult, and we investigate the relationship between the DIF effect size measure and the selection of an anchor set for DIF detection.
Work presented at the 2018 International Meeting of the Psychometric Society (IMPS) as a conference talk. Please contact the author for further details.