This report provides the psychometric properties and validation report for FranklinCovey’s Leader Diagnostic. The Diagnostic assesses a comprehensive set of skills associated with leadership. Each skill is assessed with items capturing observable behaviors. At the time of this report (October 2023), the Diagnostic is deployed as a 360-rater assessment before and after participation in FranklinCovey courses. A 180-rater version is being developed.
Download the full Leader (360) Diagnostic Technical Report
There is a process for validating assessments, sometimes called psychometrics.
The validation process can be standardized, depending on the industry. Often though, it’s more accurate to say there are several validation criteria, some are more important than others, but generally the more validation criteria an assessment meets, the better.
Many organizations that sell assessments do some kind of validation. Organizations that focus primarily on assessments may validate every assessment they produce. Other organizations that have assessments as a segment of their business, like KornFerry and Gallup, also publish extensive validation/technical reports for their most popular assessments, periodically updating those reports with new data or to document changes to the assessment.
Understandably, many of our clients have asked to see what validation work we’ve done on our FranklinCovey assessments.
This is a ~15 page report where we focused on 2 areas of psychometrics:
- The diagnostic’s reliability — for instance, how similar do people score when they take the assessment a week later.
- The diagnostic’s validity — for instance, how well do scores predict workplace engagement and job satisfaction. This is typically the most impactful part of the validation process. An assessment can be reliable but if it doesn’t predict much it’s not worth much. A Buzzfeed quiz that tells you what Disney character you are could theoretically hit the other criteria, but probably will never be good at predicting anything other than preferences for Disney characters.
In total, we analyzed about 20,000 responses from the FY ‘23 version of the Diagnostic, and then had over a thousand leaders rate themselves, and more than 500 direct reports rate their leaders on the FY ‘24 version of the Diagnostic.
The Diagnostic meets the generally accepted standards on several validation criteria — normally distributed responses, multiple forms of reliability, and multiple forms of validity — and in some instances performs really well on these criteria.
The headline finding is that the relationships between rating your leader higher on the Diagnostic and a host of great outcomes, like engagement, job satisfaction, intent to stay at your organization, is really strong. Relationships that rival some of the top academic assessments on leadership.
Yes. The report covers the following validation criteria:
- Internal consistency
- Inter-rater reliability
- Between different rater types (e.g., self-, manager-, and direct report-raters)
- Within rater types (e.g., do direct reports rating the same manager on the 360 show any consistency between each other in terms of how they rate that manager)
- Test-retest reliability
- Factor structure (i.e., is the Diagnostic a multidimensional measure)
- Convergent validity (i.e., does the Diagnostic relate to other validated measures of leadership)
- Based on self-rater data
- And based on direct report-rater data
- Criterion/concurrent validity (i.e., does the Diagnostic relate to outcomes we intend it to predict: like engagement, job satisfaction, and perceptions of manager effectiveness)
- Differences in Diagnostic scores based on respondent demographics (etc., age, gender identity), and team and organizational variables (e.g., org size, remote work status)
The FY’ 24 version of the Diagnostic is shorter, and we updated the questions that didn’t have normally distributed responses or had too high a correlation with other questions (suggesting they are redundant). These incremental improvements result in a more reliable and valid Diagnostic.
There are basically two types of testing we relied on to determine the new questions.
We first start with many possible questions that could be used as replacements or additions. We then do many small tests with respondents to find the questions that are better on a few criteria. What we’re looking for are questions that have a good range in responses, preferably a normal distribution of scores. And we’re looking for questions that have strong correlations with concepts we want the questions to be correlated with — for instance, other measures of leader effectiveness. This process got us almost all the way there to our final set of questions.
The second step in the process is even more involved. And that’s the validation effort that is detailed in the technical report.
With the FY’ 24 version of the Diagnostic, we changed the labels on our response scale from expertise (1-Novice, 4-Proficient, 7-Expert) to frequency (1-Never, 7-Always). Note, the scale is still a 1-7 point scale.
There are several reasons for this change:
- A frequency scale is more in line with what is standard and expected in assessments intended to measure behaviors and behavior change.
- Several respondents and clients have commented that the expertise scale is difficult to comprehend and define (even with the definitions of novice, proficient, and expert we provide). A frequency scale will be more consistently interpreted across raters. And will be more easily understood by respondents and clients.
- We’ve conducted a few tests and experiments where we vary the response scale labels that respondents see (expertise vs frequency) and found that scale labels affect selection of the missing data and selection of “Unable to Evaluate.” Specifically, raters are much more likely to skip a rating or select “Unable to Evaluate” when rating on the expertise scale compared to the frequency scale. Therefore, we expect the new frequency scale will lead to more complete data from raters of the 360.
There are extra steps to validating a 360. Two of those are looking at the consistency/reliability across types of raters. This is essentially a measure of whether self-ratings match manager-ratings, and peer-ratings, etc. Workplace 360s don’t show much overlap across types of raters. But we do find overlap between types of raters that is in line with what is expected within the industry. The best way to interpret this finding is that our 360 offers plenty of opportunities for learners to see how other raters rate them differently.
Another step toward validating a 360 assessment is examining the consistency/reliability within a type of rater (e.g., are direct reports rating the same manager sufficiently consistent in their ratings?). We examined the existing FC 360 data from both direct report- and manager-raters. Here we found that consistency within rater-types that are similar to some of the most popular leadership assessments.
The details of both of these forms of reliability are in the technical report under the section, Inter-rater reliabilities.
While the technical report focuses on the Leader Diagnostic, some of the reliability and validity criteria is relevant for the Individual Contributor version. In particular, the internal consistency and test-retest metrics for the Individual Effectiveness and Winning Culture categories (displayed in Tables 2 & 4 of the technical report), suggest these sections, which are what make up the IC version, are reliable. Additionally, these two categories did show statistically significant relationships to all of the convergent and criterion variables detailed within this report. Though those statistics are not in the technical report, they are available upon request.
Our validation testing was done only in the US, and in English. It’s almost always the case that validation starts within one country and language, and then if cross-cultural validation is required, it is done later through additional studies.
We also note there’s an important distinction between whether an assessment is valid across cultures, and whether there are simply differences across cultures. So for instance, we do find some demographics differences in our US-based data — some based on race/ethnicity, some based on remote work status. So these are differences between groups. But regardless of the group, Diagnostic scores still predict outcomes like engagement and job satisfaction. That’s what speaks to the validity of the assessment.
All that said, if we’re to do cross-cultural study of the Diagnostic, that would come in the future, and could be added to an revised version of the technical report.
Our validation work was lead by Alex O’Connor on the Product team. He has a PhD in research psychology, was trained in psychometrics, and has previously published validated assessments in academic journals.
Our validation work was supported by an external expert, Joshua Eng, PhD. He’s faculty at the Indiana University — School of Medical. He’s responsible for validating the assessments that measure learning and well-being outcomes for surgical residents across the country and has decades of experience as a psychometrician.
Comments
0 comments
Article is closed for comments.