[All of the posts in this series are indexed at the "Graham v MOA' tab under the blog header above. Or you can just
click here.]
The details are complicated so here's an overview of some key problems; details below.
- Overall, oral exams (including interviews) are highly invalid and unreliable predictors of success on the job, despite the fact that they are used a lot.
- Imagine a question about how to do something fairly complex like, "tell me what you know about brakes." Where do you start? You have four minutes. Go.
- The graders may or may not understand the answer, but they have four or five pages taken from a textbook, they can scan through, with nothing to indicate what's important and what isn't.
- The grading scale has uneven intervals - 5, 4.5, 3.5, 2.5, 1.5, 0 - with 3.5 being the lowest possible pass. This odd scale is not what people are used to and skews the grades downward.
- You have to answer each question, they have to assess your answer, all in about four minutes.
- Lack of Accountability. There's no recording, no way to check to see what a candidate actually said.
That should give you an idea of the kinds of problems that exist with the oral technical exam that Jeff Graham took.
The Details
As
I mentioned in an earlier post, firefighters first have to pass a written exam on technical issues. Questions come from a national bank of questions that have been shown to be relevant to being an engineer (the next step above fire fighter).
Then comes the the practical exam. Here the firefighters have to get onto trucks and use equipment based on 'evolutions' (AFD's term) which are more or less scenarios that the firefighters have to role-play. This is the most hands-on part of the exam process.
If you pass these two tests, then you can go on to the oral exams - made up of ten questions of the technical oral exam and then of five questions of the 'peer review' oral exam. That's fifteen questions in an hour, or four minutes per question.
Jeff Graham passed the written and the practical exams. His dispute was over the oral exams which he complained about as being subjective and easy to manipulate. So let's, in this post, look at the problems with the technical part of the oral exams.
All Oral Exams, Including Interviews, Are Suspect
Before we even start on the oral technical exam, I need to reiterate this point I made in an earlier post.
Despite the popularity of job interviews, experts agree that they are among the most biased and result in the least accurate predictions of candidate job performane.
You can read the whole article or search the internet on this topic and find I'm not exaggerating or cherry picking. It's pretty much only people who conduct interviews who think they know what they're doing.
" interviewers typically form strong but unwarranted impressions about interviewees, often revealing more about themselves than the candidates."
"Research that my colleagues and I have conducted shows that the problem with interviews is worse than irrelevance: They can be harmful, undercutting the impact of other, more valuable information about interviewees."
Or see this link. I offer a few quotes.
"Consider the job interview: it’s not only a tiny sample, it’s not even a sample of job behaviour but of something else entirely. Extroverts in general do better in interviews than introverts, but for many if not most jobs, extroversion is not what we’re looking for. Psychological theory and data show that we are incapable of treating the interview data as little more than unreliable gossip. It’s just too compelling that we’ve learned a lot from those 30 minutes."
Some of these comments are made about unstructured interviews. The AFD engineer exam was a structured interview which researchers find to be a little better. But still not that good.
From one of the most well known academics writing on human resources management, Edward E. Lawler III:
"Years of research on job interviews has shown that they are poor predictors of who will be a good employee. There are many reasons for this, but perhaps the key explanation is that individuals simply don’t gather the right information and don’t understand what the best predictors of job success are. A careful analysis of the background of individuals and their work history and work samples are more accurate predictors of success on the job than are the judgments which result from interviews."
The graders in the AFD oral exam did not have background information on the candidates' performance, they didn't review the performance reviews done by the candidates' supervisors.
Excuse me for reiterating this point. It's one that is counter to people's perception and to practice. But it's important to make this point loud and clear from the beginning.
My Problems With the Exams
On the positive side, the AFD oral technical and peer review exams were structured. But there were numerous problems with the structure.
The Questions
There were ten questions on technical topics. My understanding of the conditions of my reviewing the exam questions themselves, is that I can't talk about the specific questions publicly. Only things that were discussed publicly in court.
There were ten technical questions. The MOA provided no evidence - though Graham's lawyer asked the MOA to provide documentation of how the exams were validated or even that the exam was validated. That is, we have no evidence to show that the ten questions were predictors of a candidate's likely success in the position of engineer. There is content related to the job, but AFD has produced no evidence showing that, for example, a candidate with a score of 90 will be a better engineer than a candidate with a score of 65. It may very well be that AFD may only be testing who has the best oral test taking skills.
Two of the ten were about how to prepare for the exam itself. The test creator said these were intended to be softball questions to relax candidates. Graham was scored low on them. The other questions were about things like how equipment worked and about AFD procedures for different things.
The Answers
For tests to be reliable, the graders need to be able to compare the candidate's answer to the ideal answer. The key points should be listed with the value of each point. The graders were given a package of answers. Some questions had no real answers attached. Most looked like they were cut and pasted together from text books or AFD policy and procedure manuals. For one question, I timed myself reading the four pages of answer that were provided. It took me 11 minutes and 30 seconds just to read the answer. But there are only about four minutes available per question. How can a grader reliably a) listen to the candidate's answer, b) take notes, and c) read through the answer sheet to match what the candidate said to the expected correct answer? He can't. Graham appears to have done better on the few questions that had bullet points on the provided answers rather than several pages raw material.
Assuming that the question itself is valid, the answer sheets graders had should have had a list of key points that the candidate should mention for each question. The best such answer sheets would identify the most important points that should be in the response with a score for each. There was nothing like that. Instead the graders had to balance the answer the candidate gave, the pages of 'answer' copied out of the text or manual, and give the candidate a score on a completely different score sheet.
The Grading Scale
Rubric For Oral Technical Exam
5 - Outstanding, exemplary performance far exceeds expectations
4.5 - Above average performance exceeds level of basic skills/abilities
3.5 - Adequately demonstrated the basic abilities/skills
2.5 - Needs improvement, falls short of what is expected
1.5 - Unsatisfactory, performance is substandard
0 - Unacceptable, does not demonstrate comprehensive and/or application of required skills sets |
Note: There's a half point difference between "Outstanding" and "Above average". Then it drops, not .5 like from 5 to 4.5, but 1 point to 3.5 "Adequate". So a grader could think, ok, this is good enough, it's adequate. But 3.5 isn't 'adequate' it's really 'just barely adequate." It's the lowest possible passing score. The top two scores are very high marks. The next one is barely passing. 1.5 is "unsatisfactory", but 3.5 (the same interval from 5 as 1.5 is from zero
But on the bottom of the scale, it goes from 0 to 1.5 - both of which are almost equivalent failing grades. But the 1.5 is the same interval as from 5 to 3.5.
1. This scale has uneven intervals. That is, the distance between the points on the scale are not the same. Look closely. The top two scores are both strong passing scores and the bottom two scores are both very poor failing scores. But the top two are separated by .5 points, while the bottom two are separated by 1.5 points. That's the same interval as from 5 to 3.5.
If 1.5 is 'unsatisfactory' the 3.5 should be "satisfactory' but it's only 'adequate' and more accurately 'barely adequate' because 3.5 is the lowest you can get and still pass. 3.4 is a failing grade. (It's less than the 70% (20*3.5) needed to pass.)
The scale skews the scores down. The points graders can mark go from 100% (5) to 90% (4.5) to 70% (3.5) which is the lowest possible passing score. Why isn't there an 80% (4)? That's what normal scales with equal intervals would have next.
2. Passing on this test is 3.5. It took me a bit to figure that out, but 70% is the passing score, so each of these numbers need to be multiplied by 20 to get the actual score. 70% (3.5*20) is the minimal passing score. Let's compare that to other scoring forms you know. Say the standard Likert scale on most surveys:
"How do you feel about the Principal's performance this year?
5 – strongly approve
4 – somewhat approve
3 – neutral/no opinion
2 – somewhat disapprove
1 – strongly disapprove" (from Brighthub Education though it had the scale reversed)
Note that here a score of 3 is in the middle and is neutral, whereas in the AFD rubric, 3.5 is the lowest passing score. The 3 or neutral would have a range from 3.5 to 2.5. So a 3.5 would be on the high end, not the low end as it is on the AFD scale.
I chose this 5 point scale, because the AFD calls its scale a five point scale. But if you look at it, it's really a six point scale, since it has six possible scores from 0, 1.5, 2.5, 3.5, 4.5, and 5. They didn't even realize they had a six point scale! In most six-point Likert scales, there are three positive and three negative options. On such a scale (5, 4, 3, 2, 1, 0) a 3.5 would be a strong passing score, not barely passing.
Let's look at it again from a different angle. In the common academic scoring system, 90-100 is an A, 80-89 is a B, 70-79 is a C, 60-69 is a D, and below 60 is an F.
But in the AFD system, below 70 is an F. On the AFD scale the first two numbers (a ten percent interval) are both very good scores. The next score (20 percent less) is barely passing. All the rest are failing scores.
Why does this matter? First, it matters because the intervals are not equal. It's not common to have uneven intervals so that the distance between one score and another can vary from 10% (5 to 4.5) to 20% (4.5 to 3.5) to 30% (1.5 to 0).
It matters because the scale isn't like any scale we normally see and so the graders don't have a good sense of how skewed it is toward failing candidates. The 3.5 isn't 'good', it's barely passing. Yet most people are used to the academic scoring system and would assume that 70% would be something like a low C. There is no 'good' on this scale. There's 'walks on water,' 'really outstanding,' then 'barely passing'. It took me a while to understand why the scale seemed so off. I don't imagine an AFD grader would figure this out in the time they have to grade. There isn't enough time to even figure out what the right score should be, let alone analyze the scale to see how skewed it is.
Reliability
Tests have to be both valid and reliable. Valid means a high score predicts success (in this case, 'will perform the job well') and a low score predicts (in this case) 'will perform the job poorly'. Even if the test itself is valid (and the AFD never did have anyone test for validity), they also have to be reliable if they will be accurate predictors. Without reliability, the scores aren't accurate, and so the test is no longer a good predictor (no longer valid.)
Reliable means that the same candidate taking the test under different conditions (different time, different place, different graders) would essentially get the same score.
Given the lack of clear answers for the graders to use to compare the candidate's answer to, and given the lack of a clear rating system, it's not likely that this is a reliable test.
And in fact, when I looked at the scores individuals got on the same question by different graders, there were some big differences in score. Some were a point off, some were 2 points off and even 2.5 points off. There were even a few that were 3 points off.
That may not sound like much, but on a six point scale, it's a lot. Even though they actually used a six point scale, they said that 3.5 was equal to 70% - the lowest passing grade. So each point was worth 20%. Therefore, two points equals a 40% difference. 2.5 equals a 50% difference on the score. 3 equals 60%. If this test were reliable, the scores by different graders shouldn't vary by more than 10% or so. That would be half a point on any question.
On this exam, graders gave scores that were up to 60% (3 points) different for the same candidate on the same question! That's the difference - in an academic setting - between a 95% (A) and 35% (low F). But even a one point difference is a 20% difference - that could be a 90% and a 70% or the difference between an A- and a C-. That's a huge spread, and it's there because the answer sheets don't tell the graders what a good answer looks like, and the grading scale easily tricks graders into giving lower ratings than they realize they are giving.
The two people in charge of setting up the tests (and the overall training director at the time) were actually not qualified to prepare the tests. They all had Alaska Fire Safety Training Instructor Certificate Level 1. They were eligible to administer training and tests that had been prepared by someone else, someone with a Level II certificate.
Here's the earlier post on the certifications.
Accountability
The exam is an hour long. Some of the graders wrote brief notes on their score sheets, but there isn't much room. Those notes are the only record of what actually happened in the oral exam. There's no way for the candidate or the graders to check what was actually said. Even as they review the pages that contain the answer, they can't replay the tape to see if the candidate covered something or not. And years later? At trial? Well I know that five minutes later my wife and I have totally different recollections of what each of us said. And in trial, there were a number of occasions when Jeff Jarvi, Jeff Graham's attorney, asked the graders about something they had said at their depositions or at the State Human Rights Commission hearing. They'd answer. Then Jarviv would have them read the transcripts, and the transcripts were not at all what they remembered.
Video or audio recordings of the oral exam probably could have averted this whole trial. The tapes would have demonstrated whether Jeff Graham had covered the points as he said he did, or that he didn't as some of the graders said.
Not having recordings means no one can check how well the graders did their jobs. Or whether people with low scores really did answer less accurately than people with high scores. And it means that candidates can't get feedback on how they did so they can better prepare for the next exam. (I'd note, that as a professor, the most convincing way to demonstrate to students that his paper wasn't very good, was to have them read the best papers. It worked every time.)
Unequal Conditions
One other twist here. This takes us ahead to the oral peer review, but it affects how people do on the technical as well. Prior to the oral peer review, candidates are 'pre-scored' by the graders based on their personal knowledge of the candidates. Imagine random co-workers some who might work closely with you and others who don't, evaluating you based on, well, whatever they want. No one asks the basis. It's not based on written materials they have about your work history, because there is the option of N/O - not observed. If there were some written documentation, they wouldn't need the N/O option.
When the candidate walks into the room, he (it's almost always a he) is given his pre-scores. If you pass (get over 70%) you can choose to spend more time on the technical questions and skip the peer-review questions altogether. You can gain 20 minutes.
Imagine the emotional impact of being told, before you even start the oral technical exam, that the graders had failed you on the pre-scoring of the peer-review. Or that they passed you with a high score. In Jeff's case, he was stunned to learn he'd been failed by the graders in the pre-scoring of the peer-review. It set him back as the technical oral began.
You've Made It To The End
If you made it this far, congratulations. Going through details like this is necessary to truly get a sense of how badly these exams were designed and implemented. But it does take work. Thanks for hanging in there. These are the kinds of details that jury had to sit through over the three week plus trial.
This post covered the technical oral exam problems and the next post will cover the oral 'peer review' exam that was part of the one hour along with the technical oral exam.