What Do I Know?: Graham v. MOA

Showing posts with label Graham v. MOA. Show all posts

Friday, May 25, 2018

Graham v MOA #12: Fire Chief LeBlanc Retires

"Anchorage Fire Chief Denis LeBlanc retired Friday after nearly three years on the job, officials said.
A successor to LeBlanc will be named in a few weeks, said Kristin DeSmith, spokeswoman for Mayor Ethan Berkowitz. Jodie Hettrick, the fire department's deputy chief of operations, is serving as interim chief.
LeBlanc, who is 70, said he told the administration of his decision in early May. He said he has loved the job, but he's been working on and off for 53 years, including a few decades in the oil industry. . ."

I'm including this hear because LeBlanc was fire chief while the Graham suit went to court. He was never a fire-fighter (usually required for the chief's job). He came from the oil industry, was City Manager for Begich, then went to CM2HHill (the company that bought VECO).

He was asked to get the AFD budget under control. In his deposition he expressed no interest in reaching out to the community to increase the number of women or minorities in the AFD. My hope is that the Diversity Mayor will be able to find a new chief for whom increasing the number of women (about 2% now) and people of color (about 12% now) in the fire department will be a major priority.

I'd note that LeBlanc was on the public administration advisory board when I was chair of the department at UAA and he is a very affable person who was always supportive of the program.

Monday, May 21, 2018

Graham v MOA #11 - Oral Exams 4 - Jeff Graham Passes New Promotion Exam

Although Jeff Graham won his lawsuit over discrimination in his 2012 promotion exam with the Anchorage Fire Department, the Municipality told him he still had to take the promotion exam now before he would get promoted.

After 2012 he stopped taking the exam because he figured it was futile - they weren't going to promote him. He'd heard rumors, but they weren't confirmed until he learned that another fire fighter had heard the head of the promotion training and testing in 2012 (and later) say to him and a couple of others at the promotion academy that "As long as I'm in charge of promotion, Jeff Graham will never promote." He only got a name of someone who heard it directly during the second week of the trial. But that person agreed to testify.

In any case, Jeff passed the written, practical, and oral portions of the exam this time and ended up number five on the promotion list, which is much longer than the list when he last took the exam. But five is reasonably high and if the past is a good predictor, there should be at least five openings in the next two years.

So, he hasn't been promoted, but based on the past, I'm reasonably optimistic.

I'm still writing about the 2012 exam and there have been some changes in the exam recently, but from what I've seen, there are still egregious violations of how things should be done. I'm not sure if Jeff passed because the whole trial experience has given him a better sense of how to prepare for the exam, because there were different people grading the exam, or the word went out to make sure he passes - particularly the oral part of the exam. And I'm sure there are other possible explanations and that none of these are mutually exclusive.

It's a step forward for Jeff and his career. But the oral exam is still overly subjective, the scoring sheets are still bizarre, and the training materials say things that really are in conflict with merit principles and evualating someone based on job related issues only.

Tuesday, March 20, 2018

Graham v MOA #10: The Exams #3: The Oral Technical Exam

[All of the posts in this series are indexed at the "Graham v MOA' tab under the blog header above. Or you can just click here.]

The details are complicated so here's an overview of some key problems; details below.

Overall, oral exams (including interviews) are highly invalid and unreliable predictors of success on the job, despite the fact that they are used a lot.
Imagine a question about how to do something fairly complex like, "tell me what you know about brakes." Where do you start? You have four minutes. Go.
The graders may or may not understand the answer, but they have four or five pages taken from a textbook, they can scan through, with nothing to indicate what's important and what isn't.
The grading scale has uneven intervals - 5, 4.5, 3.5, 2.5, 1.5, 0 - with 3.5 being the lowest possible pass. This odd scale is not what people are used to and skews the grades downward.
You have to answer each question, they have to assess your answer, all in about four minutes.
Lack of Accountability. There's no recording, no way to check to see what a candidate actually said.

That should give you an idea of the kinds of problems that exist with the oral technical exam that Jeff Graham took.

The Details

As I mentioned in an earlier post, firefighters first have to pass a written exam on technical issues. Questions come from a national bank of questions that have been shown to be relevant to being an engineer (the next step above fire fighter).

Then comes the the practical exam. Here the firefighters have to get onto trucks and use equipment based on 'evolutions' (AFD's term) which are more or less scenarios that the firefighters have to role-play. This is the most hands-on part of the exam process.

If you pass these two tests, then you can go on to the oral exams - made up of ten questions of the technical oral exam and then of five questions of the 'peer review' oral exam. That's fifteen questions in an hour, or four minutes per question.

Jeff Graham passed the written and the practical exams. His dispute was over the oral exams which he complained about as being subjective and easy to manipulate. So let's, in this post, look at the problems with the technical part of the oral exams.

All Oral Exams, Including Interviews, Are Suspect

Before we even start on the oral technical exam, I need to reiterate this point I made in an earlier post.

Despite the popularity of job interviews, experts agree that they are among the most biased and result in the least accurate predictions of candidate job performane.

You can read the whole article or search the internet on this topic and find I'm not exaggerating or cherry picking. It's pretty much only people who conduct interviews who think they know what they're doing.

" interviewers typically form strong but unwarranted impressions about interviewees, often revealing more about themselves than the candidates."
"Research that my colleagues and I have conducted shows that the problem with interviews is worse than irrelevance: They can be harmful, undercutting the impact of other, more valuable information about interviewees."

Or see this link. I offer a few quotes.

"Consider the job interview: it’s not only a tiny sample, it’s not even a sample of job behaviour but of something else entirely. Extroverts in general do better in interviews than introverts, but for many if not most jobs, extroversion is not what we’re looking for. Psychological theory and data show that we are incapable of treating the interview data as little more than unreliable gossip. It’s just too compelling that we’ve learned a lot from those 30 minutes."

Some of these comments are made about unstructured interviews. The AFD engineer exam was a structured interview which researchers find to be a little better. But still not that good. From one of the most well known academics writing on human resources management, Edward E. Lawler III:

"Years of research on job interviews has shown that they are poor predictors of who will be a good employee. There are many reasons for this, but perhaps the key explanation is that individuals simply don’t gather the right information and don’t understand what the best predictors of job success are. A careful analysis of the background of individuals and their work history and work samples are more accurate predictors of success on the job than are the judgments which result from interviews."

The graders in the AFD oral exam did not have background information on the candidates' performance, they didn't review the performance reviews done by the candidates' supervisors.

Excuse me for reiterating this point. It's one that is counter to people's perception and to practice. But it's important to make this point loud and clear from the beginning.

My Problems With the Exams

On the positive side, the AFD oral technical and peer review exams were structured. But there were numerous problems with the structure.

The Questions

There were ten questions on technical topics. My understanding of the conditions of my reviewing the exam questions themselves, is that I can't talk about the specific questions publicly. Only things that were discussed publicly in court.

There were ten technical questions. The MOA provided no evidence - though Graham's lawyer asked the MOA to provide documentation of how the exams were validated or even that the exam was validated. That is, we have no evidence to show that the ten questions were predictors of a candidate's likely success in the position of engineer. There is content related to the job, but AFD has produced no evidence showing that, for example, a candidate with a score of 90 will be a better engineer than a candidate with a score of 65. It may very well be that AFD may only be testing who has the best oral test taking skills.

Two of the ten were about how to prepare for the exam itself. The test creator said these were intended to be softball questions to relax candidates. Graham was scored low on them. The other questions were about things like how equipment worked and about AFD procedures for different things.

The Answers

For tests to be reliable, the graders need to be able to compare the candidate's answer to the ideal answer. The key points should be listed with the value of each point. The graders were given a package of answers. Some questions had no real answers attached. Most looked like they were cut and pasted together from text books or AFD policy and procedure manuals. For one question, I timed myself reading the four pages of answer that were provided. It took me 11 minutes and 30 seconds just to read the answer. But there are only about four minutes available per question. How can a grader reliably a) listen to the candidate's answer, b) take notes, and c) read through the answer sheet to match what the candidate said to the expected correct answer? He can't. Graham appears to have done better on the few questions that had bullet points on the provided answers rather than several pages raw material.

Assuming that the question itself is valid, the answer sheets graders had should have had a list of key points that the candidate should mention for each question. The best such answer sheets would identify the most important points that should be in the response with a score for each. There was nothing like that. Instead the graders had to balance the answer the candidate gave, the pages of 'answer' copied out of the text or manual, and give the candidate a score on a completely different score sheet.

The Grading Scale

Rubric For Oral Technical Exam

5 - Outstanding, exemplary performance far exceeds expectations

4.5 - Above average performance exceeds level of basic skills/abilities

3.5 - Adequately demonstrated the basic abilities/skills

2.5 - Needs improvement, falls short of what is expected

1.5 - Unsatisfactory, performance is substandard

0 - Unacceptable, does not demonstrate comprehensive and/or application of required skills sets

Note: There's a half point difference between "Outstanding" and "Above average". Then it drops, not .5 like from 5 to 4.5, but 1 point to 3.5 "Adequate". So a grader could think, ok, this is good enough, it's adequate. But 3.5 isn't 'adequate' it's really 'just barely adequate." It's the lowest possible passing score. The top two scores are very high marks. The next one is barely passing. 1.5 is "unsatisfactory", but 3.5 (the same interval from 5 as 1.5 is from zero

But on the bottom of the scale, it goes from 0 to 1.5 - both of which are almost equivalent failing grades. But the 1.5 is the same interval as from 5 to 3.5.

1. This scale has uneven intervals. That is, the distance between the points on the scale are not the same. Look closely. The top two scores are both strong passing scores and the bottom two scores are both very poor failing scores. But the top two are separated by .5 points, while the bottom two are separated by 1.5 points. That's the same interval as from 5 to 3.5.

If 1.5 is 'unsatisfactory' the 3.5 should be "satisfactory' but it's only 'adequate' and more accurately 'barely adequate' because 3.5 is the lowest you can get and still pass. 3.4 is a failing grade. (It's less than the 70% (20*3.5) needed to pass.)

The scale skews the scores down. The points graders can mark go from 100% (5) to 90% (4.5) to 70% (3.5) which is the lowest possible passing score. Why isn't there an 80% (4)? That's what normal scales with equal intervals would have next.

2. Passing on this test is 3.5. It took me a bit to figure that out, but 70% is the passing score, so each of these numbers need to be multiplied by 20 to get the actual score. 70% (3.5*20) is the minimal passing score. Let's compare that to other scoring forms you know. Say the standard Likert scale on most surveys:

"How do you feel about the Principal's performance this year?
5 – strongly approve
4 – somewhat approve
3 – neutral/no opinion
2 – somewhat disapprove
1 – strongly disapprove" (from Brighthub Education though it had the scale reversed)

Note that here a score of 3 is in the middle and is neutral, whereas in the AFD rubric, 3.5 is the lowest passing score. The 3 or neutral would have a range from 3.5 to 2.5. So a 3.5 would be on the high end, not the low end as it is on the AFD scale.

I chose this 5 point scale, because the AFD calls its scale a five point scale. But if you look at it, it's really a six point scale, since it has six possible scores from 0, 1.5, 2.5, 3.5, 4.5, and 5. They didn't even realize they had a six point scale! In most six-point Likert scales, there are three positive and three negative options. On such a scale (5, 4, 3, 2, 1, 0) a 3.5 would be a strong passing score, not barely passing.

Let's look at it again from a different angle. In the common academic scoring system, 90-100 is an A, 80-89 is a B, 70-79 is a C, 60-69 is a D, and below 60 is an F.

But in the AFD system, below 70 is an F. On the AFD scale the first two numbers (a ten percent interval) are both very good scores. The next score (20 percent less) is barely passing. All the rest are failing scores.

Why does this matter? First, it matters because the intervals are not equal. It's not common to have uneven intervals so that the distance between one score and another can vary from 10% (5 to 4.5) to 20% (4.5 to 3.5) to 30% (1.5 to 0).

It matters because the scale isn't like any scale we normally see and so the graders don't have a good sense of how skewed it is toward failing candidates. The 3.5 isn't 'good', it's barely passing. Yet most people are used to the academic scoring system and would assume that 70% would be something like a low C. There is no 'good' on this scale. There's 'walks on water,' 'really outstanding,' then 'barely passing'. It took me a while to understand why the scale seemed so off. I don't imagine an AFD grader would figure this out in the time they have to grade. There isn't enough time to even figure out what the right score should be, let alone analyze the scale to see how skewed it is.

Reliability

Tests have to be both valid and reliable. Valid means a high score predicts success (in this case, 'will perform the job well') and a low score predicts (in this case) 'will perform the job poorly'. Even if the test itself is valid (and the AFD never did have anyone test for validity), they also have to be reliable if they will be accurate predictors. Without reliability, the scores aren't accurate, and so the test is no longer a good predictor (no longer valid.)
Reliable means that the same candidate taking the test under different conditions (different time, different place, different graders) would essentially get the same score.

Given the lack of clear answers for the graders to use to compare the candidate's answer to, and given the lack of a clear rating system, it's not likely that this is a reliable test.

And in fact, when I looked at the scores individuals got on the same question by different graders, there were some big differences in score. Some were a point off, some were 2 points off and even 2.5 points off. There were even a few that were 3 points off.

That may not sound like much, but on a six point scale, it's a lot. Even though they actually used a six point scale, they said that 3.5 was equal to 70% - the lowest passing grade. So each point was worth 20%. Therefore, two points equals a 40% difference. 2.5 equals a 50% difference on the score. 3 equals 60%. If this test were reliable, the scores by different graders shouldn't vary by more than 10% or so. That would be half a point on any question.

On this exam, graders gave scores that were up to 60% (3 points) different for the same candidate on the same question! That's the difference - in an academic setting - between a 95% (A) and 35% (low F). But even a one point difference is a 20% difference - that could be a 90% and a 70% or the difference between an A- and a C-. That's a huge spread, and it's there because the answer sheets don't tell the graders what a good answer looks like, and the grading scale easily tricks graders into giving lower ratings than they realize they are giving.

The two people in charge of setting up the tests (and the overall training director at the time) were actually not qualified to prepare the tests. They all had Alaska Fire Safety Training Instructor Certificate Level 1. They were eligible to administer training and tests that had been prepared by someone else, someone with a Level II certificate. Here's the earlier post on the certifications.

Accountability

The exam is an hour long. Some of the graders wrote brief notes on their score sheets, but there isn't much room. Those notes are the only record of what actually happened in the oral exam. There's no way for the candidate or the graders to check what was actually said. Even as they review the pages that contain the answer, they can't replay the tape to see if the candidate covered something or not. And years later? At trial? Well I know that five minutes later my wife and I have totally different recollections of what each of us said. And in trial, there were a number of occasions when Jeff Jarvi, Jeff Graham's attorney, asked the graders about something they had said at their depositions or at the State Human Rights Commission hearing. They'd answer. Then Jarviv would have them read the transcripts, and the transcripts were not at all what they remembered.

Video or audio recordings of the oral exam probably could have averted this whole trial. The tapes would have demonstrated whether Jeff Graham had covered the points as he said he did, or that he didn't as some of the graders said.

Not having recordings means no one can check how well the graders did their jobs. Or whether people with low scores really did answer less accurately than people with high scores. And it means that candidates can't get feedback on how they did so they can better prepare for the next exam. (I'd note, that as a professor, the most convincing way to demonstrate to students that his paper wasn't very good, was to have them read the best papers. It worked every time.)

Unequal Conditions

One other twist here. This takes us ahead to the oral peer review, but it affects how people do on the technical as well. Prior to the oral peer review, candidates are 'pre-scored' by the graders based on their personal knowledge of the candidates. Imagine random co-workers some who might work closely with you and others who don't, evaluating you based on, well, whatever they want. No one asks the basis. It's not based on written materials they have about your work history, because there is the option of N/O - not observed. If there were some written documentation, they wouldn't need the N/O option.

When the candidate walks into the room, he (it's almost always a he) is given his pre-scores. If you pass (get over 70%) you can choose to spend more time on the technical questions and skip the peer-review questions altogether. You can gain 20 minutes.

Imagine the emotional impact of being told, before you even start the oral technical exam, that the graders had failed you on the pre-scoring of the peer-review. Or that they passed you with a high score. In Jeff's case, he was stunned to learn he'd been failed by the graders in the pre-scoring of the peer-review. It set him back as the technical oral began.

You've Made It To The End

If you made it this far, congratulations. Going through details like this is necessary to truly get a sense of how badly these exams were designed and implemented. But it does take work. Thanks for hanging in there. These are the kinds of details that jury had to sit through over the three week plus trial.

This post covered the technical oral exam problems and the next post will cover the oral 'peer review' exam that was part of the one hour along with the technical oral exam.

Monday, February 19, 2018

Graham v MOA #9: Exams 2 - Can You Explain These Terms: Merit Principles, Validity, And Reliability?

The Municipality of Anchorage (MOA) Charter [the city's constitution] at Section 5.06(c) mandates the Anchorage Assembly to adopt

“Personnel policy and rules preserving the merit principle of employment.” AMC 3.30.041 and 3.30.044 explain examination types, content, and procedures consistent with these merit principles.

Âs defined in the Anchorage Municipal Code Personnel Policies and Rules,

“Examination means objective evaluation of skills, experience, education and other characteristics demonstrating the ability of a person to perform the duties required of a class or position.” (AMC 3.30.005)

[OK, before I lose most of my readers, let me just say, this is important stuff to know to understand why the next posts will look so closely at the engineer test that Jeff Graham did not pass. But it's also important to understand one of the fundamental principles underlying government in the United States (and other nations.) And I'd add that the concepts behind merit principles are applied in most large private organizations to some extent, though they may have different names.

Jeff Graham's attorney made me boil this down to the most basic points to improve the likelihood I wouldn't put the jury to sleep. So bear with me and keep reading.

And, you can see an annotated index of all the posts at the Graham v MOA tab above or just link here.]

Basic Parts of Government In The United States

Governments can be broken down into several parts.

The elected politicians who pass the laws and set the broad policy directions (legislature)
The elected executive who carries out the laws.
The administration is led by the elected executive - the president, the governor at the state level, and the mayor at the city level.
Civil Service refers to the career government workers who actually carry out the policies. There are also appointed officials at the highest levels who are exempt from some or all of the civil service rules.

Merit principles are the guidelines for how the career civil servants are governed.

So What Are Merit Principles?

Probably the most basic, as related to this case, are:

Employees are chosen solely based on their skills, knowledge, and abilities (SKAs) that are directly related to their performance of the job.
The purpose of this is to make government as as effective and efficient as possible by hiring people based on their job related qualities and nothing else.
That also means other factors - political affiliation, race, color, nationality, marital status, age, and disability should not be considered in hiring or promotion. It also means that arbitrary actions and personal favoritism should not be involved.
Selection and promotion criteria should be as objective as possible.

So Steve, what you're saying, this sounds obvious. What else could there be?

Before the merit system was the Spoils System. Before merit principles were imposed on government organizations, jobs (the spoils) were given to the victors (winning politicians and their supporters) The intent of the Merit System is to hire the most qualified candidates.

In 1881, President Garfield was assassinated by a disgruntled job seeker, which spurred Congress to set up the first version of the federal civil service system - The Pendleton Act.

Only a small number of federal positions were covered by this new civil service act, but over the years more and more positions were covered and the procedures improved with improvements in the technology of testing. The merit system, like any system can be abused, but it's far better than the spoils system. Objective testing is a big part of applying merit principles.

What does 'objective criteria' mean?

Objectivity has a couple common and overlapping meanings:

Grounded on facts. Grounding your understanding or belief on something concrete, tangible. Something measurable that different people could 'see' and agree on.
Unbiased. A second, implied meaning from the first, is that you make decisions neutrally, as free as you can be from bias, preconceived ideas. That’s not easy for most people to do, but there are ways to do it better.

What Ways Can Make Tests More Objective And Free Of Bias?

I think of objectivity as being on one end of a continuum and subjectivity being on the other end. No decision is completely objective or subjective, nor should it be. But generally, the more towards the objective side, the harder it is to introduce personal biases.*

objective ...............................................................................................subjective

First Let's Define "Test"

In selection and promotion, we have tests. Test is defined as any thing used to weed out candidates, or rank candidates from poor to good. So even an application form can be a test if it would lead to someone being cut out of the candidate pool. Say candidates are required to have a college degree and someone doesn’t list one on an application. They would be eliminated already.

Again, how do you make tests more objective?

There are two key terms we need to know: validity and reliability.

What’s Validity?

Validity means that if a person scores higher on a test, we can expect that person to perform better on the specific job.

Or saying it another way, the test has to truly test for what is necessary for the job. So, if candidates without a college degree can do the job as well as candidates with a degree, then using college degree to screen out candidates is NOT valid.

And what is reliability?

Reliability means that if a person takes the same test at different times or different places, or with different graders, the person should get a very similar result. Each test situation needs to have the same conditions, whether you take the test on Monday or on Wednesday, in LA or Anchorage, with Mr. X or Miss Y administering and/or grading the test.

How Validity and Reliability Relate To Each Other

To be valid, the selection or promotion test must be a good predictor of success on the job. People who score high on the exam, should perform the job better than those who score low. And people who score low should perform worse on the job than people who score high.

BUT, even if the test is intrinsically valid, the way it is administered could invalidate it. If the test is not also reliable (testing and grading is consistent enough that different test takers will get a very similar score regardless of when or where they take the test and regardless of who scores the test) the test will no longer be valid. This is because the scores will no longer be good predictors of who will do well on the job.

How do you go about testing for validity and reliability?

This can get complicated, especially for factors that are not easy to measure. I didn't go into this during the trial. I wanted to point out some pages in a national Fire Safety Instructor Training Manual used by the Municipality of Anchorage, but I was not allowed to mention it. It talks about different levels of validity and how to test for them. It also says that for 'high stakes' tests, like promotion tests, experts should be hired to validate the test. The jury didn't get to hear about this. But it's relevant because as I wrote in an earlier post, the people in charge of testing, and specifically in charge of the engineer exam, only had Level I certification, which allows them to administer training and testing designed by someone with Level II certification. It's at Level II that validity and reliability are covered.

There really wasn't need to get detailed in the trial, because the oral exam was so egregiously invalid and unreliable that you you could just look at it and see the problems. And we'll do that in the next posts.

That should be enough but for people who want to know more about this, I'll give a bit more below.

-----------------------------------------------------------------------

Extra Credit

*"the harder it is to introduce bias" There are always was that bias can be introduced, from unconscious bias to intentionally thwarting the system. When civil service was introduced in the United States, there was 'common understanding' that women were not qualified for most jobs. That was a form of bias. Blacks were also assumed to be unqualified for most jobs. Over the years these many of these sorts of cultural barriers have taken down. But people have found other ways to surreptitiously obstruct barriers.

Merit Principles

If you want to know more about merit principles I'd refer you to the M erit System Protection Board that was set up as part of the Merit System Reform Act of 1978.

A little more about reliability problems (because these are important to understand about the engineer promotion exam)

In the main part of this post I wrote that all the important (could affect the score) conditions of the test need to be the same no matter where or when or with whom a candidate takes the test. Here are some more details

Location - If one location is less comfortable - temperature, noise, furniture, lighting, whatever - it could skew the scores of test takers there.
Time - could be a problem in different ways.

All candidates must have the same amount of time to take the test.

Instructions - all instructions have to be identical
Security of the test questions - if some applicants know the questions in advance and others do not, the test is not reliable.

The scoring, too, has to be consistent from grader to grader for each applicant.

And there are numerous ways that scoring a test can go wrong.

Grader bias - conscious and unconscious. Raters who know the candidates may rate them differently than people who don’t know them at all.

The Halo effect means if you have a positive view of the candidate, you’re likely to give him or her more slack. You think, I know they know this?
The Horn or Devil Effect is the opposite - If you already have a negative opinion about a candidate, you consciously or unconsciously give that a candidate less credit. These are well documented biases.
Testing order bias affects graders and candidates.

After three poor candidates, a mediocre candidate may look good to graders.

Grading Standards - Is the grading scale clear and of a kind that the graders are familiar with?

Are the expected answers and how to score them clear to the graders?
Do the graders have enough time to calculate the scores consistently?

Grader Training -

If they aren't well trained, it could take a while to figure out how to use their scoring techniques, so they score different at the end from the beginning.

How Do You Overcome the Biases In More Subjective Tests Like Essays, Interviews, and Oral Exams?

Despite the popularity of job interviews, experts agree that they are among the most biased and result in the least accurate predictions of candidate job performane. Or see this link.

You have to construct standardized, objective rubrics and grading scales - this is critical, particularly for essay and oral exams.

On November 9, 2016 when the electoral college vote totals were tallied, everyone saw the same facts, the same results. But half the country thought the numbers were good and half though they were bad.

When evaluating the facts of a job or promotion candidate, the organization has to agree, before hand, what ‘good’ facts look like and what ‘bad’ facts look like. Good ones are valid ones - they are accurate predictors of who is more likely to be successful in the position. Good and bad are determined by the test maker, not by the graders. The graders merely test whether the performance matches the pre-determined standard of a good performance.

What’s a rubric?

It’s where you describe in as much detail as possible what a good answer looks like. If you’re looking at content, you identify the key ideas in the answer, and possibly how many points a candidate should get if they mention each of those ideas. It has to be as objective as possible. The Fire Safety Instructor Training Manual has some examples, but even those aren't as strong as they could be.

Good rubrics take a lot of thought - but it's thought that helps you clarify and communicate what a good answer means so that different graders give the same answer the same score.

Here are some examples:
UC Berkeley Graduate Student Instructors Training
Society For Human Resource Management - This example doesn't explicitly tell graders what the scores (1,2, 3, 4, 5) look like, as the previous one does.
BARS - Behaviorally Anchored Rating Scales - This is an article on using BARS to grade Structured Interviews. Look particularly at Appendices A & B.
How Olympic Ice Skating is Scored - I couldn't find an actual scoring sheet, but this gives an overall explanation of the process.

My experience is that good rubrics force graders to ground their scores on something concrete, but they can also miss interesting and unexpected things. It's useful for graders to score each candidate independently, and then discuss why they gave the scores they did - particularly those whose scores vary from most of the scores. Individual graders may know more about the topic which gives their scores more value. Or may not have paid close attention. Ultimately, it comes down to an individual making a judgment. Otherwise we could just let machines grade. But the more precise the scoring rubric, the easier it is to detect bias in the graders.

Accountability

Q: What if a candidate thinks she got the answer right on a question, but it was scored wrong?

Everything in the test has to be documented. Candidates should be able to see what questions they missed and how they were scored. If the test key had an error, they should be able to challenge it.

Q: Are you saying everything needs to be documented?

If there is going to be any accountability each candidate’s test and each grader’s score sheets must be maintained so that if there are questions about whether a test was graded correctly and consistently from candidate to candidate, it can be checked.

In the case of an oral exam or interview, at least an audio (if not video) record should be kept so that reviewers can see what was actually said at the time by the candidate and the graders.

Q: Have you strayed a bit from the Merit Principles?

Not at all. This all goes back to the key Merit Principle - selecting and promoting the most qualified candidates for the job. There won’t be 100% accuracy. But in general, if the test is valid, a high score will correlate with a high job performance. But unless the test is also reliable, it won’t be valid. The more reliable the test, the more consistent the scores will be under different conditions and graders. The best way to make tests more reliable is to make them as objective as possible.

Tuesday, February 06, 2018

Graham v MOA #8: The Exams 1: How The Process Works

[The Graham v MOA tab above lists all the posts in this series and gives some overview of the case and why I think it's important.]

The Exams - How The Process Worked In 2012

The exams firefighter Jeff Graham sued the Municipality of Anchorage over were to determine who would be promoted from the entry level Anchorage Fire Department (AFD) position - firefighter - to the next level - engineer. A firefighter has to promote to engineer to move up in the AFD. If you passed the exams, you would go on a list based on your score. Then, as there were openings for engineers, names would be taken from the top of the list. If the list was used up before the next scheduled exam - in two years - an interim exam could be held. If you were on the list, but not called in two years, then you had to take the exams all over again.

In 2012, the exams had three parts: 1) a written exam, 2) a practical exam, and 3) an oral exam. It was the first year an oral exam was part of the engineer promotion exams. It was the oral exam that Graham objected to. You had to pass ALL THREE parts to get on the engineer list. The first two exams were relatively objective. The oral exam was problematic in many ways, giving graders lots of leeway to pass or fail candidates without much accountability. Let’s look at them all before going into the details of the oral exam in the next posts.

The Written Exam

This was based on a standard set of questions from a national bank of engineer test questions about technical knowledge. It’s multiple choice. Test makers can choose from many questions. This allows them to modify the test to be appropriate to local conditions. Overall, the bank of questions has been validated nationally- the questions are related to what a fire engineer should know and this national association determines the correct answers.

Graham passed this exam with a score of 85. He needed 70 to pass.

The Practical Exam

The practical exam is made up of a series of 'evolutions,' as they call them, that test the candidates' abilities to handle the trucks and equipment as needed on duty. The evolutions (think of them as scenarios) involve actually driving vehicles, hooking up hoses, responding to different types fires, etc. This exam was designed by local test makers. Casey Johnson was in charge of this in 2012 and he followed the basic model that had been used in previous exams, but creating his own specific scenarios.

The different evolutions on the exam are supposed to be unknown to the test takers until they take the test. But the exams are given outside on consecutive days and people taking it on the second or third day can learn from others what events they will be asked to respond to.

A related issue that came up has to do with training outside the Academy. Senior AFD officers often assist firefighters in their stations when there are no emergency calls. So different candidates will get different coaching on different possible evolutions at their stations. In one case, it was argued that one of the people who helped prepare the practical exam gave his subordinates, at the station, training on a new process that hadn’t been used at AFD yet, but was on the exam. Questions were raised whether they had gotten advance information to prepare for that event. The suggestions were denied.

This exam was not validated professionally. Jeff Graham has not challenged the events on this exam - they are related to what people have to do as engineers, but whether successfully completing the events on this test is the best, or even a good, predictor of success as engineer is not known.

Graham did have some questions about the reliability of the exam. Scores on the first day of the exam were low and the fail rate was very high. By the third day, the fail rate dropped significantly. Why might this be?

The exams are done out in the open where they can be seen by anyone. The first people to take the exam do not know what they will be asked to do. By the third day, people have been able to see what events were used, plus people who took the test can talk to friends who haven’t taken the test, so the later test takers can be better prepared.

There was also some unconfirmed discussion at trial about whether the grading standards were loosened by the third day because the success rate was so low. Which would raise questions about how the grading criteria were established.

A reliable test is one where a test taker’s score should not vary regardless of the conditions of the test - which includes what day they took the test. All test takers must face the same test conditions for the test to be reliable.

Jeff Graham passed the practical test comfortably.

The Oral Exam

The oral exam was created especially for the 2012 exam, by Casey Johnson. Oral interviews had been held for higher level positions, like captain, but not for the technical job of engineer. The exam consisted of two parts: 1) a technical part and 2) a “peer review.” This is the part that Jeff Graham failed, by one point. This was the part that Graham complained about before the exams even began, after he was told he failed the exam, and later to the Alaska Human Rights Commission, and finally in court.

The technical part consists of ten questions, supposedly about technical issues, though the 2012 exam had two questions about how to prepare for the test and some that were more AFD policy and administration questions.

The ‘peer review’ consists of five questions that seem to be designed to determine whether someone’s character is good enough to become an engineer.

There are five testers for the oral exam. Engineers are asked to volunteer to assist with various parts of the Academy (the training program designed to prepare people for the exam). The Academy administrators, in this case Chad Richardson and his assistant Casey Johnson, decide who will perform what duties at the Academy and in the exams. They can also encourage people to apply, which at least a couple of the testers said happened to them.

Before the exam takes place, the testers pre-grade the peer review part of the exam. That means, they give each test taker a score based on their knowledge of that person. There was mention of reviewing the application for promotion, but graders had different responses about whether these were considered. If they have no knowledge of that person, they can leave that part blank. So, even though they, theoretically, had access to someone’s application, they could skip the pre-score, which suggests that either the application wasn’t important, they didn’t look at the applications, or prior personal knowledge of the candidate was the key factor. There was conflicting testimony about when this pre-grading was done. Graders were asked to come in early and do things like score candidates on the pretest. But testimony showed that didn’t necessarily happen. Pre-grade scoring could be done in the morning before the testers came in, before anyone was tested, or before each individual came in to be tested.

The Peer Review test process

The candidate walks into the room. He’s given his pre-scores on the peer review. He then has an hour to answer 15 questions - the ten technical questions and the five character questions. That gives someone about four minutes per question. The questions are projected on a screen and the candidate begins answering them.

If the candidate got a passing score on the peer review pre-score, he can elect to skip any or all of the peer review questions and spend more time on the technical questions. This, obviously, gives an advantage to people who were pre-scored well.

Jeff Graham’s pre-score grades were below the needed 70. He got 69. He was surprised by this.

Overall Test Scoring

To pass the engineer exam, candidates have to pass all three parts of the exam. That means that if they fail any of the three parts of the exam (get less than 70%), they fail the whole exam - even if their overall average on all the exams was 71% or 80% or 89%.

Since you have to get 70% or better on ALL three exams, if someone gets a 69 on the first exam (the written exam), they do not go on to take the practical exam. Those who pass the written and then pass the practical exam can go on to take the oral exam.

I don’t have the exact numbers available, but a large number took the written exams and fewer took each succeeding exam. At trial, Deputy Chief Hetrick said people who made it to the orals had good scores - around 85 or more - on the written exams.

From Exam To Promotion

Passing the exam doesn’t mean a firefighter will be promoted to engineer. Those who pass go on a list based on their scores. The higher the score, the higher they are on the list. When there is an opening for an engineer, the top person on the list is promoted. The list is good for two years. If all the people on the list are promoted before the two years are up and they have new openings, they can have an interim Academy and test.

Anyone left on the list after two years is no longer eligible and has to take the whole exam again.

The cost of Academies is quite high in money and in time. It is the full time job for at least two people (in 2012 Chad Richardson and Casey Johnson) for a period of time, plus the time of all the volunteers and all the candidates. Then there is the equipment and other things used. The practicing on various rigs and gas that takes. One figure I heard was about $60,000 but I don’t have confirmation of that.

The Meaning of the High Fail Rate

A lot of people go to the Academy and a relatively small number make it onto the promotion list each time. We can’t be sure why so many fail, but there are several possible explanations that come to mind.

People take the Academy to learn more about the promotion process and they might take the written test just to see how they do. Sort of a trial to gauge how difficult it is and how much they’ll have to study when they take it seriously. People mentioned this was the case for some.
The quality of the firefighters is low. Only a high school degree or a GED is required. They may not be particularly good at studying and/or test taking.
The training at the academy is inadequate to prepare most people to pass the exams.
The tests are necessarily rigorous to make sure only the most technically competent are promoted from firefighter to engineer.
The tests are unnecessarily difficult or harshly graded.
Fewer engineers means more overtime for those who are engineers

I suspect there is an aspect of all six reasons (and perhaps some I haven’t thought of.)

Let me explain number six a little more. Because of the 24 hour shifts several days a week, AFD line employees have a lot of time away from work. Many use this time to run other businesses. But for many others this is an opportunity to work overtime. Not only does the overtime give them time-and-a-half pay, it also raises the annual pay that their retirement benefits will be based on. Some have argued that by making the testing for engineer difficult, the pool of engineers is kept small, and those who become engineers can work more overtime. At trial, the MOA’s expert witness hired to calculate possible compensation for Graham in the chance he won, testified that Graham had very little overtime compared to many who had 1000-2000 hours of overtime. Firefighters work three days of 24 hour shifts per week. 1500 hours of overtime would be the equivalent of 6 weeks. That’s a lot of overtime and a lot of pay at time and a half. One has to ask whether hiring more employees would make overtime less necessary and save the MOA money.

I realize this bit on overtime goes beyond just an overview of the exams, but I’m not sure where else it will fit in, and it gives context to questions that come up about the exams overall. I’m raising this issue because it came up. I don't know how significant it is. I haven’t studied it, but it seems like something that should be followed up on.

The Devil is in the Details

The next posts on Graham v MOA will focus on the details of the oral exams.

Crimes of violence tend to be very tangible, very graphic. We can imagine someone with a gun threatening a mugging victim or a bank teller. We can imagine a stabbing. We can see it vividly. These are crimes that involve people on people violations.

But administrative crimes are much harder to imagine. They are structural crimes that are less visible and easier to hide. They are tied up in details, rows of numbers, pages of text. Easier to conceal.

How a scoring sheet for a test is designed, can make the difference between whether someone passes the exam or not. This is why we hear stories of people who have embezzled money for years and years before they were caught. A petty thief can face much stiffer legal obstacles than a white collar criminal who has bilked people of millions. The latter crime seems less problematic because it's so abstract and harder to visualize.

The details can be tedious. One reason I’m slow in getting these posts out is that I’m trying to make them as easy to read and understand as possible. 'Interesting' is a goal, but that's more elusive. I keep revising and revising and eventually I say, ok, enough, just post it already. Even though I’m sure it’s still hard for the average person with a busy schedule to read, let alone digest.

Saturday, January 20, 2018

Graham v MOA #7: "you cannot allow the bad guy to go to jail and you leave the structure intact."

Below is an NPR interview with ESPN's Howard Bryant about the current sexual abuse trial of USA Olympic gymnastic coach Larry Nassar. Bryant captures were clearly the point of my series of posts on the Graham trial.

We punish the bad guy, then let the system that enables bad guys to operate intact.

In Graham's case, 'the bad guy' got demoted two ranks and everyone else involved is now in a higher position than they had been five years ago. Except Graham who is still at the entry level fire-fighter position.

My background is public administration - how the system is designed, what are the rewards and punishments - intended and unintended? What informal systems work against the formal systems?

When I look at this situation I think: how did the system let this go on, just like Bryant asks in the audio.
But it seems like when the lawyers look at it, they think, ok, case is closed, move on to the next case. It's about individuals, not about the system. That's horribly wrong.

That's why I'm spending so much time on this case. To show what went wrong and to ask why the existing system never did anything about it. If Jeff Graham hadn't been stubborn, hadn't risked his financial security to hire an attorney, hadn't broken the code of the fire department that you go along to get along, none of this would have come out.

It's just like the other systems Bryant mentions, systems that allow abusers and a abuses to continue - like sexual assault, like concussions in football, like the church scandals.

The part I'm highlighting starts at 1:46

How did it go on for so long? We're still even asking the question if there were problems with the structure. Of course there were problems with the structure.Q: Sturcture of?
2:00 USA Gymnastics, Michigan State, . . . the adults were supposed to take care of these athletes, supposed to protect them, no different from any other scandal, whether church, concussions, you cannot allow the bad guy to go to jail and you leave the structure intact.
2:45 Q: Why did they wait so long? Why did they wait for 20 years. Larry Nassar has been under scrutiny for some time now?
2:53 This is a very American thing we do. We find the bad guys, we take the bad guys, and we punish the bad guys. Then we leave every mechanism that allows the bad guys to exist and enables the bad guys, we leave those things alone, , ,
This is something we have to deal with as a culture because we don't deal with it very well. And especially when you're dealing with young people.

What Do I Know?

Pages