What Do I Know?: test

Showing posts with label test. Show all posts

Tuesday, November 16, 2021

Negative

My wife and I got our boosters about two weeks before getting on a plane. Our 8 year old granddaughter got her first shot about the same time. There's a longer interval between shots for her - don't know if that's for all kids or just here. The island is about 82% fully vaccinated for folks over 12. And most people are wearing masks, even walking outside.

But our daughter decided that we should take tests four days after we got here (long enough for the virus to show from our airplane trip). So yesterday, while the granddaughter got tested at school, we used a home testing kit. My daughter got tested and here stepson also got tested. Everyone negative.

Waiting

Just one red line means negative. Two would be positive.

So last night for the first time since we got here Thursday, we all had dinner together. Maskless.

Two tests were $45. And I'm not happy that the rest of the world can't get the vaccines available to us. So let your Congressional representatives know you want all of Africa to have access as well.

Yes, I'm stalling. Still trying to distill the key points on the truncation and cycle allocation post.

Friday, March 20, 2020

"The state [of Alaska] Public Health Laboratories had more than 1600 kits as of Tuesday."

Since I haven't seen this number anywhere, I thought I'd post it now. It came in an email from Deputy Director, Alaska Division of Public Health Jill Lewis in response to my direct question about how many tests we had. Last week she'd mention the number 500 tests. So this time I asked:

"Do you have any updates on the number of tests Alaska has? (The state website has now reported that over 500 tests have been given.)"

So, once again:

"The state Public Health Laboratories had more than 1600 kits as of Tuesday [March 17,2020]."

She didn't mention commercial labs and whether they have additional tests.

Tuesday, February 25, 2020

Exposing Trolls -BotSentinal: Easy Way To Check For Twitter Bots

There's a lot of fake news out there. Even misinformation campaigns. Knowing what is true or false is getting harder.

So we must be ever vigilant about any bit of news - on the mainstream media, on Facebook, Twitter, Instagram, or in real life. Here are some questions to embed in your brain for filtering out the crap.

Is it believable?

Does it support what I would like to believe? (Then I need to be especially careful)
Is it too strange to believe?
Is it so believable I accept it as true without thinking?

Is it true?

Use fact checkers like Politifact or FactCheck.org (UC Berkeley has a good list.)

Who said it? (What bias do they have? What's their record for lying?)

OpenSecrets.org shows who funds organizations and their biases. You can also just google the organization (Natl XYZ reputation) to find sites like media bias fact check to give you additional information about the media or organization

How can you verify it?

Google the basic idea and see if others are reporting it? Are they all of a certain bias?
Are there links to verify what they say? Go to the links and see if they are reputable

So that's general advice. But to specifically check on Twitter I'm recommending that you check out BotSentinal. This link takes you to the BotSentinal page below. Then go to the green link in the upper right hand

Then you get a popup window which let's you insert a Twitter account. (They use light grays which aren't showing up well in these screen shots, sorry.) So, you put the Twitter handle you want to analyze in the box and hit submit.

Very quickly you get a response, like this one:

OK, so how do they figure this out? They tell us that they aren't necessarily looking for actual bots. They are looking for Twitter users who post like bots. From their About Us page:

"We trained Bot Sentinel to identify specific types of trollbot accounts using thousands of accounts and millions of tweets for our machine learning model. The system can correctly identify trollbot accounts with an accuracy of 95%. Unlike other machine learning tools designed to detect “bots,” we are focusing on specific activities deemed inappropriate by Twitter rules. We analyze hundreds of Tweets per each Twitter account to determine if an account exhibit irregular tweet activity, engaging in harassment, or troll-like behavior."

For them, 'troll-like behavior' means behavior proscribed by Twitter.

"Researchers rarely agree on what someone considers a troll or what constitutes harmful bot activity, so we took a different approach when training our machine learning model. Instead of creating a model based on our interpretation of a troll or bot, we used Twitter rules as a guide when selecting Twitter accounts to train our model. We searched for accounts that were repeatedly violating Twitter rules and we trained our model to identify accounts similar to the accounts we identified as “trollbots.” Note: Ideology, political affiliation, religious beliefs, geographic location, or frequency of tweets are not factors when determining the classification of a Twitter account."

What do the scores mean?

"We rate accounts based on a score from 0% to 100%, the higher the score the more likely the account is a trollbot. We analyze several hundred tweets per account, and the more someone engages in behavior that is troll-like, the higher their trollbot rating is."

When benefit of this is:

"We feel since trollbot accounts are likely violating Twitter rules, most Twitter users would want to report and avoid these accounts because they offer little value to meaningful public discourse."

So that leads us to ask: What are Twitter Policies here?

Twitter policies are complicated. I couldn't find a simple list. Here's a link to their General Guidelines and Policy page. It's just a set of links to other pages which give more specific rules for what you shouldn't do on Twitter. I'm trying to bring what seem like some of the more important ones together here.

1. Violent threats policyWhat is in violation of this policy?
Under this policy, you can’t state an intention to inflict violence on a specific person or group of people. We define intent to include statements like “I will”, “I’m going to”, or “I plan to”, as well as conditional statements like “If you do X, I will”. Violations of this policy include, but are not limited to:

threatening to kill someone;

threatening to sexually assault someone;

threatening to seriously hurt someone and/or commit a other violent act that could lead to someone’s death or serious physical injury; and

asking for or offering a financial reward in exchange for inflicting violence on a specific person or group of people.

Probably they should add "encourage other people to do any of these things." There's a lot more nuance on the page, but this is the gist of the Violent Threats Policy.

Next has to do with the content of your Twitter name and profile.

2. Abusive profile informationTwitter Rules: You may not use your username, display name, or profile bio to engage in abusive behavior, such as targeted harassment or expressing hate towards a person, group, or protected category.
Rationale
While we want people to feel free to express their individuality in their profile names and descriptions, we have found that accounts with abusive profile information usually indicate abusive intent and strongly correlate with abusive behavior. The high visibility of profile names and descriptions also means that people might involuntarily find themselves exposed to threatening or abusive content when visiting a profile page.
When this applies
We will review and take enforcement action against accounts that target an individual, group of people, or a protected category with any of the following behavior in their profile information, i.e., usernames, display names, or profile bios:

Violent threats

Abusive slurs, epithets, racist, or sexist tropes

Abusive content that reduces someone to less than human

Content that incites fear"

3. Glorification of violence policy (You can see the bullet points here (I hope using the term bullet point isn't considered a glorification of violence) seem to be a collection of ideas from different people, and not carefully edited)

You may not threaten violence against an individual or a group of people. We also prohibit the glorification of violence.
Glorifying violent acts could inspire others to take part in similar acts of violence. Additionally, glorifying violent events where people were targeted on the basis of their protected characteristics (including: race, ethnicity, national origin, sexual orientation, gender, gender identity, religious affiliation, age, disability, or serious disease) could incite or lead to further violence motivated by hatred and intolerance. For these reasons, we have a policy against content that glorifies acts of violence in a way that may inspire others to replicate those violent acts and cause real offline harm, or events where members of a protected group were the primary targets or victims.
What is in violation of this policy?

Under this policy, you can’t glorify, celebrate, praise or condone violent crimes, violent events where people were targeted because of their membership in a protected group, or the perpetrators of such acts. We define glorification to include praising, celebrating, or condoning statements, such as “I’m glad this happened”, “This person is my hero”, “I wish more people did things like this”, or “I hope this inspires others to act”.
Violations of this policy include, but are not limited to, glorifying, praising, condoning, or celebrating:

violent acts committed by civilians that resulted in death or serious physical injury, e.g., murders, mass shootings;
attacks carried out by terrorist organizations or violent extremist groups (as defined by our terrorism and violent extremism policy); and
violent events that targeted protected groups, e.g., the Holocaust, Rwandan genocide.

4. How Twitter limits what content shows up in your searches. Actions Twitter doesn’t like from accounts

(Current Twitter limits These are not about what you say, but about how often you do things.)

"Please do not:

Repeatedly post duplicate or near-duplicate content (links or Tweets).

Abuse trending topics or hashtags (topic words with a # sign).

Send automated Tweets or replies.

Use bots or applications to post similar messages based on keywords.

Post similar messages over multiple accounts.

Aggressively follow and unfollow people.

Current Twitter limitsThe current technical limits for accounts are:

Direct Messages (daily): The limit is 1,000 messages sent per day.

Tweets: 2,400 per day. The daily update limit is further broken down into smaller limits for semi-hourly intervals. Retweets are counted as Tweets.

Changes to account email: 4 per hour.

Following (daily): The technical follow limit is 400 per day. Please note that this is a technical account limit only, and there are additional rules prohibiting aggressive following behavior.

Following (account-based): Once an account is following 5,000 other accounts, additional follow attempts are limited by account-specific ratios."

For non-Twitter users, direct messages (DMs) are where you send a non-public message to another Twitter account. I think they have to be following you to do that. 1,000 a day seems like a pretty high number for a human.

And 2400 Tweets a day also seems way too high a limit for a human. That's 100 Tweets an hour - assuming you never sleep. Most people can only do this pace if they have programmed their computer to automatically retweet other Tweets, I imagine. As I tried to find the thoughts of others on this, it appears much of this is about using Twitter as a marketing tool. Or propaganda tool.

In any case, those are the behaviors that BotSentinal says it's more-or-less trying to track to determine its scores.

There's A LOT more rules and guidelines. This link will get you to something like a Table of Contents of Twitter Rules.

Oh, one more thing. I checked on Donald J. Trump's Twitter feed. This raises questions about how well BotSentinal works. Or maybe they just give the President a lot more leeway.

Tuesday, March 20, 2018

Graham v MOA #10: The Exams #3: The Oral Technical Exam

[All of the posts in this series are indexed at the "Graham v MOA' tab under the blog header above. Or you can just click here.]

The details are complicated so here's an overview of some key problems; details below.

Overall, oral exams (including interviews) are highly invalid and unreliable predictors of success on the job, despite the fact that they are used a lot.
Imagine a question about how to do something fairly complex like, "tell me what you know about brakes." Where do you start? You have four minutes. Go.
The graders may or may not understand the answer, but they have four or five pages taken from a textbook, they can scan through, with nothing to indicate what's important and what isn't.
The grading scale has uneven intervals - 5, 4.5, 3.5, 2.5, 1.5, 0 - with 3.5 being the lowest possible pass. This odd scale is not what people are used to and skews the grades downward.
You have to answer each question, they have to assess your answer, all in about four minutes.
Lack of Accountability. There's no recording, no way to check to see what a candidate actually said.

That should give you an idea of the kinds of problems that exist with the oral technical exam that Jeff Graham took.

The Details

As I mentioned in an earlier post, firefighters first have to pass a written exam on technical issues. Questions come from a national bank of questions that have been shown to be relevant to being an engineer (the next step above fire fighter).

Then comes the the practical exam. Here the firefighters have to get onto trucks and use equipment based on 'evolutions' (AFD's term) which are more or less scenarios that the firefighters have to role-play. This is the most hands-on part of the exam process.

If you pass these two tests, then you can go on to the oral exams - made up of ten questions of the technical oral exam and then of five questions of the 'peer review' oral exam. That's fifteen questions in an hour, or four minutes per question.

Jeff Graham passed the written and the practical exams. His dispute was over the oral exams which he complained about as being subjective and easy to manipulate. So let's, in this post, look at the problems with the technical part of the oral exams.

All Oral Exams, Including Interviews, Are Suspect

Before we even start on the oral technical exam, I need to reiterate this point I made in an earlier post.

Despite the popularity of job interviews, experts agree that they are among the most biased and result in the least accurate predictions of candidate job performane.

You can read the whole article or search the internet on this topic and find I'm not exaggerating or cherry picking. It's pretty much only people who conduct interviews who think they know what they're doing.

" interviewers typically form strong but unwarranted impressions about interviewees, often revealing more about themselves than the candidates."
"Research that my colleagues and I have conducted shows that the problem with interviews is worse than irrelevance: They can be harmful, undercutting the impact of other, more valuable information about interviewees."

Or see this link. I offer a few quotes.

"Consider the job interview: it’s not only a tiny sample, it’s not even a sample of job behaviour but of something else entirely. Extroverts in general do better in interviews than introverts, but for many if not most jobs, extroversion is not what we’re looking for. Psychological theory and data show that we are incapable of treating the interview data as little more than unreliable gossip. It’s just too compelling that we’ve learned a lot from those 30 minutes."

Some of these comments are made about unstructured interviews. The AFD engineer exam was a structured interview which researchers find to be a little better. But still not that good. From one of the most well known academics writing on human resources management, Edward E. Lawler III:

"Years of research on job interviews has shown that they are poor predictors of who will be a good employee. There are many reasons for this, but perhaps the key explanation is that individuals simply don’t gather the right information and don’t understand what the best predictors of job success are. A careful analysis of the background of individuals and their work history and work samples are more accurate predictors of success on the job than are the judgments which result from interviews."

The graders in the AFD oral exam did not have background information on the candidates' performance, they didn't review the performance reviews done by the candidates' supervisors.

Excuse me for reiterating this point. It's one that is counter to people's perception and to practice. But it's important to make this point loud and clear from the beginning.

My Problems With the Exams

On the positive side, the AFD oral technical and peer review exams were structured. But there were numerous problems with the structure.

The Questions

There were ten questions on technical topics. My understanding of the conditions of my reviewing the exam questions themselves, is that I can't talk about the specific questions publicly. Only things that were discussed publicly in court.

There were ten technical questions. The MOA provided no evidence - though Graham's lawyer asked the MOA to provide documentation of how the exams were validated or even that the exam was validated. That is, we have no evidence to show that the ten questions were predictors of a candidate's likely success in the position of engineer. There is content related to the job, but AFD has produced no evidence showing that, for example, a candidate with a score of 90 will be a better engineer than a candidate with a score of 65. It may very well be that AFD may only be testing who has the best oral test taking skills.

Two of the ten were about how to prepare for the exam itself. The test creator said these were intended to be softball questions to relax candidates. Graham was scored low on them. The other questions were about things like how equipment worked and about AFD procedures for different things.

The Answers

For tests to be reliable, the graders need to be able to compare the candidate's answer to the ideal answer. The key points should be listed with the value of each point. The graders were given a package of answers. Some questions had no real answers attached. Most looked like they were cut and pasted together from text books or AFD policy and procedure manuals. For one question, I timed myself reading the four pages of answer that were provided. It took me 11 minutes and 30 seconds just to read the answer. But there are only about four minutes available per question. How can a grader reliably a) listen to the candidate's answer, b) take notes, and c) read through the answer sheet to match what the candidate said to the expected correct answer? He can't. Graham appears to have done better on the few questions that had bullet points on the provided answers rather than several pages raw material.

Assuming that the question itself is valid, the answer sheets graders had should have had a list of key points that the candidate should mention for each question. The best such answer sheets would identify the most important points that should be in the response with a score for each. There was nothing like that. Instead the graders had to balance the answer the candidate gave, the pages of 'answer' copied out of the text or manual, and give the candidate a score on a completely different score sheet.

The Grading Scale

Rubric For Oral Technical Exam

5 - Outstanding, exemplary performance far exceeds expectations

4.5 - Above average performance exceeds level of basic skills/abilities

3.5 - Adequately demonstrated the basic abilities/skills

2.5 - Needs improvement, falls short of what is expected

1.5 - Unsatisfactory, performance is substandard

0 - Unacceptable, does not demonstrate comprehensive and/or application of required skills sets

Note: There's a half point difference between "Outstanding" and "Above average". Then it drops, not .5 like from 5 to 4.5, but 1 point to 3.5 "Adequate". So a grader could think, ok, this is good enough, it's adequate. But 3.5 isn't 'adequate' it's really 'just barely adequate." It's the lowest possible passing score. The top two scores are very high marks. The next one is barely passing. 1.5 is "unsatisfactory", but 3.5 (the same interval from 5 as 1.5 is from zero

But on the bottom of the scale, it goes from 0 to 1.5 - both of which are almost equivalent failing grades. But the 1.5 is the same interval as from 5 to 3.5.

1. This scale has uneven intervals. That is, the distance between the points on the scale are not the same. Look closely. The top two scores are both strong passing scores and the bottom two scores are both very poor failing scores. But the top two are separated by .5 points, while the bottom two are separated by 1.5 points. That's the same interval as from 5 to 3.5.

If 1.5 is 'unsatisfactory' the 3.5 should be "satisfactory' but it's only 'adequate' and more accurately 'barely adequate' because 3.5 is the lowest you can get and still pass. 3.4 is a failing grade. (It's less than the 70% (20*3.5) needed to pass.)

The scale skews the scores down. The points graders can mark go from 100% (5) to 90% (4.5) to 70% (3.5) which is the lowest possible passing score. Why isn't there an 80% (4)? That's what normal scales with equal intervals would have next.

2. Passing on this test is 3.5. It took me a bit to figure that out, but 70% is the passing score, so each of these numbers need to be multiplied by 20 to get the actual score. 70% (3.5*20) is the minimal passing score. Let's compare that to other scoring forms you know. Say the standard Likert scale on most surveys:

"How do you feel about the Principal's performance this year?
5 – strongly approve
4 – somewhat approve
3 – neutral/no opinion
2 – somewhat disapprove
1 – strongly disapprove" (from Brighthub Education though it had the scale reversed)

Note that here a score of 3 is in the middle and is neutral, whereas in the AFD rubric, 3.5 is the lowest passing score. The 3 or neutral would have a range from 3.5 to 2.5. So a 3.5 would be on the high end, not the low end as it is on the AFD scale.

I chose this 5 point scale, because the AFD calls its scale a five point scale. But if you look at it, it's really a six point scale, since it has six possible scores from 0, 1.5, 2.5, 3.5, 4.5, and 5. They didn't even realize they had a six point scale! In most six-point Likert scales, there are three positive and three negative options. On such a scale (5, 4, 3, 2, 1, 0) a 3.5 would be a strong passing score, not barely passing.

Let's look at it again from a different angle. In the common academic scoring system, 90-100 is an A, 80-89 is a B, 70-79 is a C, 60-69 is a D, and below 60 is an F.

But in the AFD system, below 70 is an F. On the AFD scale the first two numbers (a ten percent interval) are both very good scores. The next score (20 percent less) is barely passing. All the rest are failing scores.

Why does this matter? First, it matters because the intervals are not equal. It's not common to have uneven intervals so that the distance between one score and another can vary from 10% (5 to 4.5) to 20% (4.5 to 3.5) to 30% (1.5 to 0).

It matters because the scale isn't like any scale we normally see and so the graders don't have a good sense of how skewed it is toward failing candidates. The 3.5 isn't 'good', it's barely passing. Yet most people are used to the academic scoring system and would assume that 70% would be something like a low C. There is no 'good' on this scale. There's 'walks on water,' 'really outstanding,' then 'barely passing'. It took me a while to understand why the scale seemed so off. I don't imagine an AFD grader would figure this out in the time they have to grade. There isn't enough time to even figure out what the right score should be, let alone analyze the scale to see how skewed it is.

Reliability

Tests have to be both valid and reliable. Valid means a high score predicts success (in this case, 'will perform the job well') and a low score predicts (in this case) 'will perform the job poorly'. Even if the test itself is valid (and the AFD never did have anyone test for validity), they also have to be reliable if they will be accurate predictors. Without reliability, the scores aren't accurate, and so the test is no longer a good predictor (no longer valid.)
Reliable means that the same candidate taking the test under different conditions (different time, different place, different graders) would essentially get the same score.

Given the lack of clear answers for the graders to use to compare the candidate's answer to, and given the lack of a clear rating system, it's not likely that this is a reliable test.

And in fact, when I looked at the scores individuals got on the same question by different graders, there were some big differences in score. Some were a point off, some were 2 points off and even 2.5 points off. There were even a few that were 3 points off.

That may not sound like much, but on a six point scale, it's a lot. Even though they actually used a six point scale, they said that 3.5 was equal to 70% - the lowest passing grade. So each point was worth 20%. Therefore, two points equals a 40% difference. 2.5 equals a 50% difference on the score. 3 equals 60%. If this test were reliable, the scores by different graders shouldn't vary by more than 10% or so. That would be half a point on any question.

On this exam, graders gave scores that were up to 60% (3 points) different for the same candidate on the same question! That's the difference - in an academic setting - between a 95% (A) and 35% (low F). But even a one point difference is a 20% difference - that could be a 90% and a 70% or the difference between an A- and a C-. That's a huge spread, and it's there because the answer sheets don't tell the graders what a good answer looks like, and the grading scale easily tricks graders into giving lower ratings than they realize they are giving.

The two people in charge of setting up the tests (and the overall training director at the time) were actually not qualified to prepare the tests. They all had Alaska Fire Safety Training Instructor Certificate Level 1. They were eligible to administer training and tests that had been prepared by someone else, someone with a Level II certificate. Here's the earlier post on the certifications.

Accountability

The exam is an hour long. Some of the graders wrote brief notes on their score sheets, but there isn't much room. Those notes are the only record of what actually happened in the oral exam. There's no way for the candidate or the graders to check what was actually said. Even as they review the pages that contain the answer, they can't replay the tape to see if the candidate covered something or not. And years later? At trial? Well I know that five minutes later my wife and I have totally different recollections of what each of us said. And in trial, there were a number of occasions when Jeff Jarvi, Jeff Graham's attorney, asked the graders about something they had said at their depositions or at the State Human Rights Commission hearing. They'd answer. Then Jarviv would have them read the transcripts, and the transcripts were not at all what they remembered.

Video or audio recordings of the oral exam probably could have averted this whole trial. The tapes would have demonstrated whether Jeff Graham had covered the points as he said he did, or that he didn't as some of the graders said.

Not having recordings means no one can check how well the graders did their jobs. Or whether people with low scores really did answer less accurately than people with high scores. And it means that candidates can't get feedback on how they did so they can better prepare for the next exam. (I'd note, that as a professor, the most convincing way to demonstrate to students that his paper wasn't very good, was to have them read the best papers. It worked every time.)

Unequal Conditions

One other twist here. This takes us ahead to the oral peer review, but it affects how people do on the technical as well. Prior to the oral peer review, candidates are 'pre-scored' by the graders based on their personal knowledge of the candidates. Imagine random co-workers some who might work closely with you and others who don't, evaluating you based on, well, whatever they want. No one asks the basis. It's not based on written materials they have about your work history, because there is the option of N/O - not observed. If there were some written documentation, they wouldn't need the N/O option.

When the candidate walks into the room, he (it's almost always a he) is given his pre-scores. If you pass (get over 70%) you can choose to spend more time on the technical questions and skip the peer-review questions altogether. You can gain 20 minutes.

Imagine the emotional impact of being told, before you even start the oral technical exam, that the graders had failed you on the pre-scoring of the peer-review. Or that they passed you with a high score. In Jeff's case, he was stunned to learn he'd been failed by the graders in the pre-scoring of the peer-review. It set him back as the technical oral began.

You've Made It To The End

If you made it this far, congratulations. Going through details like this is necessary to truly get a sense of how badly these exams were designed and implemented. But it does take work. Thanks for hanging in there. These are the kinds of details that jury had to sit through over the three week plus trial.

This post covered the technical oral exam problems and the next post will cover the oral 'peer review' exam that was part of the one hour along with the technical oral exam.

Tuesday, February 06, 2018

Graham v MOA #8: The Exams 1: How The Process Works

[The Graham v MOA tab above lists all the posts in this series and gives some overview of the case and why I think it's important.]

The Exams - How The Process Worked In 2012

The exams firefighter Jeff Graham sued the Municipality of Anchorage over were to determine who would be promoted from the entry level Anchorage Fire Department (AFD) position - firefighter - to the next level - engineer. A firefighter has to promote to engineer to move up in the AFD. If you passed the exams, you would go on a list based on your score. Then, as there were openings for engineers, names would be taken from the top of the list. If the list was used up before the next scheduled exam - in two years - an interim exam could be held. If you were on the list, but not called in two years, then you had to take the exams all over again.

In 2012, the exams had three parts: 1) a written exam, 2) a practical exam, and 3) an oral exam. It was the first year an oral exam was part of the engineer promotion exams. It was the oral exam that Graham objected to. You had to pass ALL THREE parts to get on the engineer list. The first two exams were relatively objective. The oral exam was problematic in many ways, giving graders lots of leeway to pass or fail candidates without much accountability. Let’s look at them all before going into the details of the oral exam in the next posts.

The Written Exam

This was based on a standard set of questions from a national bank of engineer test questions about technical knowledge. It’s multiple choice. Test makers can choose from many questions. This allows them to modify the test to be appropriate to local conditions. Overall, the bank of questions has been validated nationally- the questions are related to what a fire engineer should know and this national association determines the correct answers.

Graham passed this exam with a score of 85. He needed 70 to pass.

The Practical Exam

The practical exam is made up of a series of 'evolutions,' as they call them, that test the candidates' abilities to handle the trucks and equipment as needed on duty. The evolutions (think of them as scenarios) involve actually driving vehicles, hooking up hoses, responding to different types fires, etc. This exam was designed by local test makers. Casey Johnson was in charge of this in 2012 and he followed the basic model that had been used in previous exams, but creating his own specific scenarios.

The different evolutions on the exam are supposed to be unknown to the test takers until they take the test. But the exams are given outside on consecutive days and people taking it on the second or third day can learn from others what events they will be asked to respond to.

A related issue that came up has to do with training outside the Academy. Senior AFD officers often assist firefighters in their stations when there are no emergency calls. So different candidates will get different coaching on different possible evolutions at their stations. In one case, it was argued that one of the people who helped prepare the practical exam gave his subordinates, at the station, training on a new process that hadn’t been used at AFD yet, but was on the exam. Questions were raised whether they had gotten advance information to prepare for that event. The suggestions were denied.

This exam was not validated professionally. Jeff Graham has not challenged the events on this exam - they are related to what people have to do as engineers, but whether successfully completing the events on this test is the best, or even a good, predictor of success as engineer is not known.

Graham did have some questions about the reliability of the exam. Scores on the first day of the exam were low and the fail rate was very high. By the third day, the fail rate dropped significantly. Why might this be?

The exams are done out in the open where they can be seen by anyone. The first people to take the exam do not know what they will be asked to do. By the third day, people have been able to see what events were used, plus people who took the test can talk to friends who haven’t taken the test, so the later test takers can be better prepared.

There was also some unconfirmed discussion at trial about whether the grading standards were loosened by the third day because the success rate was so low. Which would raise questions about how the grading criteria were established.

A reliable test is one where a test taker’s score should not vary regardless of the conditions of the test - which includes what day they took the test. All test takers must face the same test conditions for the test to be reliable.

Jeff Graham passed the practical test comfortably.

The Oral Exam

The oral exam was created especially for the 2012 exam, by Casey Johnson. Oral interviews had been held for higher level positions, like captain, but not for the technical job of engineer. The exam consisted of two parts: 1) a technical part and 2) a “peer review.” This is the part that Jeff Graham failed, by one point. This was the part that Graham complained about before the exams even began, after he was told he failed the exam, and later to the Alaska Human Rights Commission, and finally in court.

The technical part consists of ten questions, supposedly about technical issues, though the 2012 exam had two questions about how to prepare for the test and some that were more AFD policy and administration questions.

The ‘peer review’ consists of five questions that seem to be designed to determine whether someone’s character is good enough to become an engineer.

There are five testers for the oral exam. Engineers are asked to volunteer to assist with various parts of the Academy (the training program designed to prepare people for the exam). The Academy administrators, in this case Chad Richardson and his assistant Casey Johnson, decide who will perform what duties at the Academy and in the exams. They can also encourage people to apply, which at least a couple of the testers said happened to them.

Before the exam takes place, the testers pre-grade the peer review part of the exam. That means, they give each test taker a score based on their knowledge of that person. There was mention of reviewing the application for promotion, but graders had different responses about whether these were considered. If they have no knowledge of that person, they can leave that part blank. So, even though they, theoretically, had access to someone’s application, they could skip the pre-score, which suggests that either the application wasn’t important, they didn’t look at the applications, or prior personal knowledge of the candidate was the key factor. There was conflicting testimony about when this pre-grading was done. Graders were asked to come in early and do things like score candidates on the pretest. But testimony showed that didn’t necessarily happen. Pre-grade scoring could be done in the morning before the testers came in, before anyone was tested, or before each individual came in to be tested.

The Peer Review test process

The candidate walks into the room. He’s given his pre-scores on the peer review. He then has an hour to answer 15 questions - the ten technical questions and the five character questions. That gives someone about four minutes per question. The questions are projected on a screen and the candidate begins answering them.

If the candidate got a passing score on the peer review pre-score, he can elect to skip any or all of the peer review questions and spend more time on the technical questions. This, obviously, gives an advantage to people who were pre-scored well.

Jeff Graham’s pre-score grades were below the needed 70. He got 69. He was surprised by this.

Overall Test Scoring

To pass the engineer exam, candidates have to pass all three parts of the exam. That means that if they fail any of the three parts of the exam (get less than 70%), they fail the whole exam - even if their overall average on all the exams was 71% or 80% or 89%.

Since you have to get 70% or better on ALL three exams, if someone gets a 69 on the first exam (the written exam), they do not go on to take the practical exam. Those who pass the written and then pass the practical exam can go on to take the oral exam.

I don’t have the exact numbers available, but a large number took the written exams and fewer took each succeeding exam. At trial, Deputy Chief Hetrick said people who made it to the orals had good scores - around 85 or more - on the written exams.

From Exam To Promotion

Passing the exam doesn’t mean a firefighter will be promoted to engineer. Those who pass go on a list based on their scores. The higher the score, the higher they are on the list. When there is an opening for an engineer, the top person on the list is promoted. The list is good for two years. If all the people on the list are promoted before the two years are up and they have new openings, they can have an interim Academy and test.

Anyone left on the list after two years is no longer eligible and has to take the whole exam again.

The cost of Academies is quite high in money and in time. It is the full time job for at least two people (in 2012 Chad Richardson and Casey Johnson) for a period of time, plus the time of all the volunteers and all the candidates. Then there is the equipment and other things used. The practicing on various rigs and gas that takes. One figure I heard was about $60,000 but I don’t have confirmation of that.

The Meaning of the High Fail Rate

A lot of people go to the Academy and a relatively small number make it onto the promotion list each time. We can’t be sure why so many fail, but there are several possible explanations that come to mind.

People take the Academy to learn more about the promotion process and they might take the written test just to see how they do. Sort of a trial to gauge how difficult it is and how much they’ll have to study when they take it seriously. People mentioned this was the case for some.
The quality of the firefighters is low. Only a high school degree or a GED is required. They may not be particularly good at studying and/or test taking.
The training at the academy is inadequate to prepare most people to pass the exams.
The tests are necessarily rigorous to make sure only the most technically competent are promoted from firefighter to engineer.
The tests are unnecessarily difficult or harshly graded.
Fewer engineers means more overtime for those who are engineers

I suspect there is an aspect of all six reasons (and perhaps some I haven’t thought of.)

Let me explain number six a little more. Because of the 24 hour shifts several days a week, AFD line employees have a lot of time away from work. Many use this time to run other businesses. But for many others this is an opportunity to work overtime. Not only does the overtime give them time-and-a-half pay, it also raises the annual pay that their retirement benefits will be based on. Some have argued that by making the testing for engineer difficult, the pool of engineers is kept small, and those who become engineers can work more overtime. At trial, the MOA’s expert witness hired to calculate possible compensation for Graham in the chance he won, testified that Graham had very little overtime compared to many who had 1000-2000 hours of overtime. Firefighters work three days of 24 hour shifts per week. 1500 hours of overtime would be the equivalent of 6 weeks. That’s a lot of overtime and a lot of pay at time and a half. One has to ask whether hiring more employees would make overtime less necessary and save the MOA money.

I realize this bit on overtime goes beyond just an overview of the exams, but I’m not sure where else it will fit in, and it gives context to questions that come up about the exams overall. I’m raising this issue because it came up. I don't know how significant it is. I haven’t studied it, but it seems like something that should be followed up on.

The Devil is in the Details

The next posts on Graham v MOA will focus on the details of the oral exams.

Crimes of violence tend to be very tangible, very graphic. We can imagine someone with a gun threatening a mugging victim or a bank teller. We can imagine a stabbing. We can see it vividly. These are crimes that involve people on people violations.

But administrative crimes are much harder to imagine. They are structural crimes that are less visible and easier to hide. They are tied up in details, rows of numbers, pages of text. Easier to conceal.

How a scoring sheet for a test is designed, can make the difference between whether someone passes the exam or not. This is why we hear stories of people who have embezzled money for years and years before they were caught. A petty thief can face much stiffer legal obstacles than a white collar criminal who has bilked people of millions. The latter crime seems less problematic because it's so abstract and harder to visualize.

The details can be tedious. One reason I’m slow in getting these posts out is that I’m trying to make them as easy to read and understand as possible. 'Interesting' is a goal, but that's more elusive. I keep revising and revising and eventually I say, ok, enough, just post it already. Even though I’m sure it’s still hard for the average person with a busy schedule to read, let alone digest.

What Do I Know?

Pages