Monday, February 19, 2018

Graham v MOA #9: Exams 2 - Can You Explain These Terms: Merit Principles, Validity, And Reliability?

The Municipality of Anchorage (MOA) Charter [the city's constitution] at Section 5.06(c) mandates the Anchorage Assembly to adopt
“Personnel policy and rules preserving the merit principle of employment.”   AMC 3.30.041 and 3.30.044 explain examination types, content, and procedures consistent with these merit principles.  
Âs defined in the Anchorage Municipal Code Personnel Policies and Rules,
“Examination means objective evaluation of skills, experience, education and other characteristics demonstrating the ability of a person to perform the duties required of a class or position.” (AMC 3.30.005)
[OK, before I lose most of my readers, let me just say, this is important stuff to know to understand why the next posts will look so closely at the engineer test that Jeff Graham did not pass.  But it's also important to understand one of the fundamental principles underlying government in the United States (and other nations.)  And I'd add that the concepts behind merit principles are applied in most large private organizations to some extent, though they may have different names.

Jeff Graham's attorney made me boil this down to the most basic points to improve the likelihood I wouldn't put the jury to sleep.  So bear with me and keep reading.

And, you can see an annotated index of all the posts at the Graham v MOA tab above or just link here.]  

Basic Parts of Government In The United States

Governments can be broken down into several parts.
  • The elected politicians who pass the laws and set the broad policy directions (legislature)
  • The elected executive who carries out the laws.
  • The administration is led by the elected executive - the president, the governor at the state level, and the mayor at the city level.
  • Civil Service refers to the career government workers who actually carry out the policies.  There are also appointed officials at the highest levels who are exempt from some or all of the civil service rules.

Merit principles are the guidelines for how the career civil servants are governed.  

So What Are Merit Principles?

Probably the most basic, as related to this case, are:
  • Employees are chosen solely based on their skills, knowledge, and abilities (SKAs) that are directly related to their performance of the job. 
  • The purpose of this is to make government as as effective and efficient as possible by hiring people based on their job related qualities and nothing else.  
  • That also means other factors - political affiliation, race, color, nationality, marital status, age, and disability should not be considered in hiring or promotion.  It also means that arbitrary actions and personal favoritism should not be involved
  • Selection and promotion criteria should be as objective as possible.   

So Steve, what you're saying, this sounds obvious.  What else could there be?

Before the merit system was the Spoils System.  Before merit principles were imposed on government organizations, jobs (the spoils) were given to the victors (winning politicians and their supporters)   The intent of the Merit System is to hire the most qualified candidates.

In 1881, President Garfield was assassinated by a disgruntled job seeker, which spurred Congress to set up the first version of the federal civil service system - The Pendleton Act.

Only a small number of federal positions were covered by this new civil service act, but over the years more and more positions were covered and the procedures improved with improvements in the technology of testing.  The merit system, like any system can be abused, but it's far better than the spoils system.  Objective testing is a big part of applying merit principles.

What does 'objective criteria' mean? 

Objectivity has a couple common and overlapping meanings:
  • Grounded on facts.  Grounding your understanding or belief on something concrete, tangible.  Something measurable that different people could 'see' and agree on.
  • Unbiased.  A second, implied meaning from the first, is that you make decisions neutrally, as free as you can be from bias, preconceived ideas.  That’s not easy for most people to do, but there are ways to do it better. 

What Ways Can Make  Tests More Objective And Free Of Bias?

I think of objectivity as being on one end of a continuum and subjectivity being on the other end.  No decision is completely objective or subjective, nor should it be.  But generally, the more towards the objective side, the harder it is to introduce personal biases.* 

objective ...............................................................................................subjective

First Let's Define "Test"

In selection and promotion, we have tests. Test is defined as any thing used to weed out candidates, or rank candidates from poor to good.  So even an application form can be a test if it would lead to someone being cut out of the candidate pool.  Say candidates are required to have a college degree and someone doesn’t list one on an application.  They would be eliminated already.  

Again,  how do you make tests more objective?

There are two key terms we need to know:  validity and reliability.

What’s Validity?

Validity means that if a person scores higher on a test, we can expect that person to perform better on the specific job.  
Or saying it another way, the test has to truly test for what is necessary for the job.  So, if candidates without a college degree can do the job as well as candidates with a degree, then using college degree to screen out candidates is NOT valid.  

And what is reliability?

Reliability means that if  a person takes the same test at different times or different places, or with different graders, the person should get a very similar result.  Each test situation needs to have the same conditions, whether you take the test on Monday or on Wednesday, in LA or Anchorage, with Mr. X or Miss Y administering and/or grading the test.  

How Validity and Reliability Relate To Each Other

To be valid, the selection or promotion test must be a good predictor of success on the job. People who score high on the exam, should perform the job better than those who score low.  And people who score low should perform worse on the job than people who score high.

BUT, even if the test is intrinsically valid, the way it is administered could invalidate it.  If the test is not also reliable (testing and grading is consistent enough that different test takers will get a very similar score regardless of when or where they take the test and regardless of who scores the test) the test will no longer be valid.  This is because the scores will no longer be good predictors of who will do well on the job.

How do you go about testing for validity and reliability?
This can get complicated, especially for  factors that are not easy to measure.  I didn't go into this during the trial.  I wanted to point out some pages in a national Fire Safety Instructor Training Manual used by the Municipality of Anchorage, but I was not allowed to mention it.  It talks about different levels of validity and how to test for them.  It also says that for 'high stakes' tests, like promotion tests, experts should be hired to validate the test.  The jury didn't get to hear about this. But it's relevant because as I wrote in an earlier post, the people in charge of testing, and specifically in charge of the engineer exam, only had Level I certification, which allows them to administer training and testing designed by someone with Level II certification.  It's at Level II that validity and reliability are covered.  

There really wasn't need to get detailed in the trial, because the oral exam was so egregiously invalid and unreliable that you you could just look at it and see the problems.  And we'll do that in the next posts.

That should be enough but for people who want to know more about this, I'll give a bit more below.

Extra Credit

*"the harder it is to introduce bias"  There are always was that bias can be introduced, from unconscious bias to intentionally thwarting the system.   When civil service was introduced in the United States, there was 'common understanding' that women were not qualified for most jobs.  That was a form of bias.  Blacks were also assumed to be unqualified for most jobs.  Over the years these many of these sorts of cultural barriers have taken down.  But people have found other ways to surreptitiously obstruct barriers.  

Merit Principles

If you want to know more about merit principles I'd refer you to the Merit System Protection Board that was set up as part of the Merit System Reform Act of 1978.  

A little more about reliability problems (because these are important to understand about the engineer promotion exam)

In the main part of this post I wrote that all the important (could affect the score) conditions of the test need to be the same no matter where or when or with whom a candidate takes the test.  Here are some more details
  • Location - If one location is less comfortable - temperature, noise, furniture, lighting, whatever - it could skew the scores of test takers there.
  • Time -  could be a problem in different ways.  
    • All candidates must have the same amount of time to take the test.  
  • Instructions - all instructions have to be identical
  • Security of the test questions - if some applicants know the questions in advance and others do not, the test is not reliable.

The scoring, too, has to be consistent from grader to grader for each applicant.

And there are numerous ways that scoring a test can go wrong.
  • Grader bias  - conscious and unconscious.   Raters who know the candidates may rate them differently than people who don’t know them at all. 
    • The Halo effect means if you have a positive view of the candidate, you’re likely to give him or her more slack.  You think, I know they know this?  
    • The Horn or Devil Effect is the opposite - If you already have a negative opinion about a candidate, you consciously or unconsciously give that a candidate less credit.  These are well documented biases.
    • Testing order bias affects graders and candidates.  
      • After three poor candidates, a mediocre candidate may look good to graders.  
  • Grading Standards - Is the grading scale clear and of a kind that the graders are familiar with?
    • Are the expected answers and how to score them clear to the graders?
    • Do the graders have enough time to calculate the scores consistently?
  • Grader Training -
    •  If they aren't well trained, it could take a while to figure out how to use their scoring techniques, so they score different at the end from the beginning. 

How Do You Overcome the Biases In More Subjective Tests Like Essays, Interviews, and Oral Exams?

Despite the popularity of job interviews, experts agree that they are among the most biased and result in the least accurate predictions of candidate job performane.  Or see this link.

You have to construct standardized, objective rubrics and grading scales - this is critical, particularly for essay and oral exams.

On November 9, 2016 when the electoral college vote totals were tallied, everyone saw the same facts, the same results.  But half the country thought the numbers were good and half though they were bad.

When evaluating the facts of a job or promotion candidate, the organization has to agree, before hand, what ‘good’ facts look like and what ‘bad’ facts look like. Good ones are valid ones - they are accurate predictors of who is more likely to be successful in the position.   Good and bad are determined by the test maker, not by the graders.  The graders merely test whether the performance matches the pre-determined standard of a good performance.

What’s a rubric?

It’s where you describe in as much detail as possible what a good answer looks like.  If you’re looking at content, you identify the key ideas in the answer, and possibly how many points a candidate should get if they mention each of those ideas.  It has to be as objective as possible. The Fire Safety Instructor Training Manual has some examples, but even those aren't as strong as they could be.

Good rubrics take a lot of thought - but it's thought that helps you clarify and communicate what a good answer means so that different graders give the same answer the same score.

Here are some examples:
UC Berkeley Graduate Student Instructors Training
Society For Human Resource Management - This example doesn't explicitly tell graders what the scores (1,2, 3, 4, 5) look like, as the previous one does.
BARS - Behaviorally Anchored Rating Scales - This is an article on using BARS to grade Structured Interviews.  Look particularly at Appendices A & B.
How Olympic Ice Skating is Scored - I couldn't find an actual scoring sheet, but this gives an overall explanation of the process.

My experience is that good rubrics force graders to ground their scores on something concrete, but they can also miss interesting and unexpected things.  It's useful for graders to score each candidate independently, and then discuss why they gave the scores they did - particularly those whose scores vary from most of the scores.  Individual graders may know more about the topic which gives their scores more value.  Or may not have paid close attention.   Ultimately, it comes down to an individual making a judgment.  Otherwise we could just let machines grade.  But the more precise the scoring rubric, the easier it is to detect bias in the graders.


Q:  What if a candidate thinks she got the answer right on a question, but it was scored wrong?

Everything in the test has to be documented.  Candidates should be able to see what questions they missed and how they were scored.  If the test key had an error, they should be able to challenge it.

Q:  Are you saying everything needs to be documented?

If there is going to be any accountability each candidate’s test and each grader’s score sheets must be maintained so that if there are questions about whether a test was graded correctly and consistently from candidate to candidate, it can be checked.

In the case of an oral exam or interview, at least an audio (if not video) record should be kept so that reviewers can see what was actually said at the time by the candidate and the graders.

Q:  Have you strayed a bit from the Merit Principles?

Not at all. This all goes back to the key Merit Principle - selecting and promoting the most qualified candidates for the job.  There won’t be 100% accuracy. But in general, if the test is valid,  a high score will correlate with a high job performance.  But unless the test is also reliable, it won’t be valid. The more reliable the test, the more consistent the scores will be under different conditions and graders.  The best way to make tests more reliable is to make them as objective as possible.

No comments:

Post a Comment

Comments will be reviewed, not for content (except ads), but for style. Comments with personal insults, rambling tirades, and significant repetition will be deleted. Ads disguised as comments, unless closely related to the post and of value to readers (my call) will be deleted. Click here to learn to put links in your comment.