Outcome measures play a crucial role in everyday clinical practice. No matter your setting or geographic location, we’re betting you use outcome measures multiple times a day.
Why are these measures so fundamental to therapy practice? Because they give us valuable insight into a client’s current condition, help us demonstrate the effectiveness of a particular treatment, aid in determining eligibility for services, justify our reimbursement, inform discharge decisions, and assist in goal writing and intervention planning—just to name a few!
But, not all outcome measures are created equal—and before administering any measure that might inform our clinical decisions, it’s important to understand the quality and accuracy of its results. That’s where psychometrics come into play.
Psychometrics verify that an outcome measure accurately assesses the target factors, thus enabling clinicians to trust the results.

This blog post answers the following questions in more detail:
- What are psychometrics?
- Why are psychometrics important?
- What are the barriers to using psychometrically sound outcome measures?
- How can therapists apply their knowledge of psychometrics when reviewing research to support their practice?
Full disclosure: Psychometrics can be pretty confusing. In fact, this is one of the most technically complex topics we’ve tackled here at OT Potential. So, if you find yourself re-reading certain sections of this guide several times, you’re not alone! We’re going to do our best to walk you through the ins and outs of psychometrics clearly and succinctly—but if you have any lingering questions, please pop them into the comment section at the end of this post, and we will get you an answer!
What are Psychometrics?
Before we get too far ahead of ourselves, let’s clarify which psychometrics this post discusses. There are two main categories of psychometrics:
- The original type focuses on measuring personality traits, and
- The modernized adaptation is used to ensure credible outcome measures.
This post focuses on the second category (i.e., the modern version of psychometrics).
The original psychometrics were developed in the 19th century to mathematically measure personal traits and ensure fair results on instruments such as intellectual and personality tests. In the 1920s, scientists adapted psychometrics to standardize outcome measures, establishing the modern use of psychometrics outside of the psychology space.
To reiterate, outcome measure psychometrics are a calculated evaluation of an assessment’s accuracy and trustworthiness. Psychometric data is discovered through a careful analysis of assessment components and outcomes. From this analysis, researchers can:
- Determine whether an assessment successfully measures the intended client factor, and
- Identify the clinical conditions necessary to ensure the assessment’s accuracy.
For example, psychometric studies can verify a measure’s accuracy when used with different age groups, injuries/diseases, and environmental populations. This psychometric information reflects the academic rigor of assessments and how they can inform treatment eligibility and treatment outcomes.
Which Psychometric Categories are Important for Clinicians?
There are 4 main types of psychometric data commonly seen in research discussing clinical outcome measures:
- Minimal clinically important difference (MCID)
- Minimal detectable change (MDC)
- Validity
- Reliability
Let’s take a look at each one.
Minimal Clinically Important Difference (MCID)
The minimal clinically important difference (MCID) is the amount of improvement or decline in an assessment score that is necessary for clients to notice a change or for the change to be clinically significant.
This psychometric is useful for goal writing and supporting reimbursement, as scores that meet the MCID threshold indicate clinical changes that are significant enough to cause a noticeable difference in the client’s participation. They also offer quantifiable support for the client’s subjective report. From a payer and reimbursement perspective, these scores help justify an individual’s need for therapeutic services (e.g., when the most recent score indicates a functional decline based on previous assessment results).
MCID is one of the best psychometric measures for clinicians to use—especially in their documentation—because it demonstrates improvements supported by research.
It implies that the outcome measure is psychometrically sound, as other psychometric data must be collected before the MCID can be calculated. Unfortunately, that means MCID isn’t always readily available.
Minimal Detectable Change (MDC)
The minimal detectable change (MDC) is the difference in assessment score required to demonstrate an improvement or decline that cannot be attributed to the associated error of measurement. This psychometric category offers an evidence-based way to demonstrate that a smaller improvement is a true change. This can help support a clinician’s documentation of progress towards goals, even if the patient does not experience a noticeable difference.
MDC can be valuable when used in conjunction with the MCID. It can also be substituted when the MCID is not available. Similar to the MCID, the MDC is one of the last steps in assessing an outcome measure’s psychometric properties—meaning it’s often unavailable or difficult to obtain.
Validity
Validity is the extent to which an assessment measures what it is intended to measure. This is the most basic psychometric; it must be established before the MCID and MDC can be determined. Take the Geriatric Depression Scale, for example. This measure’s validity ensures that the information gathered measures a person’s depressive symptoms rather than their anxiety symptoms.
There are several subcategories of validity, including:
- Construct validity, or the relationship between the assessment and the area it assesses, which is important for understanding how well an assessment identifies impairments in a client.
- Criterion validity, which is the comparison of an outcome measure to an external, established system for determining real-world outcomes. Criterion validity has three subtypes: concurrent validity, predictive validity, and longitudinal validity.
- Concurrent validity compares the accuracy of an assessment to another already-established outcome measure. This is useful when there is a known “gold standard” assessment with excellent psychometrics to which newer (and possibly shorter or simpler) assessments can be compared to increase productivity and effectiveness.
- Predictive validity is the extent to which an assessment can predict an outcome. For example, when assessing an individual’s memory, the predictive validity of a measure can help you understand how well its results translate to actually remembering things in daily life.
- Longitudinal validity, also known as responsiveness, demonstrates that changes in the measure will translate to changes in another, similar measure.
Reliability
Reliability reflects an assessment’s trustworthiness or accuracy. This is another basic psychometric; it is studied after validity and before MCID and MDC. Reliability ensures the assessment can provide dependable results in various situations.
There are several subcategories of reliability, including:
- Internal consistency, which is the extent to which the assessment components measure the target characteristic or construct. This ensures that the measure’s results are related to the intended characteristic rather than something else.
- Test-retest reliability, which refers to the continued trustworthiness of an assessment after it has been used multiple times with the same client. This helps validate that changes over time are due to clinical outcomes rather than an individual “learning the assessment.” This is helpful if a clinician plans to use an assessment at the initial evaluation and for reevaluation.
- Inter-rater reliability, which is the consistent accuracy of a measurement when administered by different clinicians. This can be crucial information if multiple clinicians or team members will conduct evaluations for a client. If the inter-rater reliability is low, the results may vary greatly across encounters.
- Intra-rater reliability, which is the accuracy of a measurement when administered by the same clinician over repeated trials. This ensures that the assessor will rate the same level of ability with the same score day-to-day, confirming that results are not skewed by ambiguous scoring guidelines.
Why are Psychometrics Important?
When practitioners understand the psychometric properties of the outcome measures they administer, they can confidently choose the best evidence-based assessment for any given clinical situation. This, in turn, may increase assessment accuracy, treatment efficiency, and evaluation effectiveness.
Using psychometrics is an important part of being an evidence-based clinician. Evidence-based practice (EBP) is a tenet in all clinical professions and is considered essential for effective service delivery. To learn more about evidence-based practice, check out our guide to evidence-based practice in OT.
Standardized assessments are key to EBP, as assessments with good psychometric properties are supported by research demonstrating that they can successfully and consistently measure target factors—meaning they offer strong reinforcement of a practitioner’s efficacy.
Furthermore, evidence-based assessments add objective support for a therapist’s knowledge and clinical reasoning. This is especially relevant for assessments that characterize a client’s current status, as strong psychometrics add further validation to assessment data that corroborates the clinician’s judgment—which can be particularly helpful for new therapists and students who have not yet fully developed their diagnostic reasoning abilities.
The use of evidence-based assessments is also important for obtaining reimbursement from insurance companies, as payers are more likely to trust evaluations that incorporate psychometrically supported outcome measures (as opposed to less-credible assessments or clinical judgment alone).
Clearly, psychometrics are a vital component of any evidence-based measure. And yet, many clinicians underutilize psychometrics as a tool for selecting the best outcome measure in a given clinical scenario.
What Barriers Prevent Clinicians from Using Psychometrics?
While clinicians generally value evidence-based, client-centered practice and evaluation, they often struggle to bring it into their daily practice due to time and resource constraints.
Instead, clinicians often choose assessments based on:
- Availability (the most commonly cited reason for using a particular assessment tool)
- Time to administer
- Ease of administration
- Familiarity
In fact, we found only one article that cited level of evidence as one of the top 5 reasons for choosing an assessment.
This approach to assessment selection is prevalent worldwide and impacts the quality of treatment as well as the reputation of clinical practice itself—but the blame does not lie solely with the therapists choosing the assessments.
Here are some of the most commonly cited barriers to evidence-based assessment selection:
Limited Access
In terms of assessment information, psychometric data are the least accessible to clinicians. This is a common problem, given that few therapists have access to academic journals and other resources (often due to the associated financial cost). This implies that many do not use evidence when choosing assessment tools, leading clinicians to knowingly or unknowingly use less-than-optimal measures.
Limited Education
Despite the myriad of research articles analyzing psychometrics, it is difficult for clinicians to identify an appropriate assessment tool based on this information alone.
In order to effectively use the research available to them, clinicians must be able to understand and interpret these studies. However, clinicians generally lack the knowledge necessary to understand and interpret psychometrics.
So, it’s not surprising that therapists tend to rely on familiar assessments (possibly ones with poor psychometric data). This prevents them from using assessments with superior trustworthiness. To improve overall clinician effectiveness, it’s crucial to provide them with proper education on psychometric properties and how they can use research to identify appropriate, evidence-based assessments.
To help OTs find the best assessments, we created a comprehensive assessment library. Check it out here!
Guide to Interpreting Psychometrics and Understanding Research Results
Psychometrics are reported in the results section of research studies, with the discussion section often offering additional explanation. In the results section, psychometric data are presented with decimals ranging from 0.00 to 1.00, with numbers closer to “0” representing poorer scores and numbers closer to “1” signifying better scores.
These numbers, or coefficients, have several names based on the specific statistical test used to calculate them, but they all follow the same general rule that the closer they are to “1.00,” the higher their psychometric value.
The following table outlines each of the psychometric properties discussed above, along with their statistical equation, the symbol used to denote them in research, and the most commonly accepted rating scale for interpreting their results.
| Psychometric Property | Equation Utilized | Symbol of Equation | Rating Scale of Psychometric Property | ||
| Validity | Construct Validity | Factor Loading/Analysis | 𝓧 | Excellent = 0.70 or higher Good = 0.55-0.69 Acceptable = 0.40-0.45 Poor= Below 0.40 | |
| Criterion Validity | Concurrent Validity | Spearman’s rho (yes/no) | 𝝆 | Excellent = 0.80-1.00 Good = 0.79-0.60 Adequate = 0.59-0.40 Poor = 0.39-0.20 Very Poor = Less than 0.19 | |
| Spearman’s Correlation Coefficient | r | ||||
| Predictive Validity | Spearman’s Correlation Coefficient | r | Excellent = 0.80-1.00 Good = 0.79-0.60 Adequate = 0.59-0.40 Poor = 0.39-0.20 Very Poor = Less than 0.19 | ||
| Longitudinal Validity | Spearman’s Correlation Coefficient | r | Excellent = 0.70-1.00 Good = 0.69-0.60 Adequate = 0.59-0.50 Poor = 0.39-0.20 Very Poor = Less than 0.19 | ||
| Reliability | Internal Consistency | Cronbach’s Alpha | 𝛂 | Excellent = 0.90-1.00 Good = 0.80-0.90 Acceptable = 0.70-0.80 Questionable = 0.60 – 0.70 Poor = 0.50 – 0.60 Unacceptable = Below 0.50 | |
| Test-Retest Reliability | Interclass Correlation Coefficient | ICC(m, k) m= model, k= Unit, both are represented by a number | Excellent = 0.90-1.00 Good = 0.75 – 0.90 Adequate = 0.50-0.70 Poor = Less than 0.50 | ||
| Inter-rater Reliability | Interclass Correlation Coefficient | ICC(m, k) m= model, k= Unit, both are represented by a number | Excellent = 0.90-1.00 Good = 0.75 – 0.90 Adequate = 0.50-0.70 Poor = Less than 0.50 | ||
| Cohen’s Kappa (Used for yes/no outcome measures) | 𝓚 | Excellent = 0.81-1.00 Good = 0.61 – 0.80 Adequate = 0.41-0.60 Fair/Poor = 0.21-0.40 | |||
| Intra-rater Reliability | Interclass Correlation Coefficient | ICC(m, k) m= model, k= Unit, both are represented by a number | Excellent = 0.90-1.00 Good = 0.75 – 0.90 Adequate = 0.50-0.70 Poor = Less than 0.50 | ||
As shown in the chart, the coefficient falls within a rating scale that uses categories like “Excellent,” “Good,” Adequate/Acceptable/Moderate,” and/or “Poor.”
This rating reflects the strength of a psychometric feature:
- “Excellent” is reserved for high-quality outcome measures and is necessary for a tool to be considered “diagnostic.”
- “Good” indicates that the assessment’s psychometrics are satisfactory for use in research or clinical settings, but they are not considered a “gold standard.”
- “Acceptable” is considered useful for assessments in the early stages of research or those covering a brand-new topic. These assessments should be used with caution, as they are not psychometrically supported for trustworthiness and accuracy.
- “Poor” denotes assessments that do not meet psychometric standards in any capacity. Clinicians should avoid administering these outcome measures unless they have strong clinical reasoning for doing so.
Example and Explanation
Here is an example table explaining what each psychometric property would look like in a research article as well as its clinical meaning in the context of a specific outcome measure.
| Barthel Index (BI) in Older Adults | ||||
| Psychometric Property as Presented in Research | Equation Utilized | Rating Scale of Psychometric Property | ||
| Validity | Construct Validity Acceptable to Excellent correlation with TUG at baseline, 3-months, and 12-months post-op r = -0.490, -0.743, -0.843 (Unnanuntana et al., 2018) | Factor Loading/Analysis Or Spearnman’s Correlation Coefficient | Baseline: Acceptable 3-months and 12-months post-op: Excellent | |
| Clinical Meaning The construct validity, or the relation of this measure to the items it is attempting to measure, when compared to the TUG assessment, demonstrates that the Barthel Index (BI) can provide an “acceptable” demonstration of functional mobility/walking ability when the population is first evaluated post-operation, but it becomes a better (Excellent) indicator as they progress through the healing process, with its highest correlation at 12-months post-op. Clinically, this means that the BI is not the best measure for demonstrating functional mobility/walking ability post-operation; consider using a different outcome measure alongside the BI. However, it can provide an accurate picture at 3 months and 12 months post-op, which allows for the use of the BI independently to reduce the number of assessments needed to gain a complete picture of the client’s abilities. | ||||
| Criterion Validity | Concurrent Validity Not available for this measure | Spearman’s rho (yes/no) | None | |
| Spearman’s Correlation Coefficient | ||||
| Clinical Meaning This means there is no study looking at the relationship between the accuracy of the assessment and another, already-established “gold standard” outcome measure for the general older adult population. | ||||
| Predictive Validity Correlation between BI score and mortality r = .67 (Gonzalez et al., 2018) | Spearman’s Correlation Coefficient | Good | ||
| Clinical Meaning The BI scores can be a clinically good indicator of an individual’s mortality, but is not closely related enough to be considered an excellent or diagnostic-level criterion. Clinically, this provides an evidenced-based way to support the need for services based on a BI score and its relation to mortality. | ||||
| Longitudinal Validity “Adequate longitudinal validity” (Macleod & Counsell) | Spearman’s Correlation Coefficient | Adequate | ||
| Clinical Meaning According to the article, changes in scores on the BI translate to changes in other, similar measures. Clinically, this means that this measure shows valid changes in client factors similar to what other measures are capable of. However, this article reports longitudinal validity without a demonstration of a statistical analysis or explanation of procedures for obtaining the data, so this is an example where an appraisal of the article is necessary to ensure the results themselves are valid. | ||||
| Reliability | Internal Consistency 𝛂 = .83 (Bouwstra et al., 2018) | Cronbach’s Alpha | Good | |
| Clinical Meaning This measure shows a good ability to measure its intended client factors in the clinical setting, but it is not reliable enough to use for diagnostic purposes and may not be the best option for use in research. Clinically, this means the Barthel is a good tool for measuring ADL performance | ||||
| Test-Retest Reliability ICC = .936 (Hormozi et al., 2019) | Interclass Correlation Coefficient | Excellent | ||
| Clinical Meaning The assessment can be administered multiple times and still yield accurate results. There is no “learning” in the assessment to cheat the outcomes. In a clinical situation, the BI can be used multiple times in the continuum of care without client bias introduced, and will yield accurate and trustworthy results. | ||||
| Inter-rater Reliability ICC = 0.96 (Bouwstra et al., 2018) | Interclass Correlation Coefficient | Excellent | ||
| Clinical Meaning The BI will yield the same (or close to the same) results when different team members administer the evaluation. Clinically, this means that multiple team members can administer the BI at different points in the continuum of care, and the results will remain trustworthy and accurate. | ||||
| Cohen’s Kappa (Used for yes/no outcome measures) | ||||
| Clinical Meaning Not applicable, as this measure uses a rating scale instead of a yes/no answer | ||||
| Intra-rater Reliability ICC = 0.96 (Bouwstra et al., 2018) | Interclass Correlation Coefficient | Excellent | ||
| Clinical Meaning The BI will yield the same (or close to the same) results when the same clinician member administers the outcome measure. Clinically, this means that this assessment can be used at several points in the continuum of care, without any influence of clinician bias, and the results will remain trustworthy and accurate | ||||
Other Commonly Seen Statistics
There are other measurements that work alongside this psychometric information to support an assessment’s overall clinical value. While these measurements are not considered psychometrics, they are frequently mentioned in research on clinical outcome measures and help inform the use of assessments in different ways. Clinicians will likely encounter these terms when researching outcome measures, so it is important to have a basic understanding of their meaning:
- P-value is a statistic certifying that the data are true results and did not occur by chance. This value is present in all statistical measures, including psychometric equations.
- The standard error of measurement (SEM) is a calculation that accounts for the human error associated with using an evaluative tool or assessment. Knowing this number helps clinicians determine whether a change observed during an evaluation is true—or whether it may be influenced by administrator variance. On top of this, the SEM can be used to help calculate the MDC.
- Sensitivity and Specificity are used in diagnostic testing and “special tests” to measure the likelihood of false positives and false negatives.
- Sensitivity is the likelihood that a test/tool will yield true positive results. In simpler terms, sensitivity ensures that a “special test,” such as Phalen’s test for carpal tunnel syndrome, will accurately detect carpal tunnel syndrome.
- Specificity is the likelihood that a test will correctly identify a true negative result. Continuing with the example above, specificity ensures that when the Phalen’s test is used on healthy individuals, it yields a negative result, accurately identifying that they do not have carpal tunnel syndrome.
- Sensitivity is the likelihood that a test/tool will yield true positive results. In simpler terms, sensitivity ensures that a “special test,” such as Phalen’s test for carpal tunnel syndrome, will accurately detect carpal tunnel syndrome.
- A norm-referenced outcome measure is one where researchers test the measure with a large number of healthy participants, or participants with a shared diagnosis, to determine what is considered “normal.” This gives clinicians a reference for comparing their clients’ results to those of others.
Appraising Research
While psychometric information is a critical component of assessment research, therapists must appraise the research itself to ensure the trustworthiness of the psychometric report. It is common for multiple psychometric studies to report on the same outcome measure but yield different results. When this happens, it is up to the clinician to determine which study is more rigorous.
Here are a few things to look out for when analyzing a study’s credibility:
- Date: More current research is often conducted to challenge previous results, aiming to improve upon previous research methods. While this is not always the case, all research must consider existing, related papers, thus improving the context of the most current research paper.
- Authors: The number of authors can signal research credibility, as the more individuals involved in the research, the greater the thoroughness, extensiveness, and consideration of ethical values. The authors’ credentials should also be considered to ensure they have the training necessary to accurately conduct the research.
- Journal: The journal in which a paper is published shows the rigor of the research; journals with stronger reputations tend to be more selective of the research they accept, while those with lesser standing tend to be less selective.
- Citations: Articles with long lists of citations demonstrate careful consideration of previous research and use that to support the need for their research.
- Clarity: If a research article omits key information, such as specific data or procedural details, or if it is vague about any conflicts of interest, it may not be trustworthy.
- Sample: The number of participants in a study and the populations included in the study both demonstrate credibility. A larger number of participants makes the results more generalizable to the population being studied. The makeup of participants is also important. For example, a study researching the symptoms of Parkinson’s disease that includes only healthy individuals will not yield accurate results.
- Language: Language in a research article should be factual and professional. The presence of biased and emotional language indicates that the results and interpretation may also be biased.
Do All Outcome Measures Need to be Psychometrically Supported?
While the use of psychometrically sound outcome measures is important for gathering quantitative data during the evaluation process, it’s also important to recognize that standardized assessments are only one piece of the evaluation puzzle.
Assessment measures without psychometrics must be used with caution, but there is still value in using under-researched or subjective measures. Subjective measures, for example, can provide valuable insight into details that are not easily quantified. Assessments such as The Occupational Profile, caregiver burden scales, and food lists provide essential information for intervention planning and can reveal problem areas that standardized assessments may miss.
Likewise, clinical observation cannot be captured by an outcome measure, but it is often the most important component of the evaluation process.
Ultimately, while it’s important to use sound assessments that have been vetted for their psychometrics, these assessments should not overtake the importance of clinical observation and subjective information-gathering during the evaluation process.
What Does this All Mean for Clinicians?
Okay, we just threw a LOT of information at you. It might take some time to digest all the details, but here are a few quick takeaways to immediately bring into your daily OT practice:
1.) Whenever possible, look up the MCID and MDC for your assessment—and ideally, use these in your documentation.
2.) Try to prioritize outcome measures with:
- Studied psychometrics
- Strong psychometrics
- Relevant research for your specific population
This is currently NOT EASY. Information is hard to access, and this may be a project for you to undertake with your department. Locating systematic reviews of psychometrics can decrease the time needed to search for articles and can assist therapists with the appraisal of other research articles.
3.) Watch for (and ask for) more assistance as technology advances.
4.) Use all the resources available to assist with research access and use! These include:
- Evidence-Based Practice & Knowledge Translation | AOTA
- AOTA Knowledge Translation Toolkit
- EBP Resources – Test and Measures | APTA
- Evidence Maps | ASHA
- Evidence-Based Practices Resource Center | SAMHSA
- Practice Guidelines and Evidence-Based Clinical Resources | AOTA
Examples of Assessment Psychometrics
It is important to note that there are often several psychometrics for a given assessment. The initial psychometric testing was conducted for the intended population, with subsequent studies being conducted for use with specific populations. The chart below relays the psychometrics of just one population for each assessment.
| Validity | Reliability | MCID/MDC | Predictive Value | |
|---|---|---|---|---|
| 6-Minute Walk Test for Chronic Stroke | Criterion Validity Excellent correlation in 6MWT with VO2 max and 10MWT (Laswati et al., 2020) | Test-retest Reliability Excellent ICC = 0.99 (Flansbjer et al., 2005) | MCID 50 meters (Perera et al., 2006) MDC 36.6 meters/ 120 ft (Flansbjer et al., 2005) | A baseline distance of less than 290 meters (317 yards) results in a two–fold increase in all-cause mortality (Yazdanyar, 2014). |
| Inter-rater Reliability Excellent ICC = 0.78 (Kosak & Smith, 2005) | ||||
| 9-Hole Peg Test for Healthy Adults | Criterion Validity Adequate correlation with Purdue Pegboard test p = -0.74 to -0.75(Wang et al., 2011) | Test-retest Reliability Excellent Right Hand ICC = 0.95 Left Hand ICC = 0.92 (Wang et al., 2011) | Not available | Limited predictive studies are available |
| Interrater Reliability Excellent Right Hand r = 0.984 Left Hand r = 0.993 (Grice et al., 2003) | ||||
| 10-Meter Walk Test for Spinal Cord Injury | Construct Validity Excellent correlation between TUG and 10MWT r =0.89 (vanHedel et al, 2005) Excellent correlation between 10MWT and 6MWT (ρ = -0.95)(vanHedel et al, 2005) | Test-Retest Reliability Excellent ICC = 0.97(Bowden & Behrman, 2007) | MCID 0.06 m/s (Musselman et al., 2009) MDC 0.13 m/s (Lam et al., 2008) | Limited predictive indicators studied |
| Convergent Validity Excellent correlation with Berg Balance Scale r = 0.79 (Lemay & Nadeau, 2010) | Interrater Reliability Excellent r = 0.974 (vanHedel et al, 2005) | |||
| 30- Second Chair Stand for Hip Osteoarthritis | Construct Validity Excellent correlation to 50 ft. walk test: ICC = -0.64(Gill et al., 2012) | Test-Retest Reliability Excellent ICC (1,1) =0.97- 0.98 (Gill et al., 2008) | MCID 2-2.6 repetitions (Wright et al., 2011) | A score of less than average for the age group indicates a high risk for falls (Wallace & Shelkey, 2008) |
| Adequate correlation to SF-36 Physical Function (SF-36 PF): ICC = 0.39 (Gill et al., 2012) | Inter-Rater Reliability Excellent ICC (1,1) =0.93 – 0.98(Gill et al., 2008) | |||
| Barthel Index (BI) for Mixed Conditions in the Hospitals | Construct Validity Excellent correlation between the Barthel Index and Perme ICU Mobility Score ρ= 0.85 Excellent correlation between the Barthel Index and Functional Status Score for the ICUρ= 0.88 | Interrater Reliability Excellent ICC = 0.98(Dos Reis et al., 2022) | MCID 35 points (Castiglia et al., 2017) MDC 20.01 points (Dos Reis et al., 2022) | Score or 80.5 indicates readiness for discharge (Castiglia et al., 2017) Score less than <85 indicates risk of mortality 6 months post fracture sensitivity: 63.64%, specificity: 64.77% (Gonzalez et al, 2017) |
| Internal Consistency Good 𝞪= 0.81*Cronbach’s alpha if item is excluded…Feeding: 0.79Grooming: 0.81Toilet use: 0.81Bathing: 0.82Bowels: 0.81Bladder: 0.81Dressing: 0.81Transfers: 0.78Stairs: 0.76Ambulation: 0.76(Dos Reis et al., 2022) | ||||
| Adequate correlation between the Barthel Index and Hand grip dynamometry ρ = 0.57 (Dos Reis et al., 2022) | ||||
| Berg Balance Scale for Pulmonary Diseases | Convergent Validity Excellent correlation with Activities-specific Balance Confidence (ABC) Scale r = .75 (Jácome, 2016) | Interrater Reliability Excellent ICC = 0.94 (Jácome, 2016) | MDC 5.9 points (Jácome, 2016) | A score less than 52.5 points indicates a risk for falls in this population Sensitivity 73% Specificity 77% (Jácome, 2016) |
| Dynamometer/Grip Strength in Young Healthy Adults | Concurrent Validity Excellent correlation between the Rolyan dynamometer and known weights r = 0.9994 Excellent correlation between the Jamar dynamometer and known weights r = 0.9998 (Mathewetz, V., 2002) | Test Retest Reliability Excellent ICC = 0.81-0.99 in men Excellent ICC = 0.83-1.0 in women (Reddon et al., 1985) | MCID 5.0 to 6.5 kg (Bohannon, 2019) | Low grip strength in 20-year-olds predicts all-cause mortality in later life (Chai et al., 2024) Low grip strength is a predictor of: – Type 2 diabetes- Metabolic syndrome – Cardiovascular diseases – Dyslipidaemia – Hypertension – Cancers – Liver disease – Chronic kidney disease – Chronic respiratory diseases- Cognitive dysfunction- Impaired mental health- Musculoskeletal problems- All-cause mortality- Nutritional status- Institutional admissions- Longer hospital stay- Reduced quality of life- Functional disability (Vaishya et al., 2024) |
| Interrater Reliability Excellent ICC = 0.98(Peolsson, 2001) | ||||
| Katz ADL Index in Cancer Patients | Construct Validity Excellent correlation between Katz ADL and the Instrumental Activities of Daily Living (IADL) scale in men, r = 0.756 | Test Retest Reliability Excellent r = 0.944 (Mystakidou et al., 2013) | Predict the necessary level of care based on scores. The score of functional dependence in the acute setting highly predicts the need for later nursing home placement (Wallace & Shelly, 2008). | |
| Adequate in women r = 0.572 (Mystakidou et al., 2013) | Internal Consistency Good 𝜶= 0.88 (Mystakidou et al., 2013) | |||
| Pinch Strength in Musculoskeletal Disorders | Construct Good to Excellent correlations with grip strength r = 0.72-0.92 | Internal Consistency Excellent ICC = 0.90(Szekeres et al., 2025) | MCID 0.23 kg (0.5 Ib) for tip pinch 0.30 kg (0.6 lb) for tripod pinch (Villafañe et al., 2017) | Poor lateral pinch strength, as compared to other pinches, can be indicative of ulnar nerve impairment. Tip pinch strength can indicate poor median nerve function. Lower peak pinch strength is associated with a higher risk for cognitive impairment (Ren et al., 2025). |
| Adequate to Good correlations with assessments of dexterity r = 0.78-0.80 | ||||
| Poor correlation with patient-reported outcome measures r = 0.03-0.50 (Szekeres et al., 2025) | ||||
| Pittsburgh Sleep Quality Index in Adolescents | Construct Validity Adequate correlation between PSQI and the CES-D r = 0.58 (Raniti et al., 2018) | Test Retest Reliability Adequate ICC = 0.65(Passos et al., 2016) | MDC 3.10 Points (Passos et al., 2016) | Poorer sleep scores are associated with higher rates of depression and lower rates of optic functioning (Yun et al., 2025) |
| Construct Validity Excellent r = 0.96 (Raniti et al., 2018) | Internal Consistency Adequate 𝞪 = 0.73(Raniti et al., 2018) | |||
| Timed Up and Go Test in Stroke Patients | Criterion Validity Excellent correlation between the TUG and FGS r = -0.91 Excellent correlation between the TUG and 6MW r = -0.92 (Flansbjer et al., 2005) | Test Retest Reliability Excellent ICC= 0.96 (Flansbjer et al., 2005) | MDC 2.9 seconds (Flansbjer et al., 2005) | Score over 14 seconds is an indicator that a stroke patient is at risk of falls |
Conclusion
Outcome measures are woven into the fabric of everyday OT practice. They guide our goals, shape our interventions, and help us demonstrate the impact of our work. But only if we can rely on the results they produce—and as you’ve (hopefully) learned here, an assessment’s true clinical value has a lot to do with its psychometric strength.
Psychometrics offer a concrete way for us to determine whether a tool is accurate and valid—whether it actually measures what it claims to measure, and whether changes in score reflect real change. Yes, psychometrics can feel technical and overwhelming. Barriers like time constraints, paywalls, and dense research language are real. But even a foundational understanding empowers us to ask better questions and make more informed clinical decisions.
Outcome measures are more than a documentation box to check—they also influence eligibility, reimbursement, discharge planning, and the direction of care. When we select assessment tools grounded in sound psychometrics, we strengthen our practice and better serve our patients—and ultimately, that’s what evidence-based practice is all about.
Still have questions about psychometrics? We’re all ears. Drop us a line in the comment section below.
Trying to find the perfect assessment? Check out our comprehensive OT assessment library here.
References
- Alotaibi, N. M., Reed, K., & Nadar, M. N. (2009). Assessments used in occupational therapy practice: An exploratory study. Occupational Therapy in Health Care, 23(4), 302-318. https://doi.org/10.3109/07380570903222583
- Asaba, E., Nakamura, M., Asaba, A., & Kottorp, A. (2017). Integrating occupational therapy specific assessments in practice: Exploring practitioner experiences. Occupational Therapy International, 2017. https://doi.org/10.1155/2017/7602805
- Bohannon R. W. (2019). Minimal clinically important difference for grip strength: A systematic review. Journal of Physical Therapy Science 31(1), 75–78. https://doi.org/10.1589/jpts.31.75
- Bouwstra, H., Smit, E. B., Wattel, E. M., van der Wouden, J. C., Hertogh, C. M. P. M., Terluin, B., & Terwee, C. B. (2019). Measurement properties of the Barthel Index in geriatric rehabilitation. Journal of the American Medical Directors Association, 20(4), 420–425.e1. https://doi.org/10.1016/j.jamda.2018.09.033
- Bowden, M. G., & Behrman, A. L. (2007). Step Activity Monitor: accuracy and test-retest reliability in persons with incomplete spinal cord injury. Journal of Rehabilitation Research and Development, 44(3), 355–362. https://doi.org/10.1682/jrrd.2006.03.0033
- Carpenter, J. S., & Andrykowski, M. A. (1998). Psychometric evaluation of the Pittsburgh Sleep Quality Index. Journal of Psychosomatic Research, 45(1), 5–13. https://doi.org/10.1016/s0022-3999(97)00298-5
- Castiglia, S. F., Galeoto, G., Lauta, A., Palumbo, A., Tirinelli, F., Viselli, F., Santilli, V., & Sacchetti, M. L. (2017). The culturally adapted Italian version of the Barthel Index (IcaBI): assessment of structural validity, inter-rater reliability and responsiveness to clinically relevant improvements in patients admitted to inpatient rehabilitation centers. Functional Neurology, 22(4), 221–228. https://doi.org/10.11138/fneur/2017.32.4.221
- Chai, L., Zhang, D. & Fan, J. (2024) Comparison of grip strength measurements for predicting all-cause mortality among adults aged 20+ years from the NHANES 2011–2014. Sci Rep 14, 29245. https://doi.org/10.1038/s41598-024-80487-y
- Demetrius, R. A., Freeman, L. M., Holmes, L. B., Hough, H. E., Tran, M., Steadman, A. R., Sylvia, S. S., Williams, J. T., and D’Amico, M. (2017). A review of occupation and impairment-based assessments used in occupational therapy. Occupation: A Medium of Inquiry about Health through Occupation, 2(1), 31-67. https://nsuworks.nova.edu/occupation/vol2/iss1/4
- Dos Reis, N. F., Figueiredo, F. C. X. S., Biscaro, R. R. M., Lunardelli, E. B., & Maurici, R. (2022). Psychometric properties of the Barthel Index used at intensive care unit discharge. American Journal of Critical Care 31(1), 65-72. https://doi.org/10.4037/ajcc2022732
- Douglas, A., Liu, L., Warren, S., & Hopper, T. (2007). Cognitive assessments for older adults: Which ones are used by Canadian therapists and why. Canadian Journal of Occupational Therapy, 74(5), 370-381. https://doi.10.2182/cjot.07.010
- Flansbjer, U. B., Holmbäck, A. M., Downham, D., Patten, C., & Lexell, J. (2005). Reliability of gait performance tests in men and women with hemiparesis after stroke. Journal of Rehabilitation Medicine, 37(2), 75–82. https://doi.org/10.1080/16501970410017215
- Gill, S., de Morton, N. A., & McBurney, H. (2012). An investigation of the validity of six measures of physical function in people awaiting joint replacement surgery of the hip or knee. Clinical Rehabilitation, 26(10), 945–951. https://doi.org/10.1177/0269215511434993
- Gill, S., & McBurney, H. (2008). Reliability of performance-based measures in people awaiting joint replacement surgery of the hip or knee. Physiotherapy Research International 13(3), 141–152. https://doi.org/10.1002/pri.411
- Gonzalez, N., Bilbao, A., Forjaz, M.J., Ayala, A., Orive, M., Garcia-Gutierrez, S., Las Hayas, C., & Quintana, J.M. (2018). Psychometric characteristics of the Spanish version of the Barthel Index. Aging Clinical and Experimental Research, 30(5), 489-497. https://doi.org/10.1007/s40520-017-0809-5
- Grice, K. O., Vogel, K. A., et al. (2003). Adult norms for a commercially available Nine-Hole Peg Test for finger dexterity. The American Journal of Occupational Therapy 57(5), 570-573 https://doi.org/10.5014/ajot.57.5.570
- Hormozi, S., Alizadeh-Khoei, M., Sharifi, F., Taati, F., Aminalroaya, R., Fadaee, S., Angooti-Oshnari, L., & Saghebi, H. (2019). Iranian version of the Barthel Index: Validity and reliability in outpatients’ elderly. International Journal of Preventive Medicine, 10(130), 1-5. https://doi.org/10.4103/ijpvm.IJPVM_579_18
- Kosak, M., & Smith, T. (2005). Comparison of the 2-, 6-, and 12-minute walk tests in patients with stroke. Journal of Rehabilitation Research and Development, 42(1), 103–107. https://doi.org/10.1682/jrrd.2003.11.0171
- Lam, T., Noonan, V. K., Eng, J. J., & SCIRE Research Team (2008). A systematic review of functional ambulation outcome measures in spinal cord injury. Spinal cord, 46(4), 246–254. https://doi.org/10.1038/sj.sc.3102134
- Laswati, H., Eka Putri, C. J., & Kurniawati, P. M. (2020). Relation between cardiorespiratory fitness measured with six-minute walk test and walking speed measured with 1-meter walk test in patients of post-subacute and chronic ischemic stroke. Indian Journal of Forensic Medicine & Toxicology, 14(2), 2265–2270.
- Lemay, J. F., & Nadeau, S. (2010). Standing balance assessment in ASIA D paraplegic and tetraplegic participants: Concurrent validity of the Berg Balance Scale. Spinal cord, 48(3), 245–250. https://doi.org/10.1038/sc.2009.119
- Macleod, A. D., & Counsell, C. E. (2016). Predictors of functional dependency in Parkinson’s disease. Movement Disorders 31(10), 1482–1488. https://doi.org/10.1002/mds.26751
- Manee, F. S., Nadar, M. S., Alotaibi, N. M., & Rassafiani, M. (2020). Cognitive assessments used in occupational therapy practice: A global perspective. Occupational Therapy International, 2020(1), 8914372. https://doi.org/10.1155/2020/8914372
- Mathiowetz V. (2002). Comparison of Rolyan and Jamar dynamometers for measuring grip strength. Occupational Therapy International 9(3), 201–209. https://doi.org/10.1002/oti.165
- Moruno-Miralles, P., Reyes-Torres, A., Talavera-Valverde, M.-Á., Souto-Gómez, A.-I., & Márquez-Álvarez, L.-J. (2020). Learning and development of diagnostic reasoning in occupational therapy undergraduate students. Occupational Therapy International, 2020. https://doi.org/10.1155/2020/6934579
- Mulligan, S., White, B. P., & Arthanat, S. (2014). An examination of occupation-based, client-centered, evidence-based occupational therapy practices in New Hampshire. OTJR: Occupation, Participation and Health, 34(2), 106-116. https://doi.org/10.3928/15394492-20140226-01
- Musselman, K. E., Fouad, K., Misiaszek, J. E., & Yang, J. F. (2009). Training of walking skills overground and on the treadmill: Case series on individuals with incomplete spinal cord injury. Physical therapy, 89(6), 601–611. https://doi.org/10.2522/ptj.20080257
- Mystakidou, K., Tsilika, E., Parpa, E., Mitropoulou, E., Panagiotou, I., Galanos, A., & Gouliamos, A. (2013). Activities of daily living in Greek cancer patients treated in a palliative care unit. Supportive Care in Cancer, 21(1), 97-105. https://doi.org/10.1007/s00520-012-1497-5
- Peolsson, A., Hedlund, R., & Oberg, B. (2001). Intra- and inter-tester reliability and reference values for hand strength. Journal of Rehabilitation Medicine 33(1), 36–41. https://doi.org/10.1080/165019701300006524
- Perera, S., Mody, S. H., Woodman, R. C., & Studenski, S. A. (2006). Meaningful change and responsiveness in common physical performance measures in older adults. Journal of the American Geriatrics Society, 54(5), 743–749. https://doi.org/10.1111/j.1532-5415.2006.00701.x
- Piernik-Yoder, B., & Beck, A. (2012). The use of standardized assessments in occupational therapy in the United States. Occupational Therapy in Health Care, 26(2-3), 97-108. https://doi.org/10.3109/07380577.2012.695103
- Reddon, J. R., Stefanyk, W. O., Gill, D. M., & Renney, C. (1985). Hand dynamometer: Effects of trials and sessions. Perceptual and Motor Skills, 61(3) 1195–1198. https://doi.org/10.2466/pms.1985.61.3f.1195
- Ren, J., Chen, X., Jiang, W., Chen, Q., & Yang, M. (2025). Pinch strength as an independent correlate of cognitive impairment beyond grip strength in older adults. Archives of Gerontology and Geriatrics, 137, 105938. https://doi.org/10.1016/j.archger.2025.105938
- Romli, M. H., Wan Yunus, F., & Mackenzie, L. (2019). Overview of reviews of standardised occupation‐based instruments for use in occupational therapy practice. Australian Occupational Therapy Journal, 66(4), 428-445. https://doi.org/10.1111/1440-1630.12572
- Rust, J. (2008). Psychometrics 1889. The Psychometric Center: Cambridge Judge Business School. https://www.psychometrics.cam.ac.uk/about-us/our-history/first-psychometric-laboratory
- Szekeres, M., Aspinall, D., Kulick, J., Sajid, A., Dabbagh, A., & MacDermid, J. (2025). Reliability, validity, and responsiveness of pinch strength assessment: A systematic review. Disability and Rehabilitation, 47(7), 1631–1643. https://doi.org/10.1080/09638288.2024.2382907
- Unnanuntana, A., Jarusriwanna, A., and Nepal, S. (2018). Validity and responsiveness of barthel index for measuring functional recovery after hemiarthroplasty for femoral neck fracture. Archives of Orthopaedic and Trauma Surgery, 138, 1671-1677. https://doi.org/10.1007/s00402-018-3020-z
- Vaishya, R., Misra, A., Vaish, A., Ursino, N., & D’Ambrosi, R. (2024). Hand grip strength as a proposed new vital sign of health: A narrative review of evidences. Journal of Health, Population, and Nutrition, 43(1), 7. https://doi.org/10.1186/s41043-024-00500-y
- van Hedel, H. J., Wirz, M., & Dietz, V. (2008). Standardized assessment of walking capacity after spinal cord injury: The European network approach. Neurological research, 30(1), 61–73. https://doi.org/10.1179/016164107X230775
- Villafañe, J. H., Valdes, K., Bertozzi, L., & Negrini, S. (2017). Minimal clinically important difference of grip and pinch strength in women with thumb carpometacarpal osteoarthritis when compared to healthy subjects. Rehabilitation Nursing, 42(3), 139–145. https://doi.org/10.1002/rnj.196
- Wallace, M., & Shelkey, M. (2008). Monitoring functional status in hospitalized older adults. The American journal of nursing, 108(4), 64–72. https://doi.org/10.1097/01.NAJ.0000314811.46029.3d
- Wang, Y. C., Magasi, S. R., Bohannon, R. W., Reuben, D. B., McCreath, H. E., Bubela, D. J., Gershon, R. C., & Rymer, W. Z. (2011). Assessing dexterity function: a comparison of two alternatives for the NIH Toolbox. Journal of Hand Therapy, 24(4), 313–321. https://doi.org/10.1016/j.jht.2011.05.001
- Wright, A. A., Cook, C. E., Baxter, G. D., Dockerty, J. D., & Abbott, J. H. (2011). A comparison of 3 methodological approaches to defining major clinically important improvement of 4 performance measures in patients with hip osteoarthritis. The Journal of Orthopaedic and Sports Physical Therapy, 41(5), 319–327.
- Yazdanyar, A., Aziz, M. M., Enright, P. L., Edmundowicz, D., Boudreau, R., Sutton-Tyrell, K., Kuller, L., & Newman, A. B. (2014). Association between 6-Minute Walk Test and all-cause mortality, coronary heart disease-specific mortality, and incident coronary heart disease. Journal of Aging and Health, 26(4), 583–599. https://doi.org/10.1177/0898264314525665
- Yun, H., Seo, D., Jin, Y., Jang, I. H., Choi, L., Kim, J. H., Shin, W., Lee, H. M., Jung, H. J., Kim, H., Lim, Y. M., & Lee, E. J. (2025). Longitudinal changes in sleep quality, and their predictors in patients with multiple sclerosis. Scientific Reports, 15(1), 33153. https://doi.org/10.1038/s41598-025-18693-5
