How to Handle Variables That Occur in Multiple Files in SEER-CAHPS
In cancer research, it is essential to use variables that capture key population groups as accurately as possible and for researchers to understand the history, characteristics, and limitations of variables within linked data. SEER-CAHPS includes SEER cancer registry data, the Medicare CAHPS care experience surveys, and Medicare enrollment data. As a result, there are some demographic variables that exist in more than one data source within the linked data resource.
Below we highlight two examples within SEER-CAHPS, race/ethnicity and gender/sex, and offer considerations, guidance, and recommendations for your analyses.
Race/ethnicity values from the Medicare enrollment data are based on race/ethnicity data from the Social Security Administration (SSA), which used limited categories (White/Black/”Other”) prior to 1980. The collection of race/ethnicity data was further complicated by the mix of coding schemes before and after 1980, the paucity of racial identification response options, and increasing numbers of persons declining to complete the race question on the Social Security form.1 (See Race/ethnicity variables in the Medicare Beneficiary Summary file (MBSF) for further details). This uncertainty can lead to quality issues when attempting a nuanced analysis of race/ethnicity. In contrast, we consider the CAHPS care experience surveys to contain the most accurate “gold standard” information, as these data consist of self-reported race/ethnicity. However, in the case of proxy responses and instances when the race/ethnicity items are left blank, it could be useful to confirm or pull that information from another linked source.
Linking CAHPS data to SEER Program cancer registry and Medicare enrollment data enables researchers to address cases of missing data and confirm proxy responses. In doing so, researchers can better identify the treatment, utilization, and outcomes of specific groups and understand care experiences by race/ethnicity.
Description of Race/Ethnicity Variables in SEER-CAHPS Data
Race/ethnicity variables in Medicare-CAHPS
The Medicare CAHPS data includes a categorical variable, sc_race “Constructed: SEER-CAHPS Race, Hispanic and Race responses” containing the following eight mutually exclusive categories, including Hispanic, Non-Hispanic (NH) North American Native, Mixed and Unknown.
0 = Unknown
1 = NH White
2 = NH Black
3 = NH Other
4 = NH Asian
5 = Hispanic
6 = NH N. American Native
7 = NH Mixed
The CAHPS data also include the six indicator variables for race and ethnicity. There are no indicator variables for Mixed and Unknown race categories.
Variables for Race and Ethnicity and their codes
race_hispanic Hispanic or Latino origin/descent (any race)
race_white Race: White
race_black Race: Black/African American
race_asian Race: Asian
race_pacific Race: Native Hawaiian/Other Pacific Islander
race_natamer Race: American Indian/Alaska Native
Race/ethnicity variables in SEER data
The SEER Cancer Registry data in SEER-CAHPS include seven racial identification variables.2 The information in this section was adapted from the SEER documentation.
The multicategorical variable race_ethnicity identifies 30 categories, and according to the field description in the SEER- CAHPS Cancer File Data Dictionary Cases,3 gives “priority to non-white races for persons of mixed races” and is independent of Hispanic ethnicity.
Not all codes were in effect for all years. Starting with data through 2005 (November 2007 submission), the race/ethnicity variable used to create the race recodes was revised. Also, in the SEER file are three recoded race variables and two linkage variables, described below.
According to the SEER-CAHPS Data Dictionary, SEER Registry Participants San Francisco, San Jose-Monterey, and Los Angeles are permitted to use codes 14 and 20-97 for cases diagnosed after January 1, 1987. Greater California is permitted to use codes 14 and 20-97 for cases diagnosed after January 1, 1988. Other SEER participants may choose to recode cases diagnosed prior to 1991 using 14 and 20-97 if all cases in the following race codes are reviewed: 96 Other Asian; 97 Pacific Islander, NOS; 98 Other; and 99 Unknown.
SEER race_ethnicity codes
03=American Indian, Aleutian, Alaskan Native or Eskimo
(incl all indigenous populations of the Western hemisphere)
08=Korean (Effective with 1/1/1988 dx)
10=Vietnamese (Effective with 1/1/1988 dx)
11=Laotian (Effective with 1/1/1988 dx)
12=Hmong (Effective with 1/1/1988 dx)
13=Kampuchean (including Khmer and Cambodian) (Effective with a 1/1/1998 dx)
14=Thai (Effective with 1/1/1994 dx)
15=Asian Indian or Pakistani, NOS (Effective with 1/1/1988 dx)
16=Asian Indian (Effective with 1/1/2010 dx)
17=Pakistani (Effective with 1/1/2010 dx)
20=Micronesian, NOS (Effective with 1/1/1991)
21=Chamorran (Effective with 1/1/1991 dx)
22=Guamanian, NOS (Effective with 1/1/1991 dx)
25=Polynesian, NOS (Effective with 1/1/1991 dx)
26=Tahitian (Effective with 1/1/1991 dx)
27=Samoan (Effective with 1/1/1991 dx)
28=Tongan (Effective with 1/1/1991 dx)
30=Melanesian, NOS (Effective with 1/1/1991 dx)
31=Fiji Islander (Effective with 1/1/1991 dx)
32=New Guinean (Effective with 1/1/1991 dx)
96=Other Asian, including Asian, NOS and Oriental, NOS (Effective with 1/1/1991 dx)
97=Pacific Islander, NOS (Effective with 1/1/1991 dx)
Note: dx = diagnosis
Race_recode_White_Black_Other is based on the race variables and the American Indian/Native American Indian Health Service (IHS) link variable. This variable may be used to improve classification of Native Americans. Caution should be exercised when using this variable (see note below).
3=Other (American Indian/AK Native, Asian/Pacific Islander)
7=Other unspecified (1991+)
Race_recode_W_B_AI_API includes White, Black, American Indian/Alaska Native, Asian or Pacific Islander, Other unspecified (1991 and on), and Unknown. Caution should be exercised when using this variable (see note below).
3=American Indian/Alaska Native
4=Asian or Pacific Islander
7=Other unspecified (1991+)
OriginrecodeNHIAHispanicNonHisp includes Non-Spanish-Hispanic-Latino and Spanish-Hispanic_Latino. Caution should be exercised when this variable (see note below).
IHS_Link identifies records that were sent for linkage with an IHS match or no IHS match, and records that are blank. Incidence files are periodically linked with IHS files to identify Native Americans. The race recode variable uses information from this field and the race variable to improve precision around whether a person is Native American or not.
0=Record sent for linkage, no IHS match
1=Record sent for linkage, IHS match
Warning symbol (black triangle with an exclamation point in the middle) Please note: Caution should be exercised when using the re-coded race variables. Available race codes for fields in the underlying incidence and mortality data have changed over the years. Both SEER incidence and National Center for Health Statistics (NCHS) mortality data have had a code for “all other races,” when every race was represented and the “all other races” code not needed. However, cases and deaths were coded to this category. Starting with the 2010 data, these incidence cases are now coded as “unknown” race.
Starting in 2005, the “race/ethnicity” variable used to create the race recodes in the SEER incidence data was revised. In addition to changes to the recodes, researchers should bear in mind that people of Hispanic ethnicity may be of any race, including White, Black, Asian/Pacific islander, or American Indian/Alaska Native. Further information can be found on the SEER website.
Race/ethnicity variables in the Medicare Beneficiary Summary File (MBSF)
Race/ethnicity values from the Medicare enrollment data are populated primarily with data provided to the Social Security Administration (SSA) by beneficiaries at the time they apply for a Social Security number (SSN), when they apply or re-apply for benefits, or when they apply for a replacement Social Security card. Reporting race is voluntary and is not checked for verified by the Social Security staff receiving the application.4
Medicare has historically relied on the race and ethnicity data individuals provided when they applied for a SSN. Before 1980, the SSN application form limited respondents to choosing Black, White, or Other. “Unknown” was used to classify persons who did not report any race.
In 1980, responding to an Office of Management and Budget (OMB) Directive, SSA decided on a single question combining the ethnic and racial topics, with permitted responses of: 1) White (non-Hispanic); 2) Black (non-Hispanic); 3) Hispanic; 4) Asian, Asian American, or Pacific Islander; and 5) American Indian or Alaska Native.5
Since 1989, parents of newborns were asked if they would like the birth certificate data transmitted to SSA, so that an account number can be issued; however, race/ethnicity information is not included in the information sent to SSA, because the birth certificate data are reserved for medical and health use only.
Historically, the Medicare Enrollment Database (EDB) has been the source for enrollment and demographic information in the MBSF; eventually CMS transitioned the source from the EDB to the Common Medicare Environment (CME) database.6
Medicare bene_race_cd codes
6=North American Native
The RTI Race Code
The RTI race code (RTI_RACE_CD) is contained in the MBSF and has been used in research and public reporting of minority health disparities.7,8 The RTI race code was created by using the beneficiary race code historically used by the Social Security Administration (SSA) (and in turn used in CMS’s enrollment data base) and applying an algorithm designed to identify more beneficiaries as Hispanic or Asian using last names and residence. The algorithm, developed by RTI International, classifies beneficiaries as Hispanic or Asian if their SSA race code equals 4 (Asian) or 5 (Hispanic), or if they have a first or last name determined to be likely Hispanic or Asian in origin.9 Researchers imputed race based on U.S. Census surname lists from 1990 and 2000 combined with residence in Puerto Rico or Hawaii.8
The RTI race code contains 6 categories: non-Hispanic White, non-Hispanic Black, Hispanic/Latino, Asian American/Pacific Islander, American Indian/Alaskan Native, and Other/Unknown. Although the RTI race code has its own limitations, it is considered to be more helpful for identifying Hispanics and more accurate than the Medicare Enrollment Database (EDB) race variable for Non-Hispanic whites and blacks compared to self-reported data collected during home visits.9
The EDB race variable and the RTI race code do not include subcategories for Black, Hispanic, or Asian American/Pacific Islander groups, nor do they identify racial/ethnic identities beyond the category “Other.”8 Researchers might consider using this variable if the focus of their study is people identifying as Hispanic, Asian, Black, or White, but this variable may not be the best choice if the focus of the study is on people who identify as American Indian/Alaska Native, mixed or multiple races, or some other race.
Medicare RTI_RACE_CD codes
2=Black (or African American)
6=American Indian/Alaska Native
Concordance Between Race/Ethnicity Values from Different Sources
An examination of the concordance of race and ethnicity variables in the different files of SEER-CAHPS was conducted. We examined race/ethnicity variable concordance using an analytic file consisting of variables from CAHPS, SEER Cancer Registry data, and the Medicare Beneficiary Summary File (MBSF) linked by a unique, encrypted beneficiary identification code. Beneficiaries included in our sample were in the SEER cancer registry data, continuously enrolled in Medicare (Fee-For-Service or Medicare Advantage) in the 12 months before and the 12 months after their cancer diagnosis and completed at least one Medicare CAHPS survey in the period 2007-2019.
We compared the Medicare and SEER race/ethnicity variables to the CAHPS race/ethnicity variables using cross-tabulations and Cohen’s kappa coefficient. We found substantial agreement between CAHPS and Medicare (89.6%, κ=.652) race/ethnicity and moderate agreement between CAHPS and SEER (79.3%, κ=.457) race/ethnicity.
Considerations for Researchers
Self-Reported Versus Non-Self-Reported Data
Whenever possible, researchers should use the CAHPS self-reported race/ethnicity variables. If CAHPS race/ethnicity data are missing, researchers should use SEER. If CAHPS and SEER are missing, use MBSF. In instances where there is disagreement on race/ethnicity between CAHPS and the MBSF or SEER-CAHPS, researchers should use the race/ethnicity variable from the CAHPS survey data.
In the event of unintended missing racial self-identification—such as when a respondent skipped that question or entered “don’t know” (which may be the case when a proxy completes the survey)—analysts could consider using the following procedure:
- if the designation from the SEER data (race_ethnicity) matches the designation from the MBSF (bene_race_code), use that designation;
- if the race designation in the SEER data (race_ethnicity) and in the MBSF, (bene_race_code) do not match, use the SEER race variable (race_ethnicity).
Where respondents declined to self-identify and other sources for race/ethnicity are missing or conflicting, analysts may elect to create a separate category for “missing/unknown”.
When using race/ethnicity variables other than the self-reported CAHPS identifications, analysts should consider appropriate survey methods to address limitations (such as recoded and missing data) and describe in detail the methods applied. Some researchers consider imputation of race/ethnicity (when a person has not provided this information) to be unethical.10, 11 In addition, imputation may affect study results and conclusions. Analysts are advised to read the information on race recode changes and the guidance document on missing data (PDF) available on the SEER-CAHPS website.
This section provides information about the concordance between gender and sex variables in the SEER-CAHPS data linkage. We refer to “gender” when discussing the CAHPS variables, which are gender and sa_gender; and “sex” when discussing the MBSF variable sex_ident_cd and the SEER variable sex. Typically, sex refers to biological sex or sex assigned at birth. Gender refers to gender identity, which may or may not correspond to sex assigned at birth.
Description of Gender/Sex Variables in SEER-CAHPS Data
The CAHPS datafile includes the following two variables for gender. The variable gender is missing (denoted in the datafile by a single period) on 107,952 (38%) out of 282,747 CAHPS surveys examined from 2007-2019. According to the SEER-CAHPS Guidance Document “Missing Data in SEER-CAHPS,” missing values are indicated by a single period indicates “Question Not on Survey.”
The constructed variable sa_gender has 32 missing values during the same period. Variables with the prefix “sa_” are constructed by survey contractors and are related to sample characteristics. This variable is drawn from Medicare administrative data originally from the SSA.
1 = Male
2 = Female
sa_gender (constructed by survey contractors based on administrative data)
1 = Male
2 = Female
The SEER Cancer Registry data contains the following variable for sex.13
1 = Male
2 = Female
Medicare Beneficiary Summary File (MBSF) Datafile
In the MBSF enrollment data, there are variables for sex for each year (2007-2019) with the following nomenclature: SEX_IDENT_CD07 : SEX_IDENT_CD19. The most recent available variable for each beneficiary was used for this analysis. We identified only 3 instances where the value of the variable changed across time.
SEX_IDENT_CD07 : SEX_IDENT_CD19
0 = Unknown
1 = Male
2 = Female
Concordance Between Gender/Sex Values from Different Sources
We examined gender and sex variables using the same analytic file described above. We compared the Medicare and SEER sex variables against the two CAHPS gender variables using cross-tabulations and Cohen’s kappa coefficient. There was almost perfect agreement between the CAHPS self-reported gender and Medicare (98.67%, κ=.9734), and between CAHPS and SEER (98.67%, k=.9734). There was almost perfect agreement between the CAHPS variable sa_gender and Medicare (99.98%, κ=.9996) and between the CAHPS variable sa_gender and SEER (99.98%, κ=.9996).
Considerations for Researchers
Self-Reported Data Versus Data Collected by Other Means
The CAHPS variable gender is self-reported and has a large number of missing values, while the CAHPS variable sa_gender is based on Social Security Administration (SSA) data and has relatively few missing values.
Based on our examination, we recommend researchers use the self-reported CAHPS variable gender supplemented by SEER (sex) when the self-reported information is missing. The MBSF variable (sex) is assigned at birth, while SEER’s data come from the time of diagnosis – and is closer to self-report because the data come from health records.
When working with proxy responses to the CAHPS survey, if the gender/sex variables disagree, we recommend using the SEER variable, because some proxies answer the demographic questions about themselves rather than the beneficiary.
- Scott CG. Identifying the Race or Ethnicity of SSI Recipients. Soc Sec Bull. 1999; 62(4): 9-20. PMID: 10769868
- The National Cancer Institute. Surveillance, Epidemiology, and End Results (SEER) Program (seer.cancer.gov). Submission: November 2019. SEER-CAHPS Cancer File Data Dictionary Cases. Diagnosis Years 1975-2017. Retrieved from https://healthcaredelivery.cancer.gov/seer-cahps/aboutdata/documentation.html
- SEER-CAHPS Data Dictionary
- Waldo DR. Accuracy and Bias of Race/Ethnicity Codes in the Medicare Enrollment Database (PDF). Health Care Financ Rev. 2004 Winter; 26(2): 61–72.
- Scott CG. Identifying the Race or Ethnicity of SSI Recipients (PDF). Soc Sec Bull. 1999; 62(4): 9-20. PMID: 10769868.
- Centers for Medicare & Medicaid Services. Department of Health and Human Services. May 2017. Version 1.0. Master Beneficiary Summary File (MBSF): Impact of Enrollment Source Data Conversation from EDB to CME (PDF, 424 KB). [CCW White Paper].
- Jarrín OF, Nyandege AN, Grafova IB, Dong X, Lin H. Validity of Race and Ethnicity Codes in Medicare Administrative Data Compared With Gold-standard Self-reported Race Collected During Routine Home Health Care Visits. Med Care. 2020 Jan;58(1):e1-e8. doi: 10.1097/MLR.0000000000001216. PMID: 31688554; PMCID: PMC6904433.
- Grafova IB, Jarrín OF. Beyond Black and White: Mapping Misclassification of Medicare Beneficiaries Race and Ethnicity. Med Care Res Rev. 2021 Oct;78(5):616-626. doi: 10.1177/1077558720935733. Epub 2020 Jul 7. PMID: 32633665; PMCID: PMC8602956.
- Research Triangle Institute (RTI) Race Code
- Randall Megan, Stern Alena, and Yipeng Su. March 2021. Five Ethical Risks to Consider before Filling Missing Race and Ethnicity Data. Workshop Findings on the Ethics of Data Imputation and Related Methods. Washington DC: Urban Institute.
- Lines Lisa and Humphrey Jamie. January 2022. Imputing Race & Ethnicity: Part 1. Medical Care. Medicare Care Section of the American Public Health Association.