After the U.S. Census Bureau announced that it was changing how it protects the identities of individuals for the 2020 Census, a Penn State-led research team began to evaluate how these changes may affect census data integrity.
The Census Bureau is proposing to use differential privacy, a new method that attempts to protect the identities of individuals when publishing public data. Census data is used to distribute federal funding that impacts communities and also determines congressional representation.
Alexis Santos, assistant professor of human development and family studies at Penn State, along with researchers Jeffrey Howard, assistant professor at the University of Texas at San Antonio, and Ashton Verdery, assistant professor of sociology, demography, and social data analytics at Penn State, examined mortality rates in 2010. The researchers compared both methods of privacy protection and the implication of this change to better understand health disparities in the United States. The work was published recently in Proceedings of the National Academy of Sciences.
The research team discovered that when differential privacy method was used on Census data, it produced dramatic changes in population counts for racial and ethnic minorities compared to the traditional methods.
“We focused on mortality rate estimates because they are an essential population-level metric for which data are collected and disseminated at the national level and because mortality rates are a critical indicator of population health,” said Santos.
The research team then explored the changes in mortality rates resulting from the two disclosure avoidance systems by metropolitan classifications.
“We discovered that by using differential privacy, there were both instances of under- and over-counting of the population. In rural areas, there was undercounting of racial and ethnic minorities, while in urban areas there was an overcounting of these populations,” Santos said.
The researchers found that some discrepancies between the two methods of data analysis exceeded a 10% difference.
“This is very concerning because it could impact how much funding programs receive for a specific geographic area,” said Santos. “These discrepancies could result in understated health risks in some areas, and while overstating in others where there isn’t a great need.”
According to Santos, the findings highlight the consequences of implementing differential privacy and demonstrate the challenges in using the data products derived from this method.
“The Census Bureau has been very receptive to our research, and demonstrated concern about the accuracy of the data," Santos said. "We plan to move forward with additional research to determine how differential privacy may affect population growth estimates and populations changes from census year to census year. We still have time to fine tune the differential privacy algorithm, and our research will help pinpoint areas of improvement.”
Santos, who is also a cofunded faculty member of the Social Science Research Institute, and the research team were supported by the Population Research Institute and the Administrative Data Accelerator at Penn State. The work also is supported by the Center for Community Based and Applied Health Research at the University of Texas at San Antonio.