A large chunk of a government's budget can be traced back to a small number of frequently used, expensive programs. These can include the costs of adult and juvenile incarceration, foster care for endangered children, or safety net services such as treatment for mental health or substance abuse for poor individuals. These programs don't operate in isolation; many individuals or families in one of the above programs will also be in at least one more at some point in their lives. Finding these social service "hotspots" could allow governments to more effectively distribute resources, reducing costs without sacrificing services at a time when budgets are especially tight.
But the data from each of these programs are walled off in different departments, such as the Departments of Corrections or Children and Family Services, with limited to no sharing across bureaucratic lines. In his Sept. 27 talk at the Computation Institute, Robert Goerge, a CI senior fellow and senior research fellow at the University of Chicago's Chapin Hall, described how integrating these silos of public sector data can inform more efficient government spending, and how computation can help.
Public sector data is definitely not the same as public data, Goerge said, as it is often fiercely protected by the government to protect privacy on sensitive matters such as child abuse and mental illness. But Goerge and his collaborators worked with the state government of Illinois to obtain rare access to databases from various departments in order to figure out how to build canals between those pools of data and unleash their potential.
"The Governor's office thought there were a very small number of families in Illinois that use a large portion of the state's services or resources," Goerge said. "If the state could better understand where these families are and what services they're using, the state could do a better job of serving them. The problem was we simply didn't know how many people use multiple programs, what they cost the state, and where they were."
But even once you have the data, merging the different data sets into one master database is no simple matter. Typically, each state department builds and manages its own custom database, which has evolved in isolation under its own unique conditions. "There was nobody thinking about a database that cuts across these systems," Goerge said, leading to a variety of formats that aren't easy to merge.
In order to find specific individuals or families who were present in multiple databases, Goerge and his collaborators needed to perform record linkage, matching the records in different databases that belong to the same person. Simply matching names is not enough, as data entry errors, misspellings, or different naming conventions can complicate the process. A single person may be John, Jon, or Jonathan in different databases, or may have a name (like Goerge's own, he pointed out) that is often misspelled. Social security numbers aren't a perfect match either -- some people don't have one, some numbers are used by multiple people (such as a parent and a child), and data entry errors can cause snags again. With no national identity number, there's no perfect solution for a simple one-to-one match.
"There's no such thing as a unique identifier, that's the golden rule of matching," Goerge said. "We always have to do record linkage to make sure we are doing a good job of connecting people that really are the same person."
This problem is where computation helps out. Linkage algorithms are used to do "probabilistic" matching, where the probability that two individuals in two separate datasets are the same person is measured by looking across the various fields of information. For example, if a John Smith and a Jon Smythe are both listed at the same address in different data sets, the algorithm would assign a high probability that the two records represent the same person. Not all comparisons are so easy, so the algorithm looks at all the data, assigns a probability figure to each comparison, and the researcher then decides what the probability cutoff should be to declare a match. The method, while not perfect, makes the best of data that can occasionally be spotty or incomplete.
"The reliability of data fields is important, but you can still get a lot of information out of 'bad' data," Goerge said.
When the matching was completed, the researchers could place their individuals into families and identify those who used multiple services. Their hypothesis was confirmed, as 23 percent of families were found in multiple state programs, and these families accounted for 86 percent of state costs -- around $3 billion. The data could also be organized to show where these families lived in the state, highlighting areas that utilize more resources and thus may be good places to concentrate interventions. Counties in the southern part of the state and poorer areas of Chicago showed the highest incidence of these families, with sometimes striking specificity -- Goerge pointed out a neighborhood on the North Side of the city where an area with a high rate of resource use was literally across the street from an area with a low rate.
But using the data to identify these hotspots is only the first step. Due to privacy concerns, the researchers cannot contact the families themselves to collect more data, they can merely hand it off to the state to do with it what they will. Unfortunately, Goerge pointed out that the governor who initially commissioned the study currently resides in federal prison, but he said that the current Illinois administration has shown interest in both using their data and expanding the research. Goerge said he's interested in focusing in on which factors drive these high costs, or what tips a family with zero or only one problem requiring state resources into using multiple resources.
"We know where these families are and their problems, and so then the next question is what do we do about this?" Goerge said. "Is there something you can grab onto in these families, that you can build on to improve their future outcomes?"