Finding the Needle in the Haystack: CMU Students Develop AI Tool to Improve the Usability of Government Reports

By Jennifer Monahan

Combing through countless PDF reports for hours in search of a piece of relevant information is no one’s idea of an interesting day at work. Tedious, overwhelming, soul-crushing, maybe. Engaging? Not so much.

Dedicated public servants – and lots of other people – do it anyway, often, in service to some larger goal: to make the case for a new policy, to advocate for funding, or to explain a position. Recently, a team of graduate students from Carnegie Mellon University’s Heinz College of Information Systems and Public Policy and Integrated Innovation Institute in Professor Chris Goranson’s Policy Innovation Lab: Public Interest Technology course came up with a generative AI application that helps researchers find the information they seek in a matter of seconds, not hours.

Their tool, GovScan, provides government workers the ability to locate the proverbial needle in a haystack.

Team members Davis Craig (MSPPM ’24), Aakash Dolas (MIIPS ’24), Tyler Faris (MSPPM ’24), and Eashwari Samant (MIIPS ’24) spent seven weeks creating a tool that would improve the usability of government reports.

GovScan team members in Hamburg Hall, Left to Right: Tyler Faris, Eashwari Samant, Aakash Dolas, Davis Craig
GovScan Team Members, left to right: Tyler Faris, Eashwari Samant, Aakash Dolas, and Davis Craig

Maya Mechenbier, a project lead in the federal government, shared a challenge that she’s faced in government which gave the GovScan team a real-life challenge to solve. For this scenario, students connected with government workers tasked with reviewing reports for child care funding from all 50 states; each report might contain hundreds of pages. Policy analysts needed to find particular data points within those reports in order to be able to analyze and compare the effectiveness of programs.

“Whether it’s for Medicaid or the Child Care Development Fund (CCDF) subsidy dollars, states’ plans are typically stored and made public in a PDF form,” explained Mechenbier. “Fifty states might do 50 different things with their programs.” The magnitude and variation can make it hard for a policy analyst to absorb such large quantities of data, determine who might be addressing certain rules in certain ways, or understand trends emerging across the country.

The student team created a working model that sifts through thousands of pages of those reports to answer analysts’ questions. For example, an analyst might ask GovScan, “Which states provide child care funding for low-income, single-parent households?” The tool scans all the PDF reports in its database, and provides a list of results – complete with the source citations.

“GovScan is like the ‘Control F,’ search function on steroids,” explained Craig.

Why It’s a Game-Changer

The tool has two main benefits. The first is efficiency.

Policy analysts told the team that they typically spend three to four hours looking for data points within these reports. The GovScan platform gives an answer within about 30 seconds.

“It’s not efficiency for efficiency’s sake,” said Faris. “It’s efficiency for better decision-making and better management.”

Another challenge for analysts is knowing whether the haystack even contains a needle.

“People we interviewed were frustrated with the inherent uncertainty. It’s one thing to know that what you’re looking for is in a particular report and it’s just taking time to find it,” Samant said, but spending hours in search of information that doesn’t exist feels like a waste of time. GovScan helps analysts use their time more effectively by identifying which reports contain the information they need.

The GovScan application was designed not to replace humans, but to serve as a tool to help them work more efficiently and effectively.

It reduces the cognitive load for researchers. Aakash Dolas

“It reduces the cognitive load for researchers,” explained Dolas. “The saved time and effort free up humans to spend their time and attention on analyzing and understanding the results.”

The application is distinct from other search tools in a couple of important ways.

Platforms like Google or Bing search the Internet for information. Large language models (LLMs) such as ChatGPT or Bard also rely on the Internet as a data source.

Conversely, GovScan searches within a single, secure database of PDF files provided by an organization. The distinction is important because it eliminates false information as part of the data source.

GovScan has another key difference from LLMs like ChatGPT. GovScan’s results are linked to the source material. When the user receives an answer to a prompt, they can click on the link for each fact and find the exact location of the source of the information within the original report.

How It Works

Craig uses the analogy of a library to explain Retrieval Augmented Generation (RAG), the technology behind GovScan.

“Imagine if you went to a library and there’s a big pile of books on the ground. It would be really hard to find the specific information you want,” Craig said. “That’s the issue with unstructured data, with all those PDF reports. So what we do is basically what librarians do – take all the books and index them so that they’re organized neatly.”

The next step is to do “semantic search.” Natural language processing engineers, in this case Davis, use a technique called vector embeddings to capture the semantic meaning of the question and then scan those indexed reports to find which reports are most relevant, and which data points within those reports are most applicable to the user’s query. The application functions like a librarian helping someone use the card catalog to locate a particular book, with a particular piece of information in it.

Then the application puts the results together, gives them to an LLM, and the LLM is instructed to handle the information in a way that meets the user’s specific use case. With GovScan, the model is told to summarize the results, provide citations for the information, and link to the information source.

What Happens Next

Craig, Dolas, Faris, and Samant have made their work available via a GitHub repository under an MIT open-source license, including the code they created for the query engine and data pipeline that enable GovScan’s operation. They are exploring options for further developing the tool.

The student team is careful to note that the application needs additional testing, but they are optimistic that GovScan is a workable tool that can help research officers and policy analysts do their jobs better.

GovScan Demo

Watch this video to see GovScan in action. Development is ongoing; contact Eashwari Samant to see a live demonstration of the current prototype.

Eashwari Samant

“The tool might not seem all that flashy, but the utility of it against the sheer volume of data is significant,” said Goranson. “The team took the time to really understand the challenges facing their partner and then created something that directly addressed the problem.”

Mechenbier said the tool could be useful across many disciplines and for any federal agency that must process and analyze large quantities of data from PDF files.

Their tool is something that could really improve the lives of policymakers in a tangible way, allowing these creative, smart people to do the analysis and writing they really want to be doing. Maya Mechenbier

“Their tool is something that could really improve the lives of policymakers in a tangible way, allowing these creative, smart people to do the analysis and writing they really want to be doing,” Mechenbier said.

Finding the Needle in the Haystack: CMU Students Develop AI Tool to Improve the Usability of Government Reports

Why It’s a Game-Changer

How It Works

What Happens Next

GovScan Demo

Energy Markets Under the Microscope: Research Reveals Hidden Patterns

Powering Pennsylvania

Students develop tool to help American Red Cross estimate shelter needs after earthquakes