Dr. Peter Organisciak and Dr. Krystyna Matusiak, faculty in MCE’s Library and Information Science program, have been awarded a $277,000 grant from the Institute for Museums and Library Services (IMLS). The two-year grant will support a content-based study of text duplication and similarity in massive digital library collections.
The emergence of massive digital collections presents an opportunity to pose novel, collection-wide questions of published history, offering new ways to access and use library materials.
As Organisciak explains, access in libraries is usually driven by information describing materials, such as time, location, and subject matter. Digital libraries allow a new form of access: by peering inside the books. At large scales, such information can yield fascinating insights such as what types of books were being published in different parts of the country, how were issues of the day being addressed, and even what were the most popular terms being used at key points in time.
The problem with searching and analyzing these huge libraries is that, at present, these digital archives contain an unknown number of duplicate copies of publications. In a physical library, that’s a good thing. Multiple users can check out and read multiple copies of the same book. When you’re looking for trends across culture or history, however, duplicated or repeating text can lead to a misleading understanding of reality.
Organisciak explains, “Massive digitalization projects are perhaps best exemplified by the HathiTrust Digital Library, which contains roughly 16 million books collected from a broad consortium of university collaborations. There is much potential to learn from so much of the published record, and the purpose of this project deduplication efforts is to make those insights easier to observe. Eventually, we hope to extend our methods to better make content recommendations.”
Such a similarity algorithm could be used by libraries to make book recommendations to readers based on their themes, complementing existing approaches such as reader advisories. If a reader is interested in books like The Da Vinci Code, the algorithm could suggest books that share contextual similarities.
“Think of it like Spotify for books,” says Organisciak.
Could this new study mean the end of the aimlessness readers often experience upon finishing the last book by their favorite author?
In time, perhaps. But, for now, it means that Organisciak, Matusiak, and Benjamin Schmidt, their research partner from Northeastern University, will be hard at work, digitally combing through more than 16 million books, to help researchers analyze publications with increased accuracy, and help readers find the next book they’re most likely to fall in love with.