Project Goal
The goal of this project is
to develop a foundation for a large-scale,
open source software for handwriting recognition for historical
documents by training a neural network to recognize handwriting of
19th century scribes. We aim to reach this goal
by using a combination of machine learning,
artificial intelligence techniques and software platforms, specifically utilizing Graphical Processing Unites (GPUs),
to transform simple screenshots of characters
from handwritten documents by slightly skewing each captured
image a number of times to create a larger training set.
Project Timeline
The first stage of our work was completed in the summer of 2019 by an undergraduate computer science student, who created a training set of over 16,000+ images of 22 different characters (or classes), averaging about 260 per class from 4 of volumes of the John Adams Papers. Through a partnership with the Massachusetts Historical Society, we have access to over 200 pages of handwritten material from John Quincy Adams from the early
19th century. Obtained through the
team’s previous engagement with this extensive collection of key historical documents, these pages have served as
the initial data source for our work. We chose this data set because of the regularity of the handwriting in these documents and the fact that they are already transcribed, enabling us to compare our automated work with the manually created transcripts.
In the Summer of 2020, the project was awarded a LYRASIS Catalyst grant to continue developing our model. We determined the best next step would be to further push the model and scale up to 70+ classes of characters to increase the data set by 3-5x and perform more tests. To create a larger data set, more work needed to be done reviewing the remaining 3 volumes of handwritten documents from the John Adams Papers to identify and capture screenshots of characters. The goal was to capture more images of the 22-character classes from the original work done in 2019 and identify new classes to further build up the training set for our model.
Progress and Findings
Materials and Links
Project GitHub (under construction)
Initial GitHub - https://github.com/mattlm0831/OCR-Handwriting
Image Capture Tool Code- https://github.com/bechardj/lc-tool
Image Capture Tool Site - https://lct.jbec.us/
Team
Greg Colati
Director, Digital Preservation Repository Program
UConn Library
Joseph Johnson
Associate Professor In Residence, Associate Director of Undergraduate Programs in Computing
UConn
Mike Kemezis
Repository Manager
UConn Library
Sara Sikes
Associate Director, Greenhouse Studios
UConn Library