EE Seminar: Word-Spotting applications for historical documents

Speaker: Adi Silberpfennig

M.Sc. student under the supervision of Prof. Lior Wolf

 

Sunday, February 5th, 2017 at 15:30

Room 011, Kitot Bldg., Faculty of Engineering

 

Word-Spotting applications for historical documents

 

Historical documents have been undergoing large-scale digitization over the past years, bringing massive image collections available on-line. Optical character recognition (OCR) quality for historical manuscripts and for documents printed in old typefaces, is still lacking. As an alternative or in addition, one can perform an image-based search.

In this talk we will show a simple and efficient pipeline for word spotting in historical documents and how it is utilized for several applications;

An effective unsupervised pipeline for OCR betterment is proposed. It employs a baseline OCR engine as a black box plus a dataset of unlabeled document images. Given a new document to be analyzed, the black-box recognition engine is first applied. For each result, word spotting is carried out within the dataset and then a process for OCR improvement is applied using the spotting results.

We also present an image based approach for the retrieval of related articles in a newspaper. Given a dictionary, synthetic images are generated for every word in it, and each of these words is considered a query. Given a set of unlabeled documents they are first fed into the word spotting engine. Then, based on the spotting results, a normalized Tf-Idf vector representation is computed for every document and the articles retrieval is performed by a nearest-neighbor search.

Another utility shown here is an operational word spotting engine. We developed, in collaboration with the Friedberg Genizah Project, a real-time word spotting engine, incorporated in a large scale historical manuscripts collection – The Cairo Genizah.

05 בפברואר 2017, 15:30 
חדר 011, בניין כיתות-חשמל 
אוניברסיטת תל אביב עושה כל מאמץ לכבד זכויות יוצרים. אם בבעלותך זכויות יוצרים בתכנים שנמצאים פה ו/או השימוש
שנעשה בתכנים אלה לדעתך מפר זכויות, נא לפנות בהקדם לכתובת שכאן >>