EE Seminar: Higher Criticism for discriminating frequency-tables and testing authorship

14 בינואר 2020, 15:00 
Room 011, Kitot Building 

(The talk will be given in English)

 

Speaker:     Dr. Alon Kipnis
                    Department of Statistics, Stanford University

 

TUESDAY, January 14th, 2020
15:00 - 16:00

Room 011, Kitot Bldg., Faculty of Engineering

 

Higher Criticism for discriminating frequency-tables and testing authorship

Abstract

The Higher Criticism (HC) test is a useful tool for detecting the presence of a signal spread across a vast number of features, especially in the sparse setting when only few features are useful while the rest contain only noise. We adapt the HC test to the two-sample setting of detecting changes between two frequency tables. We apply this adaptation to authorship attribution challenges, where the goal is to identify the author of a document using other documents whose authorship is known. The method is simple yet performs well without handcrafting and tuning. Furthermore, as an inherent side effect, the HC calculation identifies a subset of discriminating words, which allow additional interpretation of the results. Our examples include authorship in the Federalist Papers, machine-generated texts, and the identity of the creator of the Bitcoin.

We take two approaches to analyze the success of our method. First, we show that, in practice, the discriminating words identified by the test have low variance across documents belonging to a corpus of homogeneous authorship. We conclude that in testing a new document against the corpus of an author, HC is mostly affected by words characteristic of that author and is relatively unaffected by topic structure. Finally, we analyze the power of the test in discriminating two multinomial distributions under sparse and weak perturbations model. We show that our test has maximal power in a wide range of the model parameters, even though these parameters are unknown to the user.

Short Bio
Alon Kipnis is a postdoctoral scholar in the department of statistics at Stanford University. He received his B.Sc. degree in mathematics (summa cum laude) and his B.Sc. degree in electrical engineering (summa cum laude), both in 2010, and his M.Sc. degree in mathematics in 2012, all from Ben-Gurion University of the Negev. He received his Ph.D. degree in electrical engineering from Stanford University, where he is now a postdoctoral scholar in the Department of Statistics. His research combines data compression and dimensionality reduction techniques with classical methods in signal processing and machine learning.

.

אוניברסיטת תל אביב עושה כל מאמץ לכבד זכויות יוצרים. אם בבעלותך זכויות יוצרים בתכנים שנמצאים פה ו/או השימוש
שנעשה בתכנים אלה לדעתך מפר זכויות, נא לפנות בהקדם לכתובת שכאן >>