Dr. Murat Dundar
Associate Professor
Computer & Information Science Dept.

 

Authorship Attribution: Machine learning to predict the authors of text passages

Background:

Authorship attribution is the problem of finding the original author of a written text given other text written by potential authors. In this project we will use over one thousand publicly available books authored by a total of fifty American/British novelists/short story writers lived in between late 18th to early 20th centuries to explore the feasibility of a variety of machine learning techniques toward solving the authorship attribution problem.

The literature is quite rich with a large body of early work approaching this problem from a variety of domains and directions. See recommended reading below.

Data Preprocessing:

All books are obtained from the gdelt database (http://www.gdeltproject.org/). The database is queried according to author names and one text file is created for all available books of each queried author. 

These text files are processed using standard NLP tools to remove all numbers, special characters as well as stop words. The first five thousand and the last two thousand characters in each book were removed to remove potentially author identifying preface and postface text. No stemming was applied. The initial vocabulary contained more than 1.6 million unique words most of which seemed to be artifacts generated as a result of OCR applied to the scanned books. We applied the following filters to reduce vocabulary size.

Only words used by at least ten of the fifty authors were included to eliminate domain/author specific words. Only words used for a total of twenty times across all books were included to eliminate very rare words. These two filters reduce the vocabulary size to less than fifty thousand. We computed tf-idf scores for individual books of all fifty authors and choose the ten thousand words with the least mean tf-idf score across all authors as our final list of vocabulary.

We then assign a numeric id for each of the word and replace all words in the preprocessed text with their corresponding ids to blind the competition to actual content in the text. Each book is then split into text passages of one thousand word each. This data is split into training and test sets. All books written by five of the authors are placed into the test set. This way we have created a non-exhaustive training data set. In other words training data set contains books written by only forty five of the fifty authors in the test set.


Challenge and Submissions:

Challenge participants are asked to build a classifier using the released training data. The trained classifiers will be evaluated based on their performance in the following two tasks:

  • Classifying text passages of forty five authors represented in the training set. For this task you will use the same author id in the training set when attributing a text passage to one of the authors in the training set.
  • Identifying text passages of five authors not represented in the training set. For this task you can use a different numeric id (not previously used by any of the authors in the training set) when attributing a text to any of the five authors not represented in the training set.

Data Set:

Data Set: Download Vocabulary

Data Set with stop words


  Variable Sizes Variable Descriptions  
train_txt 17153 x 10000 Training text passages represented by word ids.  
train_book 17153 x 1 Corresponding book ids for training text passages  
train_author 17153 x 1 Corresponding author ids for training text passages  
test_txt 12728 x 10000 Test text passages  
test_book 12728 x 1 Corresponding book ids for testing text passages  
shortened_vocab 10000 x 1 The list of vocabulary in the same order as they appear in train_txt and test_txt.  

These are limited access data sets only available to challenge participants.

 


Evaluation Criteria:

Starting with the week of March 27 each student will be allowed three submissions per week. Each submission will be evaluated by the instructor using the mean F1 measure and the leaderboard will be updated with these scores at the end of each week.


Recommended Readings:

  • Baayen, Harald, Hans Van Halteren, and Fiona Tweedie. "Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution." Literary and Linguistic Computing 11.3 (1996): 121-132.

  • Juola, Patrick. "Authorship attribution." Foundations and Trends® in Information Retrieval 1.3 (2008): 233-334.

  • Stamatatos, Efstathios. "A survey of modern authorship attribution methods." Journal of the American Society for information Science and Technology 60.3 (2009): 538-556.

  • Seroussi, Yanir, Fabian Bohnert, and Ingrid Zukerman. "Authorship attribution with author-aware topic models." Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2. Association for Computational Linguistics, 2012.

  • Bagnall, Douglas. "Author identification using multi-headed recurrent neural networks." arXiv preprint arXiv:1506.04891 (2015).

  • Macke, Stephen, and Jason Hirshman. "Deep Sentence-Level Authorship Attribution." https://cs224d.stanford.edu/reports/MackeStephen.pdf

  • Shlomo Argamon, Levitan Shlomo. Measuring the usefullness of function words for Authorship Attribution. Proceedings of ACH/ALLC Conference 2005 in Victoria, BC, Canada, June 2000.

  • Litvinova, Tatiana, et al. "Profiling a Set of Personality Traits of a Text’s Author: A Corpus-Based Approach." International Conference on Speech and Computer. Springer International Publishing, 2016.

Other Resources:

 

Proposal Guidelines


Timeline:

  • February 21: Training data set released
  • March 21, 2016: Proposals due
  • March 27 - May 5: Leaderboard open to 3 submissions per week.
  • April 25 and 27: Class presentations
  • May 5: Competition ends, reports are due.

Template for Reports:

  • Download Latex template for project reports. See PDF generated.
  • Download Miktex (the compiler for Latex)
  • Download TexnicCenter (the editor for Latex)

When you configure TexnicCenter you need to choose Miktex as the compiler.


Leaderboard:

Week 1 (3/27 - 4/2)

Alias 1 2 3
unknown 0.5932 0.6182 0.4377
aragorn 0.5752 0.5751 0.5321
avi 0.5277 0.5327 0.6174

Week 2 (4/3 - 4/7)

Alias 1 2 3
avi 0.4503 0.6127 0.6668
at 0.6102 0.6052 0.6197
bs 0.5854 0.5857  
anonymous 0.5857    
aragorn 0.5321 0.3485 0.5430
unknown 0.5001 0.5236 0.4811

Week 3 (4/10 - 4/14)

Alias 1 2 3
avi 0.5832 0.5941 0.6216
bs 0.6306 0.6306  
mike 0.6306 0.6158 0.5779
at 0.0093 0.0096 0.0059
aragorn 0.6056 0.5421 0.6138
anonymous 0.0120 0.5746 0.5140
unknown 0.5779 0.5767 0.5855

Week 4 (4/17 - 4/21)

Alias 1 2 3
avi 0.6529 0.6691 0.4537
bs 0.6252    
mike 0.6109 0.6237 0.4797
at 0.0185 0.0 0.0119
aragorn 0.6076 0.5421 0.6138
anonymous 0.3679 0.3220  
unknown 0.5917 0.5954 0.5901

Week 5 (4/24 - 4/28)

Alias 1 2 3
avi 0.5482 0.6673 0.6839
bs 0.6185 0.6228 0.6215
mike 0.6180 0.4292 0.5497
at 0.5677 0.6146 0.0119
aragorn 0.6177 0.6040 0.5556
anonymous 0.6209 0.6336  
unknown 0.5144 0.5698 0.0159

Week 6 (5/1 - 5/5)

Alias 1 2 3
avi 0.5675    
bs 0.6767 0.7098 0.6937
mike 0.6832 0.6842 0.6637
at 0.6487 0.5982 0.6630
aragorn 0.6604 0.4174  
anonymous 0.6645 0.6703 0.6763
unknown 0.6692 0.6703 0.7387

Final Leaderboard

Alias  
unknown 0.7387
bs 0.7098
mike 0.6842
avi 0.6839
anonymous 0.6763
at 0.6630
aragorn 0.6604