NLP Reading Group

Next meeting

Wednesday 2 May, 2-3pm in G30

The topic of discussion is "Document cluster Labelling"
Analysis of structural relationships for hierarchical cluster labeling (SIGIR 2010)
Markus Muhr, Roman Kern, Michael Granitzer

Moderator: Nikos Aletras

Future meetings

Papers from the topic "Document cluster Labelling"
Sequence Clustering and Labeling for Unsupervised Query Intent Discovery (WSDM 2012)
Jackie Chi Kit Cheung, Xiao Li

About

The NLP reading group from this year will try to focus on debating topics related to NLP or to explore other work from related fields. We will aim to spend 3-4 weeks on each topic. In each week, we will read and discuss usually one paper.

The target audience is all the members of the NLP group and other possible interested participants.

The meeting will take place weekly for one hour in G30 usually on Wednesdays from 3-4pm.

The meetings of the group will be informal and no necessary preparation will be required with the exception of the moderator reading the current paper and the rest having at least a brief overview of it.

Topics

At the start of the year, we will choose the topics that will be debated for the rest of that year.

After the voting for the topics is completed and the selection of topics is done based on the votes we will make a tentative schedule. When starting a topic, the person who suggested it should aim to moderate the first meeting, the other meetings being moderated on a rotational or voluntary basis. The voting results are (sorted on number of votes):

  • 9 Applications of Web Scale N-gram models
  • 9 Topic model and topic Evaluation
  • 8 Regression with textual features
  • 8 Document cluster Labelling
  • 8 Topic models with temporal aspect
  • 6 Stock market prediction using Social Media
  • 5 LBSNs
  • 5 Authorship Attribution in the wild
  • 5 Graph-based methods for word semantic similarity
  • 5 Domain adaptation
  • 3 Human Mobility Patterns
  • 2 Accent and diacritics restoration

Between topics, we can have a week of demonstrating different programming languages/software/etc. So far, the proposals are:

LBSNs
Location Based Social Networks (e.g. Foursquare) are social networks in which users share their current location with their friends in a venue-oriented way. Quantitative studies and data mining have been performed using data from these systems, finding interesting associations and possible applications. The networks also have very rich textual information which has not been analysed so far.
Motivating paper: Exploring Millions of Footprints in Location Sharing Services - Z. Cheng, J. Caverlee, K. Lee (2011)

Human Mobility Patterns
The study of human mobility by different methods (e.g. GPS tracking, banknote movement) discovers laws that underlie our movement and that can predict our future behaviour.
Motivating paper: Understanding individual human mobility patterns - M. C. González, C. A. Hidalgo, A.-L. Barabási (2008)

Authorship Attribution in the wild
The standard task of authorship attribution when we can't reduce the set of candidate authors or when we are confronted with short or noisy text (e.g. blogs or social networks)
Motivating paper: Authorship Attribution in the wild - M. Koppel, J. Schler and S. Argamon (2011)

Applications of Web Scale N-gram models
Various NLP applications (e.g. Paraphrase acceptability, Word Sense Disambiguation) make use of the release of the Google N-gram corpus in different ways.
Motivating paper: Web-scale N-gram models for lexical disambiguation - S. Bergsma, D. Lin, R. Goebel (2008)

List of relevant papers:

Domain adaptation
A domain adaptation technique with many NLP applications (e.g. sentiment classification, POS tagging) proposed for situations when there are no labelled target domain instances available, but plentiful unlabelled data in both source and target domains.
Motivating paper: Domain Adaptation with Structural Correspondence Learning. - John Blitzer, Ryan McDonald, and Fernando Pereira (2006)

Topic model and topic Evaluation

List of relevant papers:

Document cluster Labelling

Graph-based methods for word semantic similarity

Accent and diacritic restoration
Most of the languages have in their script additional special characters derived from the standard latin characters but are printed in the same way in most of the online documents. Different techniques for disambiguating between these letters have been proposed so far, but the problem has not yet been solved.
Motivating paper: Letter Level Learning for Language Independent Diacritics Restoration - Rada Mihalcea and Vivi Nastase (2002)

Stock market prediction using Social Media
Social Media reflects the immediate public mood. Stock markets are influenced by this public mood and events and respond according to it. Can stock market evolution be accurately predicted from social media text?
Motivating paper: Twitter mood predicts the stock market - J Bollen, H Mao (2011)

Regression with textual features
Regression is performed when you want to predict the outcome of a (continuous) response variable from some features. Here, textual features are considered and we'll try to predict responses based on text.

List of relevant papers:

Topic models with temporal aspect
Topic models, of which the most popular is LDA (Latent Dirichlet Allocation), are probabilistic models for uncovering the underlying sematic structure of document collections and have been succesfully applied to many NLP tasks. Models based on this that also take in consideration the time element in document collections are considered here.
Motivating paper: Topics over time: a non-Markov continuous-time model of topical trends - Xuerui Wang and Andrew Mccallum (2006)

Feedback

For suggestions of topics or regarding organisation contact Daniel Preotiuc.

Other Reading Groups

NLP Reading Group - Johns Hopkins University

Machine Learning Reading Group - Johns Hopkins University

Machine Learning Tea - Johns Hopkins University

NLP Reading Group - University of Southern California

NLP Reading Group - Stanford University

Machine Learning Tea - Berkeley

NLP Reading Group 2012
Past Meetings

Wednesday 18 April, 3-4pm in G22

The topic of discussion is "Document cluster Labelling"
Enhancing cluster labeling using wikipedia (SIGIR 2009)
David Carmel, Haggai Roitman, Naama Zwerdling

Moderator: Nikos Aletras

Wednesday 4 April, 3-4pm in G22

The topic of discussion is "Regression using textual features"
Predicting a Scientific Community's Response to an Article (EMNLP 2011)
Dani Yogatama, Michael Heilman, Brendan O'Connor, Chris Dyer, Bryan Routledge, Noah Smith

Related reading:
Predicting Risk from Financial Reports with Regression (2009)
Movie Reviews and Revenues: An Experiment in Text Regression (2010)
Moderator: Trevor Cohn

Wednesday 28 March, 3-4pm in G30

The topic of discussion is "Regression using textual features"
Tracking the flu epidemic by monitoring the Social Web
Vasileios Lampos, Nello Cristianini (2010)
Interactive demos:
Flu Detector
Mood of the Nation
Further reading:
A simple explanation of the Lasso
Towards detecting influenza epidemics by analyzing Twitter messages (2010)
Flu detector - Tracking epidemics on Twitter (2010)
Nowcasting Events from the Social Web with Statistical Learning (2011)
Effects of the Recession on Public Mood in the UK (2012)
Moderator: Daniel Preotiuc

Wednesday 21 March, 12-1pm in G30

The topic of discussion is "Regression using textual features"
From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series
B.O'Connor, R.Balasubramanyan, B.Routledge, N.Smith (ICWSM 2010)
Moderator: Jing Li

Wednesday 14 March, 3-4pm in G30

The topic of discussion is "Topic model evaluation"
Improving Topic Evaluation Using Conceptual Knowledge
C.C.Musat, J.Velcin, S.Trausan-Matu, M-A Rizoiu (IJACAI 2011)
Starting discussion on Optimizing semantic coherence in topic models
D.Mimno, H.Wallach, E.Talley, M.Leenders, A.McCallum (EMNLP 2011)
Moderator: Nikos Aletras [Slides]

Wednesday 7 March, 3-4pm in G30

The topic of discussion is "Topic model evaluation"
Automatic evaluation of topic coherence
D. Newman, J.H. Lau, K. Grieser, T. Baldwin (NAACL 2010)
Moderator: Elisabeth Cano [Slides]

Wednesday 22 February, 3-4pm in G30

The topic of discussion is "Topic model evaluation"
Reading tea leaves: How humans interpret topic models
J. Chang, J. Boyd-Graber, S. Gerrish, C. Wang, D. Blei (2009)
Discuss about papers for the next meetings from this list
Background reading:
Probabilistic Topic Models (Review in topic modeling and applications by D. Blei)
LDA
Finding Scientific Topics
pLSA
Tutorial on LDA
Tutorial on pLSA
Topic Modeling Bibliography by D. Mimno
Moderator: Nikos Aletras [Slides]

Thursday 16 February, 1-2pm in G30
First training meeting on a software/toolkit/etc. and planning for the next topic (Topic model and topic Evaluation)
MALLET - MAchine Learning for LanguagE Toolkit [Tutorial Slides]
Moderator: Andrea Varga

Wednesday 8 February, 11-12 in G30

The topic of discussion is "Applications of Web Scale N-gram models"

Creating Robust Supervised Classifiers via Web-Scale N-gram Data - S. Bergsma (2010) [Slides]
Moderator: Douwe Gelling

Wednesday 1 February, 3-4pm in G30

The topic of discussion is "Applications of Web Scale N-gram models"

New Tools for Web-Scale N-grams (2010)
An Overview of Microsoft Web N-gram Corpus and Applications (2010)
Moderator: Mark Hepple

Wednesday 25 January, 2-3pm in G30.

The topic of discussion is "Applications of Web Scale N-gram models"

Web-Scale N-gram Models for Lexical Disambiguation - S.Bergsma (2009) [Slides]
Moderator: Daniel Preotiuc

Wednesday, 18 January 2012

  1. - introduction to the Google N-gram corpora
    Web 1T 5-gram Version 1
    Web 1T 5-gram 10 European Languages Version 1
    - you can get the corpora from /share/nlp/data/corpora/Web_1T_5-gram and /share/nlp/data/corpora/Web_1T_5-gram_Euro on the network
    - an interactive tool: http://books.google.com/ngrams/
    - Microsoft Web N-gram Services
    - Yahoo N-grams (L1)
  2. schedule for the next meetings (if you have paper preferences, please bring them with you)
  3. discussion on Managing the Google Web 1T 5-gram Data Set - A. Islam, D. Inkpen (2009)

NLP Reading Group 2010-2011

Tuesday, 3 May 2011, 13:00, room G30
Hierarchical Bayesian Domain Adaptation
J.R. Finkel, C. Manning
NAACL 2009
A useful paper to read for referece and background: Frustratingly Easy Domain Adaptation.
Slides are also available to download here.

12 April 2011
Bayesian Multitask Learning with Latent Hierarchies
H. Daume III
UAI 2009
A useful paper for understanding the coalescent: Bayesian agglomerative clustering with coalescents

1 March 2011
Nonparametric Word Segmentation for Machine Translation
T. Nguyen, S. Vogel, N. Smith
COLING 2010

8 February 2011
Modeling Information Diffusion in Implicit Networks
J. Yang, J. Leskovec
ICDM 2010 Best application paper award

25 January 2011
Coreference Resolution in a Modular, Entity-Centered Model
Aria Haghighi and Dan Klein
NAACL 2010

30 November 2010
Extracting Social Networks from Literary Fiction
David Elson, Nicholas Dames and Kathleen McKeown
ACL 2010

16 November 2010
Beyond NomBank: A Study of Implicit Arguments for Nominal Predicates.
Matthew Gerber and Joyce Chai.
ACL 2010