Refreshments 3:20 p.m.
We are facing enormous growth in the amount of information available from various data
resources. This growth is even more notable when it comes to text data; the number of pages
on the internet, for example, is expected to double itself every five years, with billions of
multilingual webpages already available.
In order to make use of this textual data in natural language understanding systems, we need
to rely on text analysis that structures this information. Natural language parsing is one
such example, a fundamental problem in NLP. it provides the basic structure to text,
representing its syntax computationally. This structure is used in most NLP applications that
analyze language to understand meaning.
I will discuss three important facets of modeling syntax: (a) accuracy of learning; (b)
efficiency of parsing unseen sentences; and (c) selection of data to learn from. In this
talk, the common theme of these three ideas is the concept of learning from incomplete data.
To model syntax more effectively, I will first describe a model called latent-variable
probabilistic context-free grammars (L-PCFGs) which, because of the hardness of learning from
incomplete data, has until recently been used for learning in tandem with many heuristics and
approximations. I will show a much more principled and statistically consistent approach to
learning L-PCFGs using spectral algorithms, and will also show how L-PCFGs can parse unseen
sentences much more efficiently through the use of tensor decomposition.
In addition, I will touch on work with unsupervised language learning, one of the holy grails
of NLP, in the Bayesian setting. In this setting, priors are used to guide the learner,
compensating for the lack of labeled data. I will survey novel priors that were developed for
this setting, and mention how they can be used monolingually and multilingually.
Shay Cohen is a postdoctoral research scientist in the Department of Computer Science at
Columbia University. He holds a CRA Computing Innovation Fellowship. He received his B.Sc.
and M.Sc. from Tel Aviv University in 2000 and 2004, and his Ph.D. from Carnegie Mellon
University in 2011. His research interests span a range of topics in natural language
processing and machine learning, with a focus on structured prediction. He is especially
interested in developing efficient and scalable parsing algorithms as well as learning
algorithms for probabilistic grammars.