Project Showcase
-
Discourse Analysis and Parsing
Distantly- and Self-Supervised Approaches to Infer Discourse
In our recent line of distantly- and self-supervised approaches for RST-style discourse parsing, we aim to generate robust silver-standard discourse trees informed by related downstream tasks.
Discourse from Sentiment Analysis
MEGA-DT Disocurse Annotation Pipeline In our EMNLP 2019 and MEGA-DT (published at EMNLP 2020) papers, we propose a combination of a deep multiple instance learning model (MILNet) with the traditional CKY algorithm to generate nuclearity-attributed discourse structures for large-scale sentiment annotated corpora. We show that while the silver-standard discourse trees cannot outperform in-domain supervised discourse parsers, they do capture highly robust structures, which generalize well between domains, reaching the best inter-domain discourse parsing performance to date. Our generated silver-standard discourse treebank containing over 250.000 complete discourse trees in the review domain can be downloaded here.
Discourse from Topic Segmentation
Topic Segmentation to Infer High Level Discourse Structures Improving on our work using sentiment augmented data to infer discourse structures, we target high-level (above-sentence) discourse structures in our AAAI 2022 work on Predicting Above-Sentence Discourse Structure using Distant Supervision from Topic Segmentation. We thereby exploit our top-performing neural topic segmentation model presented in this paper to greedily segment documents, showing that the generated high-level (binary) discourse structures align well with gold-standard discourse annotations, an important factor for many downstream tasks implicitly or explicitly converting constituency trees into dependency representations.
Discourse from Summarization
Discourse Inference from Transformer Self-Attention Matrices Extending our work on distantly-supervised discourse parsing, we explore the auxiliary task of summarization, especially focussing on the nuclearity attribute, which has previously been shown to contain importnat information for summarization related tasks. In our NAACL 2021 paper, we show that discourse (dependency) structures can be reasonably inferred using the CKY and Eisner algorithms to extract discourse trees from transformer self-attention matrices, marking an important first step to explore state-of-the-art NLP models for their alignment with discourse information.
Discourse from Tree-style Autoencoders
Unsupervised Tree Auto-Encoder In our AAAI 2021 paper on Unsupervised Learning of Discourse Structures using a Tree Autoencoder, we aim to generate discourse structures (without nuclearity and relation labels) from the task of tree-style language modelling. In contrast to many modern approaches interpreting the language modelling task as a sequential problem, we explicitly generate discrete tree structures during training and inference. We show that those tree structures learned purely from existing large-scale datasets can reasonably align with discourse and further also supports important downstream tasks (here: sentiment analysis). While the performance is nowhere close to supervised (or distantly-supervised) models, we show first insights into the value of generating more tree-enabled structures for language modelling, potentially valuable for further research in the future.
W-RST: A weighted Extension of Discourse Theories
The W-RST framework bridging the gap between Linguistics and NLP In a first attempt to bridge the ever growing gap between (Computational) Linguistics and Natural Language Processing, we propose the Weighted-RST (W-RST) framework at ACL 2021. In this line of work, we explore the usage of readily available real-valued scores in distantly supervised discourse models, namely the MEGA-DT and our NAACL 2021 paper, to generate more fine-grained importance scores between sibling sub-trees (i.e., the RST nuclearity attribute). In our experiments, we show that the weighted RST trees are superior to discourse treuctures with binary nuclearity attributes for most thresholds, and further align well with human annotations.
Supervised Discourse Parsers
Our lab further contributed some of the top-performing, completely supervised discourse parsers to date. With the CODRA discourse parser reaching state-of-the-art performance at the time using an optimal parsing algorithm with two Conditional Random Fields for intra-sentential and multi-sentential parsing.
More recently, our neural discourse parser presented at CODI 2020 based on the shift-reduce paradigm, using SpanBERT and an auxiliary coreference module reached the state-of-the-art performance for RST-style discourse parsing on RST-DT.
-
ConVis
Visual Text Analytics of Conversations (GC Add better pic)
System Overview We have developed visual text analytic systems that tightly integrate interactive visualization with novel text mining and summarization techniques to fulfill information needs of users in exploring conversations (e.g, ConVis, MultiConvis, ConViscope). In this context, we have investigated techniques for interactive (human-in-the-loop) topic modeling. Check out our latest paper
-
Extractive Summarization of Long Documents by Combining Global and Local Context
Check out our work on extractive summarization for long documents!
System Overview We propose a novel neural single document extractive summarization model for long documents, incorporating both the global context of the whole document and the local context within the current topic.