UBC NLP Group | Showcase

Project Showcase

Discourse Analysis and Parsing

Distantly- and Self-Supervised Approaches to Infer Discourse

In our recent line of distantly- and self-supervised approaches for RST-style discourse parsing, we aim to generate robust silver-standard discourse trees informed by related downstream tasks.

Discourse from Sentiment Analysis

MEGA-DT Disocurse Annotation Pipeline

In our EMNLP 2019 and MEGA-DT (published at EMNLP 2020) papers, we propose a combination of a deep multiple instance learning model (MILNet) with the traditional CKY algorithm to generate nuclearity-attributed discourse structures for large-scale sentiment annotated corpora. We show that while the silver-standard discourse trees cannot outperform in-domain supervised discourse parsers, they do capture highly robust structures, which generalize well between domains, reaching the best inter-domain discourse parsing performance to date. Our generated silver-standard discourse treebank containing over 250.000 complete discourse trees in the review domain can be downloaded here.

Discourse from Topic Segmentation

Topic Segmentation to Infer High Level Discourse Structures

Improving on our work using sentiment augmented data to infer discourse structures, we target high-level (above-sentence) discourse structures in our AAAI 2022 work on Predicting Above-Sentence Discourse Structure using Distant Supervision from Topic Segmentation. We thereby exploit our top-performing neural topic segmentation model presented in this paper to greedily segment documents, showing that the generated high-level (binary) discourse structures align well with gold-standard discourse annotations, an important factor for many downstream tasks implicitly or explicitly converting constituency trees into dependency representations.

Discourse from Summarization

Discourse Inference from Transformer Self-Attention Matrices

Extending our work on distantly-supervised discourse parsing, we explore the auxiliary task of summarization, especially focussing on the nuclearity attribute, which has previously been shown to contain importnat information for summarization related tasks. In our NAACL 2021 paper, we show that discourse (dependency) structures can be reasonably inferred using the CKY and Eisner algorithms to extract discourse trees from transformer self-attention matrices, marking an important first step to explore state-of-the-art NLP models for their alignment with discourse information.

Discourse from Tree-style Autoencoders

Unsupervised Tree Auto-Encoder

In our AAAI 2021 paper on Unsupervised Learning of Discourse Structures using a Tree Autoencoder, we aim to generate discourse structures (without nuclearity and relation labels) from the task of tree-style language modelling. In contrast to many modern approaches interpreting the language modelling task as a sequential problem, we explicitly generate discrete tree structures during training and inference. We show that those tree structures learned purely from existing large-scale datasets can reasonably align with discourse and further also supports important downstream tasks (here: sentiment analysis). While the performance is nowhere close to supervised (or distantly-supervised) models, we show first insights into the value of generating more tree-enabled structures for language modelling, potentially valuable for further research in the future.

W-RST: A weighted Extension of Discourse Theories

The W-RST framework bridging the gap between Linguistics and NLP

In a first attempt to bridge the ever growing gap between (Computational) Linguistics and Natural Language Processing, we propose the Weighted-RST (W-RST) framework at ACL 2021. In this line of work, we explore the usage of readily available real-valued scores in distantly supervised discourse models, namely the MEGA-DT and our NAACL 2021 paper, to generate more fine-grained importance scores between sibling sub-trees (i.e., the RST nuclearity attribute). In our experiments, we show that the weighted RST trees are superior to discourse treuctures with binary nuclearity attributes for most thresholds, and further align well with human annotations.

Supervised Discourse Parsers

Our lab further contributed some of the top-performing, completely supervised discourse parsers to date. With the CODRA discourse parser reaching state-of-the-art performance at the time using an optimal parsing algorithm with two Conditional Random Fields for intra-sentential and multi-sentential parsing.

More recently, our neural discourse parser presented at CODI 2020 based on the shift-reduce paradigm, using SpanBERT and an auxiliary coreference module reached the state-of-the-art performance for RST-style discourse parsing on RST-DT.
ConVis

Visual Text Analytics of Conversations (GC Add better pic)

System Overview

We have developed visual text analytic systems that tightly integrate interactive visualization with novel text mining and summarization techniques to fulfill information needs of users in exploring conversations (e.g, ConVis, MultiConvis, ConViscope). In this context, we have investigated techniques for interactive (human-in-the-loop) topic modeling. Check out our latest paper
Extractive Summarization of Long Documents by Combining Global and Local Context

Check out our work on extractive summarization for long documents!

System Overview

We propose a novel neural single document extractive summarization model for long documents, incorporating both the global context of the whole document and the local context within the current topic.

Check out our paper and code.

Project Showcase

Distantly- and Self-Supervised Approaches to Infer Discourse

Discourse from Sentiment Analysis

Discourse from Topic Segmentation

Discourse from Summarization

Discourse from Tree-style Autoencoders

W-RST: A weighted Extension of Discourse Theories

Supervised Discourse Parsers

Visual Text Analytics of Conversations (GC Add better pic)

Check out our work on extractive summarization for long documents!