Karthik Yearning Deep Learning

Text Summarization Architectures


Text Summarizing is an important NLP Task. It comprises of two categories, abstractive summarization and extractive summarization. However, the real world application of text summarization is challenging due to human readability and quality. I will try to explain different research approaches and architectures employed. This blog post will focus on complete end to end information about text summarization. This blog is a summarized version of the paper

Neural Abstractive Text Summarization with Sequence-to-Sequence Models: A Survey.


Introduction

Neural abstractive text summarization with sequence to sequence models gained a lot of popularity. Many approaches differ based on network architecture, parameter inference and decoding. Along with this there are some version which focus on efficiency and parallelism for training a model.

Extractive summarization considers the words, phrases and sentences in the summaries from the source article. They are relatively simple since they are selecting the sentences from the source article and produce grammatically correct sentences.

Abstractive summaries, extracts the structural representation from source article and then generates summaries using language generation models. They have a strong potential of producing high quality summaries and can also easily incorporate external knowledge.




Seq2seq Models and Pointer Generator Network


Summarization based on text extraction is inherently limited, but generation-style abstractive methods have proven challenging to build. In this work, we propose a fully data-driven approach to abstractive sentence summarization. Our method utilizes a local attention-based model that generates each word of the summary conditioned on the input sentence. While the model is structurally simple, it can easily be trained end-to-end and scales to a large amount of training data. The model shows significant performance gains on the DUC-2004 shared task compared with several strong baselines.

\[log \ p(y|x; \theta) = \sum_{i=0}^{N-1} log \ p(y_{i+1} | x, y_c; \theta) \\ \\ \\ y \ is \ the \ output \ sequence \ = (y_1, y_2,.....y_n) \\ c \ is \ the \ context \ length \\ \theta = (E, U, V, W) \\ E \ - \ word \ embedding \ matrix \\ U \ V \ W \ - \ weight \ matrices\]


Screenshot-from-2020-03-01-10-37-20.png




Abstractive Sentence Summarization generates a shorter version of a given sentence while attempting to preserve its meaning. We introduce a conditional recurrent neural network (RNN) which generates a summary of an input sentence. The conditioning is provided by a novel convolutional attention-based encoder which ensures that the decoder focuses on the appropriate input words at each step of generation. Our model relies only on learned features and is easy to train in an end-to-end fashion on large data sets. Our experiments show that the model significantly outperforms the recently proposed state-of-the-art method on the Gigaword corpus while performing competitively on the DUC-2004 shared task.

RNN Encoder-Decoder Architecture is used to model the conditional probability. Training involves finding the optimum value for θ, that maximizes the conditional probability of sentence summary pairs in the training corpus.

If the model is trained to generate the next word of the summary, given the previous words, then the above conditional can be factorized into a product of individual conditional probabilities.

\[P(y|x; \theta) = \prod_{t=1}^N p(y_t | \{ y_!,....,y_{t-1}\}, x; \theta)\]




Handling OOV words

When there is OOV words, the switching decoder/pointer architecture will point to their location in the source document. The decoder is equipped with a ‘switch’ to use the generator or a pointer at every time-step. If the switch is turned on, the decoder produces a word from its target vocabulary. If the switch if turned off, the decoder instead generates a pointer to one of the word-positions in the source. The word at the pointer-location is then copied into the summary. The switch is modeled as a sigmoid activation function.


Screenshot-from-2020-03-01-13-21-12.png


Screenshot-from-2020-03-01-13-35-19.png


Learning Hierarchical Document Structure with Hierarchical Attention

Identifying keywords and dominating representation from lenghty source sentences is a challenging task in summarization. The model aims to capture this key sentences using two bi-directional RNNs on the source side, one at the word level and other at the sentence level. The attention mechanism operates at both levels simultaneously.

The word level attention is further re-weighted by the corresponding sentence level attention and re-normalized.

The re-scaled attention is then used to compute the attention weighted context vector that goes as input to the hidden state of the decoder. Further, we also concatenate additional positional embeddings to the hidden state of the sentence-level RNN to model positional importance of sentences in the document. This architecture therefore models key sentences as well as keywords within those sentences jointly


Screenshot-from-2020-03-01-13-31-23.png




This model combines the abstraction with the extraction. The pointer-generator architecture can copy words from source texts via a pointer and generate novel words from a vocabulary via a generator. With the pointing/copying mechanism, factual information can be reproduced accurately and OOV words can also be taken care in the summaries.

Screenshot-from-2020-03-01-14-22-36.png


Screenshot-from-2020-03-01-14-25-08.png

coverage mechanism and coverage loss

It avoids the attention mechanism to avoid repeatedly attending to the same locations and thus avoid generating repetitive text. coverage loss is used to penalize repeatedly attending to the same location.

coverage mechanism solves the word and sentence level repetitions and generating unnatural summaries.




Attentional, RNN-based encoder-decoder models for abstractive summarization have achieved good performance on short input and output sequences. For longer documents and summaries however these models often include repetitive and incoherent phrases. We introduce a neural network model with a novel intra-attention that attends over the input and continuously generated output separately, and a new training method that combines standard supervised word prediction and reinforcement learning (RL). Models trained only with supervised learning often exhibit “exposure bias” – they assume ground truth is provided at each step during training. However, when standard word prediction is combined with the global sequence prediction training of RL the resulting summaries become more readable. We evaluate this model on the CNN/Daily Mail and New York Times datasets. Our model obtains a 41.16 ROUGE-1 score on the CNN/Daily Mail dataset, an improvement over previous state-of-the-art models. Human evaluation also shows that our model produces higher quality summaries.


Screenshot-from-2020-03-01-15-24-40.png

This paper follows an heuristic approach to avoid repetition that depends on the dataset used.

Another way to avoid repetitions comes from our observation that in both the CNN/Daily Mail and NYT datasets, ground-truth summaries almost never contain the same trigram twice. Based on this observation, we force our decoder to never output the same trigram more than once during testing. We do this by setting p(y_t) = 0 during beam search, when outputting y_t would create a trigram that already exists in the previously decoded sequence of the current beam.





Current models for document summarization disregard user preferences such as the desired length, style, the entities that the user might be interested in, or how much of the document the user has already read. We present a neural summarization model with a simple but effective mechanism to enable users to specify these high level attributes in order to control the shape of the final summaries to better suit their needs. With user input, our system can produce high quality summaries that follow user preferences. Without user input, we set the control variables automatically – on the full text CNN-Dailymail dataset, we outperform state of the art abstractive systems (both in terms of F1-ROUGE1 40.38 vs. 39.53 F1-ROUGE and human evaluation).


Below are the constrained summarization presented in this paper.

Controlling summary length enables reading with different time spans. Length range bins are created with equal number of training documents. Then, special words are included in the vocabulary to fill out the missing spots to obtain the constrained length. At training, we prepend the ground truth markers. At test time, the length is controlled by prepending length marker token.

To enable the user to control length, we first quantize summary length into discrete bins, each representing a size range. Length bins are chosen so that they each contain roughly an equal number of training documents. We then expand the input vocabulary with special word types to indicate the length bin of the desired summary, which allows generation to be conditioned upon this discrete length variable. For training, we prepend the input of our summarizer with a marker that indicates the length of the ground-truth summary. At test time, we control the length of generated text by prepending a particular length marker token.


To enable entity-centric summaries, we first anonymize entities by replacing all occurrences of a given entity in a document by the same token. For a (document, summary) pair, each entity is replaced with a token from the set (@entity0, . . . , @entityN). This abstracts away the surface form, allowing our approach to scale to many entities and generalize to unseen ones. We then express that an entity should be present in the generated summary by prepending the entity token to the input — prepending @entity3 expresses that the model should generate a summary where @entity3 is present. In effect, this instructs the model to focus on sentences that mention the marked entities.

​ Higher accuracy is achieved:
​ 1. when the constrained entities were from first few sentences of a document.

​ 2. markers taken from the reference summary.

​ 3. providing more entities improves summarization quality.


we introduce special marker tokens (@genSource0, . . . , @genSourceN) to express source desiderata. For training, we preprend the input with the marker corresponding to the ground-truth source. At inference, we control the style of generated summary by prepending different markers.




In this blog, I have covered some of the major text summarization architectures. In the part2 blog, I will list out the challenges and solution strategies employed in providing a qualitative text summary.




Resources:

  1. Neural Abstractive Text Summarization with Sequence-to-Sequence Models: A Survey
  2. Controllable Abstractive Summarization - GITHUB
  3. Get To The Point: Summarization with Pointer-Generator Networks - GITHUB
comments powered by Disqus