Saturday, January 2, 2021

How to perform text summarization with sequence-to-sequence RNNs

Recurrent Neural Networks (RNNs) are a special kind of neural network that work with sequences rather than symbols that constitute the sequence. In fact, this technique does not need to know what the parts of the sequence represent whether they are words or video frames. It can infer the meaning of those symbols. When raw data is shredded into sequences, the RNN keeps a state information per sequence that it infers from that sequence. This state is the essence of the sequence. Using this state, the RNN can simply translate input sequences (text) to output sequences (summary). It can also be used to interpret the input sequence to generate an output sequence (like a chatbot). The RNN encoder-decoder model was proposed by Bahdanau et al in 2014 and it can be used to write any kind of decoder that generates custom output. Text summarization merely restricts the scope of this approach to machine translation with its use of the decoder. 

There are a few differences between machine translation and sequence-to-sequence RNNs.  Summarization is a lossy conversion where only the key concepts are retained. Machine translation is a lossless translation with no restriction to size. Summarization restricts the size of the output regardless of the size of the input. Rush et al, in 2015, proposed a convolutional model that encodes the source and uses a context sensitive additional feed-forward neural network to generate the summary.   

The annotated Gigaword corpus has been popular in training the models used in both 2014 and 2015. Mikolov’s 2013 word2vec model makes use of a different dataset for creating a word-embeddings matrix but this 100-features as dimensions word-embeddings can still be further updated by training it on the Gigaword corpus. This has been the approach by Nallapati et al in 2016 for a similar task with the deviation being that the input size is not restricted to 1 or 2 sentences from each sample. The RNN itself uses a 200-dimension hidden state with the encoder being bidirectional and the decoder being uni-directional. The vocabularies for the source and target sequences can be kept separate although the words from the source along with some frequent words may re-appear in the target vocabulary. Using the words from the source cuts down on the number of epochs considerably. 

The summary size is usually set to about 30 words at maximum while the input size may vary. The encoder itself can be hierarchical with a second bi-directional RNN layer running at the sentence level. The use of pseudo-words and sentences as sequences are left outside the scope of this article. 

The sequence length for this model is recommended to be in the 10~20 range and for that purposes and the timesteps are per word, so it is best to sample them from a few sentences. 

 

 

No comments:

Post a Comment