Wednesday, March 17, 2021

Limited dimensions and sequences in the discovery of semantic embeddings (continued)

 Dimensions are not the only ones affected by limitations. Sequences are affected too. A sequence is a series of words. For example, a series of words across sentences might reappear in another context. These sequences can even be compared with the identifiers of any sortable data maintained in a separate table. Sequences became popular with Recurrent Neural Networks (RNNs) are a special kind of neural network which processes sequences rather than symbols that constitute the sequence. This technique does not need to know what the parts of the sequence represent whether they are words or video frames. It can infer the meaning of those symbols. When raw data is shredded into sequences, this technique keeps state information per sequence that it infers from that sequence. This state is the essence of the sequence. Using this state, it can simply translate input sequences (text) to output sequences (summary). It can also be used to interpret the input sequence to generate an output sequence (like a chatbot). The RNN encoder-decoder model was proposed by Bahdanau et al in 2014 and it can be used to write any kind of decoder that generates custom output.  

 

The sequence length for this technique employs about 10 to 20 words and each constituent of the sequence is processed in a single timestep, so it is best to sample them from a few sentences. This reduction of the sample size for a selection of sequence can be considered as restrictive as the curse of the limited dimensions but it is also similarly overcome by using more batches and a collection of text rather than any one selection. Sometimes it is helpful even to pick out only the first few sentences from each sample text in a corpus.  

 

The restrictions on dimensions and sequences are not necessarily introduced by resources. Cloud computing has made it easy to store large amounts of data even if it were per iteration. The algorithms for their processing have an inherent limitation of being quite intensive even for a size of a hundred. One technique called memorization that saves the results of some computations has already provided benefit in undoing some of the repetitions in these computations. 

 

Sequence size behaves differently from feature dimensions. While the addition of dimensions improves the fidelity of the word vectors, the elongation of sequences changes its nature often making it more nebulous than before. We must look at how the sequences are used in order to appreciate this difference.  

No comments:

Post a Comment