Cluster computing

Thursday, March 18, 2021

Limited dimensions and sequences in the discovery of semantic embeddings (continued)

We discuss just the RNN encoder-decoder technique that is widely used for text mining and is easy to apply with a library such as TensorFlow. The use of the library does not necessarily standardize the technique, but it proves to be common usage.

This technique requires some preprocessing of text and other upstream and downstream activities that are typical to any techniques that work with text and in this case, we start with two dictionaries – one for index2word and another for word2index. The index2word is an alphabetical ordering of the top frequent words from the vocabulary. The word2index is the inverted set of words from the ordering to their rankings. The frequency distribution for a vocabulary is never discarded. It has an interesting trend that is typically seen with any association or market-basket analysis. The frequently occurring words are in the top 10% of the overall vocabulary. We mention association only because we will need vocabulary in both the x and the y directions when building the encoder-decoder model.

The encoder and decoder sequences as well as the labels representing the real outputs are of fixed size and format so their data structures can be initialized upfront and populated as the model is trained and tested. Each neuron in the RNN can be a Long-Short-Term-Memory/LSTM or a Gated Recurrent Unit (GRU) and it remembers useful information, discards unnecessary information, and emits relevant information at each timestep. The cells are stacked together in an n-layered model. TensorFlow has a high-level function called embedding_rnn_seq2seq which creates a model and does the word embeddings internally using the constructs just described. During training, the input to a timestep for the decoder is taken from the label sequence. During testing, another instance of the model is initialized where the decoder takes the output of the previous time step as input to the current timestep. Training is defined as a method that runs the train operation of the library’s model and minimizes a loss function. It runs for a given number of epochs or iterations. If the training is interrupted, it can resume from the last saved checkpoint. The model is periodically evaluated on the validation set. The predictor does a forward step of the most probable word returned by the model.

The cells are stacked in layers and each layer alters the interpretation to some extent. If there are several layers, the information encoded in a sequence can be comprehensive, but the data structures become much larger, and the operations require more time per epoch.

Cluster computing

Thursday, March 18, 2021

No comments:

Post a Comment