Cluster computing

A conjecture about Artificial Intelligence:

ChatGPT is increasingly in the news for its ability to mimic a human conversationalist and for being versatile. It introduces a significant improvement over the sequential encoding with the GPT-3 family of parallelizable learning. This article wonders if the state considered to be in summation form so that as the text continues to be encoded, the overall state is continuously accumulated in a streaming manner. But first a general introduction to the subject.

ChatGPT is based on a type of neural network called a transformer. These are models that can translate text, write poems and op-ed and even generate code. Newer natural language processing (NLP) models like BERT, GPT-3 or T5 are all based on transformers. Transformers are incredibly impactful as compared to their predecessors that were also based on neural networks. Just to recap, neural networks are models for analyzing complicated data by discovering hidden layers that represent the latency semantics in each data and often referred to as the hidden layer between an input layer of data and an output layer of encodings. Neural networks can handle a variety of data including images, video, audio and text. There are different types of neural networks optimized for different types of data. If we are analyzing images, we would typically use a convolutional neural network so called because it often begins with embedding the original data into a common shared space before it undergoes synthesis, training and testing. The embedding space is constructed from an initial collection of representative data. For example, it could refer to a collection of 3D shapes drawn from real world images. Space organizes latent objects between images and shapes which are complete representation of objects. The use of CNN is a technique that focuses on the salient invariant embedded objects rather than the noise.

CNNs worked great for detecting objects in images but did not do as well for languages or tasks such as summarizing text or generating text. The Recurrent Neural Network was introduced to adjust this for language tasks such as translation where an RNN would take a sentence from the source language and translate it to a destination language one word at a time and then sequentially generate the translations. Sequential is important because the words order matter for the meaning of a sentence as in “The dog chased the cat” versus “The cat chased the dog”. RNNs had a few problems. They could not handle large sequences of text, like long paragraphs. They were also not fast enough to handle large data because they were sequential. The ability to train on large data is considered a competitive advantage when it comes to NLP tasks because the models become more tuned.

Transformers changed that and in fact, were developed for the purposes of translation. Unlike RNNs, they could be parallelized. This meant that transformers could be used to train on large data sets. GPT-3 that writes poetry and code and writes conversations was trained on almost 45 Terabytes of text data and including the entire world wide web. It scales well with a huge data set.

Transformers work very well because of three components: 1. Positional Encoding, 2. Attention and 3. Self-Attention. Positional encoding is about enhancing the data with positional information rather than encoding it in the structure of the network. As we train the network on lots of text data, the transformers learn to interpret those positional encodings. It really helped transformers easier to train than RNN. Attention refers to a concept that originated from the paper aptly titled “Attention is all you need”. It is a structure that allows a text model to look at every single word in the original sentence when deciding to translate the word in the output. A heat map for attention helps with understanding the word and its grammar. While attention is for understanding the alignment of words, self-attention is for understanding the underlying meaning of a word to disambiguate it from other usages. This often involves an internal representation of the word also referred to as its state. When attention is directed towards the input text, there can be differences understood between say “server, can I have the check” and the “I crashed the server” to interpret the references to a human versus a machine server. The context of the surrounding words helps with this state.

BERT, an NLP model makes use of attention and can be used for a variety of purposes such as text summarization, question answering, classification and finding similar sentences. BERT also helps with Google search and Google cloud AutoML language. Google has made BERT available for download via TensorFlow library while Hugging Face company has made Transformers available in Python language.

Basis for the conjecture to use stream processing for state encoding rather than parallel batches is derived from the online processing of Big Data as discussed in: http://1drv.ms/1OM29ee

Cluster computing

Friday, January 26, 2024

No comments:

Post a Comment