A conjecture about Artificial Intelligence:
ChatGPT is increasingly in the news for its ability to mimic
a human conversationalist and for being versatile. It introduces a significant
improvement over the sequential encoding with the GPT-3 family of
parallelizable learning. This article wonders if the state considered to be in
summation form so that as the text continues to be encoded, the overall state
is continuously accumulated in a streaming manner. But first a general
introduction to the subject.
ChatGPT is based on a type of neural network called a
transformer. These are models that can translate text, write poems and op-ed
and even generate code. Newer natural language processing (NLP) models like
BERT, GPT-3 or T5 are all based on transformers. Transformers are incredibly
impactful as compared to their predecessors that were also based on neural
networks. Just to recap, neural networks are models for analyzing complicated
data by discovering hidden layers that represent the latency semantics in each
data and often referred to as the hidden layer between an input layer of data
and an output layer of encodings. Neural networks can handle a variety of data
including images, video, audio and text. There are different types of neural
networks optimized for different types of data. If we are analyzing images, we
would typically use a convolutional neural network so called because it often
begins with embedding the original data into a common shared space before it
undergoes synthesis, training and testing. The embedding space is constructed
from an initial collection of representative data. For example, it could refer
to a collection of 3D shapes drawn from real world images. Space organizes
latent objects between images and shapes which are complete representation of
objects. The use of CNN is a technique that focuses on the salient invariant
embedded objects rather than the noise.
CNNs worked great for detecting objects in images but did
not do as well for languages or tasks such as summarizing text or generating
text. The Recurrent Neural Network was introduced to adjust this for language
tasks such as translation where an RNN would take a sentence from the source
language and translate it to a destination language one word at a time and then
sequentially generate the translations. Sequential is important because the
words order matter for the meaning of a sentence as in “The dog chased the cat”
versus “The cat chased the dog”. RNNs had a few problems. They could not handle
large sequences of text, like long paragraphs. They were also not fast enough
to handle large data because they were sequential. The ability to train on
large data is considered a competitive advantage when it comes to NLP tasks
because the models become more tuned.
Transformers changed that and in fact, were developed for
the purposes of translation. Unlike
RNNs, they could be parallelized. This meant that transformers could be used to
train on large data sets. GPT-3 that
writes poetry and code and writes conversations was trained on almost 45
Terabytes of text data and including the entire world wide web. It scales well
with a huge data set.
Transformers work very well because of three components: 1.
Positional Encoding, 2. Attention and 3. Self-Attention. Positional encoding is about enhancing the
data with positional information rather than encoding it in the structure of
the network. As we train the network on lots of text data, the transformers
learn to interpret those positional encodings. It really helped transformers
easier to train than RNN. Attention refers to a concept that originated from the
paper aptly titled “Attention is all you need”. It is a structure that allows a
text model to look at every single word in the original sentence when deciding
to translate the word in the output. A heat map for attention helps with
understanding the word and its grammar. While attention is for understanding
the alignment of words, self-attention is for understanding the underlying
meaning of a word to disambiguate it from other usages. This often involves an
internal representation of the word also referred to as its state. When
attention is directed towards the input text, there can be differences
understood between say “server, can I have the check” and the “I crashed the
server” to interpret the references to a human versus a machine server. The
context of the surrounding words helps with this state.
BERT, an NLP model makes use of attention and can be used
for a variety of purposes such as text summarization, question answering,
classification and finding similar sentences. BERT also helps with Google
search and Google cloud AutoML language. Google has made BERT available for
download via TensorFlow library while Hugging Face company has made
Transformers available in Python language.
Basis for the conjecture to use stream processing for state
encoding rather than parallel batches is derived from the online processing of
Big Data as discussed in: http://1drv.ms/1OM29ee
No comments:
Post a Comment