Using FastText word embeddings to improve text
summarization.
Word vectors are commonly used to represent the association
words with other words. Vectors form is helpful for purposes of classification
and regression. Two popular forms of word vectors are FastText and Word2Vec. FastText
treats each word as composed of character n-grams while word2vec treats it as a
bag of words. Character n-gram is the contiguous sequence of n items from a
given sample of a character or word. For example, trigram or n=3 of the word “where”
is <wh, whe, her, ere, re>. FastText includes the character n-gram as
well as the word itself which means the input data will be <wh, whe, her, ere,
re> and <where>.
Since the objective of text summarization is to bring out
the salient topic in the text, FastText is better suited to summarize to as
little as a topic word for an entire text input than word2vec. As with most
machine learning models, the training of the model takes more compute resources
than its execution for the purposes of prediction. With FastText, the model is lightweight
enough to be hosted on a variety of devices.
Let us take an example for this extreme summarization with FastText.
```python:
from nessvec.indexers import Index
index = Index(num_vecs=200_000) #
the default is 100_000
index.extreme_summarize(‘hello and
goodbye’)
>>> array([‘hello’],
dtype=object)
index.query_series(index[0])
, 1.92093e-07
and 3.196178e-01
( 3.924445e-01
) 4.218287e-01
23 4.463376e-01
22 4.471740e-01
18 4.490819e-01
19 4.515444e-01
21 4.544248e-01
but 4.546938e-01
dtype: float64
```
If we were to take the average of index[‘hello’]
and index[‘goodbye’], it would be closer to ‘goodbye’
If we were to normalize with say
```python
Import numpy as np
avg=(index[‘hello’]+index[‘goodbye’])/2
index.query_series(avg / np.linalg.norm(avg))
>>> # this would be
closer to goodbye
This suggests that numerical
rounding and weighting can change the outcome of the extreme summarization. It
is better to not impose any arithmetic over vectors and use them merely for
latent semantics. Sample application at https://booksonsoftware.com/text
No comments:
Post a Comment