Sunday, November 12, 2023

 

Using FastText word embeddings to improve text summarization.

Word vectors are commonly used to represent the association words with other words. Vectors form is helpful for purposes of classification and regression. Two popular forms of word vectors are FastText and Word2Vec. FastText treats each word as composed of character n-grams while word2vec treats it as a bag of words. Character n-gram is the contiguous sequence of n items from a given sample of a character or word. For example, trigram or n=3 of the word “where” is <wh, whe, her, ere, re>. FastText includes the character n-gram as well as the word itself which means the input data will be <wh, whe, her, ere, re> and <where>.

Since the objective of text summarization is to bring out the salient topic in the text, FastText is better suited to summarize to as little as a topic word for an entire text input than word2vec. As with most machine learning models, the training of the model takes more compute resources than its execution for the purposes of prediction.  With FastText, the model is lightweight enough to be hosted on a variety of devices.
Let us take an example for this extreme summarization with FastText.

```python:

from nessvec.indexers import Index

index = Index(num_vecs=200_000) # the default is 100_000

index.extreme_summarize(‘hello and goodbye’)

>>> array([‘hello’], dtype=object)

index.query_series(index[0])

,             1.92093e-07

and       3.196178e-01

(             3.924445e-01

)            4.218287e-01

23         4.463376e-01

22         4.471740e-01

18         4.490819e-01

19         4.515444e-01

21         4.544248e-01

but        4.546938e-01

dtype: float64

```

If we were to take the average of index[‘hello’] and index[‘goodbye’], it would be closer to ‘goodbye’

If we were to normalize with say

```python

Import numpy as np

avg=(index[‘hello’]+index[‘goodbye’])/2

index.query_series(avg / np.linalg.norm(avg))

>>> # this would be closer to goodbye

This suggests that numerical rounding and weighting can change the outcome of the extreme summarization. It is better to not impose any arithmetic over vectors and use them merely for latent semantics. Sample application at https://booksonsoftware.com/text

No comments:

Post a Comment