Latent Semantic Analysis – Part 1

Objective

To build a Latent Semantic Analysis (LSA) model to find statistical synonyms of a word from a huge corpus.

Preliminary objective of this post is to build Term Document Matrix (TDM) as it is a prerequisite for building LSI model ; So lets first see how to construct TDM.

Observation and legends:

In order to build the LSA also known as known as Latent Semantic Indexing (LSI) model we need to create Term Document Matrix (TDM) from the corpus which then passed to Single Value Decomposition (SVD) to obtain U matrix ,S matrix and V matrix.

Following steps has to be followed in order to create TDM from a huge corpus.

1. Remove stops words (such as is, of, be, to, the etc.,) from the corpus.

2. Apply Stemmer to each token in the corpus to get rid of inflections.

3. Construct a count matrix.

4. Modify the count matrix with TFIDF (Term Frequency – Inverse Document Frequency)

Resultant of modified matrix is your TDM

 

Initially I took 50 documents as my corpus. I placed all the documents in a single directory which will help me to use loop concept in program to read all file in a sequence manner.

I broke all files into tokens and removed the stop words from those tokens. In order to accomplish this task I manual collected the list of stop words and put that into an ArrayList. Then I compared my token with the AarryList, if my token matches with any one of the elements in AarryList it means it a stop word so I will ignore that particular token. click here to see the list of stop words that I used

After removing stops words from the corpus I applied stemmer to remaining token (which are not stop word) to get their root word. To accomplish this task I stared to implement simple stemmer by defining 22 + 24 differed rules to get rid of inflections. I passed my tokens into stemmer that I have implemented to get the root word of the token. Click here to see my simple stemmer implementation

After stemming the token, it’s time to construct the count matrix with tokens and documents.

Assumption

I assume that the count matrix which has Column that corresponds to tokens and Row that corresponds to documents.In programming terms, I am just going create 2D array in which Column a tokens and Row documents.

 

 

I had a plan to work with java to construct 2D array. But 2Darray definition in java could not support string index. This means tokens and document names cannot be passed directly as array index

W1 W2 W3 Wn
T1
T2
T3
T4
T5
tn

This problem could be solved if we annotate the string values and converted them to unique numerical values. To this accomplish task I used collection concept in java. Before start constructing the count matrix (2D array) and immediately after the stemming process i pushed the entire tokens into the ArrayCollection since ArrayCollection does not allow duplication we got unique token list and their corresponding integer index will act as annotated value in Row of count matrix.

W1 W2 W3 Wn
T1
T2
T3
T4
T5
tn

                clip_image001

1 2 3 n
1
2
3
4
5
n

Initially I said I placed all documents in a single directory so that it I can use loop in program to read the files in sequence order. Now by getting the number of count of a loop we can determine the total number of documents.

After getting the total number of documents we can determine the column of the count matrix.Once the length of ArrayCollection and total number of documents we determine the size of the count matrix

countMatrix [length of ArrayCollection][ total number of documents]

After specifying the size I initialized the countMatrix to 0 (I.e. all values in countMatrix is set to 0).

1 2 3 n
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
0 0 0 0 0
n 0 0 0 0 0

Then I wrote simple java code to compute the values of each cell. Each cell contains the number of times that word occurs in that document. For example, the word “market” (whose annotated value is 2 in row) appears 4 times in a particular document (whose annotated value is 5) in column.

1 2 3 n
1 1 1 2 0 0
2 1 1 1 0 0
3 1 1 1 0 0
4 3 0 0 1 1
5 0 4 0 0 0
1 0 0 0 0
n 1 0 0 3 5

After constructing the countMatrix we need to apply weight to all token found in countMatrix. For this I have used TFIDF (Term Frequency – Inverse Document Frequency).

TFIDFi,j = ( Ni,j / N*,j ) * log( D / Di ) where

  • Ni,j = the number of times word i appears in document j (the original cell count).
  • N*,j = the number of total words in document j (just add the counts in column j).
  • D = the number of documents (the number of columns).
  • Di = the number of documents in which word i appears (the number of non-zero columns in row i).

After computing the formula on the count matrix, the resultant matrix is TERM DOCUMENT MATRIX

Result

It has been observed that no .of keywords becomes less when corpus below to same domain, if corpus consists of different domain then list of keywords increase tremendously.

It has also been observed that nearly 2/3 of corpus contain only stop words .

Continue with : Latent Semantic Analysis – Part 2

LSI1

LSI2

lsi3

Resultant of TDM

lsi4

 

Click here to see the program code for constructing TDM

Thus we finally constructed the TDM; in next post let’s see how this TDM act as input to SVD to find the statistical synonyms of a word from a huge corpus.

Continue with : Latent Semantic Analysis – Part 2


7 Comments

  1. ganesh wrote
    at 5:50 PM - 31st January 2011 Permalink

    awesome dude——-
    ur work is very much helpful for me…

  2. sudheer wrote
    at 4:45 PM - 5th March 2011 Permalink

    hi,
    can u plz make clear ur assumption and declaration of row -column and vice versa , are different..

    countMatrix [length of ArrayCollection][ total number of documents]

    Also i have posted a question to ur mail did u check.. reply me..

  3. shakthydoss wrote
    at 11:55 AM - 6th March 2011 Permalink

    CountMatrix [ No.of.row ] [ No.of.column]

    No.of.row is nothing but no of words in all documents
    No.of.column is nothing but no of documents in corpus

  4. sudheer wrote
    at 9:12 AM - 15th March 2011 Permalink

    One more question i have to you is:

    Is that Number of words in all documents (in previous post ) include:

    1) terms remained after Stopword removal stemming (index words)?? or full document(s) tokens.

    2)the above index words resulted , are they assumed duplicated in index file u have?? or removed repeated words

    Waiting 4 reply
    Regards

  5. shakthydoss wrote
    at 10:47 AM - 15th March 2011 Permalink

    It is terms remaining after stop-words removing and stemming process.

    Index will not have duplications.

  6. TV TUAN wrote
    at 2:23 PM - 21st September 2011 Permalink

    You use tdidf (not tfifd), so it might work with word similarity, since you can somehow define the tdidf of the query Q for each document by couting occurences of all words in Q in that document ? (T,F ?), however, if you want to solve doc-similarity then tdidf will not fix the case. I think we need to nomalize both term frequency, not only doc frequency, how do we define the tfidf for document query Q?

  7. huangzy wrote
    at 5:32 PM - 22nd February 2012 Permalink

    I hava a question:

    In this article, you wrote the words followed:
    For example, the word “market” (whose annotated value is 2 in row) appears 4 times in a particular document (whose annotated value is 5) in column.

    4 should not be in the 2nd row, 5th column in the count matrix (the 4 in blue color is 5th row, 2nd column in your picture)? Or am I wrong?

Post a Comment

Your email is never published nor shared. Required fields are marked *