Plagiarism Detector

Objective:

Is to build a simple application to detect plagiarism among the corpus.

1 plagiarism

Assumption:

Cosine relation.

                   V . W
cos A  =    —————
                 ||V|| ||w||

between the document vector coordinate will help in fining the similarity values between the documents.

Observation and legends:

In order to build the Model create Term Document Matrix (TDM) from the corpus which then passed to Single Value Decomposition (SVD) to obtain U matrix ,S matrix and V matrix.

Following steps has to be followed in order to create TDM from a huge corpus.

1. Remove stops words (such as is, of, be, to, the etc.,) from the corpus.

2. Apply Stemmer to each token in the corpus to get rid of inflections.

3. Construct a count matrix.

4. Modify the count matrix with TFIDF (Term Frequency – Inverse Document Frequency)

Resultant of modified matrix is your TDM

Initially I took 10 documents as my corpus. I placed all the documents in a single directory which will help me to use loop concept in program to read all file in a sequence manner.

I broke all files into tokens and removed the stop words from those tokens. In order to accomplish this task I manual collected the list of stop words and put that into an ArrayList. Then I compared my token with the AarryList, if my token matches with any one of the elements in AarryList it means it a stop word so I will ignore that particular token. click here to see the list of stop words that I used

After removing stops words from the corpus I applied stemmer to remaining token (which are not stop word) to get their root word. To accomplish this task I stared to implement simple stemmer by defining 22 + 24 differed rules to get rid of inflections. I passed my tokens into stemmer that I have implemented to get the root word of the token. Click here to see my simple stemmer implementation After stemming the token, it’s time to construct the count matrix with tokens and documents.

I have ready discussed the construction of count matrix and applying weight to count matrix in Latent Semantic Analysis – Part 1 (take a look if needed).

Once we have built our  TDM matrix, we call upon a powerful but little known technique called Singular Value Decomposition or SVD to analyze the matrix for us. We have to pass TDM to SVD then we will land up with three matrix name U, S, Vt.

Vt matrix correspond to Document vector coordinate. In order to do document – document similarity we compute Matrix S*Vt result an matrix will help in finding plagiarism.

image

using Cosine relation.

                   V . W
cos A  =    —————
                ||V|| ||w||

image

If the angle is between 0 and 90 then there exits a relation (some similarity ) between the two vector coordinated.

Illustration of my working model

From the specified directory, model reads all files one by one then removes stop words and then stemming is applied to token. Stemmed token are take as keyword. Count matrix is build by having the keyword which is followed by applying weight to matrix. Finally resultant matrix is passed to SVD to obtain Vt and S matrix. Further cosin relation formula is computed to find the similarity between the model.

Step 1 : Enter the project directory

1 plagiarism

Step 2 :Click start

2 plagiarism

Step 3: If you want to detect the document under threshold limit then give the limit in terms of angle E.g. 0.0 0r 10.0. Then click start.

4 plagiarism

Download : Shakthydoss’s – Plagiarism Detector

Result

I am really happy about my accuracy. I have tried the experiment with different set of document which has partially same content. As expected the angle difference was less and not zero. Angle difference was 0 for document which has exactly the same content.

Download : Shakthydoss’s – Plagiarism Detector


5 Comments

  1. Sudarsun Santhiappan wrote
    at 6:25 AM - 3rd November 2010 Permalink

    Good attempt. Keep experimenting more Shakthy to learn about LSI and vector space models.

  2. Shashank Yadav wrote
    at 9:06 AM - 14th September 2011 Permalink

    good going …

  3. coodapeagma wrote
    at 6:12 AM - 2nd October 2012 Permalink

    Good post.

  4. Paula wrote
    at 11:01 PM - 2nd November 2012 Permalink

    This really wowed..

  5. dinesh shan wrote
    at 6:05 AM - 10th September 2015 Permalink

    Good article Shakthy. Have you tried out the following scenario? Two students who are good at memorizing stuff, write the exact same stuff that is given in the text book( to be precise mugging up). In this case, i believe that the two documents would be detected as plagiarized. Any approach to weed out this type of behavior?

Post a Comment

Your email is never published nor shared. Required fields are marked *