Data Science Posts and Resources

Articles on Data Science

Finding document Similarity using LSA

Finding document Similarity using LSA

Laxmi K Soni

6-Minute Read


Finding document similarity can be useful sometimes for many purposes like spam filtering. In academics we try to access manually how much a course outcome relates to the program outcome of the course. This blog makes use of the Latent Symentic index method to findout the level at which a given course outcome relates or significant to a set of program outcomes for a given program.

  • TF-IDF Model:

    Tf-idf is used to extract vectors from documents based on tf, or term frequency, which determines how frequently a term appears in a document, and inverse document frequency, which how frequently a phrase appears across the whole collection. TF-IDF is a way to measure the importance of tokens in text. TfidfModel realize the transformation between word-document co-occurrence matrix into a locally/globally weighted TF-IDF matrix


  • Initialize the set of program outcomes
  • Normalize the sentenses and words
  • Assess the frequency of the words in the program outcomes
  • Encapsulate the mapping between normalized words and their frequency
  • Initiate the course outcome which checked against program outcomes for similarity
  • Create the bag of words model from words in program outcomes
  • Create lsi model from bag of words model and frequency of normalized words
  • Create similarity matrix between course outcomes and program outcomes
  • Output the similarity as pandas dataframe

Import libraries

For this we will need the following imports:

import logging
from pprint import pprint
import pandas as pd
from gensim import corpora, models, similarities
import nltk
from nltk.corpus import stopwords
from collections import defaultdict
import numpy as np
programoutcomes = ["Apply knowledge of Computer Science, Mathematics and Physics to identify, analyse problems and to provide effective solutions.","Ability to design, develop algorithms and provide software solutions to cater the industrial needs","Inculcate skills to excel in the fields of Information Technology and its Enabled services, Government and Private sectors, Teaching and Research","Instil ethical responsibilities, human and professional values and make their contribution to the society","Engaged in lifelong learning to equip them to the changing environment and be prepared to take-up mastering programmes","Provides a systematic understanding of the concepts and theories of mathematics and computing their application in the software world.","Graduates will have necessary critical and analytical skills to resolve problem","They will attain eligibility to successfully pursue their career objectives in advanced education, scientific career in government or industry.","Understand the impact of scientific solutions in societal and environmental conpotexts, and demonstrate the knowledge of, and need for sustainable development."]
pprint( len( programoutcomes ))
## 9
stopwords = set( 'for of a the and to in'.split() )
powordlist = [[word for word in po.lower().split() if word not in stopwords] for po in programoutcomes]
## [['apply', 'knowledge', 'computer', 'science,', 'mathematics', 'physics', 'identify,', 'analyse', 'problems', 'provide', 'effective', 'solutions.'], ['ability', 'design,', 'develop', 'algorithms', 'provide', 'software', 'solutions', 'cater', 'industrial', 'needs'], ['inculcate', 'skills', 'excel', 'fields', 'information', 'technology', 'its', 'enabled', 'services,', 'government', 'private', 'sectors,', 'teaching', 'research'], ['instil', 'ethical', 'responsibilities,', 'human', 'professional', 'values', 'make', 'their', 'contribution', 'society'], ['engaged', 'lifelong', 'learning', 'equip', 'them', 'changing', 'environment', 'be', 'prepared', 'take-up', 'mastering', 'programmes'], ['provides', 'systematic', 'understanding', 'concepts', 'theories', 'mathematics', 'computing', 'their', 'application', 'software', 'world.'], ['graduates', 'will', 'have', 'necessary', 'critical', 'analytical', 'skills', 'resolve', 'problem'], ['they', 'will', 'attain', 'eligibility', 'successfully', 'pursue', 'their', 'career', 'objectives', 'advanced', 'education,', 'scientific', 'career', 'government', 'or', 'industry.'], ['understand', 'impact', 'scientific', 'solutions', 'societal', 'environmental', 'conpotexts,', 'demonstrate', 'knowledge', 'of,', 'need', 'sustainable', 'development.']]
powordfrequency = defaultdict( int )
for powordline in powordlist:
    for poword in powordline:
        powordfrequency[ poword ] += 1
powordlist1 =  [ [poword for poword in powordline if powordfrequency[poword] > 1] for powordline in powordlist]
## [['knowledge', 'mathematics', 'provide'], ['provide', 'software', 'solutions'], ['skills', 'government'], ['their'], [], ['mathematics', 'their', 'software'], ['will', 'skills'], ['will', 'their', 'career', 'scientific', 'career', 'government'], ['scientific', 'solutions', 'knowledge']]
corpdictionary = corpora.Dictionary( powordlist1 )
## {'knowledge': 0, 'mathematics': 1, 'provide': 2, 'software': 3, 'solutions': 4, 'government': 5, 'skills': 6, 'their': 7, 'will': 8, 'career': 9, 'scientific': 10}

Dictionary encapsulates the mapping between normalized words and their integer ids.

courseoutcome = 'Students will have have understanding on how build python development environment.'
covec = corpdictionary.doc2bow(courseoutcome.split())
mycorpus = [ corpdictionary.doc2bow( powordline ) for powordline in powordlist ]
##corpora.MmCorpus.serialize( './', mycorpus )
##corpora.SvmLightCorpus.serialize('./corpus1.svmlight', mycorpus)
tfidf = models.TfidfModel( mycorpus )
corp_tfidf = tfidf[ mycorpus ]
## [(0, 0.5773502691896257), (1, 0.5773502691896257), (2, 0.5773502691896257)]
for d in corp_tfidf:
    print( d )
## [(0, 0.5773502691896257), (1, 0.5773502691896257), (2, 0.5773502691896257)]
## [(2, 0.5773502691896257), (3, 0.5773502691896257), (4, 0.5773502691896257)]
## [(5, 0.7071067811865476), (6, 0.7071067811865476)]
## [(7, 1.0)]
## []
## [(1, 0.6282580468670046), (3, 0.6282580468670046), (7, 0.45889394536615247)]
## [(6, 0.7071067811865476), (8, 0.7071067811865476)]
## [(5, 0.2878392791181426), (7, 0.21024434638691575), (8, 0.2878392791181426), (9, 0.840977385547663), (10, 0.2878392791181426)]
## [(0, 0.5773502691896257), (4, 0.5773502691896257), (10, 0.5773502691896257)]

latent semantic analysis

By creating a collection of ideas associated to the documents and terms, latent semantic analysis (LSA), a method in natural language processing, specifically distributional semantics, analyses relationships between a set of documents and the terms they contain. LSA believes that words with similar meanings will appear in texts with a similar structure.

Some applications of LSA

  • Information Retrieval :Find documents based on a free text or whole document as query— based on meaning independent of literal words
  • Text Assessment:– Compare document to documents of known quality/content
  • Automatic summarization of text: Determine best subset of text to portray same meaning Key words or best sentences
  • Categorization / Classification: Place text into appropriate categories or taxonomies
  • Knowledge Mapping: Discover relationships between texts
lsi = models.LsiModel( mycorpus, id2word=corpdictionary, num_topics=2)
index = similarities.MatrixSimilarity( lsi[ mycorpus ] )
lsivec = lsi[ covec ]
sims = index[ lsivec ]
copomap = []
for i, sim in enumerate( sims):
        'co' : courseoutcome,
        'po' : "po{0:01}".format(i) + ":" + programoutcomes[i],
        'similarity' : sim
  • 0: Inticates low similarity or low significance of the course outcome to program outcome
  • 2: Inticates medium similarity or medium significance of the course outcome to program outcome
  • 3: Inticates high similarity or high significance of the course outcome to program outcome
pd.set_option('display.max_columns', None)
cdf = pd.DataFrame(copomap)
cdf['similarity'] = cdf['similarity']*100
## 0   -19
## 1   -19
## 2    99
## 3    83
## 4     0
## 5    21
## 6    99
## 7    98
## 8    17
## Name: similarity, dtype: int32
cdf['similarity'] = np.where(cdf['similarity'] < 0 , 0, cdf['similarity'].astype(int) )
cdf['similarity'] = np.where( ((cdf['similarity'] > 0) & (cdf['similarity'] < 50)) , 2 , cdf['similarity'].astype(int) )
cdf['similarity'] = np.where( ((cdf['similarity'] > 50) & (cdf['similarity'] < 100)) , 3 , cdf['similarity'].astype(int) )
print('Similarity of' , courseoutcome, 'with given  outcomes is')
## Similarity of Students will have have understanding on how build python development environment. with given  outcomes is
##                                                   po  similarity
## 0  po0:Apply knowledge of Computer Science, Mathe...           0
## 1  po1:Ability to design, develop algorithms and ...           0
## 2  po2:Inculcate skills to excel in the fields of...           3
## 3  po3:Instil ethical responsibilities, human and...           3
## 4  po4:Engaged in lifelong learning to equip them...           0
## 5  po5:Provides a systematic understanding of the...           2
## 6  po6:Graduates will have necessary critical and...           3
## 7  po7:They will attain eligibility to successful...           3
## 8  po8:Understand the impact of scientific soluti...           2

Say Something


Nothing yet.

Recent Posts