You will modify the text analysis code so that it does the following: Processes the words for better analysis, Provides recommendations based on two literary works that you select , Uses the proper title when referring to the literary works
Text Processing
Modify the script so that it does at least two of the following for better comparisons:
- Strip punctuation from each word
- Remove words with little meaning (e.g. “the”, “a”)
- Make all words lower case
- Employs a stemmer
Recommendation
Choose two literary works (e.g. Pride and Prejudice, Jane Eyre) and use them to recommend similar works (“If you like Pride and Prejudice, you will also like these works…”). Your program should go through the remaining works and list them under the choice with the higher similarity index. For example, if Dracula is more similar to Jane Eyre than Pride and Prejudice, Dracula should be listed under Jane Eyre.
Report by Titles
Modify your recommendation program so that it reports the titles of the works rather than their file names. To do this, write a program that reads in the titles.txt file and creates a dictionary that looks up the title using the file name. This dictionary should then be used to report the works by their title instead of their file name.
Code:
import os
import math
def count_word(table, word):
‘for the word entry in the table, increment its count or init to 1’
if word in table:
table[word] += 1
else:
# initialize count of word to 1
table[word] = 1
def analyze():
”’read all texts from the docs folder, report similarity comparisons
among all pairs”’
doc_table = dict()
word_set = set()
os.chdir(‘docs’)
fileList = os.listdir()
for fname in fileList:
print(“Opening ” + fname)
fd = open(fname, “r”, encoding=”utf8″)
doc_table[fname] = dict()
data = fd.read()
print(“splitting”)
dataList = data.split()
print(“{} has {} words”. format(fname, len(dataList)))
for word in dataList:
word_set.add(word)
count_word(doc_table[fname], word)
fd.close()
os.chdir(‘..’) # return to parent directory
for fname in fileList:
for fname2 in fileList:
sim = similarity(doc_table[fname], doc_table[fname2], word_set)
print(“{:.2f} : {} vs. {}”.format(sim, fname, fname2))
def build_title_file():
“creates titles.txt based on works in the docs folder”
tfd = open(“titles.txt”, “w”)
os.chdir(‘docs’)
fileList = os.listdir()
for fname in fileList:
print(“Opening ” + fname)
fd = open(fname, “r”, encoding=”utf8″)
for line in fd:
if “Title: ” in line:
tfd.write(fname + “n”)
tfd.write(line[7:])
break
fd.close()
os.chdir(“..”) # return to parent directory
tfd.close()
def similarity(tableA, tableB, words):
‘return cosine similarity between tableA and tableB over all words’
ab = 0
a2 = 0
b2 = 0
for w in words:
ab += tableA.get(w, 0) * tableB.get(w, 0)
a2 += tableA.get(w, 0) * tableA.get(w, 0)
b2 += tableB.get(w, 0) * tableB.get(w, 0)
return ab / (math.sqrt(a2) * math.sqrt(b2))
TXT file:
alice_in_wonderland.txt
Aliceís Adventures in Wonderland
dracula.txt
Dracula
frankenstein.txt
Frankenstein
jane_eyre.txt
Jane Eyre
moby_dick.txt
Moby Dick; or The Whale
pride_and_prejudice.txt
Pride and Prejudice
tale_of_two_cities.txt
A Tale of Two Cities
udolpho.txt
The Mysteries of Udolpho
wizard_of_oz.txt
The Wonderful Wizard of Oz
"You need a similar assignment done from scratch? Our qualified writers will help you with a guaranteed AI-free & plagiarism-free A+ quality paper, Confidentiality, Timely delivery & Livechat/phone Support.
Discount Code: CIPD30
Click ORDER NOW..


