Project_TFIDF

Python — incorrectly specified categories?

Use Python to complete this assignment. Please submit a PDF with your report. We recommend that you type your homework instead of writing by hand.
This project aims to help you apply what you have learned about text parsing, regex, and
tfidf to a variety of books. You are not allowed to use text processing packages
and should calculate tfidf with pandas/numpy. You will use Project Gutenberg for
this purpose. Project Gutenberg( https://www.gutenberg.org/) is a collection of more than 70,000 free ebooks that are
available in different formats. Multiple file types of each book is provided, but you will use
the plaintext .txt file for this project. However, you can review the HTML5 files to get
an idea about the content of the books.
In this project you will parse a number of documents to extract terms, and then use those
terms to calculate tfidf. You will use the tfidf scores to explore the documents, analyze
them, compare them, and extract new information about them. You are free to chose what
kind of analysis you want to do on the documents using tfidf. We recommend you to use
your creativity and chose a path that interests you the most. Here are two examples of the
kind of analysis that you can do:
• Compare books with different subjects. E.g. select 2 books on biology and 2 books
on law, and compare their key words
• Compare chapters within a long book to show the progression of subject. E.g. select
some biography books and separate the chapters, then compare the content of the
chapters
Please consider that the overall size of the files you use must not be less than 250 KB. So
please check in with the TAs if the books you are working on are too long or too short.
Based on your results and analysis, you will write a technical report. The language of
report is important because it is intended to be read by someone who is familiar with the
data, but lacks deep understanding of it. Think of it as a report that you hand to your
boss or CEO at a Data Science job. So you should include a precise summary of the main
points of the report at its start.
For this project, you are required to perform a set required general tasks. You are encour-
aged to exceed these requirements and experiment with different ideas.

1 Parsing
Go through the text file of each book, extract a raw version of the document where things
such as the bibliographic information, table of content, licenses, dividers etc. are deleted.
Lowercase all the words and delete all the punctuation marks using regex commands. Split
the words and store them in a list. This process is called tokenization. Optional: You can
research about stemming and lemmatization, and use them too if you are interested. If you
need a package for stemming or lemmatization, you are free to use it but the implementation
of tfidf must be with pandas.
Using the tokenized list, create a word-document table in form of a pandas dataframe.
Keep in mind that if you are analysing chapters, each chapter will be a separate document.
You will use this table and pandas functions to complete next requirements.
Discuss any issues you faced for this task and how you solved them. Was the formatting
of the book challenging?
2 Vectorization
For each document, create a word frequency vector. In other words, calculate the tf(t, d) =
f (t,d)/∑t′∈d f (t′,d) for each term t and document d ∈ D. Try to sort these values and explore your
findings. Are you able to extract any information from these values?
For each term, calculate the inverse document frequency or the idf(t, D) = log( N/1+nt).
Sort these values and explore your findings. Are you able to extract any information from
these values?
3 TF-IDF
Calculate the value of tfidf(t, d, D) = tf(t, d) · tfidf(t, D) where N is the number of doc-
uments and nt is the number of documents that contain the term t, for each term and
document pair. Look at the highest values for each document. What kind of conclusion
can you derive from these values? Explore your findings. Try to explain the results and
use visualizations and tables as you see fit.
4 Exploration (open ended)
What are some other uses of tfidf? How can you build on your results to extract more
information? You can answer some of these questions, or propose your own:
Look at some other chapters and books from the Gutenberg project that you think might
relate to your selected books. Try to compare them with your initial documents. Can
you use tfidf to calculate how similar different documents are? Can you use this to place
documents into different groups? Try researching and using n-gram tfidf methods. How
are bigram or trigram results different from 1-gram tfidf?

Proposals 4 Discussions 1

Tair W

93 1

Projects 96
Rating -
Rating 4 264

Budget: 3000 UAH Deadline: 3 days

Hello
My name is Tair
I am Python developer
I am machine learning engineer
I have done lots of tasks on another platform
I am ready to start

Nataliia Muzhitska

18 1

Projects 18
Rating -
Rating 1 520

Budget: 5000 UAH Deadline: 5 days

Hello.
Thanks for your proposal.
I am happy to help you and to provide my solution for your project.
If you are ready, we can discuss the details.

The list does not show proposals concealed by the client or freelancer with a Plus profile, as well as proposals violating rules

Rayan Goodwill
Chicago, United States

Projects -
Rating -
Rating 45

Project_TFIDF

Tair W

Nataliia Muzhitska

Proposals are currently absent

Current freelance projects in the category Python

Pocket option | trading bot Brant Oil Otc

Backend developer for creating a SaaS analytics service for Meta Ads (Facebook Marketing API)

Improvement of the AI assistant for the contact center operator of the medical center — ChatGPT, RAG, Skills

Program for automatic video assembly (local, for personal use)

Real-Time Trainer