Project_TFIDF
This project aims to help you apply what you have learned about text parsing, regex, and
tfidf to a variety of books. You are not allowed to use text processing packages
and should calculate tfidf with pandas/numpy. You will use Project Gutenberg for
this purpose. Project Gutenberg( https://www.gutenberg.org/) is a collection of more than 70,000 free ebooks that are
available in different formats. Multiple file types of each book is provided, but you will use
the plaintext .txt file for this project. However, you can review the HTML5 files to get
an idea about the content of the books.
In this project you will parse a number of documents to extract terms, and then use those
terms to calculate tfidf. You will use the tfidf scores to explore the documents, analyze
them, compare them, and extract new information about them. You are free to chose what
kind of analysis you want to do on the documents using tfidf. We recommend you to use
your creativity and chose a path that interests you the most. Here are two examples of the
kind of analysis that you can do:
• Compare books with different subjects. E.g. select 2 books on biology and 2 books
on law, and compare their key words
• Compare chapters within a long book to show the progression of subject. E.g. select
some biography books and separate the chapters, then compare the content of the
chapters
Please consider that the overall size of the files you use must not be less than 250 KB. So
please check in with the TAs if the books you are working on are too long or too short.
Based on your results and analysis, you will write a technical report. The language of
report is important because it is intended to be read by someone who is familiar with the
data, but lacks deep understanding of it. Think of it as a report that you hand to your
boss or CEO at a Data Science job. So you should include a precise summary of the main
points of the report at its start.
For this project, you are required to perform a set required general tasks. You are encour-
aged to exceed these requirements and experiment with different ideas.
Go through the text file of each book, extract a raw version of the document where things
such as the bibliographic information, table of content, licenses, dividers etc. are deleted.
Lowercase all the words and delete all the punctuation marks using regex commands. Split
the words and store them in a list. This process is called tokenization. Optional: You can
research about stemming and lemmatization, and use them too if you are interested. If you
need a package for stemming or lemmatization, you are free to use it but the implementation
of tfidf must be with pandas.
Using the tokenized list, create a word-document table in form of a pandas dataframe.
Keep in mind that if you are analysing chapters, each chapter will be a separate document.
You will use this table and pandas functions to complete next requirements.
Discuss any issues you faced for this task and how you solved them. Was the formatting
of the book challenging?
2 Vectorization
For each document, create a word frequency vector. In other words, calculate the tf(t, d) =
f (t,d)/∑t′∈d f (t′,d) for each term t and document d ∈ D. Try to sort these values and explore your
findings. Are you able to extract any information from these values?
For each term, calculate the inverse document frequency or the idf(t, D) = log( N/1+nt).
Sort these values and explore your findings. Are you able to extract any information from
these values?
3 TF-IDF
Calculate the value of tfidf(t, d, D) = tf(t, d) · tfidf(t, D) where N is the number of doc-
uments and nt is the number of documents that contain the term t, for each term and
document pair. Look at the highest values for each document. What kind of conclusion
can you derive from these values? Explore your findings. Try to explain the results and
use visualizations and tables as you see fit.
4 Exploration (open ended)
What are some other uses of tfidf? How can you build on your results to extract more
information? You can answer some of these questions, or propose your own:
Look at some other chapters and books from the Gutenberg project that you think might
relate to your selected books. Try to compare them with your initial documents. Can
you use tfidf to calculate how similar different documents are? Can you use this to place
documents into different groups? Try researching and using n-gram tfidf methods. How
are bigram or trigram results different from 1-gram tfidf?
-
4303 93 1 3 Hello
My name is Tair
I am Python developer
I am machine learning engineer
I have done lots of tasks on another platform
I am ready to start
-
1544 18 1 Hello.
Thanks for your proposal.
I am happy to help you and to provide my solution for your project.
If you are ready, we can discuss the details.
-
Good evening.
Deadlines, stack, more details?
Regards, Sergey
-
Current freelance projects in the category Python
Creation of a Human AI Assistant for Telegram groups### 1.1. Key Concept of the System The system must perform two main functions: #### 1. Client Communication Analysis The AI assistant must automatically analyze all communication in Telegram groups and understand the context of the conversation. In particular, the system… AI & Machine Learning, Python ∙ 15 minutes back ∙ 9 proposals |
Black Ukraine (RP-project on base MTA)
1162 USD
|
Betting needs to create a bookmaker! WITH EXPERIENCEIt is necessary to develop a betting platform / bookmaker product based on a ready-made template. The main task is to adapt the template for a full-fledged bookmaker office: connect parsing of sports events and odds, implement a user personal account, balance, bets, transaction… Python, Web Programming ∙ 52 minutes back ∙ 15 proposals |
Telegram automation of message distribution in chatsNeed to send messages to Telegram chats. To avoid bans. There are several hundred chats. To configure the frequency and variability of texts. Python, Bot Development ∙ 1 hour 3 minutes back ∙ 14 proposals |
Development of software with AIIt is necessary to develop software for automatic detection, capturing, and tracking of an object using a video camera and a rotating mechanism.Output data: Video camera with optics. Rotating device on two axes (azimuth/elevation angle). Servomotors controlled via Modbus… AI & Machine Learning, Python ∙ 1 day 8 hours back ∙ 15 proposals |