4. Why are human languages complicated for a computer to understand? Explain.
Answer: The communications made by the machines are very basic and simple. Human communication is complex. There are multiple characteristics of the human language that might be easy for a human to understand but extremely difficult for a computer to understand.
For machines, it is difficult to understand our language. Let us take a look at some of them here:
Arrangement of the words and meaning – There are rules in human language. There are nouns, verbs, adverbs, and adjectives. A word can be a noun at one time and an adjective some other time. This can create difficulty while processing by computers.
Analogy with programming language- Different syntax, same semantics: 2+3 = 3+2 Here the way these statements are written is different, but their meanings are the same that is 5. Different semantics, same syntax: 2/3 (Python 2.7) ≠ 2/3 (Python 3) Here the statements written have the same syntax but their meanings are different. In Python 2.7, this statement would result in 1 while in Python 3, it would give an output of 1.5.
Multiple Meanings of a word – In natural language, it is important to understand that a word can have multiple meanings and the meanings fit into the statement according to the context of it.
Perfect Syntax, no Meaning – Sometimes, a statement can have a perfectly correct syntax but it does not mean anything. In Human language, a perfect balance of syntax and semantics is important for better understanding.
These are some of the challenges we might have to face if we try to teach computers how to understand and interact in human language.
5. What are the steps of text Normalization? Explain them in brief.
Answer: Text Normalizationin Text Normalization, we undergo several steps to normalize the text to a lower level.
Sentence Segmentation – Under sentence segmentation, the whole corpus is divided into sentences. Each sentence is taken as a different data so now the whole corpus gets reduced to sentences.
Tokenisation- After segmenting the sentences, each sentence is then further divided into tokens. Tokens is a term used for any word or number or special character occurring in a sentence. Under tokenisation, every word, number and special character is considered separately and each of them is now a separate token.
Removing Stop words, Special Characters and Numbers – In this step, the tokens which are not necessary are removed from the token list.
Converting text to a common case -After the stop words removal, we convert the whole text into a similar case, preferably lower case. This ensures that the case-sensitivity of the machine does not consider same words as different just because of different cases.
Stemming In this step, the remaining words are reduced to their root words. In other words, stemming is the process in which the affixes of words are removed and the words are converted to their base form.
Lemmatization -in lemmatization, the word we get after affix removal (also known as lemma) is a meaningful one.
With this we have normalized our text to tokens which are the simplest form of words present in the corpus. Now it is time to convert the tokens into numbers. For this, we would use the Bag of Words algorithm.
6. Through a step-by-step process, calculate TFIDF for the given corpus and mention the word(s) having highest value.
- Document 1: We are going to Mumbai
- Document 2: Mumbai is a famous place.
- Document 3: We are going to a famous place.
- Document 4: I am famous in Mumbai.
Answer: Term Frequency
Term frequency is the frequency of a word in one document. Term frequency can easily be found from the document vector table as in that table we mention the frequency of each word of the vocabulary in each document.
We | Are | Going | to | Mumbai | is | a | famous | Place | I | am | in |
1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 1 |
Inverse Document Frequency
The other half of TFIDF which is Inverse Document Frequency. For this, let us first understand what does document frequency mean. Document Frequency is the number of documents in which the word occurs irrespective of how many times it has occurred in those documents. The document frequency for the exemplar vocabulary would be:
We | Are | going | to | Mumbai | is | a | Famous | place | I | am | in |
2 | 2 | 2 | 2 | 3 | 1 | 2 | 3 | 2 | 1 | 1 | 1 |
Talking about inverse document frequency, we need to put the document frequency in the denominator while the total number of documents is the numerator. Here, the total number of documents are 3, hence inverse document frequency becomes:
We | Are | going | to | Mumbai | is | a | Famous | Place | I | am | in |
4/2 | 4/2 | 4/2 | 4/2 | 4/3 | 4/1 | 4/2 | 4/3 | 4/2 | 4/1 | 4/1 | 4/1 |
The formula of TFIDF for any word W becomes:
TFIDF(W) = TF(W) * log (IDF(W))
The words having highest value are – Mumbai, Famous
7. Normalize the given text and comment on the vocabulary before and after the normalization:
Raj and Vijay are best friends. They play together with other friends. Raj likes to play football but Vijay prefers to play online games. Raj wants to be a footballer. Vijay wants to become an online gamer.
Answer: Normalization of the given text:
Sentence Segmentation:
- 1. Raj and Vijay are best friends.
- 2. They play together with other friends.
- 3. Raj likes to play football but Vijay prefers to play online games.
- 4. Raj wants to be a footballer.
- 5. Vijay wants to become an online gamer
Tokenization:
Raj and Vijay are best friends. | Raj | and | Vijay | are | best | friends | . |
They play together with other friends | They | play | Together | with | other | friends | . |
Same will be done for all sentences. Removing Stop words, Special Characters and Numbers:
In this step, the tokens which are not necessary are removed from the token list. So, the words and, are, to, an, (Punctuation) will be removed.
Converting text to a common case:
After the stop words removal, we convert the whole text into a similar case, preferably lower case.
Here we don’t have words in different case so this step is not required for given text. Stemming:
In this step, the remaining words are reduced to their root words. In other words, stemming is the process in which the affixes of words are removed and the words are converted to their base form.
Word | Affixes | Stem |
Likes | -s | Like |
Prefers | -s | Prefer |
Wants | -s | want |
In the given text Lemmatization is not required. Given Text
Raj and Vijay are best friends. They play together with other friends. Raj likes to play football but Vijay prefers to play online games. Raj wants to be a footballer. Vijay wants to become an online gamer.
Normalized Text
Raj and Vijay best friends They play together with other friends Raj likes to play football but Vijay prefers to play online games Raj wants to be a footballer Vijay wants to become an online gamer.