I am explaining one-hot encoding in a simple and easy-to-understand way

One-hot encoding maps each unique "word" to a "vector" of length V (vocabulary size) and this called vectorization, so with a single 1 and the rest 0, and the 1 which means is a list of all the unique words in your text.
Each word is represented by a long list of zeros.
The list is as long as the vocabulary.
You put a '1' in the spot that matches the word's position in the vocabulary.
For example, if your vocabulary is [cat, dog, fish] and the word is "dog," its one-hot vector would be [0, 1, 0].
Let’s take a look at how this works in code:

import numpy as np  
def one_hot_encoding(sentence):  
    words = sentence.lower().split()  
    vocabulary = sorted(set(words))  
    word_to_index = {word: i for i,  
        word in enumerate(vocabulary)}  
    one_hot_matrix = np.zeros((  
        len(words), len(vocabulary)), dtype=int)  
    for i, word in enumerate(words):  
        one_hot_matrix[i, word_to_index[word]] = 1  
    return one_hot_matrix, vocabulary

Let’s look at a specific example:
```python sentence = "Should we prioritize front-end or back-end development?" one_hot_matrix, vocabulary = one_hot_encoding(sentence) print("Vocabulary:", vocabulary) print("One-Hot Encoding Matrix:\n", one_hot_matrix) ```
Output:
``` Vocabulary: ['back-end', 'development?', 'front-end', 'or', 'prioritize', 'should', 'we'] One-Hot Encoding Matrix: [[0 0 0 0 0 1 0] [0 0 0 0 0 0 1] [0 0 0 0 1 0 0] [0 0 1 0 0 0 0] [0 0 0 1 0 0 0] [1 0 0 0 0 0 0] [0 1 0 0 0 0 0]] ```
But it has problems:
It doesn't capture meaning. The vectors for "dog" and "cat" are completely different, even though the words are related. The vectors are very long and mostly empty. A large vocabulary means each word's vector is huge, which takes up a lot of memory. Can't measure similarity between words.
That's mean different sentences produce different vocabularies and matrices. Matrix size grows with the number of unique words.

I am explaining one-hot encoding in a simple and easy-to-understand way

Ala GARBAA

Related Posts

BoW Bag-of-Words: From Text to Frequency Vectors or to simply Word Counts

How we trun our text info useful input for ai to undertand, here we cover the hidden steps