I am explaining one-hot encoding in a simple and easy-to-understand way
1 min read
0 views
Written by
Ala GARBAA
Full Stack AI Developer & Software Engineer
One-hot encoding maps each unique "word" to a "vector" of length V (vocabulary size) and this called vectorization, so with a single 1 and the rest 0, and the 1 which means is a list of all the unique words in your text.
Each word is represented by a long list of zeros.
The list is as long as the vocabulary.
You put a '1' in the spot that matches the word's position in the vocabulary.
For example, if your vocabulary is [cat, dog, fish] and the word is "dog," its one-hot vector would be [0, 1, 0].
Let’s take a look at how this works in code:
import numpy as np
def one_hot_encoding(sentence):
words = sentence.lower().split()
vocabulary = sorted(set(words))
word_to_index = {word: i for i,
word in enumerate(vocabulary)}
one_hot_matrix = np.zeros((
len(words), len(vocabulary)), dtype=int)
for i, word in enumerate(words):
one_hot_matrix[i, word_to_index[word]] = 1
return one_hot_matrix, vocabulary
Let’s look at a specific example:
```python sentence = "Should we prioritize front-end or back-end development?" one_hot_matrix, vocabulary = one_hot_encoding(sentence) print("Vocabulary:", vocabulary) print("One-Hot Encoding Matrix:\n", one_hot_matrix) ```
Output:
``` Vocabulary: ['back-end', 'development?', 'front-end', 'or', 'prioritize', 'should', 'we'] One-Hot Encoding Matrix: [[0 0 0 0 0 1 0] [0 0 0 0 0 0 1] [0 0 0 0 1 0 0] [0 0 1 0 0 0 0] [0 0 0 1 0 0 0] [1 0 0 0 0 0 0] [0 1 0 0 0 0 0]] ```
But it has problems:
It doesn't capture meaning. The vectors for "dog" and "cat" are completely different, even though the words are related. The vectors are very long and mostly empty. A large vocabulary means each word's vector is huge, which takes up a lot of memory. Can't measure similarity between words.
That's mean different sentences produce different vocabularies and matrices. Matrix size grows with the number of unique words.