BoW Bag-of-Words: From Text to Frequency Vectors or to simply Word Counts

Bag-of-words (BoW) turns each document into a vector of word counts.

That means we represent a document as a list of numbers.

It works by counting how many times each word from a vocabulary appears in that document.

It's called a "bag" because it ignores word order and sentence structure, just focusing on the word counts.

In other word it's treats text like a bag of loose items, It builds a vocabulary from the corpus. Each column is a word. Each row is a document.

Similar documents tend to have similar frequency patterns meanwhile this keeps frequency info, which helps spot similar docs (e.g., ones with many shared words likely mean similar things).

Example: Python snippet—"Python is easy to learn and fun to use"—counts "to" 2 times, "python" 1, "is" 1, "and" 1, "easy" 1, "learn" 1, "fun" 1, "use" 1. No order preserved.

we can code that like this first we have the function bag_of_words:

import numpy as np
def bag_of_words(sentences):
    tokenized_sentences = [sentence.lower().split() for sentence in sentences]
    flat_words = [word for sublist in tokenized_sentences for word in sublist]
    vocabulary = sorted(set(flat_words))
    word_to_index = {word: i for i, word in enumerate(vocabulary)}
    
    bow_matrix = np.zeros((len(sentences), len(vocabulary)), dtype=int)
    for i, sentence in enumerate(tokenized_sentences):
        for word in sentence:
            if word in word_to_index:
                bow_matrix[i, word_to_index[word]] += 1
                
    return vocabulary, bow_matrix

then we apply like this:

corpus = [
    "Python is easy to learn and fun to use",
    "I do not like bugs in my code",
    "Learning programming takes practice and patience",
    "Debugging code can be frustrating but rewarding",
    "Writing clean code makes projects easier to maintain"
]
vocabulary, bow_matrix = bag_of_words(corpus)
print("Vocabulary:", vocabulary)
print("Bag of Words Matrix:\n", bow_matrix)

But this has problems: Vectors get huge and sparse vectors (lots of zeros) as vocab grows (e.g., 200k words), so Still creates huge vectors as vocabulary grows.

This eats memory, slows computation, and hits the curse of dimensionality—distances between points lose meaning in other words (more features = less meaningful comparisons).

Risk of overfitting: means the model fails to generalize and make accurate predictions on different data

Fixes:
Skip punctuation, stem words, drop common junk like "the" or "and."

Process:

Split (Tokenize) documents into words
Build vocabulary of unique words (word-to-index map)
Create matrix: rows = documents, columns = words, values = counts

Important limitation:
It loses context, remember that the context matters. The word "awesome" could mean positive, neutral, or negative depending on the sentence. Frequency alone doesn't tell you sentiment. BoW is an improvement on one-hot encoding because it tracks word frequency, but it still fails to capture the meaning and order of words.

Step-by-Step Execution

1. Tokenization & Lowercasing

Each sentence → split into lowercase words:

[
  ['python', 'is', 'easy', 'to', 'learn', 'and', 'fun', 'to', 'use'],
  ['i', 'do', 'not', 'like', 'bugs', 'in', 'my', 'code'],
  ['learning', 'programming', 'takes', 'practice', 'and', 'patience'],
  ['debugging', 'code', 'can', 'be', 'frustrating', 'but', 'rewarding'],
  ['writing', 'clean', 'code', 'makes', 'projects', 'easier', 'to', 'maintain']
]

2. Build Vocabulary

Flatten all words → unique → sort:

vocabulary = [
    'and', 'be', 'bugs', 'but', 'can', 'clean', 'code', 'debugging',
    'do', 'easier', 'easy', 'frustrating', 'fun', 'i', 'in', 'is',
    'learn', 'learning', 'like', 'makes', 'maintain', 'my', 'not',
    'patience', 'practice', 'programming', 'projects', 'python',
    'rewarding', 'takes', 'to', 'use', 'writing'
]

→ 33 unique words

3. Build BoW Matrix (5 docs × 33 words)

Now count occurrences per document.

Final Output:

Vocabulary: ['and', 'be', 'bugs', 'but', 'can', 'clean', 'code', 'debugging', 'do', 'easier', 'easy', 'frustrating', 'fun', 'i', 'in', 'is', 'learn', 'learning', 'like', 'makes', 'maintain', 'my', 'not', 'patience', 'practice', 'programming', 'projects', 'python', 'rewarding', 'takes', 'to', 'use', 'writing']

Bag of Words Matrix:
 [[0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 2 1 0]
  [0 0 1 0 0 0 1 0 1 0 0 0 0 1 1 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0]
  [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0]
  [0 1 0 1 1 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
  [0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 1 1]]

Matrix Explained (Row = Document, Column = Word)

Word Index	Word	Doc 1	Doc 2	Doc 3	Doc 4	Doc 5
0	and	0	0	1	0	0
10	easy	1	0	0	0	0
16	learn	1	0	0	0	0
30	to	2	0	1	0	1
2	bugs	0	1	0	0	0
6	code	0	1	0	1	1
...	...	...	...	...	...	...
Example: First row (Doc 1):

"to" appears 2 times → column 30 = 2
"python", "is", "easy", "learn", "and", "fun", "use" → 1 each

Key Observations

Sparse matrix: Most entries are 0 → typical of BoW
No punctuation handling: Words like "use" and "use" are fine, but no stemming (e.g., "learn" ≠ "learning")
Order ignored: "fun to use" → same as "use to fun" in representation
High dimensionality: 33 words from just 5 short sentences → scales poorly

Summary: Script Output

Vocabulary: ['and', 'be', 'bugs', 'but', 'can', 'clean', 'code', 'debugging', 'do', 'easier', 'easy', 'frustrating', 'fun', 'i', 'in', 'is', 'learn', 'learning', 'like', 'makes', 'maintain', 'my', 'not', 'patience', 'practice', 'programming', 'projects', 'python', 'rewarding', 'takes', 'to', 'use', 'writing']
BoW Matrix (5 x 33):
[[0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 2 1 0]
 [0 0 1 0 0 0 1 0 1 0 0 0 0 1 1 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0]
 [0 1 0 1 1 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
 [0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 1 1]]

BoW Bag-of-Words: From Text to Frequency Vectors or to simply Word Counts

Ala GARBAA