Bag-of-words representation of text: measure of document similarity

Returning to the bag-of-words example, we can use the notion of angle to measure how two different documents are close to each other.

Given two documents, and a pre-defined list of words appearing in the documents (the dictionary), we can compute the vectors of frequencies x,y of the words as they appear in the documents. The angle between the two vectors is a widely used measure of closeness (similarity) between documents.

See also: