Word Embeddings: an in-depth explanation

Word embeddings are an important part of natural language processing and as such are playing an important role in my project. I thought it'd be useful to do a post going into a little more detail about them and the features of them.

Note: This post follows on from my last post Solving a Problem. However it's not required that you've read that post to understand this one. (Some infromation may be repeated.)

What are word embeddings?

Simply put word embeddings are a form of representing words as high-dimensionality vectors. (For the non-technical: words can be represented by a long string of numbers.) The starting point for where this representation comes from is the "one-hot" representation.

The one-hot representation

The one-hot representation is where each word is represented as a binary string with only one bit set to 1. A simple example is:

cat = [0 0 0 1]
dog = [0 0 1 0]

In this example the vectors have n=4 dimensions which is extremely low for word embeddings, usually nā‰ˆ300ā€“500 is used. One of the reasons for this high dimensionality in the one-hot representation is that it allows for grey-coding of the word vectors. In the example above the word vectors have a hamming-distance of 2.

However, this representation can leave a large distance between similar words making it a poor representation of the natural language feature space. Word-embeddings is the solution to this.

How word embeddings solve the problem

Rather than using a simple binary string, word-embeddings converted word vectors to be a vector of floats. To initally generate the word embeddings, the one-hot repesentations of words were taken and multiplied by random floating point numbers.

Training the word embeddings

After being generated the word embeddings still suffer from the same issues as the one-hot representation. To solve this, the vectors are "trained" using a recurrent neural network. This training sets float values for the rest of the vectors. Clustering is applied to the vectors to optimise the vector space.

Similarities with Image processing

In image processing an M x N image can be represtend as an MN dimensional vector. This representation is similar to vectors used in word embeddings.

Recently [2] has shown that by using a neural network to learn these vectors along with training word-embeddings, a system can be developed that generates captions for images based on objects that are recognised within the image.


[1] T. Mikolov, W. tau Yih and G. Zweig, "Linguistic regularities in continuous space word representations," in Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT-2013), Association for Computational Linguistics, May 2013.

[2] A. Karpathy, L. Fei-Fei, "Deep Visual-Semantic Alignments for Generating Image Descriptions"