Solving a Problem

In my last post I talked about the data sets that can be used to train a system to find the subject of a sentence. This is where my project is after researching how to represent words that will be inputed to the system.

My project so far

So far my project has been based on using dictionary definitions followed by disambiguation to get word sense. This word sense is then used to split the sentence into it's key phrases. (e.g. noun phrase, verb phrase).

A problem arose

By approaching the problem in the way I have, encoding the words for use with neural networks would not be possible- i.e. you cannot just use a word as input to a neural network, it must somehow be converted into a standardised representation.

After doing research I found that the standard practice is to use word embeddings.

Word embeddings

The standard practice in Natural Language Processing (NLP) is to encode words into high-dimensionality vectors known as word embeddings. These vectors are then optimised during training to cluster vectors with similar meanings.

This representation is great at encoding similarities. The vector space encodes relationships between words. These similarities can be shown in a simple way by doing vector subtraction. For example [1] shows that:

xapple − xapples ≈ xcar − xcars ≈ xfamily − xfamilies

Similarly:

xshirt − xclothing ≈ xchair − xfurniture

This has been found to have great benifit in allowing systems to still function when words are substituted by words with similar meanings. Also when learning categories this approach allows words not taught to the system to still be understood as the word in the vector space will be close to others that have already been classified.

This encoding of information about the words allows interesting behaviours to be learnt, without having to supply additional data to the learning system.

A potential solution

In [2] a tool is discussed called word2vec. This tool is an open source toolkit that generates word embeddings, optimised by the meaning of words. Using this tool a look-up table of word vectors can be generated that can then be utalized in my project.

After analysing the sentence and breaking it down into it's key phrases, the sentence can now be converted into word vectors along with the noun phrases. This will hopefully allow a recurrent neural network (RNN) to be trained to learn to identify the subject of the sentence.

This means that my project so far continues to be as vaulable as it was before and allows me to continue, successfully recovering form the mistakes that have been made.

References

[1] T. Mikolov, W. tau Yih, and G. Zweig, “Linguistic regularities in continuous space word representations,” in Proceedings of the 2013 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies (NAACL-HLT-2013), Association for Computational Linguistics, May 2013.

[2] T. Mikolov, I. Sutskever, and Q. Le, “Learning the meaning behind words,” August 2013.

Comments