Data Sets for Training A.I.

In order to train any learning algorithm, a suitable data set is required. In fact, three data sets are required: A training set, a validation set and an unseen set. This means that a data set will be required for training. A suitable data set will contain a large number of randomly assorted sentences. Fortuantely, I found an open source, multi-lingual database of example sentences called Tatoeba.

From here I was able to download all the English sentences as a csv file with the following structure:

Sentence ID [tab] Lang [tab] Text

As all the sentences were English, the Lang was "eng" for all entries. By writing a simple program in C++ I was able to find that there are 477, 652 sentences in the file. This is adequate to do the job of creating training sets. The data set was broken up into the following smaller sets:

  • Initial training set - 100 sentences
  • Training set - 238, 726 sentences
  • Validation set - 238, 726 sentences
  • Unseen set - 100 sentences

The initial data set will be used to train a neural network that will give a "fast pass" identification to the training and validation sets. These will then be manually sorted for correct identifications and corrections. A new network will then be used to learn the data.

What's next

In order to develop a system that is able to detect the subject of a sentence, a suitable input to the learning system must be found. Research will be conducted on the best way to use words as an input for neural networks.