Erick Saenz
He/him
Education: Majoring in computer science
McNair Project: Using Trigrams for Recipe Extraction (2018)
Mentor: William Hsu, Ph.D.
Here we discuss the techniques, processes, and potential for text mining through the Naïve Bayes Classifier’s machine learning algorithm. The text in question would be the ingredients from recipes within chemistry documents. Additionally, we also demonstrate how the quality of data is superior over quantity when training the classifier for text-analytics. Initially, we converted text from documents into their respective parts of speech using the nltk pos tagger. Next, we combine four previously converted tags/tokens into quadruple-grams and identify consistent patterns for ingredients. Then we created a training set for the classifier made up of those patterns. Lastly, we imported and trained the Naive Bayes classifier and annotated chemistry papers to compare preciseness or the accuracy of the classifier’s results. The training set was modified multiple times throughout the research, but these manipulations led us to conclude that a training set must be carefully crafted for quality of training set, instead of how vast it can be. The point of this research is to improve scientific communication and increase the efficiency of absorbing knowledge. As more and more publications become available worldwide, this tool may save time by dozens of hours per individual for automatically extracting structured information for us. With the abundance of time and precision, it will allow for more time to collaborate on more complex ideas.