Text mining is a machine learning algorithm that I employ in my research and non-research projects. I analyze, model, and visualize text in R with numerous R packages and R functions. Text must be cleaned before the analysis, modeling, and visualization stages. Several steps are employed in the text cleaning process. R has the capacity to read in raw text data in different formats (e.g., .txt, pdf). Text data must be transformed into an acceptable format for the text mining process. I am going to share a text mining project that I executed in R recently. I was analyzing the text of research interest keywords and research statements in a database. I created .txt files and read in the .txt files with the read.delim function in R. I converted the .txt files into tibbles with the tidyverse R package. I began the text cleaning process after converting the structure of the tibbles from factors to characters with the as.character function. I employed NLP (National Language Processing), tm (Text Mining), rJava (set the Java Environment before loading), RWeka (rJava should be loaded), tidytext, RColorBrewer, plotrix, graphics, Hmisc, lattice, survival, Formula, and ggplot2 R packages in the text mining project.
I employed tidytext R package with the research interest tibbles and tm R package with the research statement tibbles. I removed the punctuations and converted words to lowercase in the research interest tibbles with the unnest_tokens function. Stop words (e.g., to, the, at) were removed with the stop_words dataset of tidytext. The original word count was 1,015 and 61 stop words were removed from the research interest tibble. The top words in research interests were “cell”, “computational”, “energy”, “science”, “cancer”, “chemistry”, “data”, “education”, “sensing”, “computing”, “design”, “drug”, “engineering”, “learning”, “materials”, “modeling”, “health”, “development”, and “environmental”. I depicted the top words in research interests with ggplot2.
The figure below does not show the word “development” as it does not indicate a clear research interest. I kept the word “learning” because the data indicated “machine learning” and “deep learning” which would come to mind as research interests. I removed the word “development” from the .txt file, repeated the process in the first paragraph, and employed a second cleaning.
I read in the research statements directory with the DirSource function. I created Corpuses for the research statements with the VCorpus function in R. More steps were employed with the research statements as compared to the research interest keywords. The original word count for research statements was 1,943 and the count decreased during the text cleaning process. I converted words to lowercase with the content_transformer function. I removed punctuations with the removePunctuation function as punctuations are counted as words. The word count decreased to 1,888. Stop words were removed with the removeWords and stopwords functions. The word count decreased to 1275. You can stem words as part of the text cleaning process but I intentionally did not perform this function. I depicted the top words in research statements with ggplot2.
The figure below does not show “research”, “using”, “work”, “development”, and “well” as compared to the figure above. I removed the words from the .txt file, repeated the process in the previous paragraph, and employed a second cleaning.
Bigrams are two words paired in a text sequence and were created with the NGramTokenizer and Weka_control functions to identify high frequency of paired words in research statements. The top bigrams are data analysis, data science, heat transfer, machine learning, and transition state.
Please note the RScripts and outputs were preceded with many steps that are not depicted above.