Favorites ()
Student and professor look at a computer.

Student Research

Graduate Programs in Software faculty members and graduate students collaborate with organizations on joint research projects using data science.

The Center allows real-world case studies, projects and datasets for classwork with the possibility of mixed public/private research, consulting and engagement. Students benefit from working on timely and relevant issues affecting industry today and are exposed to potential synergies for student placement.

The Center further enables the St. Thomas mission in providing applied learning by working with organizations needing data science acumen and a way for them to experiment with the university and student body and in turn, allows the university to have a firsthand view into employer's emerging needs by adapting and augmenting curriculum.

Student and Faculty Research Projects

Graduate Programs in Software faculty members and graduate students collaborate on joint research projects using data science.

Recent Presentations:

B. Zhang, A. Kazemzadeh, and B. Reese, "Shallow Parsing for Nepal Bhasa Complement Clauses", presented at ComputEL Workshop of Association of Computational Linguistics (ACL) Conference, Dublin, Ireland, May 23-27, 2022.

Abstract: Accelerating the process of data collection, annotation, and analysis is an urgent need for linguistic fieldwork and documentation of endangered languages (Bird, 2009). Our experiments describe how we maximize the quality for the Nepal Bhasa syntactic complement structure chunking model. Native speaker language consultants were trained to annotate a minimally selected raw data set (Suárez et al.,2019). The embedded clauses, matrix verbs, and embedded verbs are annotated. We apply both statistical training algorithms and transfer learning in our training, including Naive Bayes, MaxEnt, and fine-tuning the pre-trained mBERT model (Devlin et al., 2018). We show that with limited annotated data, the model is already sufficient for the task. The modeling resources we used are largely available for many other endangered languages. The practice is easy to duplicate for training a shallow parser for other endangered languages in general.

A. Kazemzadeh, "BERT-Assisted Semantic Annotation Correction for Emotion-Related Questions", presented at ARDUOUS Workshop of IEEE Percom Conference, Mar. 21-25, 2022.

Abstract: Annotated data have traditionally been used to provide the input for training a supervised machine learning (ML) model. However, current pre-trained ML models for natural language processing (NLP) contain embedded linguistic information that can be used to inform the annotation process. We use the BERT neural language model to feed information back into an annotation task that involves semantic labelling of dialog behavior in a question-asking game called Emotion Twenty Questions (EMO20Q). First we describe the background of BERT, the EMO20Q data, and assisted annotation tasks. Then we describe the methods for fine-tuning BERT for the purpose of checking the annotated labels. To do this, we use the paraphrase task as a way to check that all utterances with the same annotation label are classified as paraphrases of each other. We show this method to be an effective way to assess and revise annotations of textual user data with complex, utterance-level semantic labels.

R. Panda, P. Sivaprakasam, and M. Rege, "A computational analysis of the impact of population and number of police officers on assault against Law Enforcement Officers", in Proceedings of 60th Annual IACIS International Conference, Oct 7-10, 2020.

Abstract: In the United States, law enforcement professionals are being killed feloniously, accidentally and assaulted in the line of duty. The algorithm proposed in this paper will help to predict the assault count based on the population size of a geographic area and number of police personnel. This paper focuses on analyzing the trends in the numbers of assaults using Jupyter Notebook and SASPy in SAS® University Edition and SAS® Enterprise Miner. A variety of methods will be used to determine the best model such as Gradient Boosting, Decision Tree, Linear Regression and Neural Network.

L. Butler and M. Rege, "Building an Analytical Model to Predict Workforce Outcomes in Medical Education", in Proceedings of 60th Annual IACIS International Conference, Oct 7-10, 2020.

Abstract: Across the United States, there is a shortage of physicians providing care in rural areas. This shortage means patients living in rural communities must travel further with fewer care options. The purpose of this study is to ultimately fill the gap in rural workforce outcomes by identifying students that are likely to practice in rural areas once they complete medical school and residency/fellowship programs. These students may be identified through use of predictive analytics techniques. By identifying these students, we can provide informational material and optional programs to further foster interest in rural care. Through techniques such as feature extraction, resampling and data imputation, we prepare data for various machine learning classifiers. These models allow us to identify features common to urban providers and rural providers. Seventy percent of rural providers were correctly identified as practicing in rural areas, while 25% of their urban counterparts were classified as rural. One characteristic difference between the groups shows rural providers have high average scores through medical school courses, while urban providers have higher standardized test scores.

T. White and M. Rege, "Sentiment Analysis on Google Cloud Platform", in Proceedings of 60th Annual IACIS International Conference, Oct 7-10, 2020.

Abstract: This project explores two services available on the Google Cloud Platform (GCP) for performing sentiment analysis: Natural Language API and AutoML Natural Language. The former provides powerful prebuilt models that can be invoked as a service allowing developers to perform sentiment analysis, along with other features like entity analysis, content classification and syntax analysis. The latter allows developers to train more domain specific models whenever the pre-trained models offered by the Natural Language API are not sufficient. Experiments have been conducted with both of these offerings and the results are presented herein.

E. Friedman, R. Kolakaluri, and M. Rege, "Benford's Law Applied to Precinct Level Election Data", in Proceedings of 60th Annual IACIS International Conference, Oct 7-10, 2020.

Abstract: This paper attempts to determine whether precinct level election data conforms with Benford's Law. In order to evaluate whether election data conform with Benford's Law, we constructed a two-part test. The first test assesses whether election data correlates with Benford's law. If the election data under study is found to correlate with Benford's law, we then subject it to the Kolmogorov-Smirnov test, which is used to evaluate more rigorous conformance with Benford's law and which aids in forensic analysis. We conclude that the frequency pattern of the first digits of the precinct level elections under study correlate strongly with the pattern predicted by Benford's Law of first digits. We also conclude, however, that the correlation is not strong enough for definitive forensic analysis.

P. Tumu, V. Manchenasetty, and M. Rege, "Context Based Sentiment Analysis Approach Using N-Gram and Word Vectorization Methods" in Proceedings of 60th Annual IACIS International Conference, Oct 7-10, 2020.

Abstract: Consumer reviews are key indicators for product credibility and central to almost all product manufacturing companies to align and alter the products to the needs of customers. Using Sentiment analysis approach, these reviews can be analyzed for positive, negative and neutral feedback. There are many techniques designed to do Sentiment analysis and opinion mining in the past on drug reviews to study their effectiveness and side-effects on the people. In this paper, an approach is presented which is a combination of context-based sentiment analysis using N-gram and tf-idf word vectorization method to find the sentiment class - positive, negative, neutral and use this sentiment class in Naïve Bayes and Random Classifiers to predict user review emotion. Our validation process involved measuring the model performance using quality metrics. The results showed that the proposed solution outperformed conventional sentimental analysis techniques with an overall accuracy of 89%.

J. Noble and H. Gamit, "Unsupervised Contextual Clustering of Abstracts", Winner of the 2020 SAS Global Forum Student Symposium.

Watch Unsupervised Contextual Clustering of Abstracts presentation on YouTube

Abstract: This study utilizes publicly available data from the National Science Foundation (NSF) Web Application Programming Interface (API). In this paper, various machine learning techniques are demonstrated to explore, analyze and recommend similar proposal abstracts to aid the NSF or Awardee with the Merit Review Process. These techniques extract textual context and group it with similar context. The goal of the analysis was to utilize the Doc2Vec unsupervised learning algorithms to embed NSF funding proposal abstracts text into vector space. Once vectorized, the abstracts were grouped together using K-means clustering. These techniques together proved to be successful at grouping similar proposals together and could be used to find similar proposals to newly submitted NSF funding proposals. To perform text analysis, SAS® University Edition is used which supports SASPy, SAS® Studio and Python JupyterLab. Gensim Doc2vec is used to generate document vectors for proposal abstracts. Afterwards, document vectors were used to cluster similar abstracts using SAS® Studio KMeans Clustering Module. For visualization, the abstract embeddings were reduced to two dimensions using Principal Component Analysis (PCA) within SAS® Studio. This was then compared to a t-Distributed Stochastic Neighbor Embedding (t-SNE) dimensionality reduction technique as part of the Scikit-learn machine learning toolkit for Python. Conclusively, NSF proposal abstract text analysis can help an awardee read and improve their proposal model by identifying similar proposal abstracts from the last 24 years. It could also help NSF evaluators identify similar existing proposals that indirectly provides insights on whether a new proposal is going to be fruitful or not.

T. Le, T. Tran, and M. Rege, "Dynamic image for micro-expression recognition on region-based framework", in Proceedings of IEEE 21st International Conference on Information Reuse and Integration for Data Science, Aug 11-13, 2020.

Illustration of four main components in the proposed method for ME recognition system. MEs in an input video sequence are first magnified using Eulerian video magnification, which helps enlarge facial movements for the succeeding feature learning phase. The output frames then go through computations for extracting features based on rank pooling called dynamic image technique. The outcome of this stage is RGB dynamic images representing features of each frame in the sequence; Certain localized facial regions are next extracted on the elicited dynamic images: forehead, eyebrows, nose, and mouth for highlighting dominant facial motions existing in the frames. A CNN model is run on the pre-processed data for emotion categorization as a final stage.

Abstract: Facial micro-expressions are involuntary facial expressions with low intensity and short duration natures in which hidden emotions can be revealed. Micro-expression analysis has been increasingly received tremendous attention and become advanced in the field of computer vision. However, it appears to be very challenging and requires resources to a greater extent to study micro-expressions. Most of the recent works have attempted to improve the spontaneous facial micro-expression recognition with sophisticated and hand-crafted feature extraction techniques. The use of deep neural networks has also been adopted to leverage this task. In this paper, we present a compact framework where a rank pooling concept called dynamic image is employed as a descriptor to extract informative features on certain regions of interests along with a convolutional neural network (CNN) deployed on elicited dynamic images to recognize micro-expressions therein. Particularly, facial motion magnification technique is applied on input sequences to enhance the magnitude of facial movements in the data. Subsequently, rank pooling is implemented to attain dynamic images. Only a fixed number of localized facial areas are extracted on the dynamic images based on observed dominant muscular changes. CNN models are fit to the final feature representation for emotion classification task. The framework is simple compared to that of other findings, yet the logic behind it justifies the effectiveness by the experimental results we achieved throughout the study. The experiment is evaluated on three state-of-the-art databases CASMEII, SMIC and SAMM.

S. Kong and A. Kazemzadeh, "Emotion Twenty Questions in Chinese", presented at Congreso Internacional de Ingeniería de Sistemas (CIIS 2019), Lima, Peru, Sept 5-6, 2019.

Abstract: Our study introduces the game of Emotion Twenty Questions (EMO20Q), an experiment into the cognition and expression of emotion in ordinary people who speak Chinese. The preliminary results show that such a game is felicitous and that the questions generated to describe emotions have commonalities with earlier English-spoken studies.

B. Jackson and M. Rege, "Machine Learning for Classification of Economic Recessions", in Proceedings of IEEE 20th International Conference on Information Reuse and Integration for Data Science, Los Angeles, California, July 30th through August 1, 2019.

Abstract: The ability to quickly and accurately classify economic activity into periods of recession and expansion is of great interest to economists and policy makers. Machine Learning methods can potentially be applied to the classification of business cycles. This paper describes two machine learning methods, K-Nearest Neighbor and Neural Networks, and compares them to a Dynamic Factor Markov Switching model for determining business cycle turning points. We conclude that machine learning techniques can offer more accurate classifiers that are worthy of additional study.

K. Wu and M. Rege, "Hibiki: A Graph Visualization of Asian Music", in Proceedings of IEEE 20th International Conference on Information Reuse and Integration for Data Science, Los Angeles, California, July 30th through August 1, 2019.

Hibiki's Graph Visualization is an interactive tool for exploring music albums and artists. Its interface includes a (a) main panel that renders nodes and relationships, a (b) taxonomy panel that gives node counts, a (c) information panel that details more information on selected nodes and (d) a toolbar for extra functions.

Abstract: Creating a visualization for a specific subdomain is an arduous task since most commercial visualization tools are often written in a way that allows them to be applicable to multiple subject domains. These tools are certainly powerful but inherently weaker since they were not written with a specific subject domain in mind. Thus, many researchers may want to create their own visualization. The goal of this project is to create a Neo4j database and an interactive web interface for a dataset that covers the intricacies of the East Asian Music scene, primarily focused on Japanese music. This paper serves as documentation to help other authors understand the processes involved when designing and creating similar tools. We break the project down into 3 separate components. First, we introduce the fundamentals of a Neo4j Graph Database and data mapping design decisions. Next, we explore what an ETL process looks like and how to implement it using Ruby libraries. Finally, we look at the design of the graph visualization software, it's components and key design decisions. We end the discussion with some analysis of the visualization's effectiveness to provide information and how to improve computational efficiency of the visualization.

R. Mbah, M. Rege, and B. Misra, "Using Spark and Scala for Discovering Latent Trends in Job Markets", in Proceedings of 2019 The 3rd International Conference on Compute and Data Analysis, Maui, Hawaii, March 14-17, 2019.

Abstract: Job markets are experiencing an exponential growth in data alongside the recent explosion of big data in various domains including health, security and finance. Staying current with job market trends entails collecting, processing and analyzing huge amounts of data. A typical challenge with analyzing job listings is that they vary drastically with regards to verbiage, for instance a given job title or skill can be referred to using different words or industry jargons. As a result, it becomes incumbent to go beyond words present in job listings and carry out analysis aimed at discovering latent structures and trends in job listings. In this paper, we present a systematic approach of uncovering latent trends in job markets using big data technologies (Apache Spark and Scala) and distributed semantic techniques such as latent semantic analysis (LSA). We show how LSA can uncover patterns/relationships/trends that will otherwise remain hidden if using traditional text mining techniques that rely only on word frequencies in documents.

R. Mbah, M. Rege, B. Misra, “Discovering Job Market Trends with Text Analytics”, in Proceedings of the International Conference on Information Technology, Bhubaneswar, India, Dec 21-23, 2017.

Abstract: Due to the current dynamic and competitive nature of job markets especially the IT job market, it has become incumbent for organizations and businesses to stay informed about the current job market trends. Staying current with trends entails collecting and analyzing huge amounts of data which in the past, has always involved a great deal of manual work. In this paper, we present out work on collecting, analyzing and visualizing local job data using text mining techniques. We also discuss technologies used such as: cron jobs for automation; Java for API data collection and web scrapping, Elasticsearch for data subsetting and keyword analysis, and R for data analysis and visualization. We expect this work to be of relevance to a diverse range of job seekers as well as employers and educational institutions.