Data Documentation for "Analyzing Syntactic Constructs of Java Programs with Machine Learning" General Information: This data contains all the results, source code and the datasets created in the research article "F. Ortin, G. Facundo, M. Garcia. Analyzing syntactic constructs of Java programs with machine learning. Expert Systems with Applications (215), pp. 119398-119414, 2023. https://doi.org/10.1016/j.eswa.2022.119398" Name of dataset: Data from the article "F. Ortin, G. Facundo, M. Garcia. Analyzing syntactic constructs of Java programs with machine learning. Expert Systems with Applications (215), pp. 119398-119414, 2023. https://doi.org/10.1016/j.eswa.2022.119398" Name of data files in the data set: classification_rules.zip clusters.zip datasets.7z frequency_analysis.zip logistic_regression.zip source_code.zip visualization.zip Dataset language: English Date the data set was last modified: 25 June 2022 Funder: This work has been partially funded by the Spanish Department of Science, Innovation and Universities: project RTI2018-099235-B-I00. The authors have also received funds from the University of Oviedo, Spain through its support of official research groups (GR-2011-0040). How to cite data: Data from the article "F. Ortin, G. Facundo, M. Garcia. Analyzing syntactic constructs of Java programs with machine learning. Expert Systems with Applications (215), pp. 119398-119414, 2023. https://doi.org/10.1016/j.eswa.2022.119398" Methodology for data collection: Detailed in "F. Ortin, G. Facundo, M. Garcia. Analyzing syntactic constructs of Java programs with machine learning. Expert Systems with Applications (215), pp. 119398-119414, 2023. https://doi.org/10.1016/j.eswa.2022.119398" Data collector(s): Francisco Ortin Soler, ortin@uniovi.es Date of data collection: 25 June 2022 Person to contact with questions: Francisco Ortin Soler, ortin@uniovi.es, https://reflection.uniovi.es Data entry: 16 January 2024 Software (including version #) used to prepare data set: Detailed in "F. Ortin, G. Facundo, M. Garcia. Analyzing syntactic constructs of Java programs with machine learning. Expert Systems with Applications (215), pp. 119398-119414, 2023. https://doi.org/10.1016/j.eswa.2022.119398" Data processing that was performed: Detailed in "F. Ortin, G. Facundo, M. Garcia. Analyzing syntactic constructs of Java programs with machine learning. Expert Systems with Applications (215), pp. 119398-119414, 2023. https://doi.org/10.1016/j.eswa.2022.119398" Variables: Detailed in "F. Ortin, G. Facundo, M. Garcia. Analyzing syntactic constructs of Java programs with machine learning. Expert Systems with Applications (215), pp. 119398-119414, 2023. https://doi.org/10.1016/j.eswa.2022.119398" File Overview: classification_rules.zip: Classification rules obtained with the IREP and RIPPERk classification rule induction algorithms (after the filtering process described in the article). clusters.zip: All the information about the clusters found running the k-means algorithm. datasets.7z: 7-Zip file (compression level option: 9 - Ultra) with the 12 datasets used to train and test the models. frequency_analysis.zip: Frequency analysis of the most common syntactic constructs used by Java programmers. logistic_regression.zip: β coefficients of the logistic regression models created to classify AST nodes regarding the programmer's expertise. source_code.zip: Source code of the OpenJDK plugin for the Java compiler (Java code), and the Python code to analyze all the data with different machine learning algorithms. visualization.zip: Visualization of the 12 datasets after running the 4 dimensionality reduction algorithms described in the paper.