RUO Principal

Repositorio Institucional de la Universidad de Oviedo

Ver ítem 
  •   RUO Principal
  • Investigación
  • Datos de investigación
  • Ver ítem
  •   RUO Principal
  • Investigación
  • Datos de investigación
  • Ver ítem
    • español
    • English
JavaScript is disabled for your browser. Some features of this site may not work without it.

Listar

Todo RUOComunidades y ColeccionesPor fecha de publicaciónAutoresTítulosMateriasxmlui.ArtifactBrowser.Navigation.browse_issnPerfil de autorEsta colecciónPor fecha de publicaciónAutoresTítulosMateriasxmlui.ArtifactBrowser.Navigation.browse_issn

Mi cuenta

AccederRegistro

Estadísticas

Ver Estadísticas de uso

AÑADIDO RECIENTEMENTE

Novedades
Repositorio
Cómo publicar
Recursos
FAQs

Data from "Analyzing syntactic constructs of Java programs with machine learning"

Autor(es) y otros:
Ortín Soler, FranciscoAutoridad Uniovi; Facundo Colunga, GuillermoAutoridad Uniovi; García Rodríguez, MiguelAutoridad Uniovi
Palabra(s) clave:

Abstract syntax tree

Programming language

Data mining

Feature engineering

Programming idiom

Heterogeneous dataset

Fecha de publicación:
2022-06-25
Resumen:

The massive number of open-source projects in public repositories has notably increased in the last years. Such repositories represent valuable information to be mined for different purposes, such as documenting recurrent syntactic constructs, analyzing the particular constructs used by experts and beginners, using them to teach programming and to detect bad programming practices, and building programming tools such as decompilers, Integrated Development Environments or Intelligent Tutoring Systems. An inherent problem of source code is that its syntactic information is represented with tree structures, while traditional machine learning algorithms use n-dimensional datasets. Therefore, we present a feature engineering process to translate tree structures into homogeneous and heterogeneous n-dimensional datasets to be mined. Then, we run different interpretable (supervised and unsupervised) machine learning algorithms to mine the syntactic information of more than 17 million syntactic constructs in Java code. The results reveal interesting information such as the Java constructs that are barely (and widely) used (e.g., bitwise operators, union types and static blocks), different language features and patterns mostly (and barely) used by beginners (and experts), the discovery of particular types of source code (e.g., helper or utility classes, data transfer objects and too complex abstractions), and how complexity is an inherent characteristic in some clusters of syntactic constructs.

The massive number of open-source projects in public repositories has notably increased in the last years. Such repositories represent valuable information to be mined for different purposes, such as documenting recurrent syntactic constructs, analyzing the particular constructs used by experts and beginners, using them to teach programming and to detect bad programming practices, and building programming tools such as decompilers, Integrated Development Environments or Intelligent Tutoring Systems. An inherent problem of source code is that its syntactic information is represented with tree structures, while traditional machine learning algorithms use n-dimensional datasets. Therefore, we present a feature engineering process to translate tree structures into homogeneous and heterogeneous n-dimensional datasets to be mined. Then, we run different interpretable (supervised and unsupervised) machine learning algorithms to mine the syntactic information of more than 17 million syntactic constructs in Java code. The results reveal interesting information such as the Java constructs that are barely (and widely) used (e.g., bitwise operators, union types and static blocks), different language features and patterns mostly (and barely) used by beginners (and experts), the discovery of particular types of source code (e.g., helper or utility classes, data transfer objects and too complex abstractions), and how complexity is an inherent characteristic in some clusters of syntactic constructs.

Descripción:

Data from the article "F. Ortin, G. Facundo, M. Garcia. Analyzing syntactic constructs of Java programs with machine learning. Expert Systems with Applications (215), pp. 119398-119414, 2023. https://doi.org/10.1016/j.eswa.2022.119398"

URI:
https://hdl.handle.net/10651/70847
DOI:
10.17811/ruo_datasets.70847
Enlace a recurso relacionado:
http://hdl.handle.net/10651/67302
Patrocinado por:

This work has been partially funded by the Spanish Department of Science, Innovation and Universities: project RTI2018-099235-B-I00. The authors have also received funds from the University of Oviedo, Spain through its support of official research groups (GR-2011-0040).

Colecciones
  • Datos de investigación [70]
Ficheros en el ítem
untranslated
Dataset (474.1Mb)
untranslated
Readme.txt (3.617Kb)
Métricas
Compartir
Exportar a Mendeley
Estadísticas de uso
Estadísticas de uso
Metadatos
Mostrar el registro completo del ítem
Página principal Uniovi

Biblioteca

Contacto

Facebook Universidad de OviedoTwitter Universidad de Oviedo
El contenido del Repositorio, a menos que se indique lo contrario, está protegido con una licencia Creative Commons: Attribution-NonCommercial-NoDerivatives 4.0 Internacional
Creative Commons Image