Encoding Version History Context for Better Code Representation (MSR 2024 - Technical Papers) - MSR 2024

Mon 15 - Tue 16 April 2024 Lisbon, Portugal

co-located with ICSE 2024

Who

Huy Nguyen, Christoph Treude, Patanamon Thongtanunam

Track

MSR 2024 Technical Papers

Time Zone

The program is currently displayed in (GMT+01:00) Lisbon.

Use conference time zone: (GMT+01:00) LisbonSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

When

Tue 16 Apr 2024 12:00 - 12:06 at Grande Auditório - Software Evolution & Analysis Chair(s): Vladimir Kovalenko

Abstract

With the recent exponential growth of AI tools that generate source code, understanding software has become increasingly crucial. When developers comprehend a program, they might refer to additional contexts to look for information, e.g., program documentation or historical code versions. Therefore, we argue that encoding this additional contextual information could also benefit code representation for deep learning. Recent papers successfully incorporate contextual data (e.g., call hierarchy) into vector representation to address program comprehension problems. This motivates further studies to explore additional contexts, such as version history, to enhance models’ understanding of programs. That is, insights from version history enable recognition of patterns in code evolution over time, recurring issues, and the effectiveness of past solutions. Our paper presents preliminary evidence of the potential benefit of encoding contextual information from the version history to predict code clones and perform code classification. We experiment with two representative deep learning models, ASTNN and CodeBERT, to investigate whether combining one or multiple additional contexts with different aggregations might benefit downstream activities. The experimental result affirms the positive impact of combining version history into source code representation in all scenarios; however, to ensure the technique performs consistently, we need to conduct a holistic investigation on a larger code base using different combinations of contexts, aggregation, and models. Therefore, we propose a research agenda aimed at exploring various aspects of encoding additional context to improve code representation and its optimal utilization in specific situations.

Link to Preprint

https://arxiv.org/abs/2402.03773

Huy Nguyen

The University of Melbourne

Australia

Christoph Treude

Singapore Management University

Singapore

Patanamon Thongtanunam

University of Melbourne

Australia

Time Zone

The program is currently displayed in (GMT+01:00) Lisbon.

Use conference time zone: (GMT+01:00) LisbonSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Session Program

Tue 16 Apr
Displayed time zone: Lisbon change

	11:00 - 12:30	Software Evolution & AnalysisTechnical Papers / Data and Tool Showcase Track / Industry Track at Grande Auditório Chair(s): Vladimir Kovalenko JetBrains Research

	11:00 12m Talk		Unveiling ChatGPT's Usage in Open Source Projects: A Mining-based Study Technical Papers Rosalia Tufano Università della Svizzera Italiana, Antonio Mastropaolo Università della Svizzera italiana, Federica Pepe University of Sannio, Ozren Dabic Software Institute, Università della Svizzera italiana (USI), Switzerland, Massimiliano Di Penta University of Sannio, Italy, Gabriele Bavota Software Institute @ Università della Svizzera Italiana
	11:12 12m Talk		DRMiner: A Tool For Identifying And Analyzing Refactorings In Dockerfile Technical Papers Emna Ksontini University of Michigan - Dearborn, Aycha Abid Oakland University, Rania Khalsi University of Michigan - Flint, Marouane Kessentini University of Michigan - Flint
	11:24 12m Talk		A Large-Scale Empirical Study of Open Source License Usage: Practices and Challenges Technical Papers Jiaqi Wu Zhejiang University, Lingfeng Bao Zhejiang University, Xiaohu Yang Zhejiang University, Xin Xia Huawei Technologies, Xing Hu Zhejiang University
	11:36 12m Talk		Analyzing the Evolution and Maintenance of ML Models on Hugging Face Technical Papers Joel Castaño Fernández Universitat Politècnica de Catalunya, Silverio Martínez-Fernández UPC-BarcelonaTech, Xavier Franch Universitat Politècnica de Catalunya, Justus Bogner Vrije Universiteit Amsterdam Link to publication Pre-print
	11:48 12m Talk		On the Anatomy of Real-World R Code for Static Analysis Technical Papers Florian Sihler Ulm University, Lukas Pietzschmann Ulm University, Raphael Straub Ulm University, Matthias Tichy Ulm University, Germany, Andor Diera Ulm University, Abdelhalim Dahou GESIS Leibniz Institute for the Social Sciences Pre-print File Attached
	12:00 6m Talk		Encoding Version History Context for Better Code Representation Technical Papers Huy Nguyen The University of Melbourne, Christoph Treude Singapore Management University, Patanamon Thongtanunam University of Melbourne Pre-print
	12:06 4m Talk		CodeLL: A Lifelong Learning Dataset to Support the Co-Evolution of Data and Language Models of Code Data and Tool Showcase Track Martin Weyssow DIRO, Université de Montréal, Claudio Di Sipio University of L'Aquila, Davide Di Ruscio University of L'Aquila, Houari Sahraoui DIRO, Université de Montréal
	12:10 4m Talk		Bidirectional Paper-Repository Tracing in Software Engineering Data and Tool Showcase Track Daniel Garijo , Miguel Arroyo Universidad Politécnica de Madrid, Esteban González Guardia Universidad Politécnica de Madrid, Christoph Treude Singapore Management University, Nicola Tarocco CERN
	12:14 4m Talk		DistilKaggle: A Distilled Dataset of Kaggle Jupyter Notebooks Data and Tool Showcase Track Mojtaba Mostafavi Department of Computer Engineering of Sharif University of Technology, Arash Asgari Department of Computer Engineering of Sharif University of Technology, Mohammad Abolnejadian Department of Computer Engineering of Sharif University of Technology, Abbas Heydarnoori Bowling Green State University
	12:18 5m Talk		Estimating Usage of Open Source Projects Industry Track Sophia Vargas Google LLC, Georg Link Bitergia, JaYoung Lee Google