DistilKaggle: A Distilled Dataset of Kaggle Jupyter Notebooks (MSR 2024 - Data and Tool Showcase Track)

Who

Mojtaba Mostafavi, Arash Asgari, Mohammad Abolnejadian, Abbas Heydarnoori

Track

MSR 2024 Data and Tool Showcase Track

Time Zone

The program is currently displayed in (GMT+01:00) Lisbon.

Use conference time zone: (GMT+01:00) LisbonSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 16 Apr 2024 12:14 - 12:18 at Grande Auditório - Software Evolution & Analysis Chair(s): Vladimir Kovalenko

Abstract

Jupyter notebooks have become indispensable tools for data analysis and processing in various domains. However, despite their widespread use, there is a notable research gap in understanding and analyzing the contents and code metrics of these notebooks. This gap is primarily attributed to the absence of datasets that encompass both Jupyter notebooks and extracted their code metrics. To address this limitation, we introduce DistilKaggle, a unique dataset specifically curated to facilitate research on code metrics in Jupyter notebooks, utilizing the Kaggle repository as a prime source. Through an extensive study, we identify thirty-four code metrics that significantly impact Jupyter notebook code quality. These features such as \textit{lines of code cell, mean number of words in markdown cells, performance tier of developer, etc.} are crucial for understanding and improving the overall effectiveness of computational notebooks. The DistilKaggle dataset is derived from a vast collection of notebooks, and we present two distinct datasets: (i) \textbf{Code Cells and markdown Cells Dataset} which is presented in two CSV files, allowing for easy integration into researchers’ workflows as dataframes. It provides a granular view of the content structure within 542,051 Jupyter notebooks, enabling detailed analysis of code and markdown cells; and (ii) \textbf{Notebook Code Metrics Dataset} focused on the identified code metrics of notebooks. Researchers can leverage this dataset to access Jupyter notebooks with specific code quality characteristics, surpassing the limitations of filters available on the Kaggle website. Furthermore, the reproducibility of the notebooks in our dataset is ensured through the code cells and markdown cells datasets, offering a reliable foundation for researchers to build upon. Given the substantial size of our datasets, less than 5 GB, it becomes an invaluable resource for the research community, surpassing the capabilities of individual Kaggle users to collect such extensive data. For accessibility and transparency, both the datasets https://doi.org/10.5281/zenodo.10317389 and the code https://github.com/theablemo/DistilKaggle utilized in crafting this dataset are publicly available.

Mojtaba Mostafavi

Department of Computer Engineering of Sharif University of Technology

Arash Asgari

Department of Computer Engineering of Sharif University of Technology

Mohammad Abolnejadian

Department of Computer Engineering of Sharif University of Technology

Abbas Heydarnoori

Bowling Green State University

United States

Time Zone

The program is currently displayed in (GMT+01:00) Lisbon.

Use conference time zone: (GMT+01:00) LisbonSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 16 Apr
Displayed time zone: Lisbon change

11:00 - 12:30	Software Evolution & AnalysisTechnical Papers / Data and Tool Showcase Track / Industry Track at Grande Auditório Chair(s): Vladimir Kovalenko JetBrains Research

11:00 12m Talk		Unveiling ChatGPT's Usage in Open Source Projects: A Mining-based Study Technical Papers Rosalia Tufano Università della Svizzera Italiana, Antonio Mastropaolo Università della Svizzera italiana, Federica Pepe University of Sannio, Ozren Dabic Software Institute, Università della Svizzera italiana (USI), Switzerland, Massimiliano Di Penta University of Sannio, Italy, Gabriele Bavota Software Institute @ Università della Svizzera Italiana
11:12 12m Talk		DRMiner: A Tool For Identifying And Analyzing Refactorings In Dockerfile Technical Papers Emna Ksontini University of Michigan - Dearborn, Aycha Abid Oakland University, Rania Khalsi University of Michigan - Flint, Marouane Kessentini University of Michigan - Flint
11:24 12m Talk		A Large-Scale Empirical Study of Open Source License Usage: Practices and Challenges Technical Papers Jiaqi Wu Zhejiang University, Lingfeng Bao Zhejiang University, Xiaohu Yang Zhejiang University, Xin Xia Huawei Technologies, Xing Hu Zhejiang University
11:36 12m Talk		Analyzing the Evolution and Maintenance of ML Models on Hugging Face Technical Papers Joel Castaño Fernández Universitat Politècnica de Catalunya, Silverio Martínez-Fernández UPC-BarcelonaTech, Xavier Franch Universitat Politècnica de Catalunya, Justus Bogner Vrije Universiteit Amsterdam Link to publication Pre-print
11:48 12m Talk		On the Anatomy of Real-World R Code for Static Analysis Technical Papers Florian Sihler Ulm University, Lukas Pietzschmann Ulm University, Raphael Straub Ulm University, Matthias Tichy Ulm University, Germany, Andor Diera Ulm University, Abdelhalim Dahou GESIS Leibniz Institute for the Social Sciences Pre-print File Attached
12:00 6m Talk		Encoding Version History Context for Better Code Representation Technical Papers Huy Nguyen The University of Melbourne, Christoph Treude Singapore Management University, Patanamon Thongtanunam University of Melbourne Pre-print
12:06 4m Talk		CodeLL: A Lifelong Learning Dataset to Support the Co-Evolution of Data and Language Models of Code Data and Tool Showcase Track Martin Weyssow DIRO, Université de Montréal, Claudio Di Sipio University of L'Aquila, Davide Di Ruscio University of L'Aquila, Houari Sahraoui DIRO, Université de Montréal
12:10 4m Talk		Bidirectional Paper-Repository Tracing in Software Engineering Data and Tool Showcase Track Daniel Garijo , Miguel Arroyo Universidad Politécnica de Madrid, Esteban González Guardia Universidad Politécnica de Madrid, Christoph Treude Singapore Management University, Nicola Tarocco CERN
12:14 4m Talk		DistilKaggle: A Distilled Dataset of Kaggle Jupyter Notebooks Data and Tool Showcase Track Mojtaba Mostafavi Department of Computer Engineering of Sharif University of Technology, Arash Asgari Department of Computer Engineering of Sharif University of Technology, Mohammad Abolnejadian Department of Computer Engineering of Sharif University of Technology, Abbas Heydarnoori Bowling Green State University
12:18 5m Talk		Estimating Usage of Open Source Projects Industry Track Sophia Vargas Google LLC, Georg Link Bitergia, JaYoung Lee Google