RABBIT: A tool for identifying bot accounts based on their recent GitHub event history
Collaborative software development through GitHub repositories frequently relies on bot accounts that automate many repetitive and error-prone tasks. Several bot identification tools and techniques have been proposed in the past, but they tend to rely on a substantial amount of historical data, or they limit themselves to a reduced subset of activity types. To overcome these limitations, we developed RABBIT, an open-source command-line tool that queries the GitHub Events API to retrieve the recent events of a given GitHub account to predict whether the account is a human or a bot. The prediction is based on an XGBoost classification model that relies on six features related to account activities, and that is obtained through grid-search 10-fold cross-validation. Based on a newly created ground-truth dataset of GitHub accounts containing 644 bots and 691 humans, we trained the model on 60% of the data and achieved a very good performance, with an AUC of 0.97, weighted F1 score of 0.94, precision of 0.94 and recall of 0.94. After integrating this model into RABBIT, and testing the tool’s performance on the remaining 40% of unseen data achieved a performance of 0.93 for AUC, weighted F1 score, precision and recall each. Taking into account the imposed hourly GitHub API limit, RABBIT can classify thousands of accounts per hour using at most 3 queries per account.
Tue 16 AprDisplayed time zone: Lisbon change
14:00 - 15:30 | Process automation & DevOps IITechnical Papers / Data and Tool Showcase Track at Almada Negreiros Chair(s): Shane McIntosh University of Waterloo | ||
14:00 12mTalk | Options Matter: Documenting and Fixing Non-Reproducible Builds in Highly-Configurable Systems Technical Papers Georges Aaron RANDRIANAINA Université de Rennes 1, IRISA, Djamel Eddine Khelladi CNRS, IRISA, University of Rennes, Olivier Zendra Inria, Mathieu Acher University of Rennes, France / Inria, France / CNRS, France / IRISA, France | ||
14:12 12mTalk | How do Machine Learning Projects use Continuous Integration Practices? An Empirical Study on GitHub Actions Technical Papers João Helis Bernardo Federal Institute of Education, Science and Technology of Rio Grande do Norte, Daniel Alencar Da Costa University of Otago, Sergio Queiroz de Medeiros Universidade Federal do Rio Grande do Norte, Uirá Kulesza Federal University of Rio Grande do Norte DOI Pre-print | ||
14:24 4mTalk | A dataset of GitHub Actions workflow histories Data and Tool Showcase Track Guillaume Cardoen University of Mons, Tom Mens University of Mons, Alexandre Decan University of Mons; F.R.S.-FNRS | ||
14:28 4mTalk | gawd: A Differencing Tool for GitHub Actions Workflows Data and Tool Showcase Track Pooya Rostami Mazrae University of Mons, Alexandre Decan University of Mons; F.R.S.-FNRS, Tom Mens University of Mons | ||
14:32 4mTalk | RABBIT: A tool for identifying bot accounts based on their recent GitHub event history Data and Tool Showcase Track Natarajan Chidambaram University of Mons, Tom Mens University of Mons, Alexandre Decan University of Mons; F.R.S.-FNRS | ||
14:36 12mTalk | An Investigation of Patch Porting Practices of the Linux Kernel Ecosystem Technical Papers Xingyu Li UC Riverside, Zheng Zhang UC Riverside, Zhiyun Qian University of California at Riverside, USA, Trent Jaeger UC Riverside, Chengyu Song University of California at Riverside, USA | ||
14:48 4mTalk | BugsPHP: A dataset for Automated Program Repair in PHP Data and Tool Showcase Track K.D. Pramod University of Moratuwa, Sri Lanka, W.T.N. De Silva University of Moratuwa, Sri Lanka, W.U.K. Thabrew University of Moratuwa, Sri Lanka, Ridwan Salihin Shariffdeen National University of Singapore, Sandareka Wickramanayake University of Moratuwa, Sri Lanka Pre-print |