Publications
Publications in reversed chronological order. My Google Scholar is more likely to be up to date.
2024
- TMLRMultitask Learning Can Improve Worst-Group OutcomesAtharva Kulkarni, Lucio M. Dery, Amrith Setlur, Aditi Raghunathan, Ameet Talwalkar, and Graham NeubigTransactions on Machine Learning Research, 2024
In order to create machine learning systems that serve a variety of users well, it is vital to not only achieve high average performance but also ensure equitable outcomes across diverse groups. However, most machine learning methods are designed to improve a model’s average performance on a chosen end task without consideration for their impact on worst group error. Multitask learning (MTL) is one such widely used technique. In this paper, we seek not only to understand the impact of MTL on worst-group accuracy but also to explore its potential as a tool to address the challenge of group-wise fairness. We primarily consider the standard setting of fine-tuning a pre-trained model, where, following recent work (Gururangan et al., 2020; Dery et al., 2023), we multitask the end task with the pre-training objective constructed from the end task data itself. In settings with few or no group annotations, we find that multitasking often, but not consistently, achieves better worst-group accuracy than Just-Train-Twice (JTT; Liu et al. (2021)) – a representative distributionally robust optimization (DRO) method. Leveraging insights from synthetic data experiments, we propose to modify standard MTL by regularizing the joint multitask representation space. We run a large number of fine-tuning experiments across computer vision and natural language processing datasets and find that our regularized MTL approach consistently outperforms JTT on both average and worst-group outcomes. Our official code can be found here: https://github.com/atharvajk98/MTL-group-robustness.
- EACLSynthDST: Synthetic Data is All You Need for Few-Shot Dialog State TrackingAtharva Kulkarni, Bo-Hsiang Tseng, Joel Moniz, Dhivya Piraviperumal, Hong Yu, and Shruti BhargavaIn Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), Mar 2024
In-context learning with Large Language Models (LLMs) has emerged as a promising avenue of research in Dialog State Tracking (DST). However, the best-performing in-context learning methods involve retrieving and adding similar examples to the prompt, requiring access to labeled training data. Procuring such training data for a wide range of domains and applications is time-consuming, expensive, and, at times, infeasible. While zero-shot learning requires no training data, it significantly lags behind the few-shot setup. Thus, ‘\textitCan we efficiently generate synthetic data for any dialogue schema to enable few-shot prompting?’ Addressing this question, we propose , a data generation framework tailored for DST, utilizing LLMs. Our approach only requires the dialogue schema and a few hand-crafted dialogue templates to synthesize natural, coherent, and free-flowing dialogues with DST annotations. Few-shot learning using data from results in 4-5% improvement in Joint Goal Accuracy over the zero-shot baseline on MultiWOZ 2.1 and 2.4. Remarkably, our few-shot learning approach recovers nearly 98% of the performance compared to the few-shot setup using human-annotated training data.
2023
- EMNLP (Findings)The student becomes the master: Outperforming GPT3 on Scientific Factual Error CorrectionDhananjay Ashok, Atharva Kulkarni, Hai Pham, and Barnabas PoczosIn Findings of the Association for Computational Linguistics: EMNLP 2023 & NeurIPS 2023 Workshop on Synthetic Data Generation with Generative AI (SyntheticData4ML), Dec 2023
Due to the prohibitively high cost of creating error correction datasets, most Factual Claim Correction methods rely on a powerful verification model to guide the correction process. This leads to a significant drop in performance in domains like Scientific Claim Correction, where good verification models do not always exist. In this work we introduce SciFix, a claim correction system that does not require a verifier but is able to outperform existing methods by a considerable margin — achieving correction accuracy of 84% on the SciFact dataset, 77% on SciFact-Open and 72.75% on the CovidFact dataset, compared to next best accuracies of 7.6%, 5% and 15% on the same datasets respectively. Our method leverages the power of prompting with LLMs during training to create a richly annotated dataset that can be used for fully supervised training and regularization. We additionally use a claim-aware decoding procedure to improve the quality of corrected claims. Our method outperforms the very LLM that was used to generate the annotated dataset — with FewShot Prompting on GPT3.5 achieving 58%, 61% and 64% on the respective datasets, a consistently lower correction accuracy, despite using nearly 800 times as many parameters as our model.
- EMNLPCounting the Bugs in ChatGPT’s Wugs: A Multilingual Investigation into the Morphological Capabilities of a Large Language ModelLeonie Weissweiler, Valentin Hofmann, Anjali Kantharuban, Anna Cai, Ritam Dutt, Amey Hengle, Anubha Kabra, Atharva Kulkarni, Abhishek Vijayakumar, Haofei Yu, and 3 more authorsIn Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Dec 2023
Large language models (LLMs) have recently reached an impressive level of linguistic capability, prompting comparisons with human language skills. However, there have been relatively few systematic inquiries into the linguistic capabilities of the latest generation of LLMs, and those studies that do exist (i) ignore the remarkable ability of humans to generalize, (ii) focus only on English, and (iii) investigate syntax or semantics and overlook other capabilities that lie at the heart of human language, like morphology. Here, we close these gaps by conducting the first rigorous analysis of the morphological capabilities of ChatGPT in four typologically varied languages (specifically, English, German, Tamil, and Turkish). We apply a version of Berko’s (1958) wug test to ChatGPT, using novel, uncontaminated datasets for the four examined languages. We find that ChatGPT massively underperforms purpose-built systems, particularly in English. Overall, our results—through the lens of morphology—cast a new light on the linguistic capabilities of ChatGPT, suggesting that claims of human-like language skills are premature and misleading.
- SIGKDDRevisiting Hate Speech Benchmarks: From Data Curation to System DeploymentAtharva Kulkarni*, Sarah Masud*, Vikram Goyal, and Tanmoy ChakrabortyIn Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Dec 2023
Social media is awash with hateful content, much of which is often veiled with linguistic and topical diversity. The benchmark datasets used for hate speech detection do not account for such divagation as they are predominantly compiled using hate lexicons. However, capturing hate signals becomes challenging in neutrally-seeded malicious content. Thus, designing models and datasets that mimic the real-world variability of hate warrants further investigation.To this end, we present GOTHate, a large-scale code-mixed crowdsourced dataset of around 51k posts for hate speech detection from Twitter. GOTHate is neutrally seeded, encompassing different languages and topics. We conduct detailed comparisons of GOTHate with the existing hate speech datasets, highlighting its novelty. We benchmark it with 10 recent baselines. Our extensive empirical and benchmarking experiments suggest that GOTHate is hard to classify in a text-only setup. Thus, we investigate how adding endogenous signals enhances the hate speech detection task. We augment GOTHate with the user’s timeline information and ego network, bringing the overall data source closer to the real-world setup for understanding hateful content. Our proposed solution HEN-mBERT is a modular, multilingual, mixture-of-experts model that enriches the linguistic subspace with latent endogenous signals from history, topology, and exemplars. HEN-mBERT transcends the best baseline by 2.5% and 5% in overall macro-F1 and hate class F1, respectively. Inspired by our experiments, in partnership with Wipro AI, we are developing a semi-automated pipeline to detect hateful content as a part of their mission to tackle online harm.
- IJCAILearning and Reasoning Multifaceted and Longitudinal Data for Poverty Estimates and Livelihood Capabilities of Lagged Regions in Rural IndiaAtharva Kulkarni, Raya Das, Ravi S. Srivastava, and Tanmoy ChakrabortyIn Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23, Aug 2023AI for Good - Projects
Poverty is a multifaceted phenomenon linked to the lack of capabilities of households to earn a sustainable livelihood, increasingly being assessed using multidimensional indicators. Its spatial pattern depends on social, economic, political, and regional variables. Artificial intelligence has shown immense scope in analyzing the complexities and nuances of poverty. The proposed project aims to examine the poverty situation of rural India for the period of 1990-2022 based on the quality of life and livelihood indicators. The districts will be classified into ‘advanced’, ‘catching up’, ‘falling behind’, and ‘lagged’ regions. The project proposes to integrate multiple data sources, including conventional national-level large sample household surveys, census surveys, and proxy variables like daytime, and nighttime data from satellite images, and communication networks, to name a few, to provide a comprehensive view of poverty at the district level. The project also intends to examine causation and longitudinal analysis to examine the reasons for poverty. Poverty and inequality could be widening in developing countries due to demographic and growth-agglomerating policies. Therefore, targeting the lagging regions and the vulnerable population is essential to eradicate poverty and improve the quality of life to achieve the goal of ‘zero poverty’. Thus, the study also focuses on the districts with a higher share of the marginal section of the population compared to the national average to trace the performance of development indicators and their association with poverty in these regions.
- EACLCharacterizing the Entities in Harmful Memes: Who is the Hero, the Villain, the Victim?Atharva Kulkarni*, Shivam Sharma*, Tharun Suresh, Himanshi Mathur, Preslav Nakov, Md. Shad Akhtar, and Tanmoy ChakrabortyIn Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, May 2023
Memes can sway people’s opinions over social media as they combine visual and textual information in an easy-to-consume manner. Since memes instantly turn viral, it becomes crucial to infer their intent and potentially associated harmfulness to take timely measures as needed. A common problem associated with meme comprehension lies in detecting the entities referenced and characterizing the role of each of these entities. Here, we aim to understand whether the meme glorifies, vilifies, or victimizes each entity it refers to. To this end, we address the task of role identification of entities in harmful memes, i.e., detecting who is the ‘hero’, the ‘villain’, and the ‘victim’ in the meme, if any. We utilize HVVMemes – a memes dataset on US Politics and Covid-19 memes, released recently as part of the CONSTRAINT@ACL-2022 shared-task. It contains memes, entities referenced, and their associated roles: hero, villain, victim, and other. We further design VECTOR (Visual-semantic role dEteCToR), a robust multi-modal framework for the task, which integrates entity-based contextual information in the multi-modal representation and compare it to several standard unimodal (text-only or image-only) or multi-modal (image+text) models. Our experimental results show that our proposed model achieves an improvement of 4% over the best baseline and 1% over the best competing stand-alone submission from the shared-task. Besides divulging an extensive experimental setup with comparative analyses, we finally highlight the challenges encountered in addressing the complex task of semantic role labeling within memes.
2022
- EMNLPEmpowering the Fact-checkers! Automatic Identification of Claim Spans on TwitterAtharva Kulkarni*, Megha Sundriyal*, Vaibhav Pulastya, Md. Shad Akhtar, and Tanmoy ChakrabortyIn Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Dec 2022
The widespread diffusion of medical and political claims in the wake of COVID-19 has led to a voluminous rise in misinformation and fake news. The current vogue is to employ manual fact-checkers to efficiently classify and verify such data to combat this avalanche of claim-ridden misinformation. However, the rate of information dissemination is such that it vastly outpaces the fact-checkers’ strength. Therefore, to aid manual fact-checkers in eliminating the superfluous content, it becomes imperative to automatically identify and extract the snippets of claim-worthy (mis)information present in a post. In this work, we introduce the novel task of Claim Span Identification (CSI). We propose CURT, a large-scale Twitter corpus with token-level claim spans on more than 7.5k tweets. Furthermore, along with the standard token classification baselines, we benchmark our dataset with DABERTa, an adapter-based variation of RoBERTa. The experimental results attest that DABERTa outperforms the baseline systems across several evaluation metrics, improving by about 1.5 points. We also report detailed error analysis to validate the model’s performance along with the ablation studies. Lastly, we release our comprehensive span annotation guidelines for public use.
- ACLWhen did you become so smart, oh wise one?! Sarcasm Explanation in Multi-modal Multi-party DialoguesAtharva Kulkarni*, Shivani Kumar*, Md Shad Akhtar, and Tanmoy ChakrabortyIn Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), May 2022
Indirect speech such as sarcasm achieves a constellation of discourse goals in human communication. While the indirectness of figurative language warrants speakers to achieve certain pragmatic goals, it is challenging for AI agents to comprehend such idiosyncrasies of human communication. Though sarcasm identification has been a well-explored topic in dialogue analysis, for conversational systems to truly grasp a conversation’s innate meaning and generate appropriate responses, simply detecting sarcasm is not enough; it is vital to explain its underlying sarcastic connotation to capture its true essence. In this work, we study the discourse structure of sarcastic conversations and propose a novel task – Sarcasm Explanation in Dialogue (SED). Set in a multimodal and code-mixed setting, the task aims to generate natural language explanations of satirical conversations. To this end, we curate WITS, a new dataset to support our task. We propose MAF (Modality Aware Fusion), a multimodal context-aware attention and global information fusion module to capture multimodality and use it to benchmark WITS. The proposed attention module surpasses the traditional multimodal fusion baselines and reports the best performance on almost all metrics. Lastly, we carry out detailed analysis both quantitatively and qualitatively.
- CONSTRAINT (ACL)Findings of the CONSTRAINT 2022 Shared Task on Detecting the Hero, the Villain, and the Victim in MemesShivam Sharma, Tharun Suresh, Atharva Kulkarni, Himanshi Mathur, Preslav Nakov, Md. Shad Akhtar, and Tanmoy ChakrabortyIn Proceedings of the Workshop on Combating Online Hostile Posts in Regional Languages during Emergency Situations, May 2022
We present the findings of the shared task at the CONSTRAINT 2022 Workshop: Hero, Villain, and Victim: Dissecting harmful memes for Semantic role labeling of entities. The task aims to delve deeper into the domain of meme comprehension by deciphering the connotations behind the entities present in a meme. In more nuanced terms, the shared task focuses on determining the victimizing, glorifying, and vilifying intentions embedded in meme entities to explicate their connotations. To this end, we curate HVVMemes, a novel meme dataset of about 7000 memes spanning the domains of COVID-19 and US Politics, each containing entities and their associated roles: hero, villain, victim, or none. The shared task attracted 105 participants, but eventually only 6 submissions were made. Most of the successful submissions relied on fine-tuning pre-trained language and multimodal models along with ensembles. The best submission achieved an F1-score of 58.67.
2021
- LOUHI (EACL)Cluster Analysis of Online Mental Health Discourse using Topic-Infused Deep Contextualized RepresentationsAtharva Kulkarni, Amey Hengle, Pradnya Kulkarni, and Manisha MaratheIn Proceedings of the 12th International Workshop on Health Text Mining and Information Analysis, Apr 2021
With mental health as a problem domain in NLP, the bulk of contemporary literature revolves around building better mental illness prediction models. The research focusing on the identification of discussion clusters in online mental health communities has been relatively limited. Moreover, as the underlying methodologies used in these studies mainly conform to the traditional machine learning models and statistical methods, the scope for introducing contextualized word representations for topic and theme extraction from online mental health communities remains open. Thus, in this research, we propose topic-infused deep contextualized representations, a novel data representation technique that uses autoencoders to combine deep contextual embeddings with topical information, generating robust representations for text clustering. Investigating the Reddit discourse on Post-Traumatic Stress Disorder (PTSD) and Complex Post-Traumatic Stress Disorder (C-PTSD), we elicit the thematic clusters representing the latent topics and themes discussed in the r/ptsd and r/CPTSD subreddits. Furthermore, we also present a qualitative analysis and characterization of each cluster, unraveling the prevalent discourse themes.
- WASSA (EACL)PVG at WASSA 2021: A Multi-Input, Multi-Task, Transformer-Based Architecture for Empathy and Distress PredictionAtharva Kulkarni, Sunanda Somwase, Shivam Rajput, and Manisha MaratheIn Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, Apr 2021
Active research pertaining to the affective phenomenon of empathy and distress is invaluable for improving human-machine interaction. Predicting intensities of such complex emotions from textual data is difficult, as these constructs are deeply rooted in the psychological theory. Consequently, for better prediction, it becomes imperative to take into account ancillary factors such as the psychological test scores, demographic features, underlying latent primitive emotions, along with the text’s undertone and its psychological complexity. This paper proffers team PVG’s solution to the WASSA 2021 Shared Task on Predicting Empathy and Emotion in Reaction to News Stories. Leveraging the textual data, demographic features, psychological test score, and the intrinsic interdependencies of primitive emotions and empathy, we propose a multi-input, multi-task framework for the task of empathy score prediction. Here, the empathy score prediction is considered the primary task, while emotion and empathy classification are considered secondary auxiliary tasks. For the distress score prediction task, the system is further boosted by the addition of lexical features. Our submission ranked 1st based on the average correlation (0.545) as well as the distress correlation (0.574), and 2nd for the empathy Pearson correlation (0.517).
2020
- ICONAn Attention Ensemble Approach for Efficient Text Classification of Indian LanguagesAtharva Kulkarni, Amey Hengle, and Rutuja UdyawarIn Proceedings of the 17th International Conference on Natural Language Processing (ICON): TechDOfication 2020 Shared Task, Dec 2020
The recent surge of complex attention-based deep learning architectures has led to extraordinary results in various downstream NLP tasks in the English language. However, such research for resource-constrained and morphologically rich Indian vernacular languages has been relatively limited. This paper proffers a solution for the TechDOfication 2020 subtask-1f: which focuses on the coarse-grained technical domain identification of short text documents in Marathi, a Devanagari script-based Indian language. Availing the large dataset at hand, a hybrid CNN-BiLSTM attention ensemble model is proposed that competently combines the intermediate sentence representations generated by the convolutional neural network and the bidirectional long short-term memory, leading to efficient text classification. Experimental results show that the proposed model outperforms various baseline machine learning and deep learning models in the given task, giving the best validation accuracy of 89.57% and f1-score of 0.8875. Furthermore, the solution resulted in the best system submission for this subtask, giving a test accuracy of 64.26% and f1-score of 0.6157, transcending the performances of other teams as well as the baseline system given by the organizers of the shared task.
- IEEE ICSSITSmart Cap: A Deep Learning and IoT Based Assistant for the Visually ImpairedAmey Hengle, Atharva Kulkarni, Nachiket Bavadekar, Niraj Kulkarni, and Rutuja UdyawarIn 2020 Third International Conference on Smart Systems and Inventive Technology (ICSSIT), Dec 2020