I am a Machine Learning Engineer at Amazon Alexa AI where, as a member of the Intelligent Decisions group, I build ML scalable systems that predict and route user intents to appropriate Alexa actions.
Prior to Amazon, I was a Research Software Engineer at IBM Research in the Collaborative AI group. I worked on resource efficient modeling of human behaviours (nested beliefs) in multi-agent repeated social dilemma settings. Concurrently, I designed the machine learning based requisition & candidate matching subsystem of IBM Watson Recruitment.
tell me more
I have a Master's degree in Robotics from Carnegie Mellon University where I conducted research on visual assessment methods for non-disruptive object extraction in human spaces at HARP and TBD lab. I also studied the effects of anticipatory robot motion on humans to understand if robots can unintentionally (or intentionally) manipulate human decision-making by trying to pro-actively predict human intent.
While pursuing my Bachelor's degree in Computer Science from IIIT-Delhi, I got the chance to work on a breadth of projects, right from designing perception systems for autonomous vehicles at Swarath to developing methods for visual summarization of large social media datasets at PreCog. In my senior year, I was a also a founding member of a digital governance start-up at Meri Awaaz working towards enabling open, accessible and accountable governance.
I pursue applied machine learning to enhance human
interactions with robots and devices that operate ubiquitously in our personal spaces. My professional goal is to make these interactions more delightful and aligned with our well-being and intentions. Though my personal, more
pursuit is to figure out what human well-being, in the context of modern technology (and beyond), even means.
Examining the Effects of Anticipatory Robot Assistance on Human Decision Making
Benjamin A. Newman*, Abhijat Biswas*, Sarthak Ahuja, Siddharth Girdhar, Kris K. Kitani, Henny Admoni
International Conference on Social Robotics (ICSR) 2020
In this work, we investigate whether a robot’s anticipatory assistance can drive people to make choices different from those they would otherwise make. Such a study requires measuring intent, which itself could modify intent, resulting in an observer paradox. To combat this, we carefully designed an experiment to avoid this effect. We conducted a user study (N=99) in which participants completed a collaborative object retrieval task: users selected an object and a robot arm retrieved it for them. The robot predicted the user’s object selection from eye gaze in advance of their explicit selection, and then provided either collaborative anticipation (moving toward the predicted object), adversarial anticipation (moving away from the predicted object), or no anticipation (no movement, control condition). We found trends and participant comments suggesting people’s decision making changes in the presence of a robot anticipatory motion and this change differs depending on the robot’s anticipation strategy.
Visual Assessment for Non-Disruptive Object Extraction
Robots operating in human environments need to perform a variety of dexterous manipulation tasks on object arrangements that have complex physical support relationships, e.g. procuring utensils from a large pile of dishes, grabbing a bottle from a stuffed fridge, or fetching a book from a loaded shelf. The cost of a misjudged extraction in these situations can be very high (e.g., other objects falling) and therefore robots must be careful not to disturb other objects when executing manipulation skills. This requires robots to reason about the effect of their manipulation choices by accounting for the support relationships among objects in the scene. Humans do this in part by visually assessing the scene and using physics intuition to infer how likely it is that a particular object can be safely moved. Inspired by this human capability, we explore how robots can emulate similar vision-based physics intuition using deep learning based data-driven models.
Learning Vision-Based Physics Intuition Models for Non-Disruptive Object Extraction
Sarthak Ahuja, Henny Admoni, Aaron Steinfeld
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2020
Robots operating in human environments must be careful, when executing their manipulation skills, not to disturb nearby objects. This requires robots to reason about the effect of their manipulation choices by accounting for the support relationships among objects in the scene. Humans do this in part by visually assessing their surroundings and using physics intuition for how likely it is that a particular object can be safely manipulated (i.e., cause no disruption in the rest of the scene). Existing work has shown that deep convolutional neural networks can learn intuitive physics over images generated in simulation and determine the stability of a scene in the real world. In this paper, we extend these physics intuition models to the task of assessing safe object extraction by conditioning the visual images on specific objects in the scene. Our results, in both simulation and real-world settings, show that with our proposed method, physics intuition models can be used to inform a robot of which objects can be safely extracted and from which direction to extract them.
Dynamic Particle Allocation to Solve Interactive POMDP Models for Social Decision Making
Rohith D Vallam, Sarthak Ahuja, Surya Shravan Kumar Sajja, Ritwik Chaudhuri, Rakesh R Pimplikar, Kushal Mukherjee, Gyana Parija, Ramasuri Narayanam
International Conference on Autonomous Agents and Multiagent Systems (AAMAS) 2019
In repeated social dilemma settings, such as repeated Public Goods Games (PGG), humans often come across a dilemma whether to contribute or not based on past contributions from others. In such settings, the decision taken by an agent/human actually depends not only on the belief the agent has about other agents and the environment, but also on their beliefs about others’ beliefs. To factor in these aspects, we propose a novel formulation of computational theory of mind (ToM) to model human behavior in a repeated PGG using interactive partially observable Markov decision processes (I-POMDPs). We also propose a dynamic particle allocation algorithm for different agents based on how well they could predict. Our results suggest that dynamic particle allocation based IPF for I-POMDPs is effective in modelling human behaviours in repeated social dilemma setting while utilizing computational resources in an effective manner.
Benchmarking of a Novel POS Tagging Based Semantic Similarity Approach for Job Description Similarity Computation
Joydeep Mondal, Sarthak Ahuja, Kushal Mukherjee, Sudhanshu S. Singh, Gyana Parija
European Semantic Web Conference (ESWC) 2018
Most solutions providing hiring analytics involve mapping provided job descriptions to a standard job framework, thereby requiring computation of a document similarity score between two job descriptions. Finding semantic similarity between a pair of documents is a problem that is yet to be solved satisfactorily over all possible domains/contexts. Most document similarity calculation exercises require a large corpus of data for training the underlying models. In this paper we compare three methods of document similarity for job descriptions – topic modeling (LDA), doc2vec, and a novel part-of-speech tagging based document similarity (POSDC) calculation method. LDA and doc2vec require a large corpus of data to train, while POCDC exploits a domain specific property of descriptive documents (such as job descriptions) that enables us to compare two documents in isolation. POSDC method is based on an ”action-object-attribute” representation of documents, that allows meaningful comparisons. We use Standford Core NLP and NLTK Wordnet to do a multilevel semantic match between the actions and corresponding objects. We use sklearn for topic modeling and gensim for doc2vec. We compare the results from these three methods based on IBM Kenexa Talent frameworks job taxonomy
#VisualHashtags- Visual Summarization of Social Media Events Using Mid-Level Visual Elements
Sonal Goel, Sarthak Ahuja, A V Subramanyam, Ponnurangam Kumaraguru
ACM Multimedia (ACMMM) 2017
In this paper we propose a methodology for visual event summarization by extracting mid-level visual elements from images associated with social media events on Twitter (#VisualHashtags). The key research question is Which elements can visually capture the essence of a viral event? hence explain its virality, and summarize it. Compared to the existing approaches of visual event summarization on social media data, we aim to discover #VisualHashtags, i.e., meaningful patches that can become the visual analog of a regular text hashtag that Twitter generates. Our algorithm incorporates a multi-stage filtering process and social popularity based ranking to discover mid-level visual elements, which overcomes the challenges faced by direct application of the existing methods.
Similarity Computation Exploiting The Semantic And Syntactic Inherent Structure Among Job Titles
Sarthak Ahuja, Joydeep Mondal, Sudhanshu S. Singh, David G. George
International Conference on Service-Oriented Computing (ICSOC) 2017
Solutions providing hiring analytics involve mapping company
provided job descriptions to a standard job framework, thereby
requiring computation of a similarity score between two jobs. Most systems
doing so apply document similarity computation methods to all
pairs of provided job descriptions. This approach can be computationally
expensive and adversely impacted by the quality of the job descriptions
which often include information not relevant to the job or candidate qualifications.
We propose a method to narrow down pairs of job descriptions
to be compared by comparing job titles first. The observation that each
job title can be decomposed into three components, domain, function
and attribute, forms the basis of our method. Our proposal focuses on
training the machine learning models to identify these three components
of any given job title. Next we do a semantic match between the three
identified components, and use those match scores to create a composite
similarity score between any two pair of job titles. The elegance of this
solution lies in the fact that job titles are the most concise definition
of the job and the resulting matches can easily be verified by human
experts. Our results show that the approach provides extremely reliable
Multi Level Clustering Technique Leveraging Expert Insight
State of the art clustering algorithms operate well on numeric data but for textual data rely on conversion to numeric representation. This conversion is done by adopting approaches like TFIDF, Word2Vec, etc. and require large amount of contextual data to do the learning. Such contextual data may not be always available for the given domain. We propose a novel algorithm that incorporates Subject Matter Experts’ (SME) inputs in lieu of contextual data to be able to do effective clustering of a mix of textual and numeric data. We leverage simple semantic rules provided by SMEs to do a multi-level iterative clustering that is executed on the Apache Spark Platform for accelerated outcome. The semantic rules are used to generate large number of small sized clusters which are qualitatively merged using the principles of Graph Colouring. We present the results from a Recruitment Process Benchmarking case study on data from multiple jobs. We applied the proposed technique to create suitable job categories for establishing benchmarks. This approach provides far more meaningful insights than traditional approach where benchmarks are calculated for all jobs put together.
These include filed patents and unpublished research work.
Virtual-Reality Based Interactive Audience Simulation
Methods, systems and computer program products for generating virtual reality (VR)-based interactive audience simulations are provided herein. A computer-implemented method includes determining one or more situational and location characteristics for a given performance by a user, generating a VR-based simulated audience for the given performance based at least in part on the determined situational and location characteristics, presenting the VR-based simulated audience to a user during the given performance utilizing a VR headset, utilizing one or more sensors to measure one or more aspects of the given performance before the VR-based simulated audience, and generating real-time feedback adjusting the VR-based simulated audience presented to the user utilizing the VR headset based at least in part on the measured aspects of the given performance.
Creating and using Triplet Representations to Assess Similarity between Job Description Documents
David G. George, Sudhanshu S. Singh, Joydeep Mondal, Sarthak Ahuja, John A. Medicke, Amanda Klabzuba
Patent Application US15/854837 2019
One embodiment provides a method, including: receiving a requisition for a job position, the requisition having a plurality of recruiters, each having influence in selecting a candidate; generating a profile for an ideal candidate comprising (1) a plurality of attributes and (ii) weights corresponding to each of the attributes; receiving, for a plurality of candidates, profiles for each the candidates; comparing the profile of each of the plurality of candidates against the ideal candidate, using a distance method computation to determine the distance between the plurality of candidates and the ideal candidate based upon the weights; ranking the plurality of candidates and providing the ranking to each of the plurality of recruiters; receiving input from each of the plurality of recruiters that modifies the ranking, recalculating the weights of the attributes based upon the modified ranking, and modifying the ranking; and providing a final ranking of the plurality of candidates.
Candidate Selection using a Gaming Framework
Sarthak Ahuja, Ritwik Chaudhuri, Manish Kataria, Manu Kuchhal, Gyana R. Parija, Sudhanshu S. Singh
Patent Application US15/842066 2019
A method, system and computer program product for assessing similarity between two job description documents. Job description documents consist of sentences framed in a particular manner, where the sentences are represented as a set of actions, an object corresponding to each action and a set of attributes corresponding to the object. The two job description documents are parsed to generate a first and a second set of an action-object-attribute triplet representation, where the first set of the action-object-attribute triplet representation is associated with the first job description document and the second set of the action-object-attribute triplet representation is associated with the second job description document. A similarity score between the first and second sets of action-object-attribute triplet representations is then calculated by hierarchically matching the first and second sets of action-object-attribute triplet representations across the job description documents. In this manner, similar job positions/job descriptions may be more accurately identified.
Cogniculture- Towards a better Human-Machine Co-evolution
Research in Artificial Intelligence is breaking technology barriers every day. New algorithms and high performance computing are making things possible which we could only have imagined earlier. People in AI community have diverse set of opinions regarding the pros and cons of AI mimicking human behavior. Instead of worrying about AI advancements, we propose a novel idea of cognitive agents, including both human and machines, living together in a complex adaptive ecosystem, collaborating on human computation for producing essential social goods while promoting sustenance, survival and evolution of the agents’ life cycle. We highlight several research challenges and technology barriers in achieving this goal. We propose a governance mechanism around this ecosystem to ensure ethical behaviors of all cognitive agents. Along with a novel set of use-cases of Cogniculture, we discuss the road map ahead for this journey.
Smartphone Audio Based Distress Detection
Anil Sharma, Sarthak Ahuja, Mayank Gautam, and Sanjit Kaul
Independent Project 2016
We investigate an unobtrusive and 24×7 human distress detection and signaling system, Always Alert, that requires the smartphone, and not its human owner, to be on alert. The system leverages the microphone sensor, at least one of which is available on every phone, and assumes the availability of a data network. We propose a novel two-stage supervised learning framework, using support vector machines (SVMs), that executes on a user’s smartphone and monitors natural vocal expressions of fear — screaming and crying in our study — when a human being is in harm’s way. The challenge is to achieve a high distress detection rate while ensuring that the false alarm rate is a manageable overhead, while a typical smartphone user goes about living life as usual. We train the learning framework with carefully selected audio fingerprints of distress and of varied environmental contexts. Exploiting the time contiguous nature of false alarms further allows us to reduce the FAR. We show the feasibility of using our framework anytime and anywhere by testing it over many hours of audio fingerprints recorded by volunteers on their smartphones, as they went about their daily routines. We are able to achieve high distress detection rates at an average overhead that is equivalent to about 1 facebook post every 3 to 4 hours.