Arjun Majumdar

Arjun Majumdar

PhD Student Georgia Tech


I am currently a PhD Student at Georgia Tech where I am advised by Dhruv Batra and work closely with Devi Parikh. My research is in the areas of computer vision and machine learning. Recently, I have been interested in problems in the Embodied AI space, where my ultimate goal is to develop agents that are able to use a vision system to navigate and accomplish goals in diverse real-world environments.

In the summer of 2021, I am interning at Amazon where I am working with Jesse Thomason and Gaurav Sukhatme. In the summer of 2020, I was an intern at FAIR working with Ross Girshick. Previously, I worked at MIT Lincoln Laboratory where my research focused on problems like visual question answering, semantic segmentation, and image-to-image translation.

Research Interests

  • Embodied AI
  • Computer Vision
  • Vision and Language

email / google scholar / github / cv


SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation

Abhinav Moudgil, Arjun Majumdar, Harsh Agrawal, Stefan Lee, Dhruv Batra, NeurIPS, 2021.

Abstract: Natural language instructions for visual navigation often use scene descriptions (e.g., "bedroom") and object references (e.g., "green chairs") to provide a breadcrumb trail to a goal location. This work presents a transformer-based vision-and-language navigation (VLN) agent that uses two different visual encoders -- a scene classification network and an object detector -- which produce features that match these two distinct types of visual cues. In our method, scene features contribute high-level contextual information that supports object-level processing. With this design, our model is able to use vision-and-language pretraining (i.e., learning the alignment between images and text from large-scale web data) to substantially improve performance on the Room-to-Room (R2R) and Room-Across-Room (RxR) benchmarks. Specifically, our approach leads to improvements of 1.8% absolute in SPL on R2R and 3.7% absolute in SR on RxR. Our analysis reveals even larger gains for navigation instructions that contain six or more object references, which further suggests that our approach is better able to use object features and align them to references in the instructions.

/ paper / cite

Sim-to-Real Transfer for Vision-and-Language Navigation
Sim-to-Real Transfer for Vision-and-Language Navigation

Peter Anderson, Ayush Shrivastava, Joanne Truong, Arjun Majumdar, Devi Parikh, Dhruv Batra, Stefan Lee, CoRL, 2020.

Abstract: We study the challenging problem of releasing a robot in a previously unseen environment, and having it follow unconstrained natural language navigation instructions. Recent work on the task of Vision-and-Language Navigation (VLN) has achieved significant progress in simulation. To assess the implications of this work for robotics, we transfer a VLN agent trained in simulation to a physical robot. To bridge the gap between the high-level discrete action space learned by the VLN agent, and the robot’s low-level continuous action space, we propose a subgoal model to identify nearby waypoints, and use domain randomization to mitigate visual domain differences. For accurate sim and real comparisons in parallel environments, we annotate a 325m2 office space with 1.3km of navigation instructions, and create a digitized replica in simulation. We find that sim-to-real transfer to an environment not seen in training is successful if an occupancy map and navigation graph can be collected and annotated in advance (success rate of 46.8% vs. 55.9% in sim), but much more challenging in the hardest setting with no prior mapping at all (success rate of 22.5%).

/ paper / code / video / cite

Vision and Language Navigation
Improving Vision-and-Language Navigation with Image-Text Pairs from the Web

Arjun Majumdar, Ayush Shrivastava, Stefan Lee, Peter Anderson, Devi Parikh, Dhruv Batra, ECCV, 2020. (spotlight)

Abstract: Following a navigation instruction such as 'Walk down the stairs and stop at the brown sofa' requires embodied AI agents to ground scene elements referenced via language (e.g. 'stairs') to visual content in the environment (pixels corresponding to 'stairs').

We ask the following question -- Can we leverage abundant 'disembodied' web-scraped vision-and-language corpora (e.g. Conceptual Captions) to learn visual groundings (what do 'stairs' look like?) that improve performance on a relatively data-starved embodied perception task (Vision-and-Language Navigation)? Specifically, we develop VLN-BERT, a visiolinguistic transformer-based model for scoring the compatibility between an instruction ('...stop at the brown sofa') and a sequence of panoramic RGB images captured by the agent. We demonstrate that pretraining VLN-BERT on image-text pairs from the web before fine-tuning on embodied path-instruction data significantly improves performance on VLN -- outperforming the prior state-of-the-art in the fully-observed setting by 4 absolute percentage points on success rate. Ablations of our pretraining curriculum show each stage to be impactful -- with their combination resulting in further positive synergistic effects.

/ paper / code / video / cite

Vision and Language Navigation
Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments

Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, Stefan Lee, ECCV, 2020.

Abstract: We develop a language-guided navigation task set in a continuous 3D environment where agents must execute low-level actions to follow natural language navigation directions. By being situated in continuous environments, this setting lifts a number of assumptions implicit in prior work that represents environments as a sparse graph of panoramas with edges corresponding to navigability. Specifically, our setting drops the presumptions of known environment topologies, short-range oracle navigation, and perfect agent localization. To contextualize this new task, we develop models that mirror many of the advances made in prior settings as well as single-modality baselines. While some of these techniques transfer, we find significantly lower absolute performance in the continuous setting -- suggesting that performance in prior `navigation-graph' settings may be inflated by the strong implicit assumptions.

/ paper / code / website / cite

Transparecy by Design
Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning

David Mascharka*, Philip Tran, Ryan Soklaski, Arjun Majumdar*, CVPR, 2018. *Indicates equal contribution

Abstract: Visual question answering requires high-order reasoning about an image, which is a fundamental capability needed by machine systems to follow complex directives. Recently, modular networks have been shown to be an effective framework for performing visual reasoning tasks. While modular networks were initially designed with a degree of model transparency, their performance on complex visual reasoning benchmarks was lacking. Current state-of-the-art approaches do not provide an effective mechanism for understanding the reasoning process. In this paper, we close the performance gap between interpretable models and state-of-the-art visual reasoning methods. We propose a set of visual-reasoning primitives which, when composed, manifest as a model capable of performing complex reasoning tasks in an explicitly-interpretable manner. The fidelity and interpretability of the primitives’ outputs enable an unparalleled ability to diagnose the strengths and weaknesses of the resulting model. Critically, we show that these primitives are highly performant, achieving state-of-the-art accuracy of 99.1% on the CLEVR dataset. We also show that our model is able to effectively learn generalized representations when provided a small amount of data containing novel object attributes. Using the CoGenT generalization task, we show more than a 20 percentage point improvement over the current state of the art.

/ paper / poster / code / demo / cite

Detecting Intracranial Hemorrhage with Deep Learning

Arjun Majumdar, Laura Brattain, Brian Telfer, Chad Farris, Jonathan Scalera, EMBC, 2018.

Abstract: Initial results are reported on automated detection of intracranial hemorrhage from CT, which would be valuable in a computer-aided diagnosis system to help the radiologist detect subtle hemorrhages. Previous work has taken a classic approach involving multiple steps of alignment, image processing, image corrections, handcrafted feature extraction, and classification. Our current work instead uses a deep convolutional neural network to simultaneously learn features and classification, eliminating the multiple hand-tuned steps. Performance is improved by computing the mean output for rotations of the input image. Postprocessing is additionally applied to the CNN output to significantly improve specificity. The database consists of 134 CT cases (4,300 images), divided into 60, 5, and 69 cases for training, validation, and test. Each case typically includes multiple hemorrhages. Performance on the test set was 81% sensitivity per lesion (34/42 lesions) and 98% specificity per case (45/46 cases). The sensitivity is comparable to previous results (on different datasets), but with a significantly higher specificity. In addition, insights are shared to improve performance as the database is expanded.

/ paper / cite

Improving SAR Automatic Target Recognition using Simulated Images under Deep Residual Refinements

Miriam Cha, Arjun Majumdar, H.T. Kung, Jarred Barber, ICASSP, 2018.

Abstract: In recent years, convolutional neural networks (CNNs) have been successfully applied for automatic target recognition (ATR) in synthetic aperture radar (SAR) data. However, it is challenging to train a CNN with high classification accuracy when labeled data is limited. This is often the case with SAR ATR in practice, because collecting large amounts of labeled SAR data is both difficult and expensive. Using a simulator to generate SAR images offers a possible solution. Unfortunately, CNNs trained on simulated data may not be directly transferable to real data. In this paper, we introduce a method to refine simulated SAR data based on deep residual networks. We learn a refinement function from simulated to real SAR data through a residual learning framework, and use the function to refine simulated images. Using the MSTAR dataset, we demonstrate that a CNN-based SAR ATR system trained on simulated data under residual network refinements can yield much higher classification accuracy as compared to a system trained on simulated images, and so can training on real data augmented with these simulated data under refinements compared to training with real data alone.

/ paper / cite

© 2021 Arjun Majumdar