I previously worked at MIT Lincoln Laboratory where my research focused on a variety of problems including visual question answering, semantic segmentation, and image-to-image translation.
In vision and language navigation (VLN) an agent must follow a set of natural language instructions to reach a goal location. In pre-explored environments, an agent may consider multiple paths before ultimately selecting the correct one. Typically, this path selection problem is solved by accumulating the probabilities assigned to each action along the path and/or evaluating the path using a generative "speaker" model. However, these approaches have difficulty generalizing to the long-tail of visual and semantic concepts that an agent needs to jointly understand to solve this task. In our work, we are investigating the alternative approach of directly training a discriminative model for path selection. Specifically, we aim to leverage cross-model representations that are first pre-trained on a large-scale vision and language corpus and then fine-tuned for the path selection problem. Results from our approach are coming soon.
David Mascharka*, Philip Tran, Ryan Soklaski, Arjun
Majumdar*, CVPR, 2018.
*Indicates equal contribution
Visual question answering requires high-order reasoning about an image, which is a fundamental capability needed by machine systems to follow complex directives. Recently, modular networks have been shown to be an effective framework for performing visual reasoning tasks. While modular networks were initially designed with a degree of model transparency, their performance on complex visual reasoning benchmarks was lacking. Current state-of-the-art approaches do not provide an effective mechanism for understanding the reasoning process. In this paper, we close the performance gap between interpretable models and state-of-the-art visual reasoning methods. We propose a set of visual-reasoning primitives which, when composed, manifest as a model capable of performing complex reasoning tasks in an explicitly-interpretable manner. The fidelity and interpretability of the primitives’ outputs enable an unparalleled ability to diagnose the strengths and weaknesses of the resulting model. Critically, we show that these primitives are highly performant, achieving state-of-the-art accuracy of 99.1% on the CLEVR dataset. We also show that our model is able to effectively learn generalized representations when provided a small amount of data containing novel object attributes. Using the CoGenT generalization task, we show more than a 20 percentage point improvement over the current state of the art. + show more
Arjun Majumdar, Laura Brattain, Brian Telfer, Chad Farris, Jonathan Scalera, EMBC, 2018.
Initial results are reported on automated detection of intracranial hemorrhage from CT, which would be valuable in a computer-aided diagnosis system to help the radiologist detect subtle hemorrhages. Previous work has taken a classic approach involving multiple steps of alignment, image processing, image corrections, handcrafted feature extraction, and classification. Our current work instead uses a deep convolutional neural network to simultaneously learn features and classification, eliminating the multiple hand-tuned steps. Performance is improved by computing the mean output for rotations of the input image. Postprocessing is additionally applied to the CNN output to significantly improve specificity. The database consists of 134 CT cases (4,300 images), divided into 60, 5, and 69 cases for training, validation, and test. Each case typically includes multiple hemorrhages. Performance on the test set was 81% sensitivity per lesion (34/42 lesions) and 98% specificity per case (45/46 cases). The sensitivity is comparable to previous results (on different datasets), but with a significantly higher specificity. In addition, insights are shared to improve performance as the database is expanded. + show more
Miriam Cha, Arjun Majumdar, H.T. Kung, Jarred Barber, ICASSP, 2018.
In recent years, convolutional neural networks (CNNs) have been successfully applied for automatic target recognition (ATR) in synthetic aperture radar (SAR) data. However, it is challenging to train a CNN with high classification accuracy when labeled data is limited. This is often the case with SAR ATR in practice, because collecting large amounts of labeled SAR data is both difficult and expensive. Using a simulator to generate SAR images offers a possible solution. Unfortunately, CNNs trained on simulated data may not be directly transferable to real data. In this paper, we introduce a method to refine simulated SAR data based on deep residual networks. We learn a refinement function from simulated to real SAR data through a residual learning framework, and use the function to refine simulated images. Using the MSTAR dataset, we demonstrate that a CNN-based SAR ATR system trained on simulated data under residual network refinements can yield much higher classification accuracy as compared to a system trained on simulated images, and so can training on real data augmented with these simulated data under refinements compared to training with real data alone. + show more
© 2019 Arjun Majumdar