Learn to distinguish and perform human actions.

My main research area lies on the border between natural language understanding and visual event representation. My investigation is based on a simple but very difficult question to answer: Given a simple linguistic input: "Can you give me some pepper" or "Can you pour me some milk", how can a person, using his perception of surrouding environment and his knowledge of object properties and communicative common ground, can execute the act as intended by the other party. Can an AI agent perform the same thing? More specifically, what I'm interested is on the mapping among linguistic (difference between eventive expressions such as slide, pour, throw, or spatial prepositional expressions on, in, toward, around), visual (feature-based as recorded by camera/sensors) and programmatic (a dynamic plan for agent to act) representations of events. For more details, see:

  • ISA 2016 paper
  • In this paper, we presented ECAT, an event annotation toolkit that we used to annotate captured videos of human-object interaction with linguistic and programmatic description.
  • ESANN 2017 paper or QR 2017 paper
  • In these papers, we presented a common framework for learning to distinguish events using their visual representation with sequential modeling.
  • Current work, see AAAI-SS 2018 paper
  • I target to learn programmatic representation of events so that they can be performed by simulated agents.


Computational linguistics

Word vector distributed representation I have interest in the distributed representation of words after the arrival of Word2Vec toolkit. In our lab, we did a lot of experiments to either modify Word2vec to serve our purpose of representing and disambiguating event-verbs.

In my paper submitted to EMNLP 2015, I proposed a modified model, called Skipgram Backward-Forward that utilizes thematic ordering of verb arguments to create better representations for verbs. The resulted verb vectors are used in the task of verb disambiguation according to Corpus Pattern Analysis theory and corpus.

In another line of research, we looked at the performance of word2vec on different semantic analogy tasks along the dimension of syntagmatic-paradigmatic word relation (a point of view taken from Ferdinand de Saussure and other semioticians from the beginning of 20th century).


Sentiment analysis: We collect stock market reviews from online resources in Vietnamese and analize their sentiment polarities. This is an ongoing work that started when I was working with KapitalAMC as their technology consultant. We're looking to provide to our platform's customers a better picture of market trends. Currently we have several teams working on different functions: (1) crawling data from different sources (forums, social network, news); (2) annotating them for positive or negative judgement on phrasal level; and (3) another to learn to predict sentiment toward different stocks.


Temporal expression and event ordering one of my early research experience in Brandeis is on ordering of event and time expressions. Using Tarsqi toolkit as pipeline framework for processing, I implemented an SVM classifier using tree kernel to achieve a state-of-the-art result.


Other research interests

Gesture and sign language My first research experience is on a project on Vietnamese sign language initiated by our Ministry of Science and Technology in 2010. I collected video captures of language signers and applied ML methods to produce linguistic outputs. I used OpenCV to extract features from video, including background removal, hand and face blobs detection, and I used HMM to learn translation output. The project received second prize in the research competition for students in my undergrad university.

In the context of Communication with Computer (CwC) - a DARPA project, our lab is currently investigating gesture semantics as interpreted programs and gesture grammar as state transition machines. We're currently modeling gesture as one modality of multi-modal communication between human and computer. For more information, please refer to our lab papers Communicating and Acting: Understanding Gesture in Simulation Semantics - Poster in IWCS 2017 or Object Embodiment in a Multimodal Simulation.


Manual and automated descriptions for movie scenes I'm also interested in applying sequential modeling methods in the problem of generating textual description from movie snippets. So far I haven't been very successful in this task, as the textual description in this dataset is highly stylistic, preventing the algorithm to produce reasonable results.