Differences

This shows you the differences between two versions of the page.

--- groupmeeting-winter2015 [2015/01/07 11:43]
jsupanci
+++ groupmeeting-winter2015 [2015/03/10 17:13] (current)
bhkong
@@ Line 20: / Line 20: @@
 ==== Week 2 -Olga Russakovsky  - Jan 15th (4011 DBH @ 11AM) ====
-**Paper:**
+**Paper:** Designing and Overcoming Challenges in Large-Scale Object Detection
+**Abstract: ** There are four key components when tackling large-scale visual  recognition: (1) the data needs to have sufficient variety to represent  the target world, (2) the algorithms need to be powerful enough to learn  from this data, (3) the data needs to be annotated with enough  information for algorithms to learn from, and (4) the algorithms need to  be fast enough to process this large data. I’ll talk about my PhD work  on each of these four components in the context of large-scale object  detection.\\
+\\
+The main part of the talk will focus on the  difficulties of collecting and annotating diverse data when designing  the object detection task of the ImageNet Large Scale Visual Recognition  Challenge ([[http://image-net.org/challenges/LSVRC/|http://image-net.org/challenges/LSVRC/]]).  ILSVRC is a benchmark in object classification and detection on  hundreds of object categories and millions of images. The challenge has  been run annually since 2010, attracting participation from more than  fifty institutions. In 2014 the challenge had a record number of  submissions (123 entries from 36 teams) and appeared in international  media including the New York Times, MIT Technology Review and  CBC/Radio-Canada. In this talk I’ll describe some of the challenges of  scaling up the object detection task of ILSVRC by more than an order of  magnitude compared to previous datasets (e.g., the PASCAL VOC).\\
+\\
+I  will conclude by discussing my current work and future plans. As  computers are becoming exceedingly good at recognizing many object  categories, I argue that it’s time to step back and consider the long  tail: what pieces are still missing and preventing us from recognizing  /every/ object and understanding /every/ pixel in an image? I believe  that we’ll need to closely examine the performance of our algorithms on  individual classes (instead of focusing on average accuracy across  hundreds of categories), revisit the idea of object description rather  than categorization, and consider the advantages of close human-machine  collaboration.
+**Bio:** Olga Russakovsky ([[http://ai.stanford.edu/~olga|http://ai.stanford.edu/~olga]])  is a PhD student at Stanford University advised by Professor Fei-Fei  Li. Her main research interests are in large-scale object detection and  recognition. For the past two years she has been the lead organizer of  the international ImageNet Large Scale Visual Recognition Challenge  which has been featured in the New York Times, MIT Technology Review,  and other international media venues. She has organized several  workshops at top-tier computer vision conferences: the ImageNet  challenge workshop at ICCV’13 and ECCV’14, the upcoming workshop on  Large-Scale Visual Recognition and Retrieval at CVPR’15, and the new  Women in Computer Vision workshop at CVPR’15. During her PhD she  collaborated closely with NEC Laboratories America and with Yahoo!  Research Labs. She was awarded the NSF Graduate Fellowship and the CRA  undergraduate research award.
-**Abstract:**
 ==== Week 3 - Bailey  - Jan 23rd ====
@@ Line 28: / Line 35: @@
 **Abstract**:
-==== Week 4 -  Greg - Jan 29th ====
+==== Week 4 -Sam  - Jan 29th ====
-**Paper**:
+**Topic**:
 **Abstract**:
 ==== Week 5 - Shu  - Feb 5th ====
-**Paper**:
+**Paper**: Beyond R-CNN detection: Learning to Merge Contextual Attribute
-**Abstract**:
+**Abstract**: We will briefly review the R-CNN [1], which actually does classification over thousands of objectness regions extracted from the image. We will see what it missed – interaction between objects and context within the image. When people make use contextual information in addition to CNN, performance is improved [2]. This is also recently supported by an interesting study [3], which compares the action classification performance between state-of-the-art CV methods and linear SVM over the fMRI data. The conclusions in the paper are very interesting, but we emphasize the most "trivial" yet convincing one – human brain exploits semantic inference for action classification, which is absent in CV methods for action classification. So, exploiting the contextual information will be a reasonable step to improve detection. But how can we represent, extract and utilize the contextual information? To answer these questions, I will present two other papers which are seemingly unrelated to the questions. The first one is [4], which presents how to represent/learn/use texture attribute to improve texture and material classification; the second one is [5] which uses patch match techniques for chair detection in a finer way. Based on these two papers, we will try to answer the questions – how can we represent, learn and use the contextual information to boost detection?
 ==== Week 6 -Minhaeng - Feb 12th ====
+**Paper**: Knowing a good HOG filter when you see it: Efficient selection of filters for detection
+**Abstract**:[[http://ttic.uchicago.edu/~smaji/papers/goodParts-eccv14.pdf|http://ttic.uchicago.edu/~smaji/papers/goodParts-eccv14.pdf]]
+==== Week 7 -  Phuc - Feb 19th @ 10AM ====
 **Paper**:
 **Abstract**:
-==== Week 7 -  Phuc - Feb 19th ====
+==== Week 7 - Yi  - Feb 19th @ 5PM in DBH 4013 ====
-**Paper**:
+**Paper**: Deep learning!
 **Abstract**:
 ==== Week 8 - Peiyun - Feb 26th ====
-**Paper**:
+**Paper: ** Long-term Recurrent Convolutional Networks for Visual Recognition and Description
 **Abstract**:
+Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent, or “temporally deep”, are effective for tasks involving sequences, visual and otherwise. We develop a novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and demonstrate the value of these models on benchmark video recognition tasks, image description and retrieval problems, and video narration challenges. In contrast to current models which assume a fixed spatio-temporal receptive field or simple temporal averaging for sequential processing, recurrent convolutional models are “doubly deep” in that they can be compositional in spatial and temporal “layers”. Such models may have advantages when target concepts are complex and/or training data are limited. Learning long-term dependencies is possible when nonlinearities are incorporated into the network state updates. Long-term RNN models are appealing in that they directly can map variable-length inputs (e.g., video frames) to variable length outputs (e.g., natural language text) and can model complex temporal dynamics; yet they can be optimized with backpropagation. Our recurrent long-term models are directly connected to modern visual convnet models and can be jointly trained to simultaneously learn temporal dynamics and convolutional perceptual representations. Our results show such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.
 ==== Week 9 -  Raul - Mar 5th ====
@@ Line 58: / Line 74: @@
 **Abstract**:
-==== Week 10 - Sam  - Mar 12th ====
+==== Week 10 - Greg  - Mar 12th ====
 **Paper**:

CompVision

User Tools

Site Tools

Differences

Page Tools