In this project we aimed at building a construction/blockages classier with minimal possible labelling efforts. We decided against manually sifting through 1000s of hours of video footage to source relevant training samples. Instead we used fast nearest neighbour search (FAISS) on image feature extracted from our video corpus and querying the corpus with known construction samples queries sourced from Mapillary. This helped us build a quasi-clean training set which could be cleaned up minimal effort. After cleaning the training set we train the linear layer of ResNet50 (pre-trained on ImageNet) to build a blockages/construction classier.
In our first iteration we
- Re-sample the videos and extract key frames
- Extract MAC features for all key frames with a pre-trained ResNet50 model
- Use the MAC features to train/build an index with FAISS
- Query the index for semantically similar scenarios sorted by approximate L2 distance
In our second iteration we work with videos directly; we
- Re-sample videos and extract short sequences ()~32 frames)
- Compute representations using a pre-trained video model
- Index the representations for approximate nearest neighbour search
- Query the index for semantically similar video sequences
See our companion project for video summarisation. Following the work in arxiv.org/abs/1502.04681 we train a sequence model based auto-encoder for unsupervised video sequence vectors for indexing & search.
Since we didn't have examples of blockages / construction sites for Berlin and elsewhere. We sourced few construction samples from mapillary and used them to query our index. Sample retrieved results are below. We use query expansion to further improve our retrieval results. Below are some of the retrieved results from our corpus.
Retrieved results
Blockage detection results after training the last linear layer resnet50 model.