Revolutionising Surveillance: From Traditional to Language-Enabled Action Recognition Models

In the world of surveillance and security, accurately recognizing actions from video footage is crucial. Traditional computer vision models have served this purpose well, but advancements in technology are paving the way for even more sophisticated approaches. One such advancement is the emergence of language-enabled computer vision models. This post will introduce the concept, compare it with traditional models, and highlight how it can enhance action recognition (AR) in surveillance and security.

Traditional Computer Vision Models: The Basics

Traditional computer vision models are designed to identify and categorise objects or actions from images and videos. These models operate on a fixed set of output classes. For instance, a model trained to recognize activities might only distinguish between a set number of actions like walking, running, or jumping. This is similar to having a predefined dictionary of actions the model can recognize. While effective, this approach has limitations:

Fixed Output Classes: The model can only recognize actions it was explicitly trained on.
Limited Flexibility: Introducing new actions or nuances to existing actions requires retraining the model with additional labelled data.
Scalability Issues: Gathering and labelling large datasets for every possible action is time-consuming and expensive.

Despite these limitations, traditional models have been the backbone of many surveillance systems, offering reliable performance for specific, well-defined tasks

An interpretation of "traditional computer vision models" by Dall-E

Language-Enabled Computer Vision Models

Language-enabled computer vision models represent a significant leap forward. These models are trained on massive datasets that combine images or videos with corresponding textual descriptions. These datasets can consist of image-text pairs scraped directly from the internet. It should be noted that these datasets are very large, with some datasets containing upwards of 1 billion image-text samples. In these datasets, each sample is an image-text pair. For example, one such pair could be a picture of a person jogging with a corresponding caption “Man wearing trainers jogging in Central Park, New York”. By feeding these pairs to a language-enabled model, this model then learns to associate this image to its natural language counterpart and vice versa. In other words, the model can then reason either from vision to text (i.e. which caption most closely matches this image or video) and from text to vision (i.e. which image or video most closely matches this caption). It can do so freely, meaning that any caption can be supplied to compare against. It follows that these language-enabled vision models have no fixed set of output classes.

Key Advantages:

Zero-Shot Predictions: Which means that the model is capable of predicting an output class it has never seen before. This is possible because these models are not bound to a fixed set of output classes but can flexibly match image with natural language and vice versa.
Enhanced Generalisation: Training on diverse, very large datasets allows these models to generalise better across different scenarios, making them more adaptable to varied surveillance environments.
Flexible and Scalable: Unlike traditional models, language-enabled models do not require extensive retraining to recognize new actions. This flexibility makes them highly scalable and easier to maintain. Labels in the form of natural language can freely be supplied to the model while it is deployed to accommodate for new prediction tasks.

An interpretation of "traditional computer vision models" by Dall-E

Applying language-enabled models in real-world scenarios

Given the advantages and general overview of how language-enabled computer vision models work, we conclude with an example in which a language-enabled AR model is applied to a real-world surveillance example.

Consider a scenario in which a traditional action recognition model is applied to a set of CCTV cameras in a business park and in which the operator is especially interested in preventing trespassing. To do so, the operator is paying attention to people forcing locks, climbing gates and walking in specific zones. In a traditional computer vision model, we could distinguish three output classes: walking, picking a lock and climbing. A model can then be taught to predict these three fixed output classes by training it on a dataset containing these actions. After training, the model can be applied to this surveillance task, but can only recognise the three actions it was trained on and has no notion of context in recognising these actions.

Now consider that same business park surveillance scenario but with a language-enabled AR model. As we have previously noted, a language-enabled AR model is already pretrained on a very large set of video-text pairs. This means that it will already have a very broad notion of how certain actions relate to certain natural language captions and vice versa. To improve its performance, the model may be finetuned on data of very specific actions such as picking a lock. In doing so, it complements niche actions that may have been scarce or absent in its pretraining. Finetuning is usually beneficial to a language-enabled model’s performance but is not required. Then, the operator can simply compose a set of captions (also called prompts) that best describe the actions that the model needs to distinguish between. In our case, these prompts could be: “a person climbing a gate”, “a person picking a lock”, “a person not wearing a safety vest walking on the perimeter after dark”. The model will then start associating the video content streamed from the cctv cameras to the supplied set of prompts and pick the most likely prompt, if any.

It should be noted that these captions can be swapped at any time and can therefore be changed to fit specific situations or timeframes. That is, if a different set of actions has to be recognised on a given day, the operator can simply design a set of prompts that fit that situation and apply them to the model instantly.

Given this example, it is easy to see that language-enabled models are far superior to traditional models in terms of performing detailed predictions, flexibility and generalisation to other domains. Given a different set of prompts and a possibly some complementary finetuning, the same model that is applied to this business park surveillance scenario could be applied to an entirely different domains like detecting violence in city streets or detecting suspicious behaviour in prisons. At the same time, the model is better able to distinguish contextual details like what a person is wearing and the time of day, leading to more informed predictions than simply predicting the base action that is being performed.

Downsides

Despite the impressive advantages demonstrated by language-enabled AR models, there are a few downsides to keep in mind. First and foremost, these model architectures are typically slower than traditional AR models. This is because most successful language-enabled AR models utilise larger, more complex backbones than their older counterparts. What this means in practice is that they are more costly to (pre)train and that they require more GPU power during operation, and a server handling multiple video streams at the same time needs to be adequately equipped to do so. Finally, these models are relatively new and at the same time very flexible, which together can sometimes cause unpredictable behaviour in operation.

Conclusion

Flagship models like ChatGPT and Dall-E are revolutionising the way in which people interact with computers, but the applications of machine learning models that can reason with natural language don’t stop there. In this article we have shown that language-enabled computer vision models are revolutionising the field of AI-driven CCTV surveillance. These new models offer unparalleled flexibility, scalability, and contextual understanding, making them invaluable for modern surveillance system. All in all, we distinguish two key advantages of language-enabled computer vision models over traditional computer vision models:

Dynamic Recognition: With language-video models, surveillance systems can dynamically recognize and categorise new actions as they occur, without needing predefined categories. This is particularly useful in identifying (new) unusual or suspicious activities that were not part of the training data. For that same reason, they can also be applied to perform more granular recognition tasks by providing more details in the prompt.
Contextual Understanding: By understanding the context provided by textual descriptions, these models can offer more accurate and nuanced interpretations of actions. For example, distinguishing between a person running for exercise versus running away in a panic is contextual nuance that can be picked up by a language-enabled model but that is typically harder for computer vision models that are not language-enabled.