Vedant S. Joshi

Hi, I am a second-year Master's student in Computer Science and Engineering department at UC San Diego. For the summer of 2024, I was interning with the Video Engineering Team at Apple, under Dr. Javier Movellan. My work was focused on enhancing data mixtures for pre training of large scale multi-modal models that had the ability to locate/ground visual concepts using expressive natural language. Simultaneously, I also worked on establishing metrics that visualised the level of vision to language alignment in such models.

Previously, I have had professional experience as a Vision & Imaging Engineer at Tonbo Imaging, where my role was to build robust vision models that improved the spatial awareness of automobiles on Indian roads. My models covered the domain of object localization during day & night (using feature fusion strategies between RGB - infrared domain), self-supervised depth map estimation & open world object detection.

Currently I am open to full time roles in computer vision, multi-modal & LLM research, for March 2025.

Email / CV / Resume / Github / LinkedIn / Kaggle

Research

How to make machines percieve the world ?
Is the question that drives my research every morning. Based on this question, my fundamental goal is to understand the inner workings of vision language models & make a sincere contribution towards building dynamic as well as modular mechanisms that match the level of human intelligence by efficiently representing a multitude of input modalities into a single structured latent space & are contextually modifiable based on the situation at hand.

Areas of Interest :

Representational learning
Multimodal LLMs (Vision Language models)
Self-supervised learning
Multimodal latent space explainability
Open World Learning
Reflectance Models
Model quantization & pruning
Pseudo labelled data generation

Attention Splat
3d Scene Editing

Vedant S. Joshi, Prof. Manmohan Chandrekar | Project, 2024

3D Gaussian Splatting (3DGS) represents the state-of-the-art in 3D scene reconstruction, functioning by projecting 3D Gaussians onto 2D camera planes and rasterizing images. Recently, 3DGS has been adapted to various tasks beyond simple 3D scene reconstruction, tasks that were previously explored using Neural Radiance Fields (NeRFs). These tasks include 3D object segmentation, scene relighting, and modeling in-the-wild scenes, often requiring the addition of feature vectors to points in 3D space, as represented implicitly by NeRFs. Recent work has focused on attaching extra feature vectors to these Gaussians as optimization parameters to hold additional information.

However, the optimization of these features occurs in the 2D image space after rasterization. While this approach allows faithful updating and optimization of features, we believe it lacks 3D awareness. Our goal is to improve the quality of feature vectors learnt in each gaussian by making them aware of their local & global neighbourhood. We propose contextualization of vectors via self-attention in the explicit 3D representation via local & global transformers. Our initial experiments highlight several implementation challenges associated with introducing the computationally expensive O(n2) self-attention mechanism. We provide solutions to address these challenges and demonstrate the effectiveness of our approach.

YZR-Net
Self-supervised Hidden representations Invariant to Transformations for profanity detection [S.H.I.T]

Vedant S. Joshi, Dr. Sivanagaraja Tatinati, Dr. Yubo Wang | PrePrint arXiv, 2022

In this new age of online learning students are still getting accustomed to virtual sessions. Watching recorded videos in the name of attending classes seems to be the trend followed by many EdTech players but at Vedantu, we understand the problems associated with such passive setups. Therefore we provide live classes where students can directly chat with their teachers as well as peers & enjoy the simulated experience of offline classes in an online world. The chatting framework is crucial to our platform for achieving an improved flow of information between the players involved. Sometimes certain miscreants in the class, use our chats framework to post certain insults or abusive language to disturb the decorum of the session. Posting ill intent messages that are directed towards another student, teacher, racial group or gender can foster the feeling of negativity in the mind of the receiver. Even though such situations occur rarely, they must be addressed instantaneously due to the magnitude of their impact on a student’s mind.

Since students don't strictly adhere to syntax or grammatical rules in a chat environment, our solution needs to be invaraint to noisy substitues to a given word. Also there are scenarios where some creative players come-up with clever work arounds such as self-censoring (f**k) or random character deletions (fck) to fool our detection system & still maintain the negative intent of the chat being posted on our platform. YZR-Net is our NLP implementation of the image SSL framework SimCLR. It is trained on a multi-pair, instance discrimination objective for words & since we are using a dictionary of unique words we are 100% sure that the negative pairs being generated, are semantically separate in nature. For our NLP usecase the whole framework seemed ideal because we had an added challenge of profanity detection in the unstructured, transliterated language Hinglish. Due to lack of research or pre-trained models in Hinglish, we had to formulate our learning objective on word syntax instead of learning the loosely defined semantics of the language. Our final goal was to learn a structured latent space where representations of profane words & their augmented counterparts are present in close proximity ( metric : cosine similarity ).

Our system improved the baseline regex recall by 10% & the precision was maintained by using a false positive (similar in syntax but different in semantics) dictionary of non profane words which are removed in the pre-processing pipeline. The key achievement of our model is that now we have a compact dictionary of key profane tokens only & this dictionary can be updated without retraining the YZR-Net incase a new profane token is coined by our creative students ! The implementation is useful only when there is a profane token used in the chat & therefore as a part of version 2, we are working on Heirarchial Atttention Networks to capture cases where a complex combination of non profane tokens sends an ill intent message.

Looking For A Match
Self-supervised Clustering For Automatic Doubt Matching In e-learning Platforms

Vedant S. Joshi, Dr. Sivanagaraja Tatinati, Dr. Yubo Wang | PrePrint arXiv, 2022

Doubts are a natural outcome of any student's learning journey & solving them immediately becomes imperative for any EdTech platform so that the steady growth of learning for every child can be maintained. In this era of Big Data the volume of data being generated on Vedantu is huge & this is also applicable for the doubts being asked on our platform. The traditional way of solving doubts by Subject Matter Experts(SME) is time consuming, redundant & infeasible in nature. Assigning one SME per doubt for every student is unimaginable. Therefore coming up with a system that can find a possible solved match for an asked doubt & create clusters of semantically similar doubts so that the redundancy in answering the same question by SMEs is reduced, would help our platform immensely.

In this work we solved a sub-problem of diagram based matching for doubt questions. For the non-diagram questions we rely on the power of OCRs & transformers to build strong text matching engines but for pure reverse image search we come-up with a diagram extraction & matching module which allowed us to prevent re-answering of the same question & reduced the redundancy of diagram based doubts from 9.5 lakh individual points to 2.5 lakh clusters in our database. Our solution utilised the SOTA, self supervised framework BYOL in this project because there are a lot of similar diagram images in our training set & utilising the instance discrimination objective in SimCLR & MoCo for negative pairs had a high chance of pushing away the latent representations of 2 semantically relevant diagram images thereby affecting our top-5 matching scores.

Our implementation is an improved version of the BYOL which is trained on diagram images extracted from a custom trained Scaled YOLOv4 module that gave a mean average precision of 90% for diagram detection. The main contribution of our work was to come up with a new, domain specific augmentation pipeline that modelled the possible noises generated by a student while uploading an image on our platform. The newly designed augmentations were guided by mutual information metrics to capture flashes, skews, random camera noise etc. but at the same time not loose semantic relevance of the matching pairs. The augmentations played a crucial role towards learning of noise invariant, compressed representations of diagrams & aided in achieving more accurate as well as relevant matches for a given input query.

The diagrams on the right show the improved convergance of our models with stronger clustering abilities along the diagonals of image similarity matrices. In order to deploy the model, the vectors computed by our Custom BYOL were searched using approximate K-NN algorithms such as Heirarchial Navigable Small Worlds (HNSW) & the search space was reduced by performing clustering on the vectors using UMAP & HDBSCAN. Each cluster was represented by a centroid which was the key vector in the HNSW search space.

For Your Eyes Only
Character Level Model For Lip Reading [ Won Best Paper ]

Vedant S. Joshi, Dr. Ebin Deni Raj | 2021 8th International Conference on Smart Computing & Communications (ICSCC)

The amazing ability of our human mind to handle multiple input sources at once & make sense of the environment in which we are present is truly extra-ordinary. Along with this ability, our mind is also dynamic enough to adapt to situations where we loose certain input sources & still make the best possible use of the information availale to us. We all think that speech understanding is a skill that is entirely dependent on hearing but vision also plays a key hidden role which helps us to disambiguate a lot of confusing scenarios. For people who have trouble in hearing, they rely a lot on their vision to understand speech. Based on this observation, I worked on my thesis titled For Your Eyes Only , to build an end to end deep learning system which is able to accurately map a set of lip movements in a given video to its corresponding character.

The system is trained on a subset of words from the large scale Oxford-BBC Lip Reading in the Wild (LRW) Dataset. The whole problem statement is formulated around single word prediction, character by character because the aim of our thesis was to come up with a system that could learn lip movement to character mapping using only a limited set of words in training. This approach of character level prediction made our system more generalized in nature, incase we encountered a previously unseen word. The out of vocabulary words cannot be handled by simple classification models.

We train the model on 112 words & the videos for each of them are processed according to the timestamp at which they are spoken. To further ease the task of learning in spatial domain, we make use of D-Lib facial features library to extract only the lip region from these videos with some buffer. Also speaking speed normalisation is applied so that every video would be 22 frames long in the temporal dimension & support efficient batching of training data into tensors.

For baseline performance we repurposed DeepMind's LipNet model from sentence level to word level predictor & train it from scratch on the LRW subset. To learn complex frame interactions & improve model explainability, we added Bahdanau attention mechanism in between the the 3D ConvNet encoder & GRU decoder which is depicted by the heatmap on the left. The final prediction is made by using greedy CTC decoding techniques in order to get a single character that was scaled to multiple frames during the alignment learning.

Result shown below.

Quantized Coconut Detection Models with Edge Devices

Vedant S. Joshi, Dr. Ebin Deni Raj, Jeena Thomas | Journal of Interconnection Networks, 2022

Owing to the landscape & natural conditions, coconut is an important fruit in Kerala. Coconut farms are an important aspect of the state's economy therefore a lot of citizen's livelihood is dependent on this fruit. Plucking coconuts from plam trees is a huge challenge since it involves scaling large heights & without proper equipment, it could lead to serious injuries for the daily wage workers. Inspired by this, I started to work on my Honours degree thesis project Coco-Layers.

The whole project was divided into 2 modules :

Hardware : Building of an autonomous drone with flight stabilization for curation of a novel coconut dataset which captured variances in real world lighting conditions, multiple scales at which coconuts occur in wild & possible hindrances that could confuse the overall detection system.

Software : On board object detecion module that supported edge based real time inference without loosing detection accuracy.

We experimented on Nvidia Jetson Nano & Raspberry Pi 3b+ using Nvidia tensor RT and Tensorflow lite quantisation kits for reducing model resource consumption. To further push the limits, a frame buffer handling mechanism was written in OpenCV to reduce the calls to main memory for achieving faster frame processing rates. The peak accuracy achieved in the study was 0.4 mean average precision with a 22 FPS detection rate (using a Tiny YOLOv4) in real time flight. The downside of fast processing was reduced flight time from 12 minutes to 6 minutes which was kept as a goal to improve in the next phase.

Result shown below.

Self-supervised Frameworks : Kaggle Implementations

In my free time I have the habit of contributing to the open source community by writing Kaggle notebooks from a mathematical point of view for the latest ongoing research in the field of self supervised learning. My notebooks are a step by step implementation of the latest frameworks on small scale datasets along with links to other research papers that might aid fellow researchers in their problem statements. Some of my implementations are as follows:

BYOL

SimCLR

SCARF

Deep K-Means paper photo

Selective Search paper photo