selected projects from personal, academic, and professional work.
image for illustration only (source)
I like reading on Kindle because my notes and highlights actually get saved somewhere. Kindle dumps those into a text file called 'My Clippings', and I wanted a way for me to view those highlights easily.
I've also been doing some backend/data engineering at work so I thought this project would be a good way to solidify my learnings from there
so I wrote some pipeline(s) to do just that! They extract my data from the file into a DB add some metadata per book. I also made a Streamlit app to go with it! π
It was fun so I went ahead and added some quick analytics to it hehe. Check out the app in the link below! ππ»
Dagster, SQLAlchemy, Streamlit, scikit-learn, pandas, etc.
image for illustration only (source)
(this is not supposed to be the abstract): I've always been interested with the brain, cognition, psychology, and the like... long story short I found eye movements to be a related and exciting area of research, so I combined that with my interest in unsupervised learning and tried to explore representation learning for eye movement signals π π€
After a hundred+ research papers, lots of abstract maths, higher electricity bills, I ended up with two models: One method was an autoencoder with TCNs, and the other is also a TCN but uses contrastive learning. Links down below! it was a glorious time when my left π§ was the happiest.
a lot of PyTorch, then the usual scikit-learn, numpy, and pandas
(Autoencoder) ICPR 2020 Paper, Presentation, Code
(Contrastive Learning) EUSIPCO 2021 Paper, Presentation
image for illustration only (source)
2022
Customer data for one of TM's client is very important. They have millions of customer information and transaction records lying around, and they want to deduplicate those and then augment with third-party data. (Of course, with added complications due to the nature of records but I won't go into that for confidentiality)
Text preprocessing at scale using BigQuery, then Supercharged fuzzy matching using Python. By supercharged I mean layer after layer of business rules, matching rules, and then topping it off with the quintessential TF-IDF fuzzy matching method. Everything under 3 hours on a 32 GB VM!
BigQuery, layers upon layers of pandas vector operations, some scikit-learn
image for illustration only (source)
2021
Our team at TM was able to create nicely-performing prediction models on sampled geographic points, but then we needed to roll it out to the whole country. The naΓ―ve solution (iteratively read data from BQ to pandas) however... takes 5 days
I investigated the bottlenecks and hacked around the code, queries, BQ tables, and GCS. In the end, I found the optimal mix of the said tools (e.g. use clustered and partitioned BQ tables, export the data into GCS, and loop over the files) dropped the runtime to 1 day. It was a happy day!
BigQuery, GCS, pandas, scikit-learn
image for illustration only (source)
2020
It's a tradition at Thinking Machines to use AI/ML to generate something Christmas-y every holiday season. I was one of the Christmas elves for 2020, and our theme was NLP!
I scraped christmas-related songs and poems and finetuned a deep language model on them so that the model learns to embody the holiday spirit β¨. The final output is a customizable virtual greeting card, which we served on TM's website -- it was a cool project! More details in the blog post below.
GPT-2 on Huggingface (PyTorch), ScraPy, FastAPI
image for illustration only (source)
2019
A friend and I were interested in exploring ML for Neuroscience. We thought it'd be fun to try to predict what image a person is viewing based on his fMRI brain scans
This is understandably a hard problem (it's an exciting area of research though! with some papers finding some success already), so we kept this simple at first -- we tried predict the dominant colors in the image, hypothesizing that viewing certain colors has some distinguishable manifestation in the brain.
(...we weren't very successful, but that's research experimentation life... π€·π»ββοΈ and we really only did this during our free time...)
PyTorch
image for illustration only (source)
2019
For a Reinforcement Learning class, we wanted to see if adding an incentive to a model will increase its performance.
We used this CVPR paper as a jump-off point, and reformulated the problem as an RL problem (using Self-Critical Sequence Training). We incentivized the model to maximize the similarity between the generated caption's contextual embeddings with the ground truth caption. Contextual embeddings were extracted using a BERT model.
(...we weren't very successful, but that's research experimentation life... π€·π»ββοΈ)
PyTorch, BERT
image for illustration only (source)
2017
came in as fresh seniors to the computer vision and machine intelligence lab, and our professor was into medical images, and so that led us to the lumbar vertebrae, specifically, fractured ones, that we wanted to detect and classify
we had to do a lot of 3D image processing to augment our dataset (it was cool tinkering around with 3d data!), my partner took care of finetuning a VGG model. Everything turned out ok
and this would be the reason I would go back to do a masters 1 year later.
TensorFlow, OpenCV