Projects

selected projects from personal, academic, and professional work.

Featured

Kindle Highlights ✨

image for illustration only (source)

Kindle Highlights ✨

a personal + learning project

The Situation

I like reading on Kindle because my notes and highlights actually get saved somewhere. Kindle dumps those into a text file called 'My Clippings', and I wanted a way for me to view those highlights easily.

I've also been doing some backend/data engineering at work so I thought this project would be a good way to solidify my learnings from there

The Solution

so I wrote some pipeline(s) to do just that! They extract my data from the file into a DB add some metadata per book. I also made a Streamlit app to go with it! πŸ™‚
It was fun so I went ahead and added some quick analytics to it hehe. Check out the app in the link below! πŸ‘‡πŸ»

The Stack

Dagster, SQLAlchemy, Streamlit, scikit-learn, pandas, etc.

The Sources

Streamlit App

GitHub

Embeddings for gaze patterns πŸ‘

image for illustration only (source)

Embeddings for gaze patterns πŸ‘

my masters thesis

The Situation

(this is not supposed to be the abstract): I've always been interested with the brain, cognition, psychology, and the like... long story short I found eye movements to be a related and exciting area of research, so I combined that with my interest in unsupervised learning and tried to explore representation learning for eye movement signals πŸ‘€ πŸ€–

The Solution

After a hundred+ research papers, lots of abstract maths, higher electricity bills, I ended up with two models: One method was an autoencoder with TCNs, and the other is also a TCN but uses contrastive learning. Links down below! it was a glorious time when my left 🧠 was the happiest.

The Stack

a lot of PyTorch, then the usual scikit-learn, numpy, and pandas

The Sources

Final Defense Presentation

(Autoencoder) ICPR 2020 Paper, Presentation, Code

(Contrastive Learning) EUSIPCO 2021 Paper, Presentation

Other projects

Fuzzy matching for 10M records

image for illustration only (source)

Fuzzy matching for 10M records

an unsexy but important task
The Situation

Customer data for one of TM's client is very important. They have millions of customer information and transaction records lying around, and they want to deduplicate those and then augment with third-party data. (Of course, with added complications due to the nature of records but I won't go into that for confidentiality)

The Solution

Text preprocessing at scale using BigQuery, then Supercharged fuzzy matching using Python. By supercharged I mean layer after layer of business rules, matching rules, and then topping it off with the quintessential TF-IDF fuzzy matching method. Everything under 3 hours on a 32 GB VM!

The Stack

BigQuery, layers upon layers of pandas vector operations, some scikit-learn

Model rollout for a whole country in less than 24 hours

image for illustration only (source)

Model rollout for a whole country in less than 24 hours

a fun hacking-around at work
The Situation

Our team at TM was able to create nicely-performing prediction models on sampled geographic points, but then we needed to roll it out to the whole country. The naΓ―ve solution (iteratively read data from BQ to pandas) however... takes 5 days

The Solution

I investigated the bottlenecks and hacked around the code, queries, BQ tables, and GCS. In the end, I found the optimal mix of the said tools (e.g. use clustered and partitioned BQ tables, export the data into GCS, and loop over the files) dropped the runtime to 1 day. It was a happy day!

The Stack

BigQuery, GCS, pandas, scikit-learn

AI-generated holiday greetings πŸŽ„

image for illustration only (source)

AI-generated holiday greetings πŸŽ„

as a designated christmas elf
The Situation

It's a tradition at Thinking Machines to use AI/ML to generate something Christmas-y every holiday season. I was one of the Christmas elves for 2020, and our theme was NLP!

The Solution

I scraped christmas-related songs and poems and finetuned a deep language model on them so that the model learns to embody the holiday spirit ✨. The final output is a customizable virtual greeting card, which we served on TM's website -- it was a cool project! More details in the blog post below.

The Stack

GPT-2 on Huggingface (PyTorch), ScraPy, FastAPI

The Sources

Thinking Machines Blog Post

Model Training Code

Decoding images from brain fMRI?

image for illustration only (source)

Decoding images from brain fMRI?

a curiosity project
The Situation

A friend and I were interested in exploring ML for Neuroscience. We thought it'd be fun to try to predict what image a person is viewing based on his fMRI brain scans

The Solution

This is understandably a hard problem (it's an exciting area of research though! with some papers finding some success already), so we kept this simple at first -- we tried predict the dominant colors in the image, hypothesizing that viewing certain colors has some distinguishable manifestation in the brain.

(...we weren't very successful, but that's research experimentation life... πŸ€·πŸ»β€β™€οΈ and we really only did this during our free time...)

The Stack

PyTorch

Can we improve an image captioning model using BERT embeddings?

image for illustration only (source)

Can we improve an image captioning model using BERT embeddings?

a mini project
The Situation

For a Reinforcement Learning class, we wanted to see if adding an incentive to a model will increase its performance.

The Solution

We used this CVPR paper as a jump-off point, and reformulated the problem as an RL problem (using Self-Critical Sequence Training). We incentivized the model to maximize the similarity between the generated caption's contextual embeddings with the ground truth caption. Contextual embeddings were extracted using a BERT model.

(...we weren't very successful, but that's research experimentation life... πŸ€·πŸ»β€β™€οΈ)

The Stack

PyTorch, BERT

The Sources

GitHub

Classifying vertebrae fractures from 3D CT Scans

image for illustration only (source)

Classifying vertebrae fractures from 3D CT Scans

our undegraduate thesis
The Situation

came in as fresh seniors to the computer vision and machine intelligence lab, and our professor was into medical images, and so that led us to the lumbar vertebrae, specifically, fractured ones, that we wanted to detect and classify

The Solution

we had to do a lot of 3D image processing to augment our dataset (it was cool tinkering around with 3d data!), my partner took care of finetuning a VGG model. Everything turned out ok

and this would be the reason I would go back to do a masters 1 year later.

The Stack

TensorFlow, OpenCV

The Sources

Presentation

Paper