Projects

image for illustration only (source)

Fuzzy matching for 10M records

an unsexy but important task

2022

The Situation

Customer data for one of TM's client is very important. They have millions of customer information and transaction records lying around, and they want to deduplicate those and then augment with third-party data. (Of course, with added complications due to the nature of records but I won't go into that for confidentiality)

The Solution

Text preprocessing at scale using BigQuery, then Supercharged fuzzy matching using Python. By supercharged I mean layer after layer of business rules, matching rules, and then topping it off with the quintessential TF-IDF fuzzy matching method. Everything under 3 hours on a 32 GB VM!

The Stack

BigQuery, layers upon layers of pandas vector operations, some scikit-learn

image for illustration only (source)

Model rollout for a whole country in less than 24 hours

a fun hacking-around at work

2021

The Situation

Our team at TM was able to create nicely-performing prediction models on sampled geographic points, but then we needed to roll it out to the whole country. The naïve solution (iteratively read data from BQ to pandas) however... takes 5 days

The Solution

I investigated the bottlenecks and hacked around the code, queries, BQ tables, and GCS. In the end, I found the optimal mix of the said tools (e.g. use clustered and partitioned BQ tables, export the data into GCS, and loop over the files) dropped the runtime to 1 day. It was a happy day!

The Stack

BigQuery, GCS, pandas, scikit-learn

image for illustration only (source)

AI-generated holiday greetings 🎄

as a designated christmas elf

2020

The Situation

It's a tradition at Thinking Machines to use AI/ML to generate something Christmas-y every holiday season. I was one of the Christmas elves for 2020, and our theme was NLP!

The Solution

I scraped christmas-related songs and poems and finetuned a deep language model on them so that the model learns to embody the holiday spirit ✨. The final output is a customizable virtual greeting card, which we served on TM's website -- it was a cool project! More details in the blog post below.

The Stack

GPT-2 on Huggingface (PyTorch), ScraPy, FastAPI

The Sources

Thinking Machines Blog Post

Model Training Code

image for illustration only (source)

Decoding images from brain fMRI?

a curiosity project

2019

The Situation

A friend and I were interested in exploring ML for Neuroscience. We thought it'd be fun to try to predict what image a person is viewing based on his fMRI brain scans

The Solution

This is understandably a hard problem (it's an exciting area of research though! with some papers finding some success already), so we kept this simple at first -- we tried predict the dominant colors in the image, hypothesizing that viewing certain colors has some distinguishable manifestation in the brain.

(...we weren't very successful, but that's research experimentation life... 🤷🏻‍♀️ and we really only did this during our free time...)

The Stack

PyTorch

image for illustration only (source)

Can we improve an image captioning model using BERT embeddings?

a mini project

2019

The Situation

For a Reinforcement Learning class, we wanted to see if adding an incentive to a model will increase its performance.

The Solution

We used this CVPR paper as a jump-off point, and reformulated the problem as an RL problem (using Self-Critical Sequence Training). We incentivized the model to maximize the similarity between the generated caption's contextual embeddings with the ground truth caption. Contextual embeddings were extracted using a BERT model.

(...we weren't very successful, but that's research experimentation life... 🤷🏻‍♀️)

The Stack

PyTorch, BERT

The Sources

GitHub

$Classifying vertebrae fractures from 3D CT Scans$

image for illustration only (source)

Classifying vertebrae fractures from 3D CT Scans

our undegraduate thesis

2017

The Situation

came in as fresh seniors to the computer vision and machine intelligence lab, and our professor was into medical images, and so that led us to the lumbar vertebrae, specifically, fractured ones, that we wanted to detect and classify

The Solution

we had to do a lot of 3D image processing to augment our dataset (it was cool tinkering around with 3d data!), my partner took care of finetuning a VGG model. Everything turned out ok

and this would be the reason I would go back to do a masters 1 year later.

The Stack

TensorFlow, OpenCV

The Sources

Presentation

Paper

Projects

Featured

Kindle Highlights ✨

The Situation

The Solution

The Stack

The Sources

Embeddings for gaze patterns 👁

The Situation

The Solution

The Stack

The Sources

Other projects

Fuzzy matching for 10M records

The Situation

The Solution

The Stack

Model rollout for a whole country in less than 24 hours

The Situation

The Solution

The Stack

AI-generated holiday greetings 🎄

The Situation

The Solution

The Stack

The Sources

Decoding images from brain fMRI?

The Situation

The Solution

The Stack

Can we improve an image captioning model using BERT embeddings?

The Situation

The Solution

The Stack

The Sources

Classifying vertebrae fractures from 3D CT Scans

The Situation

The Solution

The Stack

The Sources