CS 234 : Computational Methods for the Analysis of Biomolecular Data

Project: Classifying High Occupancy of Targets (HOT) regions, and Low Occupancy of Targets (LOT) regions from DNA sequences

Regulatory binding factor are not randomly distributed among the whole genome of human. 50% of them are found clustered is different regions in the genome. These regions are called High Occupancy Target (HOT) regions, consequently the other regions of the genome are termed as Low Occupancy Target (LOT) regions. Approximately, 90% of the constituents of Human HOT regions show strong enrichment of promoters while 10-20% shows context-specific enrichment.

Due to the development of machine learning and with known HOT positions in genome, we can find a model to classify these regions. My goal is to find a model which can be interpretable.

This webpage is dedicated for the project work of CS 234!

Project Updates

Oct 13

  • Created the homepage for the project.
  • Downloaded the dataset from DeepLoc.
  • Preprocessed protein sequence into 2D vectors.
  • Exploratory Data Analysis: Notebook-1
Oct 28
  • Changed the project to HOT classification.
  • Preprocessed sequence using k-mers. Used pretarained k-mer vectors for classification.
  • Notebook: Notebook Link
Nov 11
  • Preprocessed sequence using a fixed length 1000. Used a deep learning model for classification and evaluated performance on unknown test set.
  • Notebook: Notebook Link