Entry for the Stanford RNA 3D Folding Kaggle competition — predicting 3D atomic coordinates for RNA molecules from primary sequence, evaluated by structural similarity against experimentally determined targets.
What’s in the repo
Five iterative pipelines (pipeline1 → pipeline5) progressing from baseline featurization to more sophisticated architectures, plus:
- TM-align integration (
TMalign.cppcompiled locally) for structural scoring against ground-truth coordinates — the closer the predicted tertiary structure, the higher the score. - Evaluation harness (
evaluation.ipynb) running per-molecule structural comparisons and summarizing by length bin, family, and difficulty.
Approach
RNA folding sits at an awkward intersection: much less data than protein folding (no RNA-level equivalent of AlphaFold’s training set), but the same underlying physics. The pipelines explore featurization choices — base-pairing priors, sequence embeddings, pairwise distance maps — and evaluate which features carry the most signal once you score by TM-align rather than per-residue error.
Primary artifacts (trained checkpoints, 290GB of competition sequence data) kept locally; repo carries the pipeline code and evaluation stack.