This repository contains the submission for the M2 Major Module Coursework of the MPhil in Data Intensive Science programme at the University of Cambridge. It includes all relevant code and the final report.
All code is implemented in Python and organised for clarity and reproducibility. A LaTeX-formatted report is also included.
This environment uses Python 3.12, and all required packages are compatible. Please ensure your Conda installation is up-to-date to avoid version resolution issues.
├── LICENSE
├── README.md
├── requirements.txt
├── environment.yml
├── pyproject.toml
├── 2a_preprocessing.ipynb
├── 2b_flops_test.ipynb
├── 3a_grid_search.ipynb
├── 3b_context_length.ipynb
├── 3c_final_model.ipynb
├── model_comparison.ipynb
├── ctx_len_experiment/
│ ├── best_lm_head_bias_lr0.0001_rank8_ctx{128,512,768}.pt
│ ├── best_lora_state_lr0.0001_rank8_ctx{128,512,768}.pt
│ ├── best_result_lr0.0001_rank8_ctx{128,512,768}.json
│ └── val_inference_metrics_lr0.0001_rank8_ctx{128,512,768}.json
├── data/
│ └── lotka_volterra_data.h5
├── docs/
│ ├── conf.py
│ ├── index.rst
│ ├── make.bat
│ ├── Makefile
│ ├── source/
│ └── _build/
│ └── html/
│ ├── index.html
│ └── ...
├── figures/
│ ├── grad_norm_{ctx_len,final,grid_search}.png
│ ├── lora.png
│ ├── lora_a_mean_norm_{ctx_len,final,grid_search}.png
│ ├── lora_b_mean_norm_{ctx_len,final,grid_search}.png
│ ├── model_comparison.png
│ ├── Qwen2_{Decoder_Layer,Model}.png
│ ├── train_loss_{ctx_len,final,grid_search}.png
│ └── val_loss_{ctx_len,final,grid_search}.png
├── final_model/
│ ├── best_lm_head_bias_lr0.0001_rank8_ctx768.pt
│ ├── best_lora_state_lr0.0001_rank8_ctx768.pt
│ ├── best_result_lr0.0001_rank8_ctx768.json
│ └── val_inference_metrics_lr0.0001_rank8_ctx768.json
├── lr_rank_experiment/
│ ├── best_lm_head_bias_lr{1e-05,5e-05,0.0001}_rank{2,4,8}.pt
│ ├── best_lora_state_lr{1e-05,5e-05,0.0001}_rank{2,4,8}.pt
│ ├── best_result_lr{1e-05,5e-05,0.0001}_rank{2,4,8}.json
│ └── val_inference_metrics_lr{1e-05,5e-05,0.0001}_rank{2,4,8}.json
├── report/
│ └── main.pdf
├── src/
│ ├── flops_counter.py
│ ├── lora_utils.py
│ ├── preprocessor.py
│ ├── qwen.py
│ └── __init__.py
├── tables/
│ ├── validation_mae_{ctx_length,grid_search}.csv
│ └── validation_mse_{ctx_length,grid_search}.csv
After setting up the environment using the provided environment.yml or requirements.txt, you can run the .ipynb notebooks directly. However, please note that you’ll need to configure your own Weights & Biases account - this includes running wandb login with your API key before using logging features.
The figures/ folder contains all the plots generated throughout the experiments. The folders lr_rank_experiment/, ctx_len_experiment/, and final_model/ store the best model outputs for each run, including the LoRA state, LM head bias and the corresponding validation inference metrics. The .pt files could be loaded into base models using load_lora_weights and lm_head.bias.data.copy_.
The src/ directory includes utility code used across different stages of the project - such as data preprocessing, FLOPS counting, and model wrappers - serving as a central toolkit.
tables/ contains the MSE and MAE values obtained from the evaluate_inference() function under different hyperparameter configurations.
To download this repository, run:
git clone https://gitlab.developers.cam.ac.uk/phy/data-intensive-science-mphil/assessments/m2_coursework/yz929.gitYou can install the dependencies using either Conda or pip.
Using Conda
conda env create -f environment.yml
conda activate m2_envTo deactivate:
conda deactivateUsing pip
pip install -r requirements.txtNote: PyTorch is installed via pip because conda packages are no longer officially supported.
- If you have an NVIDIA GPU, pip will try to install the appropriate CUDA version automatically.
- If you want to ensure compatibility or manually choose a version, visit: https://pytorch.org/get-started/locally/
To install with specific CUDA version (e.g., 11.8), run:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118If you are using CPU only:
pip install torch torchvision torchaudioAfter setting up the environment and wandb API, you can run the Python notebooks after changing directories and wandb parameters of your own.
All required data files are included in the data/ directory - no external download is necessary.
Sphinx-based automatic documentation for the image processing toolkit is located at:
docs/_build/html/index.htmlYou can open this file in a browser to explore the full API reference and module descriptions.
The report is written in LaTeX and located inside the report/ directory. The compiled version can be found as main.pdf.
ChatGPT was used to assist in proofreading and formatting the LaTeX report. This included improvements to grammar, clarity, and LaTeX formatting (equations, tables).
All suggestions were critically reviewed and selectively integrated to maintain academic integrity.
This repository is licensed under the MIT License. For more details, see the LICENSE file.