M2 Coursework – MPhil in Data Intensive Science

This repository contains the submission for the M2 Major Module Coursework of the MPhil in Data Intensive Science programme at the University of Cambridge. It includes all relevant code and the final report.

All code is implemented in Python and organised for clarity and reproducibility. A LaTeX-formatted report is also included.

Python Version

This environment uses Python 3.12, and all required packages are compatible. Please ensure your Conda installation is up-to-date to avoid version resolution issues.

Repository Structure

├── LICENSE
├── README.md
├── requirements.txt
├── environment.yml
├── pyproject.toml
├── 2a_preprocessing.ipynb
├── 2b_flops_test.ipynb
├── 3a_grid_search.ipynb
├── 3b_context_length.ipynb
├── 3c_final_model.ipynb
├── model_comparison.ipynb

├── ctx_len_experiment/
│   ├── best_lm_head_bias_lr0.0001_rank8_ctx{128,512,768}.pt
│   ├── best_lora_state_lr0.0001_rank8_ctx{128,512,768}.pt
│   ├── best_result_lr0.0001_rank8_ctx{128,512,768}.json
│   └── val_inference_metrics_lr0.0001_rank8_ctx{128,512,768}.json

├── data/
│   └── lotka_volterra_data.h5

├── docs/
│   ├── conf.py
│   ├── index.rst
│   ├── make.bat
│   ├── Makefile
│   ├── source/
│   └── _build/ 
│       └── html/  
│           ├── index.html  
│           └── ...  

├── figures/
│   ├── grad_norm_{ctx_len,final,grid_search}.png
│   ├── lora.png
│   ├── lora_a_mean_norm_{ctx_len,final,grid_search}.png
│   ├── lora_b_mean_norm_{ctx_len,final,grid_search}.png
│   ├── model_comparison.png
│   ├── Qwen2_{Decoder_Layer,Model}.png
│   ├── train_loss_{ctx_len,final,grid_search}.png
│   └── val_loss_{ctx_len,final,grid_search}.png

├── final_model/
│   ├── best_lm_head_bias_lr0.0001_rank8_ctx768.pt
│   ├── best_lora_state_lr0.0001_rank8_ctx768.pt
│   ├── best_result_lr0.0001_rank8_ctx768.json
│   └── val_inference_metrics_lr0.0001_rank8_ctx768.json

├── lr_rank_experiment/
│   ├── best_lm_head_bias_lr{1e-05,5e-05,0.0001}_rank{2,4,8}.pt
│   ├── best_lora_state_lr{1e-05,5e-05,0.0001}_rank{2,4,8}.pt
│   ├── best_result_lr{1e-05,5e-05,0.0001}_rank{2,4,8}.json
│   └── val_inference_metrics_lr{1e-05,5e-05,0.0001}_rank{2,4,8}.json

├── report/
│   └── main.pdf

├── src/
│   ├── flops_counter.py
│   ├── lora_utils.py
│   ├── preprocessor.py
│   ├── qwen.py
│   └── __init__.py

├── tables/
│   ├── validation_mae_{ctx_length,grid_search}.csv
│   └── validation_mse_{ctx_length,grid_search}.csv

After setting up the environment using the provided environment.yml or requirements.txt, you can run the .ipynb notebooks directly. However, please note that you’ll need to configure your own Weights & Biases account - this includes running wandb login with your API key before using logging features.

The figures/ folder contains all the plots generated throughout the experiments. The folders lr_rank_experiment/, ctx_len_experiment/, and final_model/ store the best model outputs for each run, including the LoRA state, LM head bias and the corresponding validation inference metrics. The .pt files could be loaded into base models using load_lora_weights and lm_head.bias.data.copy_.

The src/ directory includes utility code used across different stages of the project - such as data preprocessing, FLOPS counting, and model wrappers - serving as a central toolkit.

tables/ contains the MSE and MAE values obtained from the evaluate_inference() function under different hyperparameter configurations.

Clone the Repository

To download this repository, run:

git clone https://gitlab.developers.cam.ac.uk/phy/data-intensive-science-mphil/assessments/m2_coursework/yz929.git

Environment Setup

You can install the dependencies using either Conda or pip.

Using Conda

conda env create -f environment.yml
conda activate m2_env

To deactivate:

conda deactivate

Using pip

pip install -r requirements.txt

Note: PyTorch is installed via pip because conda packages are no longer officially supported.

If you have an NVIDIA GPU, pip will try to install the appropriate CUDA version automatically.
If you want to ensure compatibility or manually choose a version, visit: https://pytorch.org/get-started/locally/

To install with specific CUDA version (e.g., 11.8), run:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

If you are using CPU only:

pip install torch torchvision torchaudio

Code

After setting up the environment and wandb API, you can run the Python notebooks after changing directories and wandb parameters of your own.

All required data files are included in the data/ directory - no external download is necessary.

Documentation

Sphinx-based automatic documentation for the image processing toolkit is located at:

docs/_build/html/index.html

You can open this file in a browser to explore the full API reference and module descriptions.

Report

The report is written in LaTeX and located inside the report/ directory. The compiled version can be found as main.pdf.

Use of Auto-Generation Tools

ChatGPT was used to assist in proofreading and formatting the LaTeX report. This included improvements to grammar, clarity, and LaTeX formatting (equations, tables).

All suggestions were critically reviewed and selectively integrated to maintain academic integrity.

License

This repository is licensed under the MIT License. For more details, see the LICENSE file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

M2 Coursework – MPhil in Data Intensive Science

Python Version

Repository Structure

Clone the Repository

Environment Setup

Code

Documentation

Report

Use of Auto-Generation Tools

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
ctx_len_experiment		ctx_len_experiment
data		data
docs		docs
figures		figures
final_model		final_model
lr_rank_experiment		lr_rank_experiment
report		report
src		src
tables		tables
.gitignore		.gitignore
2a_preprocessing.ipynb		2a_preprocessing.ipynb
2b_flops_test.ipynb		2b_flops_test.ipynb
3a_grid_search.ipynb		3a_grid_search.ipynb
3b_context_length.ipynb		3b_context_length.ipynb
3c_final_model.ipynb		3c_final_model.ipynb
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
model_comparison.ipynb		model_comparison.ipynb
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

License

0xInco/Qwen2.5-TimeSeries-Forecasting

Folders and files

Latest commit

History

Repository files navigation

M2 Coursework – MPhil in Data Intensive Science

Python Version

Repository Structure

Clone the Repository

Environment Setup

Code

Documentation

Report

Use of Auto-Generation Tools

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages