-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Describe the bug
cubi-tk snappy pull-raw-data doesn't seem to find fastq files with a different name than the collection. This is unfortunate, because DKFZ Heidelberg (for example) has its own file naming convention, and it is an important data provider.
To Reproduce
# SODAR project
project=7097920f-d1ce-4014-a4f0-97f2c2ef9b81
assay=9b897bea-e233-4610-9f9c-5536ec850c3f
# Create snappy environment
mkdir -p dir/.snappy_pipeline RAW_DATA
cat > dir/.snappy_pipeline/config.yaml << __EOF
data_sets:
exomes:
sodar_uuid: 7097920f-d1ce-4014-a4f0-97f2c2ef9b81
file: SampleSheet.tsv
search_patterns:
- {"left": "*.read1.fastq.gz", "right": "*.read2.fastq.gz"}
search_paths:
- /data/hdd/eblanc/tmp/tmp/2025-09-22_cubi_tk_cancer/RAW_DATA
type: matched_cancer
naming_scheme: only_secondary_id
__EOF
cat > dir/.snappy_pipeline/SampleSheet.tsv << __EOF
[Metadata]
schema cancer_matched
schema_version v1
title Becnel public dataset
description Multiple tumor/normal pairs with WES, WGS & RNA-seq data
[Data]
patientName sampleName libraryType folderName isTumor
case001 N1 WES case001-N1-DNA1-WES1 N
case001 T1 WES case001-T1-DNA1-WES1 Y
case002 N1 WES case002-N1-DNA1-WES1 N
case002 T1 WES case002-T1-DNA1-WES1 Y
__EOF
# Offending command
cubi-tk snappy pull-raw-data \
--tsv-shortcut cancer \
--assay-uuid $assay \
--output-directory RAW_DATA --base-path dir \
--samples case001 \
$project
Command output
I - 22.09.2025 18:00:32 - Will start at dir
I - 22.09.2025 18:00:32 - Loading configuration file and look for dataset
I - 22.09.2025 18:00:32 - => will download to RAW_DATA
I - 22.09.2025 18:00:32 - Will start at dir
W - 22.09.2025 18:00:32 - No file was found using the selected criteria.
Available files (limited to first 50):
TCRBOA1-N-WEX.read1.fastq.gz
TCRBOA1-N-WEX.read2.fastq.gz
TCRBOA1-T-WEX.read1.fastq.gz
TCRBOA1-T-WEX.read2.fastq.gz
TCRBOA2-N-WEX.read1.fastq.gz
TCRBOA2-N-WEX.read2.fastq.gz
TCRBOA2-T-WEX.read1.fastq.gz
TCRBOA2-T-WEX.read2.fastq.gz
TCRBOA3-N-WEX.read1.fastq.gz
TCRBOA3-N-WEX.read2.fastq.gz
TCRBOA3-T-WEX.read1.fastq.gz
TCRBOA3-T-WEX.read2.fastq.gz
...
S - 22.09.2025 18:00:32 - All done. Have a nice day!
Expected behavior
TCRBOA1-N-WEX.read1.fastq.gz & TCRBOA1-N-WEX.read2.fastq.gz should be downloaded in folder RAW_DATA/case001-N1-DNA1-WES1, &
TCRBOA1-T-WEX.read1.fastq.gz & TCRBOA1-T-WEX.read2.fastq.gz in RAW_DATA/case001-T1-DNA1-WES1.
Additional context
I also tried using case001-N1 & case001-N1-DNA1-WES1 as values of the --sample-list argument, but without success. The error message is identical.
This feature is important for the cancer branch, because DKFZ Heidelberg (an important provider of cancer-related sequencing data) has its own file naming convention.
In particular. It uses extension .md5sum to store raw data file checksums. It would be very useful to have (optionally) a flag allowing to upload *.md5sum as *.md5 in the landing zone.