Skip to content

snappy pull-raw-data subcommand does not find all fastq files #306

@ericblanc20

Description

@ericblanc20

Describe the bug
cubi-tk snappy pull-raw-data doesn't seem to find fastq files with a different name than the collection. This is unfortunate, because DKFZ Heidelberg (for example) has its own file naming convention, and it is an important data provider.

To Reproduce

# SODAR project
project=7097920f-d1ce-4014-a4f0-97f2c2ef9b81
assay=9b897bea-e233-4610-9f9c-5536ec850c3f

# Create snappy environment
mkdir -p dir/.snappy_pipeline RAW_DATA
cat > dir/.snappy_pipeline/config.yaml << __EOF
data_sets:
  exomes:
    sodar_uuid: 7097920f-d1ce-4014-a4f0-97f2c2ef9b81
    file: SampleSheet.tsv
    search_patterns:
    - {"left": "*.read1.fastq.gz", "right": "*.read2.fastq.gz"}
    search_paths:
    - /data/hdd/eblanc/tmp/tmp/2025-09-22_cubi_tk_cancer/RAW_DATA
    type: matched_cancer
    naming_scheme: only_secondary_id
__EOF
cat > dir/.snappy_pipeline/SampleSheet.tsv << __EOF
[Metadata]
schema  cancer_matched
schema_version  v1
title   Becnel public dataset
description     Multiple tumor/normal pairs with WES, WGS & RNA-seq data

[Data]
patientName     sampleName      libraryType     folderName      isTumor
case001 N1      WES     case001-N1-DNA1-WES1    N
case001 T1      WES     case001-T1-DNA1-WES1    Y
case002 N1      WES     case002-N1-DNA1-WES1    N
case002 T1      WES     case002-T1-DNA1-WES1    Y
__EOF

# Offending command
cubi-tk snappy pull-raw-data \
    --tsv-shortcut cancer \
    --assay-uuid $assay \
    --output-directory RAW_DATA --base-path dir \
    --samples case001 \
    $project

Command output

I - 22.09.2025 18:00:32 - Will start at dir
I - 22.09.2025 18:00:32 - Loading configuration file and look for dataset
I - 22.09.2025 18:00:32 - => will download to RAW_DATA
I - 22.09.2025 18:00:32 - Will start at dir
W - 22.09.2025 18:00:32 - No file was found using the selected criteria.
Available files (limited to first 50):
TCRBOA1-N-WEX.read1.fastq.gz
TCRBOA1-N-WEX.read2.fastq.gz
TCRBOA1-T-WEX.read1.fastq.gz
TCRBOA1-T-WEX.read2.fastq.gz
TCRBOA2-N-WEX.read1.fastq.gz
TCRBOA2-N-WEX.read2.fastq.gz
TCRBOA2-T-WEX.read1.fastq.gz
TCRBOA2-T-WEX.read2.fastq.gz
TCRBOA3-N-WEX.read1.fastq.gz
TCRBOA3-N-WEX.read2.fastq.gz
TCRBOA3-T-WEX.read1.fastq.gz
TCRBOA3-T-WEX.read2.fastq.gz
...
S - 22.09.2025 18:00:32 - All done. Have a nice day!

Expected behavior
TCRBOA1-N-WEX.read1.fastq.gz & TCRBOA1-N-WEX.read2.fastq.gz should be downloaded in folder RAW_DATA/case001-N1-DNA1-WES1, &
TCRBOA1-T-WEX.read1.fastq.gz & TCRBOA1-T-WEX.read2.fastq.gz in RAW_DATA/case001-T1-DNA1-WES1.

Additional context
I also tried using case001-N1 & case001-N1-DNA1-WES1 as values of the --sample-list argument, but without success. The error message is identical.

This feature is important for the cancer branch, because DKFZ Heidelberg (an important provider of cancer-related sequencing data) has its own file naming convention.

In particular. It uses extension .md5sum to store raw data file checksums. It would be very useful to have (optionally) a flag allowing to upload *.md5sum as *.md5 in the landing zone.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions