Skip to content

28 input file parser and classes#30

Open
marrip wants to merge 10 commits into
developfrom
28-input-file-parser-and-classes
Open

28 input file parser and classes#30
marrip wants to merge 10 commits into
developfrom
28-input-file-parser-and-classes

Conversation

@marrip

@marrip marrip commented May 27, 2026

Copy link
Copy Markdown
Contributor

I have started on implementing 3 different classes for the different types of input files and a parent class that handles the identification of dragen or localapp data. I even added a parser function but that just reads the different files as is and doesn't really do any parsing yet. I can continue with the parsing while you can have a look at the first draft of the implementation.

@marrip marrip requested review from danielvo and tinavisnovska May 27, 2026 12:17
@marrip marrip self-assigned this May 27, 2026
Comment thread src/tsoppy/general/input_classes.py Outdated
@marrip marrip force-pushed the 28-input-file-parser-and-classes branch from 33b4597 to de6026e Compare May 27, 2026 13:21
Comment thread src/tsoppy/general/input_classes.py Outdated
Comment thread src/tsoppy/general/input_classes.py Outdated
Subclasses should define `default_subpath_formats` mapping workflow names to
subpath format strings that accept `(sample, sample)` for formatting.

Attributes:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I can nitpick on the form side, I would suggest some slight attribute naming changes:

sample -> sample_id
type -> workflow_version

(these would make the attribute names more compatible with the Illumina input/output nomenclature)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with sample_id. workflow_version I find misleading. I would expect a version string like v1.0.0 and not dragen or localapp. But I agree, type might not be the best name. And, yes, please be nitpicky, I think it is good to name stuff consistently 👍

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checking your other comment, I can agree on workflow_version if we decide to use the actual version string ☺️

@danielvo danielvo May 29, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did wonder whether "workflow_version" would be too general, thinking that "Illumina_workflow_version" or even "Illumina_secondary_analysis_workflow_version" might be too long (though that would be very descriptive!). The values of the "Workflow Version" item inside the MetricsOutput.tsv files are 2.6.2.4 in case of the currently tested Dragen pipeline and ruo-2.2.0.12 in case of the Local App pipeline. "Dragen" and "Local App" are not explicitly mentioned there. One could parse some other file(s) produced by the pipelines (e.g., file named "receipt" is generated by both pipelines in the root of the output directory), though I am not sure whether that would be easier.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can keep workflow_version if we want to use the actual version string. I can have a look at the other files you mentioned. We could parse both the version string and the workflow name/type to have more fine grained control. That might be good.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at receipt but it seems that it might be even more difficult to extract information from that. I think we can parse the first line in MetricsOutput.tsv and use the first part as workflow and then in addition we use workflow_version. The combination can serve as our identifier for data source.

Comment thread src/tsoppy/general/input_classes.py Outdated
raise FileNotFoundError(
f"No workflow file found for sample {self.sample}. Searched: {self.paths}"
)
self.type = found[0]

@danielvo danielvo May 27, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the pipeline determining itself what its input is generated by, I just wonder whether doing it repeatedly, for each input file, is the way to go (some input files have the same paths in both pipelines; also, it seems unnecessary to do this determination many times) . Should we instead aim for there being a piece of code that always detects the workflow version based on one specific indicator in the root directory (e.g., the workflow version specified in the metrics output file), and then based on that the correct path for a particular input file could be picked from the default_subpath_formats dictionary? (At that point, one should double-check that the expected file is indeed there - if a sample analysis fails due to there being too few reads for the sample for example, some output files might be missing even when we have the workflow version determined correctly.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, I like the idea of checking MetricsOutput.tsv (this is the file you meant, right?). I see that Workflow Version is not including localapp or dragen, should we map it back or just work with the actual version string? For me it doesn't matter it might just be easier to read.
The next point is would we then detect the workflow version in the beginning of the pipeline or for each process or even each input file we read? I think per process makes sense?
The workflow version would then determine the path of the file depending on the default_subpath_formats or supplied changed subpath formats that are supplied by the user.
If files are missing, we can simply allow the process to fail (each sample is processed by its own process) and still allow for the pipeline to finish successfully. We can give a warning about the failed sample as well.
I am also wondering if we want a yaml file that serves as a map for the different subpath format strings so we have them all in one place and the user can supply a new file if they want to change the subpaths?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, MetricsOutput.tsv is the file I meant. I was wondering about the same, whether using the original version string would be enough. One could try to maintain some kind of mapping and monitor that there won't be conflicts between Dragen and Local App version strings, I suppose.. As long as we do support processing Local App output, I think it would be useful to make it clear in the TSOPPI_II results whether Local App or Dragen data was used on input.
We could do it per process, yes. If a process doesn't use any Local App/Dragen output or uses output for multiple runs, the determination should be done per processed run. The plan for missing files/failed samples sounds good. :-) The typical scenario is that there is one RNA sample that failed the Local App analysis - we would then want the post-processing for all other samples to go through and be given a clear error message about the one failed sample.
We did want to avoid hard-coding, so the yaml file sounds like the way to go? If the paths change in a later Dragen workflow version, we would simply add new items into the yaml path dictionaries. :-)

@tinavisnovska

Copy link
Copy Markdown
Collaborator

This reads well! I am curious about how you imagine other developers will use objects of the input classes in their code - i know that the detailed implementation is for later and you most likely discussed this already, just want to understand where we are heading in general terms.

I can imagine having yet another class (say allInputs) with objects of vcf, annotatedJson and tmbTrace being its attributes and others will in their code create objects of this allInputs class or something like that... Just would like to know how you look at this.

@marrip

marrip commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

This reads well! I am curious about how you imagine other developers will use objects of the input classes in their code - i know that the detailed implementation is for later and you most likely discussed this already, just want to understand where we are heading in general terms.

I can imagine having yet another class (say allInputs) with objects of vcf, annotatedJson and tmbTrace being its attributes and others will in their code create objects of this allInputs class or something like that... Just would like to know how you look at this.

I think we would need to have methods in those classes that allow the developers to manipulate and retrieve the data in any way they need to. We will probably have to expand the list of methods as we go. I think we have the need for some filter functions and potentially something to merge information. I guess that also touches a bit on the second part of your question. Having a combined class is one way to go. Another would be to merge or add information from one class to another. I think adding the information to the right place in the receiving class could facilitate filtering and writing to file. Keeping the three read files separate would require us to have some function that always compares position etc. which seems computationally more demanding.

@marrip marrip force-pushed the 28-input-file-parser-and-classes branch from 9e0bcfe to d9e74c3 Compare June 2, 2026 14:18
@marrip

marrip commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

Hey @danielvo and @tinavisnovska

I have added a parser for the MetricsOutput.tsv which should be able to handle files with a similar structure.

https://github.com/InPreD/tsoppy/pull/30/changes#diff-380184cf0a380cb8b367e831b1344f47e025bdf9af7f900be34cc30567a73687R157

If the parser looks fine, I can start to implement what we had discussed, Daniel, and read the workflow_type and workflow_version as part of the BaseInput.

Comment thread src/tsoppy/general/input_classes.py Outdated
if any(item is None for item in df_slice.row(0)):
df_slice = _handle_row_with_nulls(df_slice)

# assume that two columns containing values are key value pairs and transpose

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the [Analysis Status] section, if there is only one sample (this can happen for example if one runs analysis on a single sample from fastq files), this assumption wouldn't hold.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, I was afraid this might be the case. We could selectively convert sections that we know are key-value pairs. Something like

  • provide a list of section names that we want to transpose
  • check if the section indeed consists of 2 columns -> fail if not
  • transpose the section

This could be either part of the parser or an additional function one can run after?

@tinavisnovska tinavisnovska Jun 3, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't number of columns number of sequenced/analysed samples? so in case of the test run the number of columns is 2 but can be anything from 1 to many (the limit depends most likely on your seq machine setup)...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have fixed it now, the parser takes a list[str] where one can specify the sections that should be key-value instead of table.

@danielvo danielvo Jun 3, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a solution that offers more options but doesn't break anything at default. :-) I haven't thought through what functionality should be taken care of by this parser and what should be handled inside the tools that simply accept data prepared by the parser.

From what I saw, there can be some general comments on top (with no section header), some comments on the bottom (with a section header..), a section actually titled "[Header]".. There can be sections with key-value pairs, sections with per-sample values (and possibly additional columns like recognized lower and higher thresholds), sections that have an arbitrary number of fixed columns (like the fusion details). But again, as there is no specification, I have no idea what else Illumina possibly does in files with this general structure, so the parser being rather general should be safe. Additional processing/parsing steps that are utilized in multiple tools should use the same functions, but these functions don't need to be triggered by the parser.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, nice, I think it is general enough now so we can parse any of those files if they follow our assumed specifications. In case we have to expand the specs, we modify it. Any additional case we find we can include in the unit tests to make sure they all work.

Comment thread src/tsoppy/general/input_classes.py Outdated
if df_slice.width == 2:
df_slice = df_slice.transpose()

# assume first row contains column names and rename

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the troublesome [Analysis Status] section, the first column of field names line is for some reason empty (consistently for both the Local App and Dragen output). :-/

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I handled it by setting the column name to "-", should be ok?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's good and it should work predictably for other similar occurrences of the issue. It's better than an empty value which the tool that processes the data section downstream would need to deal with otherwise. :-)

Comment thread src/tsoppy/general/input_classes.py Outdated
return self.data


def Parse_section_tsv(path: str) -> tuple[list[str], dict[str, polars.DataFrame]]:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's difficult to write parsers for formats that are not well defined (at least I haven't found a proper specification of the MetricsOutput.tsv format), so we unfortunately have to do with assumptions. I have previously relied on section label lines (^[section label]) to delienate the sections, but it seems that simply relying on empty lines does work as well. :-) We should perhaps create a list of assumptions that need to be checked when there is a new pipeline version (?). It might be ideal to test our assumptions automatically, but we would still need to make some assumptions in the test, I think.. with undocumented formats in play, it's hard to tell what one needs to stick to as the reliable guidelines.
The MetricsOutput.tsv files are parsed for metrics plotting purposes as well, so Omer should take advantage of this code. The <samnple_ID>_CombinedVariantOutput.tsv files have the same structure, so we should be able to advantage of this code also for TMB, MSI, GIS, etc. info extraction. :-)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this format is weird and illumina uses it in other cases as well. We can try to make it more robust. I think using the section headers would be beneficial so let me redo that part. Also the file might start with a section header, I have seen that. I will add some logic to handle both cases.
For slightly changing formats, we have unit tests. We can simply add the files we want to test against the parser and define what we expect out so we know that we can rely on the parser doing its job properly.
Yes, Omer should use this, for sure. We can also define a class for him that he can take advantage of.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to make the parser more resilient to handle slight changes in the format. Will add unit tests in the upcoming days.

Comment thread src/tsoppy/general/input_classes.py Outdated
def _handle_row_with_nulls(df: polars.DataFrame) -> polars.DataFrame:
"""Handle rows with null values removing empty columns and by filling with "-"."""

# remove any columns that are completely null

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder about this one. If run A has undefined values for its samples in one of the columns and run B does not, then the two runs would not return the same format back due to this removal. I would prefer ensuring that the returned format is always the same, with the selected placeholder value being used in the whole empty column if necessary.
Not sure whether this would be relevant, but it feels like a safer approach (?).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but wouldn't the undefined samples be NA? These columns wouldn't be removed. Also if the sample has a header (the sample name) it is also kept. Only "non-existent" columns are removed. Columns with missing values are still kept. I can rephrase the comment to make it clearer?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have rephrased now

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely null columns sound safe to remove (if there is no header and no value..). :-)

@marrip

marrip commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

Do you think it makes sense to move the parser onto its own branch/into its own PR so we can merge it in and Omer and others can start using it? I feel that the other parts of this PR might take some additional rounds of review.

@danielvo

danielvo commented Jun 3, 2026

Copy link
Copy Markdown

Do you think it makes sense to move the parser onto its own branch/into its own PR so we can merge it in and Omer and others can start using it? I feel that the other parts of this PR might take some additional rounds of review.

That could be a good idea! I don't know how quickly Omer will be able to utilize the parser, but it's quite independent of the small variant file parsing, so merging the Metrics parser in ahead of the rest could be useful with no downsides.

@tinavisnovska

Copy link
Copy Markdown
Collaborator

Do you think it makes sense to move the parser onto its own branch/into its own PR so we can merge it in and Omer and others can start using it? I feel that the other parts of this PR might take some additional rounds of review.

Sounds like a good idea to get it out on its own. However, do you want to keep the metrics file parser as a standalone function or eventually include it to a class handling the metrics input file? In the second case, i would suggest to make the pr for that class not only the parser, but not sure how you see this...

@marrip

marrip commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

Do you think it makes sense to move the parser onto its own branch/into its own PR so we can merge it in and Omer and others can start using it? I feel that the other parts of this PR might take some additional rounds of review.

Sounds like a good idea to get it out on its own. However, do you want to keep the metrics file parser as a standalone function or eventually include it to a class handling the metrics input file? In the second case, i would suggest to make the pr for that class not only the parser, but not sure how you see this...

My vision is that this is its own tool/function and classes represent input files that utilize the parser. So it would be similar to the vcf class that uses the vcf package that the class for the metrics output tsv uses the parser inside. This way, we can reuse the parser for other files that follow the structure.

@marrip marrip force-pushed the 28-input-file-parser-and-classes branch from 259f8bd to f371398 Compare June 11, 2026 13:17
@marrip marrip requested a review from danielvo June 11, 2026 13:19
@marrip

marrip commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

Now I redesigned the classes to use the MetricsOutput.tsv as the source for the workflow information and the file classes (small variants genome vcf etc.) are created from that a base class. I like this approach as the base class serves as a representation of the workflow output dir and the file classes are individual files that can be customized to the use cases.

Will do some more polishing but please feel free to start reviewing and giving feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants