28 input file parser and classes by marrip · Pull Request #30 · InPreD/tsoppy

marrip · 2026-05-27T12:17:18Z

I have started on implementing 3 different classes for the different types of input files and a parent class that handles the identification of dragen or localapp data. I even added a parser function but that just reads the different files as is and doesn't really do any parsing yet. I can continue with the parsing while you can have a look at the first draft of the implementation.

danielvo · 2026-05-27T13:51:59Z

+    Subclasses should define `default_subpath_formats` mapping workflow names to
+    subpath format strings that accept `(sample, sample)` for formatting.
+
+    Attributes:


If I can nitpick on the form side, I would suggest some slight attribute naming changes:

sample -> sample_id
type -> workflow_version

(these would make the attribute names more compatible with the Illumina input/output nomenclature)

I am ok with sample_id. workflow_version I find misleading. I would expect a version string like v1.0.0 and not dragen or localapp. But I agree, type might not be the best name. And, yes, please be nitpicky, I think it is good to name stuff consistently 👍

checking your other comment, I can agree on workflow_version if we decide to use the actual version string ☺️

I did wonder whether "workflow_version" would be too general, thinking that "Illumina_workflow_version" or even "Illumina_secondary_analysis_workflow_version" might be too long (though that would be very descriptive!). The values of the "Workflow Version" item inside the MetricsOutput.tsv files are 2.6.2.4 in case of the currently tested Dragen pipeline and ruo-2.2.0.12 in case of the Local App pipeline. "Dragen" and "Local App" are not explicitly mentioned there. One could parse some other file(s) produced by the pipelines (e.g., file named "receipt" is generated by both pipelines in the root of the output directory), though I am not sure whether that would be easier.

I think we can keep workflow_version if we want to use the actual version string. I can have a look at the other files you mentioned. We could parse both the version string and the workflow name/type to have more fine grained control. That might be good.

I looked at receipt but it seems that it might be even more difficult to extract information from that. I think we can parse the first line in MetricsOutput.tsv and use the first part as workflow and then in addition we use workflow_version. The combination can serve as our identifier for data source.

danielvo · 2026-05-27T14:33:48Z

+            raise FileNotFoundError(
+                f"No workflow file found for sample {self.sample}. Searched: {self.paths}"
+            )
+        self.type = found[0]


I like the pipeline determining itself what its input is generated by, I just wonder whether doing it repeatedly, for each input file, is the way to go (some input files have the same paths in both pipelines; also, it seems unnecessary to do this determination many times) . Should we instead aim for there being a piece of code that always detects the workflow version based on one specific indicator in the root directory (e.g., the workflow version specified in the metrics output file), and then based on that the correct path for a particular input file could be picked from the default_subpath_formats dictionary? (At that point, one should double-check that the expected file is indeed there - if a sample analysis fails due to there being too few reads for the sample for example, some output files might be missing even when we have the workflow version determined correctly.)

good point, I like the idea of checking MetricsOutput.tsv (this is the file you meant, right?). I see that Workflow Version is not including localapp or dragen, should we map it back or just work with the actual version string? For me it doesn't matter it might just be easier to read.
The next point is would we then detect the workflow version in the beginning of the pipeline or for each process or even each input file we read? I think per process makes sense?
The workflow version would then determine the path of the file depending on the default_subpath_formats or supplied changed subpath formats that are supplied by the user.
If files are missing, we can simply allow the process to fail (each sample is processed by its own process) and still allow for the pipeline to finish successfully. We can give a warning about the failed sample as well.
I am also wondering if we want a yaml file that serves as a map for the different subpath format strings so we have them all in one place and the user can supply a new file if they want to change the subpaths?

Yes, MetricsOutput.tsv is the file I meant. I was wondering about the same, whether using the original version string would be enough. One could try to maintain some kind of mapping and monitor that there won't be conflicts between Dragen and Local App version strings, I suppose.. As long as we do support processing Local App output, I think it would be useful to make it clear in the TSOPPI_II results whether Local App or Dragen data was used on input.
We could do it per process, yes. If a process doesn't use any Local App/Dragen output or uses output for multiple runs, the determination should be done per processed run. The plan for missing files/failed samples sounds good. :-) The typical scenario is that there is one RNA sample that failed the Local App analysis - we would then want the post-processing for all other samples to go through and be given a clear error message about the one failed sample.
We did want to avoid hard-coding, so the yaml file sounds like the way to go? If the paths change in a later Dragen workflow version, we would simply add new items into the yaml path dictionaries. :-)

tinavisnovska · 2026-06-01T17:33:06Z

This reads well! I am curious about how you imagine other developers will use objects of the input classes in their code - i know that the detailed implementation is for later and you most likely discussed this already, just want to understand where we are heading in general terms.

I can imagine having yet another class (say allInputs) with objects of vcf, annotatedJson and tmbTrace being its attributes and others will in their code create objects of this allInputs class or something like that... Just would like to know how you look at this.

marrip · 2026-06-02T07:55:43Z

This reads well! I am curious about how you imagine other developers will use objects of the input classes in their code - i know that the detailed implementation is for later and you most likely discussed this already, just want to understand where we are heading in general terms.

I can imagine having yet another class (say allInputs) with objects of vcf, annotatedJson and tmbTrace being its attributes and others will in their code create objects of this allInputs class or something like that... Just would like to know how you look at this.

I think we would need to have methods in those classes that allow the developers to manipulate and retrieve the data in any way they need to. We will probably have to expand the list of methods as we go. I think we have the need for some filter functions and potentially something to merge information. I guess that also touches a bit on the second part of your question. Having a combined class is one way to go. Another would be to merge or add information from one class to another. I think adding the information to the right place in the receiving class could facilitate filtering and writing to file. Keeping the three read files separate would require us to have some function that always compares position etc. which seems computationally more demanding.

marrip · 2026-06-02T14:22:07Z

Hey @danielvo and @tinavisnovska

I have added a parser for the MetricsOutput.tsv which should be able to handle files with a similar structure.

https://github.com/InPreD/tsoppy/pull/30/changes#diff-380184cf0a380cb8b367e831b1344f47e025bdf9af7f900be34cc30567a73687R157

If the parser looks fine, I can start to implement what we had discussed, Daniel, and read the workflow_type and workflow_version as part of the BaseInput.

danielvo · 2026-06-03T08:18:56Z

+        if any(item is None for item in df_slice.row(0)):
+            df_slice = _handle_row_with_nulls(df_slice)
+
+            # assume that two columns containing values are key value pairs and transpose


In the [Analysis Status] section, if there is only one sample (this can happen for example if one runs analysis on a single sample from fastq files), this assumption wouldn't hold.

good point, I was afraid this might be the case. We could selectively convert sections that we know are key-value pairs. Something like

provide a list of section names that we want to transpose

check if the section indeed consists of 2 columns -> fail if not

transpose the section

This could be either part of the parser or an additional function one can run after?

Isn't number of columns number of sequenced/analysed samples? so in case of the test run the number of columns is 2 but can be anything from 1 to many (the limit depends most likely on your seq machine setup)...

I have fixed it now, the parser takes a list[str] where one can specify the sections that should be key-value instead of table.

That sounds like a solution that offers more options but doesn't break anything at default. :-) I haven't thought through what functionality should be taken care of by this parser and what should be handled inside the tools that simply accept data prepared by the parser.

From what I saw, there can be some general comments on top (with no section header), some comments on the bottom (with a section header..), a section actually titled "[Header]".. There can be sections with key-value pairs, sections with per-sample values (and possibly additional columns like recognized lower and higher thresholds), sections that have an arbitrary number of fixed columns (like the fusion details). But again, as there is no specification, I have no idea what else Illumina possibly does in files with this general structure, so the parser being rather general should be safe. Additional processing/parsing steps that are utilized in multiple tools should use the same functions, but these functions don't need to be triggered by the parser.

ok, nice, I think it is general enough now so we can parse any of those files if they follow our assumed specifications. In case we have to expand the specs, we modify it. Any additional case we find we can include in the unit tests to make sure they all work.

danielvo · 2026-06-03T08:20:33Z

+            if df_slice.width == 2:
+                df_slice = df_slice.transpose()
+
+        # assume first row contains column names and rename


In the troublesome [Analysis Status] section, the first column of field names line is for some reason empty (consistently for both the Local App and Dragen output). :-/

I handled it by setting the column name to "-", should be ok?

I think that's good and it should work predictably for other similar occurrences of the issue. It's better than an empty value which the tool that processes the data section downstream would need to deal with otherwise. :-)

danielvo · 2026-06-03T08:37:45Z

+        return self.data
+
+
+def Parse_section_tsv(path: str) -> tuple[list[str], dict[str, polars.DataFrame]]:


It's difficult to write parsers for formats that are not well defined (at least I haven't found a proper specification of the MetricsOutput.tsv format), so we unfortunately have to do with assumptions. I have previously relied on section label lines (^[section label]) to delienate the sections, but it seems that simply relying on empty lines does work as well. :-) We should perhaps create a list of assumptions that need to be checked when there is a new pipeline version (?). It might be ideal to test our assumptions automatically, but we would still need to make some assumptions in the test, I think.. with undocumented formats in play, it's hard to tell what one needs to stick to as the reliable guidelines.
The MetricsOutput.tsv files are parsed for metrics plotting purposes as well, so Omer should take advantage of this code. The <samnple_ID>_CombinedVariantOutput.tsv files have the same structure, so we should be able to advantage of this code also for TMB, MSI, GIS, etc. info extraction. :-)

I agree, this format is weird and illumina uses it in other cases as well. We can try to make it more robust. I think using the section headers would be beneficial so let me redo that part. Also the file might start with a section header, I have seen that. I will add some logic to handle both cases.
For slightly changing formats, we have unit tests. We can simply add the files we want to test against the parser and define what we expect out so we know that we can rely on the parser doing its job properly.
Yes, Omer should use this, for sure. We can also define a class for him that he can take advantage of.

I tried to make the parser more resilient to handle slight changes in the format. Will add unit tests in the upcoming days.

danielvo · 2026-06-03T08:42:05Z

+def _handle_row_with_nulls(df: polars.DataFrame) -> polars.DataFrame:
+    """Handle rows with null values removing empty columns and by filling with "-"."""
+
+    # remove any columns that are completely null


I wonder about this one. If run A has undefined values for its samples in one of the columns and run B does not, then the two runs would not return the same format back due to this removal. I would prefer ensuring that the returned format is always the same, with the selected placeholder value being used in the whole empty column if necessary.
Not sure whether this would be relevant, but it feels like a safer approach (?).

but wouldn't the undefined samples be NA? These columns wouldn't be removed. Also if the sample has a header (the sample name) it is also kept. Only "non-existent" columns are removed. Columns with missing values are still kept. I can rephrase the comment to make it clearer?

have rephrased now

Completely null columns sound safe to remove (if there is no header and no value..). :-)

marrip · 2026-06-03T14:26:27Z

Do you think it makes sense to move the parser onto its own branch/into its own PR so we can merge it in and Omer and others can start using it? I feel that the other parts of this PR might take some additional rounds of review.

danielvo · 2026-06-03T14:41:33Z

Do you think it makes sense to move the parser onto its own branch/into its own PR so we can merge it in and Omer and others can start using it? I feel that the other parts of this PR might take some additional rounds of review.

That could be a good idea! I don't know how quickly Omer will be able to utilize the parser, but it's quite independent of the small variant file parsing, so merging the Metrics parser in ahead of the rest could be useful with no downsides.

tinavisnovska · 2026-06-03T17:29:29Z

Do you think it makes sense to move the parser onto its own branch/into its own PR so we can merge it in and Omer and others can start using it? I feel that the other parts of this PR might take some additional rounds of review.

Sounds like a good idea to get it out on its own. However, do you want to keep the metrics file parser as a standalone function or eventually include it to a class handling the metrics input file? In the second case, i would suggest to make the pr for that class not only the parser, but not sure how you see this...

marrip · 2026-06-05T07:49:00Z

Do you think it makes sense to move the parser onto its own branch/into its own PR so we can merge it in and Omer and others can start using it? I feel that the other parts of this PR might take some additional rounds of review.

Sounds like a good idea to get it out on its own. However, do you want to keep the metrics file parser as a standalone function or eventually include it to a class handling the metrics input file? In the second case, i would suggest to make the pr for that class not only the parser, but not sure how you see this...

My vision is that this is its own tool/function and classes represent input files that utilize the parser. So it would be similar to the vcf class that uses the vcf package that the class for the metrics output tsv uses the parser inside. This way, we can reuse the parser for other files that follow the structure.

… files

marrip · 2026-06-11T13:22:06Z

Now I redesigned the classes to use the MetricsOutput.tsv as the source for the workflow information and the file classes (small variants genome vcf etc.) are created from that a base class. I like this approach as the base class serves as a representation of the workflow output dir and the file classes are individual files that can be customized to the use cases.

Will do some more polishing but please feel free to start reviewing and giving feedback.

marrip requested review from danielvo and tinavisnovska May 27, 2026 12:17

marrip self-assigned this May 27, 2026

danielvo reviewed May 27, 2026

View reviewed changes

Comment thread src/tsoppy/general/input_classes.py Outdated

marrip force-pushed the 28-input-file-parser-and-classes branch from 33b4597 to de6026e Compare May 27, 2026 13:21

danielvo reviewed May 27, 2026

View reviewed changes

Comment thread src/tsoppy/general/input_classes.py Outdated

danielvo reviewed May 27, 2026

View reviewed changes

marrip force-pushed the 28-input-file-parser-and-classes branch from 9e0bcfe to d9e74c3 Compare June 2, 2026 14:18

danielvo reviewed Jun 3, 2026

View reviewed changes

marrip added 7 commits June 11, 2026 10:55

feat: add classes for vcf, tmb trace and nirvana json

dba6323

chore: sort imports

34f6149

chore: lint

3c2e3ed

chore: replace Dict with Mapping

0c801ae

feat: rename sample to sample id

1cab661

feat: workflow type and version are retrieved from MetricsOutput.tsv

89183e8

chore: add config yaml containing subpath format strings for workflow…

f371398

… files

marrip force-pushed the 28-input-file-parser-and-classes branch from 259f8bd to f371398 Compare June 11, 2026 13:17

marrip requested a review from danielvo June 11, 2026 13:19

marrip mentioned this pull request Jun 12, 2026

1 update small variant vcf list #2

Open

marrip added 3 commits June 12, 2026 10:15

docs: include more docstrings and expand existing ones

5d0d035

feat: make file parsing part of the class instance init process

404c1d3

feat: reanem TmbTrace to TmbTraceTsv

cf87d15

		return self.data


		def Parse_section_tsv(path: str) -> tuple[list[str], dict[str, polars.DataFrame]]:

Conversation

marrip commented May 27, 2026

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danielvo May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danielvo May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tinavisnovska commented Jun 1, 2026

Uh oh!

marrip commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marrip commented Jun 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tinavisnovska Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danielvo Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marrip commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielvo commented Jun 3, 2026

Uh oh!

tinavisnovska commented Jun 3, 2026

Uh oh!

marrip commented Jun 5, 2026

Uh oh!

marrip commented Jun 11, 2026

danielvo May 29, 2026 •

edited

Loading

danielvo May 27, 2026 •

edited

Loading

marrip commented Jun 2, 2026 •

edited

Loading

tinavisnovska Jun 3, 2026 •

edited

Loading

danielvo Jun 3, 2026 •

edited

Loading

marrip commented Jun 3, 2026 •

edited

Loading