28 input file parser and classes#30
Conversation
33b4597 to
de6026e
Compare
| Subclasses should define `default_subpath_formats` mapping workflow names to | ||
| subpath format strings that accept `(sample, sample)` for formatting. | ||
|
|
||
| Attributes: |
There was a problem hiding this comment.
If I can nitpick on the form side, I would suggest some slight attribute naming changes:
sample -> sample_id
type -> workflow_version
(these would make the attribute names more compatible with the Illumina input/output nomenclature)
There was a problem hiding this comment.
I am ok with sample_id. workflow_version I find misleading. I would expect a version string like v1.0.0 and not dragen or localapp. But I agree, type might not be the best name. And, yes, please be nitpicky, I think it is good to name stuff consistently 👍
There was a problem hiding this comment.
checking your other comment, I can agree on workflow_version if we decide to use the actual version string
There was a problem hiding this comment.
I did wonder whether "workflow_version" would be too general, thinking that "Illumina_workflow_version" or even "Illumina_secondary_analysis_workflow_version" might be too long (though that would be very descriptive!). The values of the "Workflow Version" item inside the MetricsOutput.tsv files are 2.6.2.4 in case of the currently tested Dragen pipeline and ruo-2.2.0.12 in case of the Local App pipeline. "Dragen" and "Local App" are not explicitly mentioned there. One could parse some other file(s) produced by the pipelines (e.g., file named "receipt" is generated by both pipelines in the root of the output directory), though I am not sure whether that would be easier.
There was a problem hiding this comment.
I think we can keep workflow_version if we want to use the actual version string. I can have a look at the other files you mentioned. We could parse both the version string and the workflow name/type to have more fine grained control. That might be good.
There was a problem hiding this comment.
I looked at receipt but it seems that it might be even more difficult to extract information from that. I think we can parse the first line in MetricsOutput.tsv and use the first part as workflow and then in addition we use workflow_version. The combination can serve as our identifier for data source.
| raise FileNotFoundError( | ||
| f"No workflow file found for sample {self.sample}. Searched: {self.paths}" | ||
| ) | ||
| self.type = found[0] |
There was a problem hiding this comment.
I like the pipeline determining itself what its input is generated by, I just wonder whether doing it repeatedly, for each input file, is the way to go (some input files have the same paths in both pipelines; also, it seems unnecessary to do this determination many times) . Should we instead aim for there being a piece of code that always detects the workflow version based on one specific indicator in the root directory (e.g., the workflow version specified in the metrics output file), and then based on that the correct path for a particular input file could be picked from the default_subpath_formats dictionary? (At that point, one should double-check that the expected file is indeed there - if a sample analysis fails due to there being too few reads for the sample for example, some output files might be missing even when we have the workflow version determined correctly.)
There was a problem hiding this comment.
good point, I like the idea of checking MetricsOutput.tsv (this is the file you meant, right?). I see that Workflow Version is not including localapp or dragen, should we map it back or just work with the actual version string? For me it doesn't matter it might just be easier to read.
The next point is would we then detect the workflow version in the beginning of the pipeline or for each process or even each input file we read? I think per process makes sense?
The workflow version would then determine the path of the file depending on the default_subpath_formats or supplied changed subpath formats that are supplied by the user.
If files are missing, we can simply allow the process to fail (each sample is processed by its own process) and still allow for the pipeline to finish successfully. We can give a warning about the failed sample as well.
I am also wondering if we want a yaml file that serves as a map for the different subpath format strings so we have them all in one place and the user can supply a new file if they want to change the subpaths?
There was a problem hiding this comment.
Yes, MetricsOutput.tsv is the file I meant. I was wondering about the same, whether using the original version string would be enough. One could try to maintain some kind of mapping and monitor that there won't be conflicts between Dragen and Local App version strings, I suppose.. As long as we do support processing Local App output, I think it would be useful to make it clear in the TSOPPI_II results whether Local App or Dragen data was used on input.
We could do it per process, yes. If a process doesn't use any Local App/Dragen output or uses output for multiple runs, the determination should be done per processed run. The plan for missing files/failed samples sounds good. :-) The typical scenario is that there is one RNA sample that failed the Local App analysis - we would then want the post-processing for all other samples to go through and be given a clear error message about the one failed sample.
We did want to avoid hard-coding, so the yaml file sounds like the way to go? If the paths change in a later Dragen workflow version, we would simply add new items into the yaml path dictionaries. :-)
|
This reads well! I am curious about how you imagine other developers will use objects of the input classes in their code - i know that the detailed implementation is for later and you most likely discussed this already, just want to understand where we are heading in general terms. I can imagine having yet another class (say allInputs) with objects of vcf, annotatedJson and tmbTrace being its attributes and others will in their code create objects of this allInputs class or something like that... Just would like to know how you look at this. |
I think we would need to have methods in those classes that allow the developers to manipulate and retrieve the data in any way they need to. We will probably have to expand the list of methods as we go. I think we have the need for some filter functions and potentially something to merge information. I guess that also touches a bit on the second part of your question. Having a combined class is one way to go. Another would be to merge or add information from one class to another. I think adding the information to the right place in the receiving class could facilitate filtering and writing to file. Keeping the three read files separate would require us to have some function that always compares position etc. which seems computationally more demanding. |
9e0bcfe to
d9e74c3
Compare
|
Hey @danielvo and @tinavisnovska I have added a parser for the If the parser looks fine, I can start to implement what we had discussed, Daniel, and read the |
| if any(item is None for item in df_slice.row(0)): | ||
| df_slice = _handle_row_with_nulls(df_slice) | ||
|
|
||
| # assume that two columns containing values are key value pairs and transpose |
There was a problem hiding this comment.
In the [Analysis Status] section, if there is only one sample (this can happen for example if one runs analysis on a single sample from fastq files), this assumption wouldn't hold.
There was a problem hiding this comment.
good point, I was afraid this might be the case. We could selectively convert sections that we know are key-value pairs. Something like
- provide a list of section names that we want to transpose
- check if the section indeed consists of 2 columns -> fail if not
- transpose the section
This could be either part of the parser or an additional function one can run after?
There was a problem hiding this comment.
Isn't number of columns number of sequenced/analysed samples? so in case of the test run the number of columns is 2 but can be anything from 1 to many (the limit depends most likely on your seq machine setup)...
There was a problem hiding this comment.
I have fixed it now, the parser takes a list[str] where one can specify the sections that should be key-value instead of table.
There was a problem hiding this comment.
That sounds like a solution that offers more options but doesn't break anything at default. :-) I haven't thought through what functionality should be taken care of by this parser and what should be handled inside the tools that simply accept data prepared by the parser.
From what I saw, there can be some general comments on top (with no section header), some comments on the bottom (with a section header..), a section actually titled "[Header]".. There can be sections with key-value pairs, sections with per-sample values (and possibly additional columns like recognized lower and higher thresholds), sections that have an arbitrary number of fixed columns (like the fusion details). But again, as there is no specification, I have no idea what else Illumina possibly does in files with this general structure, so the parser being rather general should be safe. Additional processing/parsing steps that are utilized in multiple tools should use the same functions, but these functions don't need to be triggered by the parser.
There was a problem hiding this comment.
ok, nice, I think it is general enough now so we can parse any of those files if they follow our assumed specifications. In case we have to expand the specs, we modify it. Any additional case we find we can include in the unit tests to make sure they all work.
| if df_slice.width == 2: | ||
| df_slice = df_slice.transpose() | ||
|
|
||
| # assume first row contains column names and rename |
There was a problem hiding this comment.
In the troublesome [Analysis Status] section, the first column of field names line is for some reason empty (consistently for both the Local App and Dragen output). :-/
There was a problem hiding this comment.
I handled it by setting the column name to "-", should be ok?
There was a problem hiding this comment.
I think that's good and it should work predictably for other similar occurrences of the issue. It's better than an empty value which the tool that processes the data section downstream would need to deal with otherwise. :-)
| return self.data | ||
|
|
||
|
|
||
| def Parse_section_tsv(path: str) -> tuple[list[str], dict[str, polars.DataFrame]]: |
There was a problem hiding this comment.
It's difficult to write parsers for formats that are not well defined (at least I haven't found a proper specification of the MetricsOutput.tsv format), so we unfortunately have to do with assumptions. I have previously relied on section label lines (^[section label]) to delienate the sections, but it seems that simply relying on empty lines does work as well. :-) We should perhaps create a list of assumptions that need to be checked when there is a new pipeline version (?). It might be ideal to test our assumptions automatically, but we would still need to make some assumptions in the test, I think.. with undocumented formats in play, it's hard to tell what one needs to stick to as the reliable guidelines.
The MetricsOutput.tsv files are parsed for metrics plotting purposes as well, so Omer should take advantage of this code. The <samnple_ID>_CombinedVariantOutput.tsv files have the same structure, so we should be able to advantage of this code also for TMB, MSI, GIS, etc. info extraction. :-)
There was a problem hiding this comment.
I agree, this format is weird and illumina uses it in other cases as well. We can try to make it more robust. I think using the section headers would be beneficial so let me redo that part. Also the file might start with a section header, I have seen that. I will add some logic to handle both cases.
For slightly changing formats, we have unit tests. We can simply add the files we want to test against the parser and define what we expect out so we know that we can rely on the parser doing its job properly.
Yes, Omer should use this, for sure. We can also define a class for him that he can take advantage of.
There was a problem hiding this comment.
I tried to make the parser more resilient to handle slight changes in the format. Will add unit tests in the upcoming days.
| def _handle_row_with_nulls(df: polars.DataFrame) -> polars.DataFrame: | ||
| """Handle rows with null values removing empty columns and by filling with "-".""" | ||
|
|
||
| # remove any columns that are completely null |
There was a problem hiding this comment.
I wonder about this one. If run A has undefined values for its samples in one of the columns and run B does not, then the two runs would not return the same format back due to this removal. I would prefer ensuring that the returned format is always the same, with the selected placeholder value being used in the whole empty column if necessary.
Not sure whether this would be relevant, but it feels like a safer approach (?).
There was a problem hiding this comment.
but wouldn't the undefined samples be NA? These columns wouldn't be removed. Also if the sample has a header (the sample name) it is also kept. Only "non-existent" columns are removed. Columns with missing values are still kept. I can rephrase the comment to make it clearer?
There was a problem hiding this comment.
Completely null columns sound safe to remove (if there is no header and no value..). :-)
|
Do you think it makes sense to move the parser onto its own branch/into its own PR so we can merge it in and Omer and others can start using it? I feel that the other parts of this PR might take some additional rounds of review. |
That could be a good idea! I don't know how quickly Omer will be able to utilize the parser, but it's quite independent of the small variant file parsing, so merging the Metrics parser in ahead of the rest could be useful with no downsides. |
Sounds like a good idea to get it out on its own. However, do you want to keep the metrics file parser as a standalone function or eventually include it to a class handling the metrics input file? In the second case, i would suggest to make the pr for that class not only the parser, but not sure how you see this... |
My vision is that this is its own tool/function and classes represent input files that utilize the parser. So it would be similar to the vcf class that uses the vcf package that the class for the metrics output tsv uses the parser inside. This way, we can reuse the parser for other files that follow the structure. |
259f8bd to
f371398
Compare
|
Now I redesigned the classes to use the Will do some more polishing but please feel free to start reviewing and giving feedback. |
I have started on implementing 3 different classes for the different types of input files and a parent class that handles the identification of dragen or localapp data. I even added a parser function but that just reads the different files as is and doesn't really do any parsing yet. I can continue with the parsing while you can have a look at the first draft of the implementation.