Add sec tsv parser by marrip · Pull Request #31 · InPreD/tsoppy

marrip · 2026-06-05T13:48:36Z

I have moved the section tsv parser to this branch and added unit tests along with test data, please check the test data and comment if there is anything else we want to check.

…ng for section heading instead of empty lines

danielvo · 2026-06-08T13:44:38Z

Very nice work, Martin! :-D I expect the current code to do the desired job with the Local App and Dragen files that we plan to utilize. I have a few topics we can discuss further:

There are some very elegant lines with lambda functions and polars functions, and it's very helpful that they are introduced with an explaining comment. :-D For those who are going to utilize the parser in their modules, how do we best provide a full picture of what the parsing does (i.e., what the full input -> output transformation is and which steps are taken)? The tests are very helpful, but I am thinking it would be useful with a concise summary that doesn't require full comprehension of the code. Should we try to create documentation pages while developing? Or we could perhaps have a more detailed, comprehensive description of the main Parse_section_tsv() function instead and refer to it? It would be good to do it before we move on and start forgetting the details.
The initial batch of tests is based on test file inputs - I found that convenient because the exact file structure is easy to observe there (maybe with the exception of white spaces..). Then there are additional test batches with hard-coded polars variables (which take some mental work to re-imagine as input files) - is there a specific reason for mixing those two?
For the extra_empty_lines.tsv case, I think it could be nice to also have empty lines in the middle of a section. I do not know of a case where it would be relevant right now, so there is no pressure to implement this.. but since we are relying on the [section header titles] to inform about section starts, I am thinking the parsing could eventually drop relying on empty lines completely: A section end could be defined by file end or a new section start, and the _handle_row_with_nulls() function that handles rows with nulls could also remove empty lines.

marrip · 2026-06-09T06:40:19Z

Very nice work, Martin! :-D I expect the current code to do the desired job with the Local App and Dragen files that we plan to utilize. I have a few topics we can discuss further:

Thank you 🙂

There are some very elegant lines with lambda functions and polars functions, and it's very helpful that they are introduced with an explaining comment. :-D For those who are going to utilize the parser in their modules, how do we best provide a full picture of what the parsing does (i.e., what the full input -> output transformation is and which steps are taken)? The tests are very helpful, but I am thinking it would be useful with a concise summary that doesn't require full comprehension of the code. Should we try to create documentation pages while developing? Or we could perhaps have a more detailed, comprehensive description of the main Parse_section_tsv() function instead and refer to it? It would be good to do it before we move on and start forgetting the details.

Completely agree, let's add some docs. I guess it is also good practice for all of us to document the major classes and functions to a certain extend.

The initial batch of tests is based on test file inputs - I found that convenient because the exact file structure is easy to observe there (maybe with the exception of white spaces..). Then there are additional test batches with hard-coded polars variables (which take some mental work to re-imagine as input files) - is there a specific reason for mixing those two?

The only reason that I am mixing them is because different functions require different input types and it is mainly for checking if the tests work for different cases. I understand that as a reader it is difficult to follow. One solution could be to add a small description for each so you do not have to do any mental gymnastics but rather know instantly what the case is and you can assess whether cases are missing or not.

For the extra_empty_lines.tsv case, I think it could be nice to also have empty lines in the middle of a section. I do not know of a case where it would be relevant right now, so there is no pressure to implement this.. but since we are relying on the [section header titles] to inform about section starts, I am thinking the parsing could eventually drop relying on empty lines completely: A section end could be defined by file end or a new section start, and the _handle_row_with_nulls() function that handles rows with nulls could also remove empty lines.

I see your point, however, premature optimization comes with a cost of spending more time but it might not have a use case for us. Let's wait until we actually encounter files that have empty rows within a section. If we keep the rest of the interface consistent, it is no problem for the places it was already used in other modules. Does that sound ok?

danielvo · 2026-06-09T07:57:36Z

Completely agree, let's add some docs. I guess it is also good practice for all of us to document the major classes and functions to a certain extend.

Excellent. Do you have a vision of how to structure the documentation? Classes, functions, tools, ... As per Tina's comment to TSOPPI docs, we might want to have the technical documentation ("how to run things", what the inputs are, etc.), as well as documentation for those who want to be able to interpret the outputs (understanding the resulting plots is a completely different thing than running the analysis correctly).

The only reason that I am mixing them is because different functions require different input types and it is mainly for checking if the tests work for different cases. I understand that as a reader it is difficult to follow. One solution could be to add a small description for each so you do not have to do any mental gymnastics but rather know instantly what the case is and you can assess whether cases are missing or not.

That's a good solution. And: Very nice with the tests in the first place!

(...) Does that sound ok?

It does indeed! It's not like we don't have anything else to do.. :-) We can explicitly mention the assumption that 'empty lines are not appearing in the middle of a section' in the docs (together with any other implicit assumptions that we can identify). If it should become a problem in the future, the docs might help identify the issue.

marrip · 2026-06-09T08:10:32Z

Excellent. Do you have a vision of how to structure the documentation? Classes, functions, tools, ... As per Tina's comment to TSOPPI docs, we might want to have the technical documentation ("how to run things", what the inputs are, etc.), as well as documentation for those who want to be able to interpret the outputs (understanding the resulting plots is a completely different thing than running the analysis correctly).

I started putting everything in docs and also think we should move the development section from the readme in docs. For now, we have decisions for adrs and we have references that hold any kind of reference information (for now publications and specifications for the functions and classes). I would add guides for any information on how to use things and move the development section there. Information about how to use tsoppy can also be included in Guides. We can then make the README rather general and point to the relevant documentation in docs.

Regarding interpretation of results, I am not sure if that should be part of the repo. These information would be for the molecular biologists, right? We could have a separate repo for that. Otherwise, I might opt for the actual nextflow workflow as this will link all tools and tsoppy together which produces the results that are used for interpretation. However, I think we can also decide on that later and even move it if we realize later that we made the wrong choice. What do you think?

It does indeed! It's not like we don't have anything else to do.. :-) We can explicitly mention the assumption that 'empty lines are not appearing in the middle of a section' in the docs (together with any other implicit assumptions that we can identify). If it should become a problem in the future, the docs might help identify the issue.

Will add a comment.

danielvo · 2026-06-09T08:22:27Z

I started putting everything in docs and also think we should move the development section from the readme in docs. For now, we have decisions for adrs and we have references that hold any kind of reference information (for now publications and specifications for the functions and classes). I would add guides for any information on how to use things and move the development section there. Information about how to use tsoppy can also be included in Guides. We can then make the README rather general and point to the relevant documentation in docs.

Superb!

Regarding interpretation of results, I am not sure if that should be part of the repo. These information would be for the molecular biologists, right? We could have a separate repo for that. Otherwise, I might opt for the actual nextflow workflow as this will link all tools and tsoppy together which produces the results that are used for interpretation. However, I think we can also decide on that later and even move it if we realize later that we made the wrong choice. What do you think?

I am not sure what the best solution would be. Mixing the technical documentation and an interpretation guide on the same page would probably be an annoyance to all users. :-p So I believe that separating the two in some way would make sense (that being said, they can both link to each other in a convenient way). We can indeed decide on this later, I just wanted to bring it up so that we are aware of it. :-)

Will add a comment.

Thanks! :-)

marrip · 2026-06-09T08:29:59Z

I am not sure what the best solution would be. Mixing the technical documentation and an interpretation guide on the same page would probably be an annoyance to all users. :-p So I believe that separating the two in some way would make sense (that being said, they can both link to each other in a convenient way). We can indeed decide on this later, I just wanted to bring it up so that we are aware of it. :-)

I agree, linking them somehow would be good. Thanks for bringing it up ☺️

danielvo

Excellent! I have nothing else to add. Many thanks, Martin! :-D

marrip added 6 commits June 5, 2026 13:28

feat: move parser to file_parser module and change behavior to checki…

985d65d

…ng for section heading instead of empty lines

fix: don't assume key value store but have user define it

d4e3fb4

chore: ruff lint

93f331b

docs: update comment to make purpose clearer

bd0737b

fix: add init.py to mark subpackage general

4a83af3

test: add test_data and unittests for Parse_section_Tsv

60d28ee

marrip requested review from danielvo and tinavisnovska June 5, 2026 13:48

marrip self-assigned this Jun 5, 2026

marrip added 9 commits June 5, 2026 16:05

feat: intercept file not found and no data and log

962e93a

test: add test cases and data for empty file and non-existent

a7a2712

fix: correct output signature for _get_section_idx

dbe533f

test: add unit tests for _get_section_idx

18dfbde

test: add unit tests for _parse_headers and _handle_row_with_nulls

56a10e2

chore: lint ruff and isort

f4898e9

chore: lint ruff

0b9fbd7

chore: lint ruff

2a59e8d

chore: install polars

ea01cf0

danielvo reviewed Jun 8, 2026

View reviewed changes

Comment thread src/tsoppy/general/file_parser.py Outdated

docs: update docstring with @danielvo 's suggestion

2b31aae

marrip added 4 commits June 9, 2026 09:52

docs: add short comments to describe test cases

016a5dd

chore: include pytest in dev deps

abd8898

docs: include description of Parse_section_tsv

86b33d9

docs: include header and MLA citation for cyvcf paper

ac891ae

marrip requested a review from danielvo June 9, 2026 07:54

danielvo reviewed Jun 9, 2026

View reviewed changes

Comment thread tests/general_file_parser_test.py Outdated

docs: correct docstring according to @danielvo 's suggestion

0978c77

danielvo approved these changes Jun 9, 2026

View reviewed changes

marrip merged commit 743cfc1 into develop Jun 10, 2026
2 checks passed

marrip deleted the add-sec-tsv-parser branch June 10, 2026 13:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sec tsv parser#31

Add sec tsv parser#31
marrip merged 21 commits into
developfrom
add-sec-tsv-parser

marrip commented Jun 5, 2026

Uh oh!

Uh oh!

danielvo commented Jun 8, 2026

Uh oh!

marrip commented Jun 9, 2026 •

edited

Loading

Uh oh!

danielvo commented Jun 9, 2026

Uh oh!

marrip commented Jun 9, 2026

Uh oh!

danielvo commented Jun 9, 2026

Uh oh!

marrip commented Jun 9, 2026

Uh oh!

Uh oh!

danielvo left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

marrip commented Jun 5, 2026

Uh oh!

Uh oh!

danielvo commented Jun 8, 2026

Uh oh!

marrip commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielvo commented Jun 9, 2026

Uh oh!

marrip commented Jun 9, 2026

Uh oh!

danielvo commented Jun 9, 2026

Uh oh!

marrip commented Jun 9, 2026

Uh oh!

Uh oh!

danielvo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

marrip commented Jun 9, 2026 •

edited

Loading