Skip to content

Add sec tsv parser#31

Merged
marrip merged 21 commits into
developfrom
add-sec-tsv-parser
Jun 10, 2026
Merged

Add sec tsv parser#31
marrip merged 21 commits into
developfrom
add-sec-tsv-parser

Conversation

@marrip

@marrip marrip commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

I have moved the section tsv parser to this branch and added unit tests along with test data, please check the test data and comment if there is anything else we want to check.

@marrip marrip requested review from danielvo and tinavisnovska June 5, 2026 13:48
@marrip marrip self-assigned this Jun 5, 2026
Comment thread src/tsoppy/general/file_parser.py Outdated
@danielvo

danielvo commented Jun 8, 2026

Copy link
Copy Markdown

Very nice work, Martin! :-D I expect the current code to do the desired job with the Local App and Dragen files that we plan to utilize. I have a few topics we can discuss further:

  1. There are some very elegant lines with lambda functions and polars functions, and it's very helpful that they are introduced with an explaining comment. :-D For those who are going to utilize the parser in their modules, how do we best provide a full picture of what the parsing does (i.e., what the full input -> output transformation is and which steps are taken)? The tests are very helpful, but I am thinking it would be useful with a concise summary that doesn't require full comprehension of the code. Should we try to create documentation pages while developing? Or we could perhaps have a more detailed, comprehensive description of the main Parse_section_tsv() function instead and refer to it? It would be good to do it before we move on and start forgetting the details.

  2. The initial batch of tests is based on test file inputs - I found that convenient because the exact file structure is easy to observe there (maybe with the exception of white spaces..). Then there are additional test batches with hard-coded polars variables (which take some mental work to re-imagine as input files) - is there a specific reason for mixing those two?

  3. For the extra_empty_lines.tsv case, I think it could be nice to also have empty lines in the middle of a section. I do not know of a case where it would be relevant right now, so there is no pressure to implement this.. but since we are relying on the [section header titles] to inform about section starts, I am thinking the parsing could eventually drop relying on empty lines completely: A section end could be defined by file end or a new section start, and the _handle_row_with_nulls() function that handles rows with nulls could also remove empty lines.

@marrip

marrip commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

Very nice work, Martin! :-D I expect the current code to do the desired job with the Local App and Dragen files that we plan to utilize. I have a few topics we can discuss further:

Thank you 🙂

  1. There are some very elegant lines with lambda functions and polars functions, and it's very helpful that they are introduced with an explaining comment. :-D For those who are going to utilize the parser in their modules, how do we best provide a full picture of what the parsing does (i.e., what the full input -> output transformation is and which steps are taken)? The tests are very helpful, but I am thinking it would be useful with a concise summary that doesn't require full comprehension of the code. Should we try to create documentation pages while developing? Or we could perhaps have a more detailed, comprehensive description of the main Parse_section_tsv() function instead and refer to it? It would be good to do it before we move on and start forgetting the details.

Completely agree, let's add some docs. I guess it is also good practice for all of us to document the major classes and functions to a certain extend.

  1. The initial batch of tests is based on test file inputs - I found that convenient because the exact file structure is easy to observe there (maybe with the exception of white spaces..). Then there are additional test batches with hard-coded polars variables (which take some mental work to re-imagine as input files) - is there a specific reason for mixing those two?

The only reason that I am mixing them is because different functions require different input types and it is mainly for checking if the tests work for different cases. I understand that as a reader it is difficult to follow. One solution could be to add a small description for each so you do not have to do any mental gymnastics but rather know instantly what the case is and you can assess whether cases are missing or not.

  1. For the extra_empty_lines.tsv case, I think it could be nice to also have empty lines in the middle of a section. I do not know of a case where it would be relevant right now, so there is no pressure to implement this.. but since we are relying on the [section header titles] to inform about section starts, I am thinking the parsing could eventually drop relying on empty lines completely: A section end could be defined by file end or a new section start, and the _handle_row_with_nulls() function that handles rows with nulls could also remove empty lines.

I see your point, however, premature optimization comes with a cost of spending more time but it might not have a use case for us. Let's wait until we actually encounter files that have empty rows within a section. If we keep the rest of the interface consistent, it is no problem for the places it was already used in other modules. Does that sound ok?

@marrip marrip requested a review from danielvo June 9, 2026 07:54
@danielvo

danielvo commented Jun 9, 2026

Copy link
Copy Markdown

Completely agree, let's add some docs. I guess it is also good practice for all of us to document the major classes and functions to a certain extend.

Excellent. Do you have a vision of how to structure the documentation? Classes, functions, tools, ... As per Tina's comment to TSOPPI docs, we might want to have the technical documentation ("how to run things", what the inputs are, etc.), as well as documentation for those who want to be able to interpret the outputs (understanding the resulting plots is a completely different thing than running the analysis correctly).

The only reason that I am mixing them is because different functions require different input types and it is mainly for checking if the tests work for different cases. I understand that as a reader it is difficult to follow. One solution could be to add a small description for each so you do not have to do any mental gymnastics but rather know instantly what the case is and you can assess whether cases are missing or not.

That's a good solution. And: Very nice with the tests in the first place!

(...) Does that sound ok?

It does indeed! It's not like we don't have anything else to do.. :-) We can explicitly mention the assumption that 'empty lines are not appearing in the middle of a section' in the docs (together with any other implicit assumptions that we can identify). If it should become a problem in the future, the docs might help identify the issue.

@marrip

marrip commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

Excellent. Do you have a vision of how to structure the documentation? Classes, functions, tools, ... As per Tina's comment to TSOPPI docs, we might want to have the technical documentation ("how to run things", what the inputs are, etc.), as well as documentation for those who want to be able to interpret the outputs (understanding the resulting plots is a completely different thing than running the analysis correctly).

I started putting everything in docs and also think we should move the development section from the readme in docs. For now, we have decisions for adrs and we have references that hold any kind of reference information (for now publications and specifications for the functions and classes). I would add guides for any information on how to use things and move the development section there. Information about how to use tsoppy can also be included in Guides. We can then make the README rather general and point to the relevant documentation in docs.

Regarding interpretation of results, I am not sure if that should be part of the repo. These information would be for the molecular biologists, right? We could have a separate repo for that. Otherwise, I might opt for the actual nextflow workflow as this will link all tools and tsoppy together which produces the results that are used for interpretation. However, I think we can also decide on that later and even move it if we realize later that we made the wrong choice. What do you think?

It does indeed! It's not like we don't have anything else to do.. :-) We can explicitly mention the assumption that 'empty lines are not appearing in the middle of a section' in the docs (together with any other implicit assumptions that we can identify). If it should become a problem in the future, the docs might help identify the issue.

Will add a comment.

@danielvo

danielvo commented Jun 9, 2026

Copy link
Copy Markdown

I started putting everything in docs and also think we should move the development section from the readme in docs. For now, we have decisions for adrs and we have references that hold any kind of reference information (for now publications and specifications for the functions and classes). I would add guides for any information on how to use things and move the development section there. Information about how to use tsoppy can also be included in Guides. We can then make the README rather general and point to the relevant documentation in docs.

Superb!

Regarding interpretation of results, I am not sure if that should be part of the repo. These information would be for the molecular biologists, right? We could have a separate repo for that. Otherwise, I might opt for the actual nextflow workflow as this will link all tools and tsoppy together which produces the results that are used for interpretation. However, I think we can also decide on that later and even move it if we realize later that we made the wrong choice. What do you think?

I am not sure what the best solution would be. Mixing the technical documentation and an interpretation guide on the same page would probably be an annoyance to all users. :-p So I believe that separating the two in some way would make sense (that being said, they can both link to each other in a convenient way). We can indeed decide on this later, I just wanted to bring it up so that we are aware of it. :-)

Will add a comment.

Thanks! :-)

@marrip

marrip commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

I am not sure what the best solution would be. Mixing the technical documentation and an interpretation guide on the same page would probably be an annoyance to all users. :-p So I believe that separating the two in some way would make sense (that being said, they can both link to each other in a convenient way). We can indeed decide on this later, I just wanted to bring it up so that we are aware of it. :-)

I agree, linking them somehow would be good. Thanks for bringing it up ☺️

Comment thread tests/general_file_parser_test.py Outdated

@danielvo danielvo left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent! I have nothing else to add. Many thanks, Martin! :-D

@marrip marrip merged commit 743cfc1 into develop Jun 10, 2026
2 checks passed
@marrip marrip deleted the add-sec-tsv-parser branch June 10, 2026 13:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants