Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
78 commits
Select commit Hold shift + click to select a range
7bfa850
Script for query
rdurnik Jan 27, 2026
ca87cb4
Script for abstracts download
rdurnik Jan 27, 2026
1bed026
Query clean up
rdurnik Jan 27, 2026
50f8ced
Script for chemical identification
rdurnik Jan 27, 2026
36ed78e
Script for matching of chemical names to database of relevant chemicals
rdurnik Jan 27, 2026
e5fcdc9
Lint
rdurnik Jan 27, 2026
95044b4
Script to normalize chemical
rdurnik Jan 27, 2026
5759105
Updated matching to take list of chemical names (Excel file)
rdurnik Jan 27, 2026
409c8c9
Script to generate MeSH terms dataframe
rdurnik Jan 27, 2026
f3258e2
Script for find relationships
rdurnik Jan 27, 2026
dffc889
IDs are saved as txt file
rdurnik Jan 27, 2026
4aa629b
Clean up
rdurnik Jan 27, 2026
9204380
Added output dir
rdurnik Jan 28, 2026
12849a8
Abstracts are saved as tsv file
rdurnik Jan 28, 2026
9803780
Script for PDF download
rdurnik Jan 28, 2026
0f55a39
MeSH terms df saved as tsv
rdurnik Jan 28, 2026
2c5ff4a
Lint
rdurnik Jan 28, 2026
e691d6f
Chemical identification saved as file
rdurnik Jan 28, 2026
95c1524
Normalize chemical takes saved output and outputs a file
rdurnik Jan 28, 2026
2e91525
Chemical matching takes file and returns a file
rdurnik Jan 28, 2026
36239be
Find chemicals keeps text
rdurnik Jan 28, 2026
0c93372
Script to parse PDFs
rdurnik Jan 29, 2026
fd189fe
Renamed file
rdurnik Jan 29, 2026
56c0604
Abstract download fixed for Europe PMC
rdurnik Jan 29, 2026
19d6d32
Renamed column names
rdurnik Jan 29, 2026
ace2fe8
Script to find relationships
rdurnik Jan 29, 2026
7e331f2
Initial version chemical identification wrappers
rdurnik Feb 10, 2026
0bfc9c7
Made find chemicals into xml file
rdurnik Feb 25, 2026
b6f33b5
Updated version
rdurnik Feb 25, 2026
75acb8f
Replaced tsv test file with txt
rdurnik Feb 25, 2026
b401fd7
Changed extension
rdurnik Feb 25, 2026
c893ba7
Changed download abstracts into xml
rdurnik Feb 25, 2026
1adeee5
Fixed download abstracts name
rdurnik Mar 5, 2026
2ea1ee4
Fixed description
rdurnik Mar 5, 2026
d72834d
Download abstracts fixes
rdurnik Mar 5, 2026
12f21a3
Download PDF as XML
rdurnik Mar 5, 2026
a63a4a8
Removed print
rdurnik Mar 5, 2026
219fa0c
Updated text data set
rdurnik Mar 5, 2026
722fead
IDs test data set
rdurnik Mar 5, 2026
6e58b46
Find relationships XML
rdurnik Mar 5, 2026
5e09fd9
Removed MeSH term generation
rdurnik Mar 5, 2026
972f9b0
PDF parsing XML
rdurnik Mar 5, 2026
589c8ad
Query literature XML
rdurnik Mar 5, 2026
96c7a70
Changed chemical identification to output one chemical per line
rdurnik Mar 5, 2026
fb0df48
Chemical normalization
rdurnik Mar 5, 2026
67cebd2
Matching chemicals
rdurnik Mar 5, 2026
9d05d10
Description fix
rdurnik Mar 6, 2026
9320772
Needs to be defined before the other imports
rdurnik Mar 6, 2026
d1c3642
Fixed macros
rdurnik Mar 6, 2026
6961497
Renamed normalization to normalization PubChem
rdurnik Mar 6, 2026
43447bd
Changed description
rdurnik Mar 6, 2026
8a8114d
Merge branch 'master' into new_wrappers
hechth Mar 24, 2026
832eb76
removed imports
rdurnik Mar 25, 2026
89d1f73
added pmc
rdurnik Mar 25, 2026
2d2c0df
updated database selection
rdurnik Apr 9, 2026
13a7ee9
updated to get pdf
rdurnik Apr 9, 2026
20c5436
updated database selection
rdurnik Apr 9, 2026
42482c7
added openai key macro
rdurnik Apr 9, 2026
57d83a5
updated to openai_api_key_credentials
rdurnik Apr 9, 2026
fd700d2
updated name
rdurnik Apr 9, 2026
0f37098
wrapper to get full text
rdurnik Apr 9, 2026
01190f0
Name update
rdurnik Apr 10, 2026
8322fcb
added potential llm api key to pdf parsing
rdurnik Apr 10, 2026
ec59595
updated macros for llms
rdurnik Apr 10, 2026
f8c5550
find relationships with llms
rdurnik Apr 10, 2026
50156b6
updated pdf desc
rdurnik Apr 10, 2026
6891035
changed desc to text
rdurnik Apr 20, 2026
0deaeba
llm find chemicals
rdurnik Apr 20, 2026
87aefb8
new test file
rdurnik Apr 20, 2026
54b7e5c
normalization using llms
rdurnik Apr 20, 2026
30c8ba6
clean up, tests fix
rdurnik Apr 20, 2026
8d8e4c0
name change
rdurnik Apr 20, 2026
71e57a8
pdf test data
rdurnik Apr 21, 2026
e89e76f
requirement not supposed to be in inputs
rdurnik Apr 21, 2026
9cde057
pdf input file name change
rdurnik Apr 21, 2026
86dfc8f
file name rename
rdurnik Apr 21, 2026
4b7a32a
test fixes
rdurnik Apr 21, 2026
f07680a
added mesh terms normalization
rdurnik Apr 21, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions tools/aoptk/.shed.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,13 @@ remote_repository_url: "https://github.com/rdurnik/aoptk"
homepage_url: "https://github.com/rdurnik/aoptk"
categories:
- Machine Learning
description: "AOP-toolkit (aoptk) is a Python package designed to support the development of Adverse Outcome Pathways (AOPs) that require extensive data mining."
description: "AOP-toolkit (aoptk) is a Python package designed to support data mining and analysis of toxicological outcomes."
long_description: |
"AOP-toolkit (aoptk) is a Python package developed to support the construction of Adverse Outcome Pathways (AOPs) that require extensive mining and integration of toxicological data from heterogeneous sources. It enables researchers to collect literature from databases such as PubMed and Europe PMC, extract relevant information from full-text publications, and analyze complex, unstructured data using large language models. The toolkit also provides functionality for normalizing chemical names across publications, helping ensure consistency and interoperability."
"AOP-toolkit (aoptk) is a Python package for mining and analyzing toxicological and biomedical literature. Originally developed to support the construction of Adverse Outcome Pathways (AOPs), it provides general-purpose tools for retrieving, processing, and analyzing scientific publications."
auto_tool_repositories:
name_template: "{{ tool_id }}"
description_template: "{{ tool_name }} tool from the aoptk package"
suite:
name: suite_aoptk
description: AOP-toolkit (aoptk) is a Python package developed to support the construction of Adverse Outcome Pathways (AOPs) that require extensive mining and integration of toxicological data from heterogeneous sources.
description: AOP-toolkit (aoptk) is a Python package for mining and analyzing toxicological and biomedical literature. Originally developed to support the construction of Adverse Outcome Pathways (AOPs), it provides general-purpose tools for retrieving, processing, and analyzing scientific publications.
type: repository_suite_definition
10 changes: 6 additions & 4 deletions tools/aoptk/aoptk_chemical_identifier.xml
Original file line number Diff line number Diff line change
Expand Up @@ -6,19 +6,21 @@

<requirements>
<expand macro="requirements"/>
<expand macro="email_credentials"/>
</requirements>

<command detect_errors="exit_code"><![CDATA[
chemical-identifier
--query '$query'
--literature_database "$literature_database"
--literature_database "$literature_database_pubmed_europepmc"
--chemical_database "$chemical_database"
--outdir .
\${EMAIL:+--email \$EMAIL}
]]></command>

<inputs>
<expand macro="inputs"/>
<expand macro="query"/>
<expand macro="literature_database_pubmed_europepmc"/>
<param argument="--chemical_database" type="data" format="xlsx" label="Chemical database" help="Custom chemical database with toxicologically relevant chemicals. Excel file with single column: chemical_name. Examples can be found in Citations." />
</inputs>

Expand All @@ -31,13 +33,13 @@
<!-- Hint: You can use [ctrl+alt+t] after defining the inputs/outputs to auto-scaffold some basic test cases. -->
<test>
<param name="query" value="hepg2 thioacetamide"/>
<param name="literature_database" value="pubmed"/>
<param name="literature_database_pubmed_europepmc" value="pubmed"/>
<param name="chemical_database" location="https://zenodo.org/records/16532456/files/tg_gates.xlsx?download=1"/>
<output name="Chemicals_per_publication" file="chemicals_per_publication_test.xlsx" compare="sim_size" delta="100"/>
</test>
<test>
<param name="query" value="hepg2 thioacetamide spheroid"/>
<param name="literature_database" value="europepmc"/>
<param name="literature_database_pubmed_europepmc" value="europepmc"/>
<param name="chemical_database" location="https://zenodo.org/records/16532456/files/tg_gates.xlsx?download=1"/>
<output name="Publications_per_chemical" file="publications_per_chemical.xlsx" compare="sim_size" delta="100"/>
</test>
Expand Down
58 changes: 58 additions & 0 deletions tools/aoptk/aoptk_chemical_matching.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
<tool id="aoptk_match_chemicals" name="aoptk match chemicals" version="@TOOL_VERSION@+galaxy@VERSION_SUFFIX@" profile="25.1" license="MIT">
<description>Match chemical entities.</description>
<macros>
<import>macros.xml</import>
</macros>

<requirements>
<expand macro="requirements"/>
</requirements>

<command detect_errors="exit_code"><![CDATA[
python3 '${match_chemicals}'
]]></command>

<configfiles>
<configfile name="match_chemicals">
import os
import pandas as pd

chemicals_df_1 = pd.read_csv("$input_file_1", sep="\t")
chemicals_df_2 = pd.read_csv("$input_file_2", sep="\t")
merged_files = chemicals_df_1.merge(
chemicals_df_2,
left_on="heading",
right_on="heading",
how="outer",
)
merged_files.to_csv("merged_chemicals.tsv", sep="\t", index=False)

</configfile>
</configfiles>

<inputs>
<param name="input_file_1" type="data" format="tabular" label="TSV with heading column." help="Input tsv file with heading column." />
<param name="input_file_2" type="data" format="tabular" label="TSV with heading column." help="Input tsv file with heading column." />
</inputs>

<outputs>
<data name="merged_chemicals" format="tabular" from_work_dir="merged_chemicals.tsv" label="Merged chemicals with heading." />
</outputs>

<tests>
<test>
<param name="input_file_1" value="test-data/normalized.tsv"/>
<param name="input_file_2" value="test-data/normalized.tsv"/>
<output name="merged_chemicals" file="test-data/normalized.tsv" compare="sim_size" delta="100"/>
</test>
</tests>

<help><![CDATA[
Chemical Matching
===================

Tool to match chemical entities.

]]></help>

</tool>
62 changes: 62 additions & 0 deletions tools/aoptk/aoptk_chemical_normalization_llm.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
<tool id="aoptk_normalize_chemicals_llm" name="aoptk normalize chemicals llm" version="@TOOL_VERSION@+galaxy@VERSION_SUFFIX@" profile="25.1" license="MIT">
<description>Normalize chemical entities using LLMs.</description>
<macros>
<import>macros.xml</import>
</macros>

<requirements>
<expand macro="requirements"/>
<expand macro="openai_api_key_credentials"/>
</requirements>

<command detect_errors="exit_code"><![CDATA[
python3 '${normalize_chemicals_llm}'
]]></command>

<configfiles>
<configfile name="normalize_chemicals_llm">
import os
from aoptk.text_generation_api import TextGenerationAPI
from aoptk.chemical import Chemical
import pandas as pd

openai_key = os.environ.get("OPENAI_KEY")
text_generation_api = TextGenerationAPI(model="$llm_model", api_key=openai_key)
chemical_list = pd.read_csv("$chemical_list", sep="\t")["chemical"].tolist()
chemicals = pd.read_csv("$chemicals", sep="\t")
chemicals["chemical"] = chemicals["chemical"].apply(
lambda x: TextGenerationAPI(model="$llm_model", api_key=openai_key).normalize_chemical(chemical=Chemical(x), chemical_list=chemical_list)
)
chemicals["heading"] = chemicals["chemical"].apply(lambda chem: chem.heading)
chemicals.to_csv("normalized_chemicals.tsv", sep="\t", index=False)

</configfile>
</configfiles>

<inputs>
<expand macro="llm_models"/>
<param name="chemicals" type="data" format="tabular" label="TSV with chemical column." help="Input tsv file with chemical column." />
<param name="chemical_list" type="data" format="tabular" label="TSV with chemical list." help="Input tsv file with chemical list." />
</inputs>

<outputs>
<data name="normalized_chemicals" format="tabular" from_work_dir="normalized_chemicals.tsv" label="Chemicals with heading generated." />
</outputs>

<tests>
<test>
<param name="chemicals" value="test-data/chemicals.tsv"/>
<param name="chemical_list" value="test-data/chemicals.tsv"/>
<output name="normalized_chemicals" file="test-data/normalized.tsv" compare="sim_size" delta="100"/>
</test>
</tests>

<help><![CDATA[
Chemical Normalization LLMs
===================

Tool to normalize chemical entities using LLMs. Using LLM to match a given chemical against a provided chemical list.

]]></help>

</tool>
55 changes: 55 additions & 0 deletions tools/aoptk/aoptk_chemical_normalization_mesh.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
<tool id="aoptk_normalize_chemicals_mesh" name="aoptk normalize chemicals mesh" version="@TOOL_VERSION@+galaxy@VERSION_SUFFIX@" profile="25.1" license="MIT">
<description>Normalize chemical entities using MeshTerms.</description>
<macros>
<import>macros.xml</import>
</macros>

<requirements>
<expand macro="requirements"/>
</requirements>

<command detect_errors="exit_code"><![CDATA[
python3 '${normalize_chemicals_mesh}'
]]></command>

<configfiles>
<configfile name="normalize_chemicals_mesh">
from aoptk.chemical import Chemical
from aoptk.normalization.mesh_terms import MeshTerms
import os
import pandas as pd

chemicals = pd.read_csv("$input_file", sep="\t")
chemicals["chemical"] = chemicals["chemical"].apply(
lambda x: MeshTerms().normalize_chemical(Chemical(x))
)
chemicals["heading"] = chemicals["chemical"].apply(lambda chem: chem.heading)
chemicals.to_csv("normalized_chemicals.tsv", sep="\t", index=False)

</configfile>
</configfiles>

<inputs>
<param name="input_file" type="data" format="tabular" label="TSV with chemical column." help="Input tsv file with chemical column." />
</inputs>

<outputs>
<data name="normalized_chemicals" format="tabular" from_work_dir="normalized_chemicals.tsv" label="Chemicals with heading generated." />
</outputs>

<tests>
<test>
<param name="input_file" value="test-data/chemicals.tsv"/>
<output name="normalized_chemicals" file="test-data/normalized.tsv" compare="sim_size" delta="100"/>
</test>
</tests>

<help><![CDATA[
Chemical Normalization MeSH Terms
===================

Tool to normalize chemical entities using MeSH Terms.

]]></help>

</tool>
55 changes: 55 additions & 0 deletions tools/aoptk/aoptk_chemical_normalization_pubchem.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
<tool id="aoptk_normalize_chemicals_pubchem" name="aoptk normalize chemicals pubchem" version="@TOOL_VERSION@+galaxy@VERSION_SUFFIX@" profile="25.1" license="MIT">
<description>Normalize chemical entities using PubChem API.</description>
<macros>
<import>macros.xml</import>
</macros>

<requirements>
<expand macro="requirements"/>
</requirements>

<command detect_errors="exit_code"><![CDATA[
python3 '${normalize_chemicals_pubchem}'
]]></command>

<configfiles>
<configfile name="normalize_chemicals_pubchem">
from aoptk.normalization.pubchem_api import PubChemAPI
from aoptk.chemical import Chemical
import os
import pandas as pd

chemicals = pd.read_csv("$input_file", sep="\t")
chemicals["chemical"] = chemicals["chemical"].apply(
lambda x: PubChemAPI().normalize_chemical(Chemical(x))
)
chemicals["heading"] = chemicals["chemical"].apply(lambda chem: chem.heading)
chemicals.to_csv("normalized_chemicals.tsv", sep="\t", index=False)

</configfile>
</configfiles>

<inputs>
<param name="input_file" type="data" format="tabular" label="TSV with chemical column." help="Input tsv file with chemical column." />
</inputs>

<outputs>
<data name="normalized_chemicals" format="tabular" from_work_dir="normalized_chemicals.tsv" label="Chemicals with heading generated." />
</outputs>

<tests>
<test>
<param name="input_file" value="test-data/chemicals.tsv"/>
<output name="normalized_chemicals" file="test-data/normalized.tsv" compare="sim_size" delta="100"/>
</test>
</tests>

<help><![CDATA[
Chemical Normalization PubChem
===================

Tool to normalize chemical entities using PubChem API.

]]></help>

</tool>
74 changes: 74 additions & 0 deletions tools/aoptk/aoptk_download_abstracts.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
<tool id="aoptk_download_abstracts" name="aoptk download abstracts" version="@TOOL_VERSION@+galaxy@VERSION_SUFFIX@" profile="25.1" license="MIT">
<description>Download abstracts for a list of publication IDs.</description>
<macros>
<import>macros.xml</import>
</macros>

<requirements>
<expand macro="requirements"/>
<expand macro="email_credentials"/>
</requirements>

<command detect_errors="exit_code"><![CDATA[
python3 '${download_abstracts}'
]]></command>

<configfiles>
<configfile name="download_abstracts">
from aoptk.literature.databases.pubmed import PubMed
from aoptk.literature.databases.europepmc import EuropePMC
from aoptk.literature.abstract import Abstract
from Bio import Entrez
import os

with open("$input_file", "r") as f:
ids = [line.strip() for line in f.readlines()]
email = os.environ.get("EMAIL")


if "${literature_database_pubmed_europepmc}" == "pubmed":
Entrez.email = email
pubmed = PubMed.__new__(PubMed)
pubmed.id_list = ids
abstracts = pubmed.get_abstracts()
elif "${literature_database_pubmed_europepmc}" == "europepmc":
europepmc = EuropePMC("")
europepmc.id_list = ids
abstracts = europepmc.get_abstracts()
else:
raise ValueError("Select valid database.")

for abstract in abstracts:
with open(f"{abstract.publication_id}.txt", "w") as f:
f.write(abstract.text)

</configfile>
</configfiles>

<inputs>
<expand macro="literature_database_pubmed_europepmc"/>
<param name="input_file" type="data" format="txt" label="List of IDs to search for." help="Input text file with IDs to search for." />
</inputs>

<outputs>
<collection name="abstracts" type="list" label="Downloaded abstracts">
<discover_datasets pattern="(?P&lt;designation&gt;.*)$" format="txt" visible="false" />
</collection>
</outputs>

<tests>
<test>
<param name="input_file" value="test-data/ids.txt"/>
<output_collection name="abstracts" type="list" count="2"/>
</test>
</tests>

<help><![CDATA[
Download Abstracts
===================

Tool to download publication abstracts.

]]></help>

</tool>
Loading
Loading