Spaces:
Running
on
Zero
Running
on
Zero
File size: 8,428 Bytes
4c346eb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 |
from collections.abc import Collection
import pytest
from datasets import Dataset
from pydantic import JsonValue
from ether0.data import (
SMILES_PATTERN,
get_problem_categories_from_datasets,
get_problem_category,
)
from ether0.models import RewardFunctionInfo
from ether0.rewards import EVAL_FUNCTIONS
def test_get_problem_categories_from_datasets(ether0_benchmark_test: Dataset) -> None:
assert get_problem_categories_from_datasets(ether0_benchmark_test) == {
"functional-group",
"molecule-completion",
"molecule-formula",
"molecule-name",
"oracle-solubility",
"property-cat-eve",
"property-cat-safety",
"property-cat-smell",
"property-regression-adme",
"property-regression-ld50",
"property-regression-pka",
"reaction-prediction",
"retro-synthesis",
"simple-formula",
}
UNVERIFIABLE_PROBLEM_CATEGORY_PREFIXES_TO_EXCLUDE: Collection[str] = {
"oracle-solubility", # 'ideal' is not actually an answer
"retro-synthesis", # 'ideal' is not actually an answer
}
def test_evals(ether0_benchmark_test: Dataset) -> None:
failures = []
for row in ether0_benchmark_test:
reward_info = RewardFunctionInfo.model_validate(row["solution"])
fxn_name, answer_info, problem_type = tuple(reward_info.model_dump().values())
problem_category = get_problem_category(problem_type)
if (
problem_category in UNVERIFIABLE_PROBLEM_CATEGORY_PREFIXES_TO_EXCLUDE
or problem_category
== "molecule-completion" # Molc had no 'ideal's when this was made
):
continue
metadata: dict[str, JsonValue] = {}
try:
if problem_category.startswith("property"):
yhat = answer_info
else:
assert row["ideal"]
yhat = row["ideal"]
assert (
EVAL_FUNCTIONS[fxn_name](yhat=yhat, y=answer_info, metadata=metadata)
== 1.0
)
except AssertionError:
failures.append((problem_category, row["id"], metadata))
assert not failures
TEST_REASONING_TEXT = (
"Let's analyze the given molecules and try to predict their LD50 values. LD50"
" refers to the lethal dose at which 50% of the test organisms die. A lower LD50"
" means higher toxicity, and a higher LD50 indicates lower toxicity. We need to"
" identify structural features that relate to toxicity.\n\nThe question leaves open"
" the possibility that none of the compounds have an LD50 of 320 mg/kg. Let's"
" consider each molecule individually:\n\n1."
" ClC1=C(C=CC(=C1)Cl)C1(OCC(O1)COC1=CC=C(C=C1)N1CCN(CC1)C(C)=O)CN1C=NC=C1: This"
" molecule appears to be quite complex. It has a dichloro-substituted aromatic"
" ring, an ether linkage, a morpholine ring, a piperazine ring, and an imidazole"
" ring. The presence of two chlorine atoms on the phenyl ring could suggest some"
" interaction with biological targets. The molecule also has a morpholine and"
" piperazine moiety which could contribute to binding with receptors or enzymes."
" The presence of an amide group might indicate some polarity, but the overall"
" structure looks relatively lipophilic (nonpolar) given the aromatic rings and"
" alkyl chains.\n\n2."
" ClC1=C(C=CC(=C1)Cl)[C@]1(OC[C@@H](O1)COC1=CC=C(C=C1)N1CCN(CC1)C1=CC=C(C=C1)N1C(N(N=C1)[C@H](CC)C)=O)CN1N=CN=C1:" # noqa: E501
" This is a very complex molecule, with multiple rings, stereocenters, and"
" heteroatoms. It's a distinct structure and appears to be larger than the first"
" molecule. We can see a furan ring, a pyrazole ring, an amide group, and other"
" major differences. This change in the rings and other functional groups is likely"
" to significantly change the molecular properties compared to the first"
" molecule.\n\n3."
" [2H]C(C(=O)N1CCN(CC1)C1=CC=C(C=C1)OCC1O[C@@](OC1)(CN1C=NC=C1)C1=C(C=C(C=C1)Cl)Cl)([2H])[2H]:" # noqa: E501
" This molecule, labeled with deuterium, has multiple rings including a piperazine,"
" furan, a substituted imidazole, and a dichlorinated phenyl ring. It also includes"
" an ester group which is sometimes associated with higher toxicity compared to"
" simple ethers.\n\nThinking about general principles of toxicity, lipophilicity"
" (fat solubility) is often related to higher toxicity. A molecule with a marked"
" lipophilic character can often accumulate in fatty tissues and interact with the"
" cell membrane, affect cellular transport or receptor activity. This could lead to"
" higher toxicity by interfering with normal cellular function. Similarly, the"
" presence of chlorine atoms can sometimes contribute to toxicity due to possible"
" metabolic activation to reactive intermediates. However, the position and nature"
" of other substituents and functional groups can influence how chlorine"
" substitutions modulate toxicity. For example, some chlorinated compounds are"
" relatively non-toxic.\n\nConsidering the size and complexity of the molecules, we"
" should think about their potential metabolic pathways. Large molecules can be"
" metabolized through various pathways, potentially leading to reactive"
" intermediates that interact with biological molecules. Metabolites of these"
" compounds might be more or less toxic than the initial molecules, and the"
" metabolic pathways themselves might be quite different. Perhaps one of the"
" metabolites could be the reason for an LD50 of 320 mg/kg. Alternatively, a"
" compound might be relatively non-toxic in itself, but its presence can alter"
" enzyme activity or other metabolic processes and indirectly lead to cell"
" damage.\n\nComparing the three molecules. Molecules 1 and 2 share some structural"
" features like the dichloro-substituted aromatic ring and the presence of a"
" morpholine ring system. However, they also have distinct differences in the"
" connectivity and presence of additional rings, including likely some more polar"
" and/or sterically bulky substituents. Molecule 3 has different ring systems and"
" the addition of both a deuterated methyl group and an ester group which adds"
" polar character and can often activate adjacent portions of the molecule by"
" metabolic oxygenation.\n\nLet's think about bioreactivity beyond simple chemical"
" interactions. Structures can influence how a molecule interacts with biological"
" receptors or enzymes. The size and shape of these molecules and the nature of the"
" functional groups can determine the extent of the molecule's binding interactions"
" with biomolecules. Some conformationally adaptable structures might bind strongly"
" to targets and interfere with crucial pathways, which can lead to toxicity."
" Therefore, weaknesses in essential molecular machinery could have similar"
" negative effects if bound by those biomolecules.\n\nIf one of these molecules has"
" an LD50 of 320 mg/kg, it suggests moderate toxicity. It could be that one of the"
" molecules doesn't have the necessary structural features to interact strongly"
" with critical biological targets for high toxicity, and/or it might be"
" metabolized to relatively non-toxic products, such as carbon dioxide and water."
" Thus, while the molecules share some features with other potentially bioactive"
" molecules, it could be that they themselves are not exceptionally potent."
)
NO_SMILES_TEXT = "This text does not contain any SMILES"
@pytest.mark.parametrize(
("text", "expected_answer"),
[
(
TEST_REASONING_TEXT,
[
"ClC1=C(C=CC(=C1)Cl)C1(OCC(O1)COC1=CC=C(C=C1)N1CCN(CC1)C(C)=O)CN1C=NC=C1",
"ClC1=C(C=CC(=C1)Cl)[C@]1(OC[C@@H](O1)COC1=CC=C(C=C1)N1CCN(CC1)C1=CC=C(C=C1)N1C(N(N=C1)[C@H](CC)C)=O)CN1N=CN=C1",
"[2H]C(C(=O)N1CCN(CC1)C1=CC=C(C=C1)OCC1O[C@@](OC1)(CN1C=NC=C1)C1=C(C=C(C=C1)Cl)Cl)([2H])[2H]",
],
),
(
NO_SMILES_TEXT,
[],
),
],
)
def test_extract_smiles_from_text(text: str, expected_answer: list[str]) -> None:
assert sorted(SMILES_PATTERN.findall(text)) == sorted(expected_answer)
|