Data correction 21.06.24
We reviewed and corrected several structures following remarks of challenge participants.
The corrections were done mainly for mixtures and salts.
Structure check was extended to prioritise structures as provided by ACS CAS service.
In cases when CAS suggestions were ambiguos, we used structures as retrived from
PubChem. The structural information and descripors (see below) were also updated.
Leaderboard set was released 15.08.24
You may use this set to increase accuracy of your model.
The Challenge data were collected from the EPA dataset, which was split into training, leaderboard and blind sets.
The datasets (SMILES and activity for the training set) can be retrived from the original
EPA dataset or downloaded from these links:
You can use these sets to develop and test models directly within OCHEM or/and calculate
and export descriptors for external model development.
A preliminary analysis (OCHEM results for the training set) showed that best five cross-validation Root Mean Squared Error (RMSE) values were
24-25%
and 21-23% for the training and the leaderboard sets, respectively. Data for the leaderboard
set were released on August 15th.
The higher accuracy obtained for the leaderboard set could be the result of (a) more
accurate structural information of molecules in this set and/or (b) the use of the data split procedure.
The performance for the leaderboard set developed by Random Forest using AlogPS + OEstate descriptors (baseline model) is shown as an entry for itetko_acs.
This user did not participate to the challenge, nor did other organisers of the Challenge.