Convert from DeepSMILES to SMILES using C. This package provides a C library, a (simple) command-line converter, and a C extension for Python.
This package only implements the "rings=True" and "branches=True" option from the Python DeepSMILES implementation.
This package also includes code used to fuzz-test the RDKit SMILES parser using libFuzzer from the LLVM project.
This package was developed by Andrew Dalke, an independent software developer and consulting in chemical informatics. If you are interested in hiring his services, send an email to firstname.lastname@example.org .
cdeepsmiles executable converts about 800,000 DeepSMILES lines
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.
This package uses the Ragel State Machine Compiler for the low-level DeepSMILES tokenization.
You will only need to install ragel if you are going to modify the state machine description in
cdeepsmiles.rl because the distribution includes the ragel-generated C code is in
cdeepsmiles.h for the public API. See
main.c for an example of use.
cdeepsmiles_internal.h API might be useful for those working with DeepSMILES variants.
By default, "make" builds the
libcdeepsmiles.a library and the
The executable demonstrates how to use the API. It reads from stdin in "DeepSMILES" format, where the line starts with a DeepSMILES, followed by an optional id/title. The DeepSMILES ends with the first whitespace character.
Each DeepSMILES is parsed and written to stdout as a SMILES string. By default the id/title is included after the SMILES string, so the result may be used as a SMILES file. Use the
--no-id command-line option so the ids are not included.
Error diagnostics are written to stderr.
cdeepsmiles executable links with the
libcdeepsmiles.a library, which you may find useful for your own code.
The distribution includes a Python
setup.py for the Python/C extension named "cdeepsmiles".
The module contains the function
decode, which takes a DeepSMILES string and returns a SMILES string. It raises a ValueError if the DeepSMILES could not be parsed.
>>> import cdeepsmiles >>> cdeepsmiles.decode("CCO)NP4") 'C1C(O)NP1'
The unit tests are in the
% cd tests/ % python -m unittest discover
The Makefile includes two build targets for use in fuzzing with libFuzzer from the LLVM project. You will need to configure the Makefile yourself as they are not meant to be portable code. You'll also need to read the libFuzzer documentation.
fuzz_cdeepsmiles binary fuzzes the DeepSMILES to SMILES converter in the cdeepsmiles package itself. The build script for it creates a
libcdeepsmiles_fuzzy.a which is the same as
libcdeepsmiles.a except that it includes the compiler flags for fuzzing.
fuzz_SmilesToMol binary fuzzes the
RDKit::SmilesToMol function in the RDKit C++ API. I conjecture is that fuzzing that function directly with SMILES strings is difficult because it's almost impossible for a non-syntax-directed fuzzer to match parens and closure values.
Instead, the fuzzer might find it easier to modify a DeepSMILES, which can then be transformed into a SMILES for further checking.
Experiments with libFuzzer seem to show that using DeepSMILES generates more interesting and complex structures than using SMILES.
Some aspects of the DeepSMILES syntax appear to make it more difficult for fuzzing tools to work. Instead, I have developed FuzzSMILES as a DeepSMILES variant. They are not meant to handle the full SMILES chemistry space. In particular, they do not cover the inorganics, nor do they allow large macrocycles or branches, nor do they allow disconnected fragments.
The differences are:
A. The atom symbols are not based on the SMILES atoms. Instead, they are upper-case letter, with a pre-defined mapping to a SMILES atom. These are:
FuzzSMILES | SMILES atom | atom ----------- | ------- A | c B | B C | C F | F I | I L | Cl M | n N | N O | O P | P Q | [nH] R | Br S | S T | s U | o V | [Si] W | [Se] X | [se] Y | [C] Z | [Sn]
These were chosen as from the most common atom tokens in ChEMBL which were not typically seen as counter-ions. New symbols may be added and current ones may be renamed in the future.
These are all one-letter symbols, which makes it very easy to parse.
B. Bracket atoms and atom modifiers are not supported. There is no way to specify isotopes, chirality, hydrogen count, or charge.
A future version may use the lower-case letters to encode the most common atom symbols with charge and hydrogen count. Isotopes will never be supported.
C. The bond symbols are '[', '', and ']' for single, double, and triple bonds, respectively. Stereochemistry (the '/' and '' bonds in SMILES), aromatic (':'), and quadruple bonds ("$") are not supported.
D. The closures are the ASCII characters from '0'/ASCII 48, which corresponds to a '3' in DeepSMILES, that is, a ring of size 3, to '='/ASCII 61, which corresponds to a '%16' in DeepSMILES, that is, a ring of size 16.
In DeepSMILES, the closures 0, 1, and 2 were never valid, so the valid closures were the single digits 3 to 9, the double-digit values %nn, and the higher-digit %(nnn). I conjecture that this syntax change make it hard for methods like fuzzing and machine learning techniques like genetic algorithms to work well.
This change means that FuzzSMILES does not handle large rings. Larger rings may be supported in the future with new notation.
E. The branches are the ASCII characters from '!'/ASCII 33, which corresponds to a ')' in DeepSMILES, to '/'/ASCII 47, which corresponds to a ')))))))))))))))' (15 ')'s).
This change means that FuzzSMILES does not handle R-groups with more than 15 (heavy) atoms. Larger branches may be supported in the future with new notation.
F. Dot disconnection ('.') is not supported.
The API function
cdeepsmiles_parse_fuzzsmiles parses a FuzzSMILES
into the standard
cdeepsmiles_t parser data structure.
--fuzz flag of the
cdeepsmiles executable tells it to parse
the input as FuzzSMILES instead of as a DeepSMILES.
cdeepsmiles Python extension implements the function
decode_fuzzsmiles, which decodes a FuzzSMILES into a SMILES.
The two fuzzing tools support a compile-time option to use a FuzzSMILES instead of a DeepSMILES. This will likely change to use an environment variable so it can be selected at run-time.
% echo 'AB[CFI\LM]OPQ"A0' | ./cdeepsmiles --fuzz cB-CFI=Cln1#O(P[nH])c1 % echo 'AAAO\!AAA3' | ./cdeepsmiles --fuzz c1cc(O)ccc1
NOTE: the FuzzSMILES code has not been extensively tested and there are no automated tests for it.