Imported from Bitbucket
forgot a numbering. Switched to 'F'
gave up on MArkdown auto-numbering. Switched to A, B, C, etc.
more Markdown tweaks

heads

tip
browse log
cdeepsmiles-0.5
browse .tar.gz

clone

read-only
https://hg.sr.ht/~dalke/cdeepsmiles
read/write
ssh://hg@hg.sr.ht/~dalke/cdeepsmiles

#cdeepsmiles 0.6 (in-development)

Convert from DeepSMILES to SMILES using C. This package provides a C library, a (simple) command-line converter, and a C extension for Python.

This package only implements the "rings=True" and "branches=True" option from the Python DeepSMILES implementation.

This package also includes code used to fuzz-test the RDKit SMILES parser using libFuzzer from the LLVM project.

#Advertising

This package was developed by Andrew Dalke, an independent software developer and consulting in chemical informatics. If you are interested in hiring his services, send an email to dalke@dalkescientific.com .

#Performance

The cdeepsmiles executable converts about 800,000 DeepSMILES lines per second.

The copyright to this package is held by Andrew Dalke (dalke@dalkescientific.com). It is licensed under the GNU GPLv3 license:

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program.  If not, see <https://www.gnu.org/licenses/>.

#Dependencies

This package uses the Ragel State Machine Compiler for the low-level DeepSMILES tokenization.

You will only need to install ragel if you are going to modify the state machine description in cdeepsmiles.rl because the distribution includes the ragel-generated C code is in cdeepsmiles.c.

#Public API

See cdeepsmiles.h for the public API. See main.c for an example of use.

The cdeepsmiles_internal.h API might be useful for those working with DeepSMILES variants.

#cdeepsmiles executable

By default, "make" builds the libcdeepsmiles.a library and the cdeepsmiles executable.

The executable demonstrates how to use the API. It reads from stdin in "DeepSMILES" format, where the line starts with a DeepSMILES, followed by an optional id/title. The DeepSMILES ends with the first whitespace character.

Each DeepSMILES is parsed and written to stdout as a SMILES string. By default the id/title is included after the SMILES string, so the result may be used as a SMILES file. Use the --no-id command-line option so the ids are not included.

Error diagnostics are written to stderr.

#libcdeepsmiles.a library

The cdeepsmiles executable links with the libcdeepsmiles.a library, which you may find useful for your own code.

#Python library

The distribution includes a Python setup.py for the Python/C extension named "cdeepsmiles".

The module contains the function decode, which takes a DeepSMILES string and returns a SMILES string. It raises a ValueError if the DeepSMILES could not be parsed.

>>> import cdeepsmiles
>>> cdeepsmiles.decode("CCO)NP4")
'C1C(O)NP1'

#Tests

The unit tests are in the tests/ subdirectory.

% cd tests/
% python -m unittest discover

#Fuzzing targets

The Makefile includes two build targets for use in fuzzing with libFuzzer from the LLVM project. You will need to configure the Makefile yourself as they are not meant to be portable code. You'll also need to read the libFuzzer documentation.

The fuzz_cdeepsmiles binary fuzzes the DeepSMILES to SMILES converter in the cdeepsmiles package itself. The build script for it creates a libcdeepsmiles_fuzzy.a which is the same as libcdeepsmiles.a except that it includes the compiler flags for fuzzing.

The fuzz_SmilesToMol binary fuzzes the RDKit::SmilesToMol function in the RDKit C++ API. I conjecture is that fuzzing that function directly with SMILES strings is difficult because it's almost impossible for a non-syntax-directed fuzzer to match parens and closure values.

Instead, the fuzzer might find it easier to modify a DeepSMILES, which can then be transformed into a SMILES for further checking.

#"FuzzSMILES" (Experimental)

Experiments with libFuzzer seem to show that using DeepSMILES generates more interesting and complex structures than using SMILES.

Some aspects of the DeepSMILES syntax appear to make it more difficult for fuzzing tools to work. Instead, I have developed FuzzSMILES as a DeepSMILES variant. They are not meant to handle the full SMILES chemistry space. In particular, they do not cover the inorganics, nor do they allow large macrocycles or branches, nor do they allow disconnected fragments.

The differences are:

A. The atom symbols are not based on the SMILES atoms. Instead, they are upper-case letter, with a pre-defined mapping to a SMILES atom. These are:

  FuzzSMILES  |  SMILES
     atom     |   atom
  ----------- | -------
       A      |    c
       B      |    B
       C      |    C
       F      |    F
       I      |    I
       L      |    Cl
       M      |    n
       N      |    N
       O      |    O
       P      |    P
       Q      |    [nH]
       R      |    Br
       S      |    S
       T      |    s
       U      |    o
       V      |    [Si]
       W      |    [Se]
       X      |    [se]
       Y      |    [C]
       Z      |    [Sn]

These were chosen as from the most common atom tokens in ChEMBL which were not typically seen as counter-ions. New symbols may be added and current ones may be renamed in the future.

These are all one-letter symbols, which makes it very easy to parse.

B. Bracket atoms and atom modifiers are not supported. There is no way to specify isotopes, chirality, hydrogen count, or charge.

A future version may use the lower-case letters to encode the most common atom symbols with charge and hydrogen count. Isotopes will never be supported.

C. The bond symbols are '[', '', and ']' for single, double, and triple bonds, respectively. Stereochemistry (the '/' and '' bonds in SMILES), aromatic (':'), and quadruple bonds ("$") are not supported.

D. The closures are the ASCII characters from '0'/ASCII 48, which corresponds to a '3' in DeepSMILES, that is, a ring of size 3, to '='/ASCII 61, which corresponds to a '%16' in DeepSMILES, that is, a ring of size 16.

In DeepSMILES, the closures 0, 1, and 2 were never valid, so the valid closures were the single digits 3 to 9, the double-digit values %nn, and the higher-digit %(nnn). I conjecture that this syntax change make it hard for methods like fuzzing and machine learning techniques like genetic algorithms to work well.

This change means that FuzzSMILES does not handle large rings. Larger rings may be supported in the future with new notation.

E. The branches are the ASCII characters from '!'/ASCII 33, which corresponds to a ')' in DeepSMILES, to '/'/ASCII 47, which corresponds to a ')))))))))))))))' (15 ')'s).

This change means that FuzzSMILES does not handle R-groups with more than 15 (heavy) atoms. Larger branches may be supported in the future with new notation.

F. Dot disconnection ('.') is not supported.

#Using FuzzSMILES

The API function cdeepsmiles_parse_fuzzsmiles parses a FuzzSMILES into the standard cdeepsmiles_t parser data structure.

The --fuzz flag of the cdeepsmiles executable tells it to parse the input as FuzzSMILES instead of as a DeepSMILES.

The cdeepsmiles Python extension implements the function decode_fuzzsmiles, which decodes a FuzzSMILES into a SMILES.

The two fuzzing tools support a compile-time option to use a FuzzSMILES instead of a DeepSMILES. This will likely change to use an environment variable so it can be selected at run-time.

% echo 'AB[CFI\LM]OPQ"A0' | ./cdeepsmiles --fuzz
cB-CFI=Cln1#O(P[nH])c1
% echo 'AAAO\!AAA3' | ./cdeepsmiles --fuzz
c1cc(O)ccc1

NOTE: the FuzzSMILES code has not been extensively tested and there are no automated tests for it.