Imported from Bitbucket
remove trailing '/' from the test names so I can hit tab in the shell
don't generate the sql test cases; manually enable to do that.
new suite of tests for possible SQL injection

heads

tip
browse log

clone

read-only
https://hg.sr.ht/~dalke/evil_sdf
read/write
ssh://hg@hg.sr.ht/~dalke/evil_sdf
                     evil_sdf

Test suite to see how a chemistry toolkit implementation handle
different aspects of the SD file syntax.

While there is an SD format specification ("ctfile.pdf"), it is
incomplete and ambiguous, and has changed over time. Different
toolkits can and do have different interpretations of how to handle
specific cases.

The goal of this project is to identify these differences.  Do not
consider this a validation suite for SD file syntax.

While I would like it if all of the toolkits to work the same way,
it's unreasonable to make that demand. For example, I've never seen
real-world data with 'S SKP' records, or SD records with a terminal
line that was anything other than '$$$$', and I prefer the stricter de
facto SD format over the documented one.

Instead, use it as a way to spot corner-cases, identify how you want
to handle it, and perhaps document your decision for your users.

RUN TESTS
=========

'evil_sdf.py' is the main testing program. It knows how to run the
test suite against the OEChem, Open Babel, and RDKit chemistry
toolkits, as well as chemfp's record-oriented SDF parser.

List of supported toolkits
--------------------------

You may not have all of these toolkits. Here's how to get a list of
what 'evil_sdf.py' thinks you have installed:

  % python evil_sdf.py -L
  Available toolkits: openeye rdkit openbabel chemfp

Run all of the suites
---------------------

I'll now run the tests using the 'rdkit' toolkit, which is the default
if it's installed:

  % python evil_sdf.py -t rdkit 
    ==== Start of suite 'non_ascii' ====
  WARNING: rdkit: tag order checks disabled for this toolkit
  AGREE: non_ascii/HBr.sdf
  AGREE: non_ascii/HBr_latin1.sdf
    ...  many lines removed ...


The tests are arranged as test suites. Each suite has multiple test
cases.

Each test suite is in its own directory. The 'non_ascii' tests are in
the 'non_ascii/' subdirectory:

  % ls -l non_ascii/
  total 112
  -rw-r--r--  1 dalke  admin   170 May 10 16:20 HBr.sdf
  -rw-r--r--  1 dalke  admin   174 May 10 16:26 HBr_0xFF.sdf
  -rw-r--r--  1 dalke  admin   167 May 10 15:13 HBr_latin1.sdf
  -rw-r--r--  1 dalke  admin   168 May 10 15:14 HBr_utf8.sdf
  -rw-r--r--  1 dalke  admin  4249 May 10 15:02 alpha-lactose_utf8.sdf
  -rw-r--r--  1 dalke  admin  3271 May  9 02:06 chebi_with_non_ascii_title.sdf
  -rw-r--r--  1 dalke  admin   192 May 12 02:46 tag_0xFF.sdf
  -rw-r--r--  1 dalke  admin   191 May 12 04:00 tag_data_0xFF.sdf
  -rw-r--r--  1 dalke  admin   178 May 12 03:57 tag_data_latin1.sdf
  -rw-r--r--  1 dalke  admin   175 May 12 04:00 tag_data_utf8.sdf
  -rw-r--r--  1 dalke  admin   173 May 12 02:38 tag_latin1.sdf
  -rw-r--r--  1 dalke  admin   173 May 12 02:42 tag_utf8.sdf
  -rw-r--r--  1 dalke  admin  2811 May 12 03:59 test_suite.json

You can run a test suite by name. I'll test OEChem on the non_ascii
test set:

  % python evil_sdf.py -t openeye non_ascii
    ==== Start of suite 'non_ascii' ====
  AGREE: non_ascii/HBr.sdf
  AGREE: non_ascii/HBr_latin1.sdf
  AGREE: non_ascii/HBr_utf8.sdf
  AGREE: non_ascii/HBr_0xFF.sdf
  AGREE: non_ascii/chebi_with_non_ascii_title.sdf
  AGREE: non_ascii/alpha-lactose_utf8.sdf
  AGREE: non_ascii/tag_latin1.sdf
  AGREE: non_ascii/tag_utf8.sdf
  AGREE: non_ascii/tag_0xFF.sdf
  AGREE: non_ascii/tag_data_latin1.sdf
  AGREE: non_ascii/tag_data_utf8.sdf
  AGREE: non_ascii/tag_data_0xFF.sdf
    ==== End of suite 'non_ascii' (agree: 12 disagree: 0) ====

Actually, all of the chemistry toolkit handle the 'non_ascii' tests
suite without a problem. I'll use one that's a problem.

Run one suite
-------------

The Open Babel and RDKit toolkits store SD tag fields as
general-purpose properties, with a mapping from a tag name to the tag
data. Both use the same table to store toolkit-specific properties.
Open Babel also has a special "Comment" property used for the comment
line of an SD file.

It's possible for an evil SD file to have a tag which uses those
special internal names. As it turns out, there's no way to get the
corresponding data, because the SD parser overwrites that field.

Here's what I mean when I used the 'openbabel' toolkit to run the
'special_tags' suite:

  % python evil_sdf.py -t openbabel special_tags
    ==== Start of suite 'special_tags' ====
  WARNING: openbabel: occurrences of tag 'Comment' will not be checked for this toolkit
  WARNING: openbabel: occurrences of tag 'MOL Chiral Flag' will not be checked for this toolkit
  WARNING: openbabel: occurrences of tag 'OpenBabel Symmetry Classes' will not be checked for this toolkit
  AGREE: special_tags/rdkit_Name.sdf
  AGREE: special_tags/rdkit_stereochemDone.sdf
  AGREE: special_tags/rdkit__computedProps.sdf
  AGREE: special_tags/rdkit_numArom.sdf
  MISMATCH: special_tags/openbabel_mol_chiral_flag.sdf: tags do not match (found 2, expected 1)
  --- openbabel
  +++ reference
  @@ -1,2 +1,1 @@
  -'MOL Chiral Flag': '0'
   'MOL Chiral Flag': 'left? or right?'
  
    Comment: Open Babel combines SD tags with internal properties like 'MOL Chiral Flag'
  
  MISMATCH: special_tags/openbabel_openbabel_symmetry_classes.sdf: tags do not match
  --- openbabel
  +++ reference
  @@ -1,1 +1,1 @@
  -'OpenBabel Symmetry Classes': '1 1 1 1 1 1'
  +'OpenBabel Symmetry Classes': 'Morgan, FTW'
  
    Comment: Open Babel combines SD tags with internal properties like 'OpenBabel Symmetry Classes'
  
  AGREE: special_tags/openbabel_comment.sdf
    ==== End of suite 'special_tags' (agree: 5 disagree: 2) ====

The three 'WARNING' lines point out that by default evil_sdf will skip
the named tags if they are found in the record and are not listed as
one of the tags to check.

The 'AGREE' lines mean that the toolkit returned what the toolkit expected.

The first 'MISMATCH' line says the toolkit returned two tags for 'MOL
Chiral Flag' but the text case expected only one. Ater that is a diff
of the result, which shows that Open Babel returned both "0" and
"left? or right?" for the value.

The second 'MISMATCH' line says that Open Babel returned the expected
'OpenBabel Symmetry Class' tag, but with the value "1 1 1 1 1 1", when
the test case expected "Morgan, FTW".

(The 'Morgan' here is an homage to Morgan of the Morgan algorithm.)


Finally, there's a test case for the 'Comment' tag. Internally Open
Babel can distinguish "PairData" entries and other, including
"CommentData" entries. This test makes sure that worked correctly.

Run one test case
-----------------

I'll use OEChem to execute a single test case from the 'nul_char'
suite, which tests how the different toolkits handle data with an
embedded NUL ("\0") character. 

  % python evil_sdf.py -t openeye nul_char/nul_in_tag_data1.sdf 
    ==== Start of suite 'nul_char' ====
  MISMATCH: nul_char/nul_in_tag_data1.sdf: tags do not match
  --- openeye
  +++ reference
  @@ -1,1 +1,1 @@
  -'name': 'methane'
  +'name': 'meth\x00ane'
  
    Comment: It is not reasonable to expect NUL in a data line (including the first line).
  
    ==== End of suite 'nul_char' (agree: 0 disagree: 1) ====

This one says that OEChem will accept tag data which contains an
embedded NUL character (the diff output uses Python's repr()
encoding), but it will remove the NUL from the string.

The comment is my view that this test is bogus. The toolkit should be
free to do what it want with such unreasonable data.

What is "cr"?
-------------

The 'valid' suite includes tests like:

  AGREE: valid/tag_with_dollars.sdf
  AGREE: valid/tag_with_dollars_cr.sdf

The tests without the "_cr.sdf" use the Unix newline convention of
'\n' while the tests with the '_cr.sdf' use the Windows newline
convention of '\r\n'. The '\r' is a "control return', often
abbreviated 'CR' or in my case, 'cr'.

The CR test cases are auto-generated from the non-CR test cases.


Test suite definition
=====================

The test driver is in Python, but the test data is stored in a JSON
file to make it easier to use in other systems. Each test suite is in
its own directory. The file 'test_suite.json' contains the test suite
data, and looks like this:

  % cat nul_char/test_suite.json 
  {
   "evil_sdf_version": 1, 
   "name": "nul_char", 
   "test_cases": [
    {
     "filename": "methane.sdf", 
     "title": "Nothing unusual", 
     "num_carbons": 1, 
     "tags": [
      [
       "name", 
       "methane"
      ]
     ], 
     "comment": "This is not an evil test case"
    }, 
    {
     "filename": "nul_in_title.sdf", 
     "title": "The NUL is >\u0000< here", 
        .... many lines deleted ...
    }
   ]
  }

The top-level object contains three properties:

   "evil_sdf_version" - a version number for the configuration data
   "name" - the name of the test suite
   "test_cases" - a list of test case configuration objects

Each test case configuration option contains the following:

   "filename" - this is both the name of the test, and the filename
      which contains the test data
   "title" - the first line of the SD record
   "num_carbons" - the number of carbons (currently unused)
   "tags" - a list of tag data pairs, expressed as the two element
      list [tag name, tag data]

NOTE:

The title, tag name, and tag data may contain non-ASCII and
non-Unicode bytes, but JSON does not support byte strings. Instead, if
the first four characters of the string is "hex:" then the rest of the
line should be hex-decoded into a byte string.

The SD format does not specify an encoding. I leave it to the toolkits
to figure out what to do .

Create the test suite
---------------------

The test suite configuration may be in JSON format, but I didn't
create it directly. The primary configuration is in
'make_test_suites.py'. I run that program to generate the test suite
JSON files (if neeed) for each of the subdirectories:

  % python  make_test_suites.py
  Completed suite 'valid'
  Completed suite 'non_ascii'
  Completed suite 'nul_char'
  Completed suite 'special_tags'

This also generate the CR variants for all of the 'valid' test cases.