wavemol team mailing list archive
-
wavemol team
-
Mailing list archive
-
Message #00002
Let's go!
So, let's start.
At the moment, I am writing most of the information on the wiki, at
http://forthescience.org/wiki/MolecularDatabase
Plan for development:
- I will commit very soon a prototype mock database interface which will
return fake data. This will allow us to design a database interface that
satisfies our needs. We will replace it with the real thing at a later
stage. Having this mock data provider will also give us the chance to
start developing the web interface, so it's important to get the db to
satisfy our needs. (I will detail them in a later mail)
- We have the problem of the ID to be solved. All the chemists out there
consider a molecule as a set of atoms connected by bonds, so they can
use InChiKey. We don't. For this project, the molecule _can_ be a set of
atoms connected by bonds, but in most frequent cases it is a set of
atoms at given positions in XYZ space. How to assign an ID to a
molecule? At the moment, my algo does the following. The ID is composed
of three parts: AA-<brutepart>-<geometrypart>
For the brute formula part: takes all the atomic numbers of the atoms
composing the molecule, sort them and join them with a \n.
e.g in H2O takes 1,1,6
(sort) 1,1,6
(join with \n) "1\n1\n6"
For the geometry part:
make a list of strings made of four entries, with the following format
string "%d %15.10f %15.10f %15.10f". Values are filled with the atomic
number (first marker) and the three XYZ coordinates in the unit bohr.
Join each of these strings with \n. This algo is actually wrong in the
current implementation as I am not taking into account the possibility
of different order of the lines.
For both cases, the resulting string is hashed and made human-readable
with the following algo: compute a hex representation of a CRC32 of the
string, zero padded on the left side. Convert the hex into actual
characters and b32encode the obtained data.
This algo is not guaranteed to be perfect. Any proposal accepted, but as
the first two letters of the ID (the AA) are reserved, we can change it
later. it also has high incidence of collision, due to the CRC32.
The same molecule, rotated or translated in space, gives different ID's.
This is mostly intentional, as moving a molecule can have an effect on
the results. In this sense, two spatially rototranslated molecules are
different.
If you have any questions, feel free to ask.