wavemol team mailing list archive

Thread
Date

Let's go!

To: wavemol@xxxxxxxxxxxxxxxxxxx
From: Stefano Borini <stefano.borini@xxxxxxxxxxxxxxxx>
Date: Tue, 08 Sep 2009 01:02:35 +0200
User-agent: Thunderbird 2.0.0.19 (Macintosh/20081209)

So, let's start.

At the moment, I am writing most of the information on the wiki, at

http://forthescience.org/wiki/MolecularDatabase

Plan for development:

- I will commit very soon a prototype mock database interface which willreturn fake data. This will allow us to design a database interface thatsatisfies our needs. We will replace it with the real thing at a laterstage. Having this mock data provider will also give us the chance tostart developing the web interface, so it's important to get the db tosatisfy our needs. (I will detail them in a later mail)

- We have the problem of the ID to be solved. All the chemists out thereconsider a molecule as a set of atoms connected by bonds, so they canuse InChiKey. We don't. For this project, the molecule _can_ be a set ofatoms connected by bonds, but in most frequent cases it is a set ofatoms at given positions in XYZ space. How to assign an ID to amolecule? At the moment, my algo does the following. The ID is composedof three parts: AA-<brutepart>-<geometrypart>

For the brute formula part: takes all the atomic numbers of the atomscomposing the molecule, sort them and join them with a \n.

e.g in H2O takes 1,1,6
      (sort) 1,1,6
      (join with \n) "1\n1\n6"

For the geometry part:

make a list of strings made of four entries, with the following formatstring "%d %15.10f %15.10f %15.10f". Values are filled with the atomicnumber (first marker) and the three XYZ coordinates in the unit bohr.Join each of these strings with \n. This algo is actually wrong in thecurrent implementation as I am not taking into account the possibilityof different order of the lines.

For both cases, the resulting string is hashed and made human-readablewith the following algo: compute a hex representation of a CRC32 of thestring, zero padded on the left side. Convert the hex into actualcharacters and b32encode the obtained data.

This algo is not guaranteed to be perfect. Any proposal accepted, but asthe first two letters of the ID (the AA) are reserved, we can change itlater. it also has high incidence of collision, due to the CRC32.

The same molecule, rotated or translated in space, gives different ID's.This is mostly intentional, as moving a molecule can have an effect onthe results. In this sense, two spatially rototranslated molecules aredifferent.


If you have any questions, feel free to ask.