← Back to team overview

wavemol team mailing list archive

Let's go!

 

So, let's start.

At the moment, I am writing most of the information on the wiki, at

http://forthescience.org/wiki/MolecularDatabase

Plan for development:

- I will commit very soon a prototype mock database interface which will return fake data. This will allow us to design a database interface that satisfies our needs. We will replace it with the real thing at a later stage. Having this mock data provider will also give us the chance to start developing the web interface, so it's important to get the db to satisfy our needs. (I will detail them in a later mail)

- We have the problem of the ID to be solved. All the chemists out there consider a molecule as a set of atoms connected by bonds, so they can use InChiKey. We don't. For this project, the molecule _can_ be a set of atoms connected by bonds, but in most frequent cases it is a set of atoms at given positions in XYZ space. How to assign an ID to a molecule? At the moment, my algo does the following. The ID is composed of three parts: AA-<brutepart>-<geometrypart>

For the brute formula part: takes all the atomic numbers of the atoms composing the molecule, sort them and join them with a \n.
e.g in H2O takes 1,1,6
      (sort) 1,1,6
      (join with \n) "1\n1\n6"

For the geometry part:
make a list of strings made of four entries, with the following format string "%d %15.10f %15.10f %15.10f". Values are filled with the atomic number (first marker) and the three XYZ coordinates in the unit bohr. Join each of these strings with \n. This algo is actually wrong in the current implementation as I am not taking into account the possibility of different order of the lines.

For both cases, the resulting string is hashed and made human-readable with the following algo: compute a hex representation of a CRC32 of the string, zero padded on the left side. Convert the hex into actual characters and b32encode the obtained data.

This algo is not guaranteed to be perfect. Any proposal accepted, but as the first two letters of the ID (the AA) are reserved, we can change it later. it also has high incidence of collision, due to the CRC32.

The same molecule, rotated or translated in space, gives different ID's. This is mostly intentional, as moving a molecule can have an effect on the results. In this sense, two spatially rototranslated molecules are different.

If you have any questions, feel free to ask.