## Tuesday, October 25, 2011

### "Post-doc with experience in CDK programming wanted"

SMARTCyp (see papers below) is an integrated computational approach that mixes cheminformatics with molecular modeling approaches to predict the metabolic fate of molecules. This fate is important to various biological aspects of small molecules, and the metabolism can active a prodrug into a drug, make a toxic compound non-toxic, and a non-toxic compound risky.

The tool has been well received by the community, complementing other approaches. Now, the reason why I blog about these papers now, is that the tool uses the CDK for the cheminformatics parts, which I find really cool. In fact, the project has resulted in good feedback on the CDK. In fact, the project received further funding creating a short-term open position to continue research on enzymatic reactivity of molecules in the cytochrome P450 family, as outlined below.

So, this peer-review is a bit more on the impact of the CDK and SMARTCyp on the academic landscape, than on the content.of the paper.

Patrik Rydberg wrote on the CDK LinkedIn group and the CDK user mailing list about an open post-doc position where CDK expertise is welcomed:
I'm seeking candidates for a post-doc position in applied cheminformatics at the University of Copenhagen. We are working on drug metabolism prediction models, and our results are as far as possible made into open source software based on the CDK. The group has previously developed the SMARTCyp site-of-metabolism prediction software which is based on CDK, and this project aims to extend the scope of our cytochrome P450 project.

Employer: department of medicinal chemistry, faculty of pharmaceutical sciences, University of Copenhagen
Location: Copenhagen, Denmark
Position: post-doc
Duration: 7 months, starting january 2012
Project: drug metabolism by cytochromes P450

Experiences required:
Machine learning methods
Cheminformatics
java programming

Bonus for experience in java programming using the Chemistry Development Kit (CDK), experience in development of ligand based virtual screening methods, and experience of work on the cytochrome P450 enzyme family.
The SMARTCyp paper can be found linked to below, with the details of how the CDK and SMARTCyp interoperate and make P450 predictions.

Rydberg, P., Gloriam, D., & Olsen, L. (2010). The SMARTCyp cytochrome P450 metabolism prediction server Bioinformatics, 26 (23), 2988-2989 DOI: 10.1093/bioinformatics/btq584

Rydberg, P., Gloriam, D., Zaretzki, J., Breneman, C., & Olsen, L. (2010). SMARTCyp: A 2D Method for Prediction of Cytochrome P450-Mediated Drug Metabolism ACS Medicinal Chemistry Letters, 1 (3), 96-100 DOI: 10.1021/ml100016x

## Saturday, October 22, 2011

I reported earlier how to I uploaded the ChemPedia (RIP) data onto Kasabi. But for ChEMBL-RDF I have used the pytassium tool, not just because it has a cool name :) I discovered yesterday, however, that I did not write down in this lab notebook, what steps I needed to take to reproduce it. And I just wanted to uploaded new triples to the ChEMBL-RDF data set on Kasabi.

The new triples I wanted to upload, link the new public CHEMBL identifiers (like CHEMBL25 for aspirin) to the internal ChEMBL database identifier I used for ChEMBL 09 for the URIs. So, I am adding a lot of triples like:

<http://data.kasabi.com/dataset/chembl-rdf/09/molecule/m517180> <http://www.w3.org/2002/07/owl#sameAs>
<http://data.kasabi.com/dataset/chembl-rdf/09/chemblid/CHEMBL1>

And the pytassium code I use to upload this to Kasabi looks like:

import pytassium
import time

dataset = pytassium.Dataset('chembl-rdf','XXX')

# Store the contents of a turtle file
dataset.store_file('chemblids.nt', media_type='text/plain')

So, that omission in my log book has been corrected now.

## Thursday, October 20, 2011

### CDK & File Formats #1: MDL molfiles and bond order 4

I just had a conference call on one of the translational cheminformatics projects I am involved in: Bioclipse-OpenTox. A paper about this project has been submitted, and we are writing up a more practice oriented book chapter (almost done). In writing up a use case, we ran into a recurrent problem: proper cheminformatics handling of input files. Ola suggested to start writing more extensive documentation on what users of the CDK are supposed to do when reading a file. So, here I start a new series.

Introduction
But before I start writing up how to work with MDL molfiles in the CDK, I like to stress three key design principles in the CDK:
1. there is no single solution to everything (aka there are multiple solutions to the same problem)
2. algorithms must be modular (the LEGO building block approach)
3. the user is responsible for using the right blocks at the right time
These principles have a profound effect on the usability of the CDK. And here I must stress the point I made recently on usability: what happens if you neglect less abundant personas. Well, in this case, the abundancy is actually somewhat different. But, CDK libraries target a few personas: one of these is the scientist that uses cheminformatics as a mere tool, who doesn't know graph theory, let alone file formats; this personas works in the field of translational cheminformatics. Let's call him Tony. The other personas has a more extensive education in cheminformatics (e.g. former Gasteiger lab, Sheffield, former CAOS/CAMM (like me), etc) or are actively involved in cheminformatics research (like that of Christoph, and many, many more). This personas we will call Carry.

The CDK, with its limited resources, must target both Tony as well as Carry; both have widely different needs, and somewhere, in our free hours, we must work on solutions for both personas. Carry and Tony are, of course, exaggerations, and all readers (and me too) are linear combinations of these personas. In fact, you can even be a Carry in some parts of the CDK, while being a Tony in others. The CDK has so much functionality nowadays, that even I have unexplored corners.

And, practically, the building block approach is important to Carry (she might want to plug in her own aromaticity model), but a killer for Tony.

Missing information
A prominent, recurrent problem for the first personas, is to deal with missing information. And input files have missing information. Some formats are more explicit than others. This series will focus on these aspects, and discuss how the CDK can be used to add that missing information.

Another important aspect is that the CDK data model cannot hold all information. For example, I have no clue how the CDK should read MDL files with a muonium (browse the cdk-devel mailing list archives of this month). Here too, this is not always clear to Tony, who does not have the time to read file format specifications, nor CDK interfaces. He just wants the CDK to do its thing which it is supposed to be good at.

MDL formats
So, here we are. We have a MDL .mol file. They are pretty much the community standard, and even pretty Open too. You can now find the specs in the ctfile.pdf, readily available on the web. Actually, they are no longer called MDL formats, but Symyx formats, umm, Accelrys formats. These formats define a number of file formats, including the aforementioned Accelrys molfile, the Symyx SD file, but also query formats, used to store queries against their database software.

Like any file format, they support a number of features. For example, MDL files cannot represent a bond order 4, a quadruple bond. Organic chemistry doesn't need them. Moreover, hydrogens are often implicit, as they can easily be added later, and both memory and disk space is expensive (think 80-ies). Stereochemistry is wedge-bond-based and others.

Well, more recent MDL formats have become more powerful. The V3000 format can do much more then the V2000 format, or even the pre-V2000 format.

Now, the trigger for the start of this series is the bond order 4 in MDL files. Strictly speaking this is not part of the molfile format, nor of the SD file format; instead, it's part of the query format. However, the community ignored that part of the specification, and the molfiles and SD files are commonly using this query type to represent aromatic bonds. Well, or bonds that can be both single or double.

Now, the CDK does not have a structure to represent a single OR a double bond order. The CDK only has SINGLE, DOUBLE, TRIPLE, and QUADRUPLE (see IBond.Order). Moreover, it has a separate mechanism to indicate if a bond is aromatic. A flag is used for that. This allows the CDK to store both kekule-ized bond order localization and aromaticity perception information separately.

So, if a MDL molfile (or SD file) is read with the CDK in RELAXED mode with the MDLV2000Reader, order 4 bonds are read as SINGLE bonds with a flag indicating it is aromatic. If only the CDK had IBond.Order.UNKNOWN. This is scheduled for master. Mind you, this is a complicated patch, because a lot of algorithm introspect the bond order information, which will have to be updated to handle UNKNOWN bond orders.

This is a nasty co-incidence (interaction effect) killing Tony's use case: the MDL molfiles in the wild have information that the CDK cannot represent, sadly. Now, if users would have been paying for CDK releases, we could have assigned a developer on it. We have the mechanisms in place to buy a copy of the CDK, so that's not really an excuse (price negotiable; maybe 4999 SEK, but you are welcome to buy a campus-wide CDK version for more). If enough people / companies would do this, we could hire a developer to implement this particular use case. This would be a great way to support the project!

Anyway, the purpose of this series is not to rant about these things, but to describe how file formats might be read with the CDK. Here's a recipe with inline comments (in the Groovy syntax):

Mode.RELAXED
);

// perceive atom types
AtomContainerManipulator
.percieveAtomTypesAndConfigureAtoms(
azulene
)

DefaultChemObjectBuilder.getInstance()
);
AtomContainerManipulator
.convertImplicitToExplicitHydrogens(
azulene
);

// if bond order 4 was present,
// deduce bond orders
dbst = new DeduceBondSystemTool();
azulene = dbst.fixAromaticBondOrders(azulene);

Carry would know this. In fact, Carry would probably have some comments on this recipe. Well, Carry can leave those in the comments of this post.

## Tuesday, October 18, 2011

### The Blue Obelisk Shoulders for Translational Cheminformatics

I guess reader of my blog already heard about it via other channels (e.g. via Noel's blog post), but our second Blue Obelisk paper is out. In the past five-ish years since Peter instantiated this initiative, it has created a solid set of shoulder on which to developed Open Source-based cheminformatics solutions. I created the following diagram for the paper, showing how various Blue Obelisk projects interoperate (image is CC-BY, from the paper):

It shows a number of Open Standards (diamonds), one Open Data set (rectangles), and Open Source projects (ovals). What does diagram is not showing, is the huge amount of further Open Source cheminformatics projects around, that use one or more of the components listed here, but which do not link themselves to the Blue Obelisk directly. And there are many indeed, both proprietary and Open.

I am proud of this diagram: it really shows that the interoperability we set out in the first paper worked out very well! This makes the Blue Obelisk an excellent set of shoulders to do translational cheminformatics.

Translational cheminformatics?? Well, I have been looking for a while for a good term for my research regarding all that hacking on the CDK, Bioclipse, etc. Now, that's the translation of my core molecular chemometrics research to other scientific fields, like metabolomics, toxicology, etc.

Guha, R., Howard, M., Hutchison, G., Murray-Rust, P., Rzepa, H., Steinbeck, C., Wegner, J., & Willighagen, E. (2006). The Blue Obelisk - Interoperability in Chemical Informatics Journal of Chemical Information and Modeling, 46 (3), 991-998 DOI: 10.1021/ci050400b

O'Boyle NM, Guha R, Willighagen EL, Adams SE, Alvarsson J, Bradley JC, Filippov IV, Hanson RM, Hanwell MD, Hutchison GR, James CA, Jeliazkova N, Lang AS, Langner KM, Lonie DC, Lowe DM, Pansanel J, Pavlov D, Spjuth O, Steinbeck C, Tenderholt AL, Theisen KJ, & Murray-Rust P (2011). Open Data, Open Source and Open Standards in chemistry: The Blue Obelisk five years on. Journal of cheminformatics, 3 (1), 37 PMID: 21999342 DOI: 10.1186/1758-2946-3-37

## Wednesday, October 12, 2011

### Tricks I learned today #2: importing ontologies into a Semantic MediaWiki

I learned a second trick today (see also this first); this one is about the Semantic MediaWiki (SMW). I was using a trick I learned from RDFIO before, setting Equivalent and Original URIs (though the difference between those, I lost). But I ran into the problem that these equivalent URIs cannot contain hashes (#), or not always it seems.

After some googling, I did not find an answer, and turned to the SMW IRC channel. Saruman was helping out and pointed me to the Equivalant URI wiki page. I had looked at that earlier, but now turned to the Import vocabulary page. While this was not what Saruman wanted me to look at, it did turn out a nice workaround. The wiki page shows how to import external ontologies. This trick requires you to craft a wiki page with a special name, starting with MediaWiki:Smw_import_ followed by a namespace. It exemplifies this with the FOAF ontology, which is in fact an ontology I was also using equivalent URIs for.

So, I created a new MediaWiki:Smw_import_foaf page, with this content:

  http://xmlns.com/foaf/0.1/|
[http://www.foaf-project.org/ Friend Of A Friend]
name|Type:String
homepage|Type:URL
depiction|Type:URL
phone|Type:String
knows|Type:Page
member|Type:Page


And another page, MediaWiki:Smw_import_rdfs, with this:

  http://www.w3.org/2000/01/rdf-schema#|
[http://www.w3.org/TR/rdf-schema/ RDF Schema]
seeAlso|Type:URL

Then, in a Property page, which I want to make equivalent with a property in FOAF or RDF Schema, I can now simply use [[imported from::foaf:homepage]] or [[imported from::rdfs:seeAlso]], in Property:Has_homepage and Property:See_also respectively.

That's a neat trick, /me thinks!

### Tricks I learned today #1: as.integer() on factor levels

I normally work with full numerical data, not categorical data. R, when using read.csv() seems to recognize such categories and marks the column as to have factor levels. This is useful indeed. However, I wanted to make a PCA biplot on this data, so was looking for ways to convert this to class numbers. After some googling we, Anna and me, ran into as.integer() which can be used on the factor levels. So, today I learned this trick:

> a = as.factor(c("A", "B", "A", "C"))
> b = as.integer(factor(a))

Well, probably basic to many, it was new to me :)

Now, wondering if it is equally easy to convert it into a multi-column matrix where each column indicates class membership (thus, resulting in three columns for the above...). That's another trick I need to learn...

## Tuesday, October 11, 2011

### Blogs I Follow: Henry Rzepa

Just in case you have not run into Henry's blog yet, check it out. His blog makes me so jealous I did not follow up on my basic quantum chemistry education. Implementing Hartree-Fock in Fortran is not nearly as interesting or useful as the stuff he has been blogging about. A second reason you should, is his brilliant use of Jmol (Henry is one of those using Jmol for more than 10 years). This is his blog in action:

His double gravatar is not vanity, but a bug in one of blog extensions he is using creating those blue z icons. I think the French in the blog title is, while unlikely for British I always understood, deliberate.

Henry, if you're reading this... what about adding permalinks to those Jmol visualizations (or DOIs, like data cites)? Second, I'd love to be able to download to be able to download a visualization, and open it in Jmol or Bioclipse. Would such be possible in CML or JVXL (think datument)?

### Useless statistics: blog visitor by OS

The market is seriously changing now. Another year, and Microsoft Windows is no longer the majority OS. Of course, my blog is very specific, and these statistics do not map well to global market shares.

## Monday, October 10, 2011

### Call for Help: categorizing Open Data repositories for chemistry

Where to host chemistry data? This was the question two people asked a few weeks ago:
I had these two blog posts open in my browser since about the time they were blogged, intending to reply. But I could not come up with a good answer, despite I was hoping to do so. For RDF-based data there are a few options now, such as Kasabi and Science 3.0. Also, for crystallography data there is the Crystallography Open Database, and for quantum chemical calculations there is Quixote. And, of course, annotated NMR spectra can go into the NMRShiftDB.

But for chemistry data in general I do not know a solution. What to do with JDX files, with images of chromatograms? BioTorrents perhaps? But that is mostly for large data sets, and does not have a clear indexing approach. ChemSpider, as Jean-Claude has been doing for spectra (see this YouTube video)? ChemSpider does not have a solution to extracting the Open Data.

These are features such a repository must have:
1. allows you to specify who is the owner, creator, or similar
2. allows you to license the data, or, to waive your rights, per Panton Principles
4. allows users to automate data extraction
5. data should be indexed, at least by InChI (which just got a 1.04 release)
6. support any format
Optionally, these extras are welcome:
1. semantic annotation of repository content
2. provide CMLRSS feeds of new content
So, hereby this call for help: let's categorize what repositories are around that fulfill the 6 required features (or come very close). We can start of using regular blogging practices, by blogging solutions, ideas, comments, etc in reply, or use the commenting facilities here. Any activity in this area is appreciated and most welcomed by the community.

## Saturday, October 08, 2011

### An ontology for QSAR and cheminformatics

QSAR and QSPR are the fields that statistically correlate chemical substance features with (biological) activities (QSAR) or properties (QSPR). The chemical substance can be molecular structures, drug (which are not uncommonly mixtures), and true mixture like nanomaterials (NanoQSAR). Readers of this blog know I have been working towards making these kind of studies more reproducible for many years now.

Parts of this full story include the Blue Obelisk Data Repository (BODR), QSAR-ML, the CDK for descriptor calculation, the Blue Obelisk Descriptor Ontology (BODO, doi:), still used by the CDK, and in the past by JOELib too, and much, much more. Really, I still feel that the statistics is by far the easiest bit in QSAR modeling.

New in this list of tools to make QSAR more reproducible, is the CHEMINF ontology, which further formalizes cheminformatics computation. In a collaboration with Janna and Christoph (EBI), Michel and Leonid (Carlton University), and Nico (formerly at Cambridge, now at CSIRO), we have cooked up an ontology, and the computational bits of it are captured by the below figure from the paper that just appeared in PLoS ONE.

Both the paper and the ontology have a Creative Commons license. The ontology has already been used by Leonid in other papers, and I have been using it already in the RDF-ed version of ChEMBL.

Next steps for me regarding this ontology is to convert to BODO to be based on CHEMINF, but highly interesting too is a reformulation of QSAR-ML to be based on CHEMINF. The QSAR markup language was long started before RDF came into the picture, so please forgive us for now using RDF from the start there.

One particularly interesting aspect this ontology captures is the difference between molecular entities and mixtures. Not uncommonly, QSAR studies correlate drugs to their binding affinities, even if those drugs are in fact mixtures of stereoisomers. While 0D, 1D, and 2D descriptors are not affected, geometrical descriptors most certainly are. Moveover, the modeled endpoint is very possibly the property of only one of the stereoisomers, most certainly for binding affinities. Yet, many QSAR study reports in literature do not record such details. The CHEMINF ontology defines the terms you need to publish such details.

Hastings, J., Chepelev, L., Willighagen, E., Adams, N., Steinbeck, C., & Dumontier, M. (2011). The Chemical Information Ontology: Provenance and Disambiguation for Chemical Data on the Biological Semantic Web PLoS ONE, 6 (10) DOI: 10.1371/journal.pone.0025513

## Sunday, October 02, 2011

### CDK's getNaturalExactMass(): isotopic and atomic weights

A recurrent question for the CDK project is the about the AtomContainerManipulator.getNaturalExactMass() and why it return NaN for some elements. There are various incarnations of the issues here, but the key here is the difference between the various weights a molecule can have.

Monoisotopic molecular weight
The weight of a single molecule is well-defined, but requires you to know which isotopes are present in the molecule, which you typically do not. Each isotope of an element has a natural abundance. For example, carbon isotopes found on earth are mostly 12C, and a bit of 13C. In fact, 12C is about 99 times more abundant than 13C. The same goes for 1H and 2H: the first is way more abundant than the latter. This means that the molecular weight of a single methane molecule can only be calculated which isotopes are present in the molecule. But, the most occurring combination, is the molecule with just the major isotopes, SMILES: [1H][12C]([1H])([1H])[1H]. A few elements have two isotopes which are almost equally abundant, such as bromine.

Natural molecular weight
However, in experimental chemistry we are not looking at individual molecules (well, we can nowadays), but typically at substances of those molecules. Substances contains very many single molecules (remember the Avogadro's constant). So, we have a mixture of molecules with different isotope ratios. As indicated earlier, the majority of those molecules contain only major isotopes, but as a substance we rather just take the natural mass of the molecular, which is an average weight for that element (called atomic weights), giving the natural abundances of the element's isotopes. Mind you, these atomic weights are not constant, as recently made the news. Only isotopic weights are constant, the atomic weights depend on the ratio of isotopes, which varies around the world.

But, returning to AtomContainerManipulator.getNaturalExactMass() method, it should be noted that atomic weights are also only defined for elements which in fact are found on earth; elements which are naturally found, and for which the isotopic abundances can be calculated. Thus, atomic weights are not defined for synthetic elements, which we have more and more.

So, if you call this getNaturalExactMass() method for a molecular structures which has one or more elements with that do not occur naturally, you will get an NaN answer. In fact, I guess the method should throw a CDKException with a message like "Hey dude, you have non-natural elements in this molecule!".

Getting practical: technetium
Technetium (Tc) is such an element which does not have any naturally occurring isotopes, so the CDK cannot calculate a atomic weight. And that is the answer to this bug report.

### Robert Hanson: "Jmol 12.2.0 ready for release"

Robert Hanson is, as most of my readers now, the current project leader and lead developer of Jmol, assisted by many translators (you can help here) and Nico as release manager. Robert just sent out the message that the next stable series is ready for release. With 224 new features, he shows his enthusiasm for Jmol has not quite slowed down yet :) I sometime miss working on the successful Jmol, but I do not regret giving way to Miguel Howard and Robert Hanson which have taken Jmol to higher levels.

One of the new cool features is the contact command, which Robert demoed at the Gordon Research Conference on Visualization In Science & Education earlier this summer (yeah, autumn does not quite seem to have started in Europe yet). With this, supramolecular interactions can be conveniently visualized (click the images for the Jmol applet version):

And in protein:

Of course, creating the Jmol script to create these informative visualizations is another story. The online documentation helps, as well as Herrez' book. I still hope to find time at some point to have interactive documention in Bioclipse, perhaps based on content from Proteopedia, which uses Jmol in a MediaWiki, and notice all the green links, which run a Jmol script to highlight a particular bit of chemistry for that biological system. Recommended reading.

Congratulations to all involved in this Jmol (upcoming) release!