Sequencing Antibiotics using Bioinformatics

The human genome is made up of about 3 billion base pairs — the tiny little characters which make up the sentences in the instruction manual that is our genetic code. The mechanism which converts these instructions from an “idea” to an “action” is governed by the backbone of high school genetics units: the Central Dogma of Molecular Biology.

This basically states that DNA (the raw genetic material) encodes RNA which encodes proteins (the things which actually carry out different bodily functions). More specifically, DNA is transcribed to a strand of RNA comprised of four analagous ribonucleotides — adenine, guanine, cytosine and uracil. The RNA strand is then read off and depending on the content, it is translated into a particular protein.

During translation, RNA is converted to a protein by

  1. Partitioning the strand into “codons”: non overlapping sets of three consecutive bases.
  2. Matching each codon with it’s respective amino acid based on the genetic code.

Since there are 4³ = 64 different codons, but only 20 amino acids, some amino acids are coded for by multiple different codons and some codons translate to the “stop codon” (denoted by a “*”) which just interrupts the translation process.

For a long time, the central dogma was the explanation for how every single protein was fabricated in the body. That changed in 1963 when Nobel Laureate Edward Tatum shut down ribosomes in a bacteria called Bacillia Breivs. The ribosome is the site of translation from RNA into proteins, and so Tatum predicted that by inhibiting the ribosome from function, protein generation should come to a full stop.

But, what actually happened was that protein generation did come to a full stop except for a handful of proteins including tyrocidines and gramicidins. So, if aren’t produced by the ribosome, then where do they come from?

It turns out that there are a subset of proteins called non ribosomal peptides which are not produced by the ribosome but by another protein called NRP synthetase.

Now, if we did want to somehow sequence these peptides, a couple complications emerge

  1. We can’t do so by making use of a genome because these proteins do not correlate to DNA.
  2. Most NRPs are cyclic — so rather than being coded for in a linear fashion which have one obvious start and end, there are instead multiple possible “starts” and “ends”.

The first problem is overcome by using data generated by a mass spectrometer as a basis for analysis. A mass spectrometer is a machine which shatters molecules into pieces and then weighs the resulting fragments. It measures in Daltons which also denotes the mass of one nuclear particle like a proton or a neutron.

The mass of a molecule is the sum total of all the masses of the atoms that make it up and is also known as it’s integer mass. For instance, water (H_2O) would have a mass of 2*1 + 8 = 10 Da. One subtlety is that one Dalton does not exactly correspond to the mass of one nuclear particle, which results in molecules having masses with fractions attatched. Just for the sake of simplicity, we will take masses to be integers.

Each different non ribosomal peptide has it’s own mass which is between 57 and 200 Da.

Let’s say that we shatter Tyrocidine (LFPWFNQYVK) into a bunch of parts using the mass spectrometer. One possible result would give us LFP and WFNQYVK and another possible result would give us YVKLF and PWFNQ. The specific collection generated by one “shattering” is called the experimental spectrum.

This is where the second point (NRPs being cyclic vs. linear) merits a bit more explanation. Cyclic peptides and linear peptides would be expected to have different spectrums. For instance, if the NQEL was linear, then for the spectrum, we could only get shattered pieces from the set {N, Q, E, L, NQ, QE, EL, NQE, QEL, NQEL}. But if NQEL was cyclic, then for the spectrum, we could get values from the set {N, Q, E, L, NQ, QE, EL, LN, NQE, QEL, ELN, LNQ, NQEL}.

The experimental spectrum is often compared to the theoretical spectrum which would contain all the possible masses of the cyclic peptide. For instance, the theoretical spectrum of NQEL would be

Notice that it contains duplicates when necessary (i.e. NQ and EL both have the same mass so 242 is included twice).

This leads us to the problem of finding a cyclic peptide whose theoretical spectrum matches a given theoretical spectrum (i.e. trying to guess what cyclic peptide a given set of experimental data may correlate to). This is called the Cyclopeptide Sequencing Problem.

Let’s say that the parent mass of an spectrum the mass of the entire peptide (the largest one in the specturum). For instance, the parent mass of NQEL is 484. A brute force approach would take every peptide whose theoretical linear spectrum has a parent mass of the given experimental spectrum and then see if the resulting theoretical spectrum is the same as the experimental spectrum.

But of course, we should be able to do better than brute force because this gets time intensive for long strings. Instead we should take a branch and bound approach which involves brute force but in a slightly more clever way.

This algorithm makes use of a “leaderboard” of potential peptides each of whose mass is less than the Parent Mass of the experimental spectrum. At each step the algorithm

  1. expands the list of existing leaderboard peptides by creating a version with each of the possible letters tacked on to the end. Ex the peptide “GA” would be replaced by “GAG”, “GAE”, “GAD”…etc. This is the branching step.
  2. gets rid of the peptides whose mass is larger than the Parent Mass. This is the bounding step.
  3. For each peptide whose mass is equal to the parent mass, it compares it’s theoretical linear spectrum to the spectrum in question. If they are equal, we save the peptide, or else we discard it.

Now, this algorithm is designed to handle instances were the spectrum we are given is theoretical and so would perfectly match the theoretical spectrum of any of the peptides in the leaderboard. But, what if we are given an experimental spectrum which has some inconsistencies — some additonal masses or some missing ones from the theoretical spectrum of the peptide that it is supposed to represent.

In order to relax the requirements somewhat, we will develop a scoring function which returns a score based on how similar a theoretical spectrum of a peptide in the leaderboard is to the experimental spectrum. Then, our goal would be to return the spectrum with maximum score.

This inevitably means that our leaderboard of peptides would grow to contain more peptides than it did before. So, we somehow need to make sure that our leaderboard doesn’t grow out of control. This is done by introducing a cut which means that at every step, only the N highest scoring peptides are kept on the leaderboard and the rest are discarded.

The smallest amino acids in a theoretical spectrum (between 57 and 200 Da) should be the ones that correlate to specific amino acids. But, what if we totally miss one of the amino acids in the experimental spectrum, for instance, NQEL with out the E.

We define a convolution to be an operation which takes the positive differences between all possible pairs of masses in the experimental spectrum.

We can integrate this idea into Cyclopeptide sequencing as follows:

  1. compute the convolution.
  2. choose the M most frequent elements to form the alphabet of amino acids which we will use for expansion.
  3. run the previously defined branch and bound leaderboard cyclopeptide sequencing algorithm using this algorithm.

In the next article, we will be focusing on finding the closest occurrences of patterns with in a length of genome. You can also check out my previous article which had to do with using bioinformatics to sequence the human genome.

Activator at The Knowledge Society | A Sandwich or Two Founder