I'm always learning something from the readers of the Loom. Yesterday, I wrote about how scientists had inserted their names into a synthetic genome, and how such signatures would erode away like graffiti inside real organisms. But how about the opposite case--what if evolution has produced sequences of DNA that happen to form words? In the comment thread, Peter Ellis asked,
What actually is the longest word (in any language) encoded by the reference human genome? If I had the time and computer power I'd have a look... Guesstimate - it'll be somewhere in the 4-5 letter range, depending on letter frequency in the target language.
Bear in mind the rules of this game...the letters are the amino acids specified by codons (three bases of DNA). There are 20 amino acids in most living things, so you can't spell every word--or you can use alternatives, like using V for U. (Here's a table.) Ron then replied:
Just wander over to NCBI and blast to your hearts content. Taking "gvesstimate" (note the classical spelling) and checking against the protein refseq database finds: >ref|NP_939322.1| Putative peptide ABC transport system ATP-binding protein [Corynebacterium diphtheriae NCTC 13129] Length=560 GENE ID: 2649530 DIP0959 | protein coding [Corynebacterium diphtheriae NCTC 13129] (10 or fewer PubMed links) Score = 26.1 bits (54), Expect = 215, Method: Composition-based stats. Identities = 9/11 (81%), Positives = 10/11 (90%), Gaps = 0/11 (0%) Query 1 GVESSTIMATE 11 GVESS I+ATE Sbjct 278 GVESSEILATE 288 (sorry about the lack of proper formating) Knock yourself out. I do have vague recollections of someone doing something similar a long time ago, when the database was much, much smaller.
I had not heard about anyone trying this before, but it sounds like a lot of fun. I'm a complete novice when it comes to reading genomes with BLAST, so I won't try. But if anyone wants to post the longest word they can find, let's see what you get. (Maybe I'll get my word-guru brother to team up with a geneticist...that would be interesting.) If you think about it, life on Earth is probably coming up with stray words in its many genomes, which then turn to gibberish (to our eyes), only to produce new words for us to find. The four-billion-year world search, as it were. Update: Stephen Matheson offers easy step-by-step instructions. Thanks! Without a Z in the genetic code, I can't make an egotistic search for Zimmer. But here's Darwin lurking in bacteria.