## gene duplications (II)

Many protein structural domains correspond to distinct exons. Neighboring exons often similar, as a result of internal gene duplications.

If two copies of a gene (or domain) are in close proximity to each other on the genome, then the probability for additional duplication in the region is greatly enhanced.

Suppose we have genes A and B in two different species. If A and B are related to each other via speciation, then they are orthologous. If they are related by duplication then they are paralogous.

Introns early vs Introns late debate. Introns early theory argues that introns were present in ancient genes as linkers between functional protein modules that are the exons. They are lost in prokaryotes. Introns late theory says that introns were integrated into functional genes only after the emergence of endosymbiotic process and eukaryotic cells.

## movies

I have watched two stunning movies recently: Monster (2003) and Blue is the Warmest Color (2013). They are driven by two of the most spectacular performances by female actresses: Charlize Theron as Aileen and Adele Exarchopoulos as Adele.

Both films deal with confused sexuality, loneliness in modern society, and friendship/relationship that sadly but unavoidably runs its course. Both explore these issues from the perspective of a couple of females. The two films also exemplifies the differences between American and European cinema.

Monster is shocking in its violence and brutality. It may be based on a true story, but the film is undeniably a freak show, a circus act, a proverbial train crash that we know can not be real but we can’t look away. The characters, Aileen in particular, are victims of circumstances, both direct social forces and indirect mental disorders caused by abuse. There’s an sense of inevitability in her march towards self-destruction, and she has little choice.

Blue is the Warmest Color is a much more subtle, artistic and, ultimately, more interesting film. Even as a 15 year old student, Adele straddles on the edge of society. Most of the time, she is a very normal, popular girl at school. But at times, her eyes glaze over and there’s a hollowness behind it (similar to Matteo’s eyes). She takes comfort in food, sleep and other basics of life but there’s little joy in it. She comes to life in moments with friends, especially when they are marching in protest or dancing at picnics. She cries often and we don’t know why. Perhaps Adele doesn’t know herself. Her relationship with Emma is driven by raw physical attraction, but they have different outlooks on life. Emma is in the art world, is intellectual and comes from a sophisticated family. At times, she talks like a feminist artist pamphlet.

Adele likes art but not at the intellectual level. She is content to be a physical muse and a cook and server at parties. Throughout the film, she looks like a high school student (Adele was 18-19 at the time of the filming); a girl in the world of grownups. Even though at least 6 years passes during the movie, she never ages. She’s never comfortable in the grownup world of artist/intellectuals, but she’s troubled and has lost some of the innocence of childhood. She enjoys working with kindergarten kids but otherwise drifts without ambition. Is she happy?

Emma loses the wildness of her student days (symbolized by the dyed blue hair) and they inevitably drift apart. The film is less about a love story and more about the confusing coming-of-age of one girl. What does it mean to fall in love beyond the physical attraction? Adele’s sexual ambiguity highlights this question. Her interactions with Thomas (the first boyfriend) is just as natural and compatible as her interactions with Emma.

The film has several recurring motifs: the color blue, close up shots of Adele sleeping with her mouth open, voluptuous eating, Adele fussing with her hair. These motifs stabilize the movie and give the viewers a constant stream of familiar handles. They are also quirky ticks that make the characters much more real.

Monster: 4/5

Blue is the warmest color: 5/5

## duplication

There’s strong empirical evidence that gene duplication is a major driver of innovations in evolution. However it has received only scant modeling and theoretical analysis, especially compared to the amount of work in population genetics. Moreover most of the efforts were on modeling duplication at the sequence level rather than on the functions.

Basic numbers of gene duplication. In yeast $10^{-6}$ per gene per generation. In worm $10^{-7}$ per gene per generation. The rate at which duplicated genes are fixed in eukaryotes is 0.01 per gene per million years, suggesting that the vast majority of duplications do not reach fixation. Human and chimps gain and lose genes (0.004 gain and lose/gene/my) faster than other primates (0.0024 gene/my), and almost 3X faster than non-primate mammals (0.0014 gene/my).

Mechanisms of gene duplication. There are three main mechanisms: unequal cross-over, retroposition, and chromosomal/genome duplication. Unequal cross-over is often caused by similar stretches of DNA sequences, including transposable elements, that accidentally recombine.

Effects of gene duplication. The majority of duplicated genes quickly decay into pseudogenes. Of the functional ones, there are four types of functions:

1. Amplification. More expression the better.
2. Neofunctionalization. One copy retains the original function and one copy diverge to a new function through drift and selection.
3. Subfunctionalization. The original gene might be performing multiple jobs (pleiotropy), and the new copies each performs a subset of the jobs better than before.

It seems that subfunctionalization is particularly (maybe is the most) common. Duplicate gene expression patterns tend to diverge quickly after the duplication event. OR genes is a good example of this. However, many duplicate genes (even very old ones) are functionally redundant to some extent, as observed from deletion studies.

Other genomic innovations. Micro-RNA have expanded and functionally diversified via gene duplication. RNA-based duplications involve retroposition of mRNA back into the genome and results in “stripped-down” genes devoid of introns. Thousands of retrocopies and >100 functional retrogenes have been identified in the human genome. De novo emergence of genes is rare but has been observed in open reading frames. New genes may also arise from domesticated genomic parasites such as retroviruses and transposons. Many news formed genes tends to be specifically expressed in the testis. The testis constitutes the most rapidly evolving organ, hence may have especially large selection for new genes.

## Ira Glass

Last night I went to a wonderful performance by Ira Glass (of the This American Life fame) called 3 acts, 2 dancers and 1 radio host. It’s a combination of mediums that I have not seen on stage before: nonfiction story telling + interpretive dance set to lively music. Truly excellent. I have to say that this, the Heart of Robin Hood, the Glass Menagerie and sleep walking are the best and most memorable performances I have seen over the last few years.

All three of the Ira’s stories really connected with me. The first one was about the inevitably tension that arise from repetitive performance of creative acts. The second  story talked about love, taking place in middle school dances to marriages. And the third was about loss.

Being a professional host, Ira is excellent at making people feel at ease and asking questions, but he admitted to not being very good at sharing about himself (’emotionally present’). This is something that I feel about myself.

## Reflections from the second Simons workshop

The workshop has been a interesting meeting of minds. The talks were quite variable in quality and I will reflect on some of the themes below. In general I find the standard lecture based setup of this kind of workshop a bit sleepy (esp after a dozen talks) and not conducive to deep interactions.

What if we organize it as follows: a few hour long plenaries by people like Les, Haussler, Eric Siggia, and Christos. Each talk is 35 min of material with a 10 min half time discussion and 10 min at the end, plus plenty of questions mixed in. A few hour long chalk talks (at most 2 slides). A bunch of “lightning talks” each lasting ~5 mins. This leaves time for standard coffee breaks plus two hours each day for speed dating. The idea is that a pair of researchers meet for 30 min to discuss, so each person meets with 4 others everyday. Each participant submit some research interests and the organizer pair up people who might have interesting discussions. Ideally it would be a mix of people from your field and people from other fields and priorities given to people who don’t already know each other. In a workshop of 70 participants, each person meets 20 strangers which is a sizable fraction. The idea is to spur new interactions outside existing cliques and especially between juniors and seniors.

On to the content. This workshop raises several important questions: how does theory make impact, what is the role of worst-case analysis, how can TCS contribute to understanding evolution, does evolution need more understanding, is this a potentially impactful intersection?

In theoretical/modeling research it is very easy to become detached from science and data. It tempting to: 1. make up problems to solve; 2. make up metaphors; 3. work on very particular extensions of existing models. I believe all of these are traps that can potentially derail a research agenda from having real scientific and social impact. There are many intellectually interesting problems so with a few exceptions I don’t consider pure mental stimulation to be a main priority.

To avoid these traps, it’s useful to have a checklist to evaluate a potential modeling idea:

1. Is there a real scientific, empirical puzzle? It’s best if there a concrete conflict in data begging for explanation. Short of this, there should be an empirical phenomenon (read: data driven) that’s interesting and not well understood.
2. The approach and model should be new. I firmly believe that in theory and in science the bulk of the impact is in conceptualizing a new framework and getting the first results. After that, it tends to get less interesting and much more technical. A double whammy.
3. If you are still working on a model developed by Fisher more than 50 years ago, there better be a significant new element such as new data or new structure. Otherwise the results are incremental in the worst case. In general it’s useful to ask what am I bringing that’s unique to the problem.

I think a lot of the work in theoretical evolution, modeling and evolutionary algorithms fall under one of these three traps. For people coming from physics/math side 1 and 3 are more common. For CS folks, 2 is a danger.

## Simons workshop 2

Awards for favorite talks.

Eric Siggia. Evolving regulatory networks to fit target morphological patterns. Positive (which means greedy, I believe) evolution of network topology, parameters, and outputs reproduce known morphology. It’s important that topology and parameters coevolve. It given final topology then hard to tune parameters. Duplication seems very important here.

Carsten Witt. Time complexity of EA as randomized algorithms. I normally think of EA as pretty blah but this talk was crystal clear. The idea is that it’s hard to analyze EA directly and the main work-horse is drift analysis, which is basically local analysis of how much progress is made in one step on average. Additive drift. Multiplicative drift. Chernoff bound analysis of deviation from expectation. Recurrence analysis.

David Haussler. Evolution of the human brain (probably the overall winner). Several strands of evidence suggest that Notch2NL significantly contributes to increase in the human brain size. It’s a fairly recent duplication; present in human, chimp, gorilla. In tissue culture, it seems to prolong the stem stage, leading to larger brains. Ectopic overexpression in mouse fetus changes morphology. Deletion of that region in human leads to small-brain illness. Also interesting that the duplication became a sudogene and was reactived via gene conversion before having these functional effects.

Other interesting ideas/tidbits.

• Steve Frank. Gaussian distribution is an attractor. What are other attractor distributions/functions common in nature? Perhaps hill function is an attractor in the sense that a broad class of interactions all result in hill function like shapes.
• Guy Sella. Fisher’s geometric model of genetic architecture.
• Edo Kussell. Molecular memory via LAC operon.
• Carl Bergstrom. Defensive complexity. The immune system is so complexity so that it’s harder to break/hijack.

Submodular optimization has been receiving a lot of attention in machine learning recently. It a nice way to generalize convex optimization to sets, and has the nice property that very simple greedy algorithm is : 1) pretty close to optimal and 2) pretty much the best one can do.

If $\Omega$ is a set, a set function $f: 2^{\Omega} \rightarrow R$ is submodular if it satisfies one of the following equivalent conditions:

1. If $S, T \subset \Omega$, then $f(S)+f(T) > f(S \cup T) + f(S \cap T)$.

2. If $S \subset T$ and $x \in \Omega/T$, then $f(T \cup x) - f(T) < f(S \cup x) - f(S)$.

The conditions capture the intuition for diminishing margins of return.

We call a submodular function monotone if $\forall S \subset T, f(S). Monotone submodular optimization include problems such as Max Cover, which says given a collection of sets and $k$, find the $k$ sets whose union is the greatest. Max Cover is NP-hard. What’s very nice is that any monotone submodular optimization can be approximated to within $1-1/e$ by a simple Greedy algorithm.

The Greedy algorithm works in iterations. In the first round, find the single element $x_1 \in \Omega$ to maximize $f(x_1)$. Next, given $x_1$, find $x_2$ to maximize $f(x_1 \cup x_2)$ and so on. It’s actually pretty easy to show by induction that this give s $1-1/e$ approximation to the optimal solution for any fixed $k$.