Some highlights from this year’s RECOMB.

- Madan Badu gave a nice talk on disordered proteins. He did a good job in describing the prevalent paradigm (sequence -> structure -> function) and then presenting a different framework (sequence -> disorder -> function). Even though this is an oversimplification and I don’t remember all the details of the talk, the general concepts are intriguing and memorable.
- Bonnie Berger also gave a nice keynote on compressive genomics. It ticked all the hallmarks of a good talk: one single, clear conceptual idea; repeat until it hits home; lots of different applications.

For example, let x = [1,2]. If we say y = x, then x and y are point to the same memory that holds [1,2]. x.append(3) acts on this memory and changes both x and y to [1,2,3]. However, if we set x = [1,2,3], then x is pointing at a different memory holding [1,2,3] while y is still pointing at the original [1,2], so the variables become disjoint. If the RHS of ‘=’ is a variable, then the LHS is pointed to the memory associated with the RHS. If RHS of ‘=’ is a concrete object like a string or list or dict, then a new piece of memory is created and the LHS variable now points to it. Also, if the RHS is var + something, then a new memory is created. Variables are not tied to other variables, but only points to specific memory.

]]>such that and .

Here are all matrices and the operation corresponds to entry-wise multiplication. An equivalent formulation of SDP is such that . So we can either view it as *linear *programming with matrices and a funny PSD constraint or as *quadratic *programming where the variables are vectors. Note that PSD matrices forms a convex cone in . If we simply treat it as a convex optimization problem, then everything scales as and is not efficient.

**MAX-CUT.** Given a graph with weight $w_{ij}$ on edge $e_{ij}$, the goal is to partition the vertices into two sets so that the sum of the crossing edges is maximized. This is a NP-hard problem and can be formulated as an integer program. Assign a binary variable to each vertex such that if belongs to the left partition and otherwise. Then MAX-CUT is equivalent to:

such that .

As a relaxation, let be a vector in , and solve the following SDP:

such that . The solutions are vectors on the unit sphere. We think of the vector as representing the node in some -dim feature space and as measuring the similarity between the two points. *Is there a way to interpret these features? *

To derive a hard partition, we randomly select a hyperplane and put all the vectors on the one side of the hyperplane into one partition. The neat thing is that the probability of being separated by the hyperplane is quite close to . So the randomized rounding is close to the optimal of the SDP, which is guaranteed to be better than the maximum MAX-CUT value. Truly an amazing algorithm.

]]>

A few other observations:

1. Smart phones everywhere. There are more people using it than in the U.S. and they seemed to be of high quality too! In stark contrast with the slow and sparse internet connections for desktops.

2. Amazing transportation infrastructure: shiny new train stations, airports, highways and high-speed rail that shrinks the country.

3. Over-construction. This is especially severe in middle level cities like Fu Zhou and smaller country sides, where colonies of tall, clonal apartment complexes sprouted in the midst of farmland and deserted country-side.

]]>**Neural word embedding as implicit matrix factorization.** word2vec is equivalent to SVD of the word-by-context matrix. The proposed algorithm using negative sampling is computationally more efficient than SVD. Some of the preprocessing, e.g. eliminating rare words, improves the results.

**Large scale canonical correlation analysis with iterative least squares.** A generalized dual power iteration method for computing CCA. Iteratively running regressions of the two datasets onto lower dimensions.

**Hamming ball auxiliary sampling for FHMM**. Gibbs sampling for FHMM samples one chain at a time while fixing all other chains. This mixes very slowly. The paper introduces using auxiliary variables (one for each latent state), so that each column of latent variables is sampled uniform on the Hamming ball from a vector of auxiliary variables. Speeds up mixing.

**Talks**

**Privacy in the land of plenty.** Dwork. Add noise to preserve differential privacy. The noise only depends on the statistical algorithm you want to run and does not scale with the size of data, so the larger the dataset the more tolerable the noise is. Natural connection between privacy and generalization performance of the algorithm. Add noise to training data.

**Randomized experimental design for causal graph discovery. **

**Games, networks, and people**. Kearns. Interesting experiments where people occupy nodes on a graph and perform actions. In the biased consensus experiment, a minority of hubs seems to win over a majority of leaves.

**Constrained Bayesian Optimization**. Gramacy. Use augmented Lagrangian formulation of constrained optimization. Apply Bayesian Opt to each term of the Lagrangian separately. Seems to work for a very low dimensional toy problem.

**cuDNN**

**Po Ling**. Convex loss function and non-convex regularizer. Can prove that the convex term dominates over large scales and the sum is only nonconvex near the global optimum. Therefore all the local optima are close to the global optimum in the parameter space.

**Jure**. Stanford large networks collection. RINGO: a simple data structure for graphs. $30k terabyte memory machine.

**Guestrin**. Graphlab Create. A python system to work with tables and networks. Seems similar to pandas but more optimized for large data and with additional network analysis primitives. Columnar encoding is better for compression for some reason. SFrame. SGraph. Maintains locality in edge traversal.

**IBM**. HIV prediction using EMR.

**Brudno**. Deep phenotyping using patient information + model organisms + gene knockouts. Phenotips. Human phenotype ontology. Create patient databases to reduce diagnostic odyssey of rare diseases.

**Bengio**. Underfitting is the main problem for DNN because the model capacity is so large. Curriculum learning (scheduling the training examples) makes optimization easier? How do human get around saddle points?

**Chris Moore**. Phase transition in stochastic block model: both information-theoretic and in terms of the convergence time of Belief Propagation.

**Kevin Murphy**. Knowledge vault and procedure learning in recipes.

]]>

- typing. For numpy arrays, if we have an int array then adding a float to an element of the array will automatically be floored. So operations on individual elements of the array are casted to the original type of the array. Each np.array has exactly one type for all of its elements. So for calculation, always initialize as floats.
- It can be confusing whether two variables point to the same data or different data. To be sure, initially the variable temp = [0] or something like it.

Here are some recent ideas:

1. Brain computer interface. By recording neuronal activity of several dozens of neurons in our brain as we perform a specific task (think picking up a cup of water and bringing it to our lips), we can train a mapping from physical activity to neural signals. We can invert this mapping, so that given a set of brain signals, we can infer what activity it correspond to. This is useful for people with physical disability for example. If they can think, “pick up the cup”, and we can decipher it, then we can pick up the cup for them. This works to some extent, but not very well. The reason is somatosensory perception, which, as far as I can tell, is the idea the we are aware of our body. Touch is an example of somatosensation, and apparently normal individuals who have temporarily lost touch have a hard time picking up small objects even if their motor system is not impaired. The idea is that brain computer interface without feedback to the brain that somehow mimic somatosensation would be inaccurate. Ok so let’s build better neural UI by stimulating some neurons to imitate such feedback.

2. EPAS1 gene and high altitude adaptation. One reason that Tibetans are well adapted to high altitude is because of several mutations in the EPAS1 gene. It’s clearly under strong positive selection. Is it selection on a de novo mutation or a standing variation? Actually turned out to be an introgression from neanderthals at EPAS1 that was present in the initial population and then spread due to selection. A clean (and very rare) example of beneficial gene flow from neanderthals to a human population. Very cool!

3. Godel’s incompleteness theorem, as told by the incomparable Christos. My favorite Simons lunch! This statement is unprovable. This is the type of self-referential statement that drives logicians to the wall. If true then the logical system is incomplete. If false, then we can “prove” that this statement is unprovable and the framework is inconsistent. Hmmm. So to show that a mathematical system, say integer arithmetic, is incomplete then we want to encode into mathematical language this self-referential statement. To do this rigorously, the key idea is arithmetization. We can assign each symbol to a hexadecimal number. So that each logical statement (e.g. for all x there exits y such that x = 2y or x=2y+1) now corresponds to a number. Moreover each proof, which is just a string of logical statements, is also just a number! Then we can encode a self-referential statement as a number.

4. Grobner basis. Given a multivariate polynomial f and a set of polynomials g1, g2, …, is f in the ideal generated by {g’s}? This is equivalent to asking can we write f as a linear combination of the g’s where the coefficients are other polynomials. The straightforward thing to do is to take f, divide by g1 as much as possible, and the divide the remainder by g2 and so on. If in the end, the remainder is 0 then f is in the ideal. The problem is that the outcome depends on the order of the g’s, so we might need to try all the orderings before conclusively concluding where f is in the ideal or not! Let I = <g’s> be the ideal. A Grobner basis of I is a set of polynomials h in I such the <leadterm(h)> = leadterm(I). Moreover, <h> = I. These h’s form a nice basis of I meaning that the order of h does not matter in the division. So we can test for ideal membership easily. In the worst case, it’s doubly exponential to find a Grobner basis (not unique), but works in practice.

]]>

If two copies of a gene (or domain) are in close proximity to each other on the genome, then the probability for additional duplication in the region is greatly enhanced.

Suppose we have genes A and B in two different species. If A and B are related to each other via speciation, then they are orthologous. If they are related by duplication then they are paralogous.

Introns early vs Introns late debate. Introns early theory argues that introns were present in ancient genes as linkers between functional protein modules that are the exons. They are lost in prokaryotes. Introns late theory says that introns were integrated into functional genes only after the emergence of endosymbiotic process and eukaryotic cells.

]]>Both films deal with confused sexuality, loneliness in modern society, and friendship/relationship that sadly but unavoidably runs its course. Both explore these issues from the perspective of a couple of females. The two films also exemplifies the differences between American and European cinema.

Monster is shocking in its violence and brutality. It may be based on a true story, but the film is undeniably a freak show, a circus act, a proverbial train crash that we know can not be real but we can’t look away. The characters, Aileen in particular, are victims of circumstances, both direct social forces and indirect mental disorders caused by abuse. There’s an sense of inevitability in her march towards self-destruction, and she has little choice.

Blue is the Warmest Color is a much more subtle, artistic and, ultimately, more interesting film. Even as a 15 year old student, Adele straddles on the edge of society. Most of the time, she is a very normal, popular girl at school. But at times, her eyes glaze over and there’s a hollowness behind it (similar to Matteo’s eyes). She takes comfort in food, sleep and other basics of life but there’s little joy in it. She comes to life in moments with friends, especially when they are marching in protest or dancing at picnics. She cries often and we don’t know why. Perhaps Adele doesn’t know herself. Her relationship with Emma is driven by raw physical attraction, but they have different outlooks on life. Emma is in the art world, is intellectual and comes from a sophisticated family. At times, she talks like a feminist artist pamphlet.

Adele likes art but not at the intellectual level. She is content to be a physical muse and a cook and server at parties. Throughout the film, she looks like a high school student (Adele was 18-19 at the time of the filming); a girl in the world of grownups. Even though at least 6 years passes during the movie, she never ages. She’s never comfortable in the grownup world of artist/intellectuals, but she’s troubled and has lost some of the innocence of childhood. She enjoys working with kindergarten kids but otherwise drifts without ambition. Is she happy?

Emma loses the wildness of her student days (symbolized by the dyed blue hair) and they inevitably drift apart. The film is less about a love story and more about the confusing coming-of-age of one girl. What does it mean to fall in love beyond the physical attraction? Adele’s sexual ambiguity highlights this question. Her interactions with Thomas (the first boyfriend) is just as natural and compatible as her interactions with Emma.

The film has several recurring motifs: the color blue, close up shots of Adele sleeping with her mouth open, voluptuous eating, Adele fussing with her hair. These motifs stabilize the movie and give the viewers a constant stream of familiar handles. They are also quirky ticks that make the characters much more real.

Monster: 4/5

Blue is the warmest color: 5/5

]]>

**Basic numbers of gene duplication. **In yeast per gene per generation. In worm per gene per generation. The rate at which duplicated genes are fixed in eukaryotes is 0.01 per gene per million years, suggesting that the vast majority of duplications do not reach fixation. Human and chimps gain and lose genes (0.004 gain and lose/gene/my) faster than other primates (0.0024 gene/my), and almost 3X faster than non-primate mammals (0.0014 gene/my).

**Mechanisms of gene duplication. **There are three main mechanisms: unequal cross-over, retroposition, and chromosomal/genome duplication. Unequal cross-over is often caused by similar stretches of DNA sequences, including transposable elements, that accidentally recombine.

**Effects of gene duplication. **The majority of duplicated genes quickly decay into pseudogenes. Of the functional ones, there are four types of functions:

- Amplification. More expression the better.
- Neofunctionalization. One copy retains the original function and one copy diverge to a new function through drift and selection.
- Subfunctionalization. The original gene might be performing multiple jobs (pleiotropy), and the new copies each performs a subset of the jobs better than before.

It seems that subfunctionalization is particularly (maybe is the most) common. Duplicate gene expression patterns tend to diverge quickly after the duplication event. OR genes is a good example of this. However, many duplicate genes (even very old ones) are functionally redundant to some extent, as observed from deletion studies.

**Other genomic innovations. Micro-RNA** have expanded and functionally diversified via gene duplication. **RNA-based duplications** involve retroposition of mRNA back into the genome and results in “stripped-down” genes devoid of introns. Thousands of retrocopies and >100 functional retrogenes have been identified in the human genome. **De novo emergence** of genes is rare but has been observed in open reading frames. New genes may also arise from domesticated genomic parasites such as retroviruses and transposons. Many news formed genes tends to be specifically expressed in the **testis**. The testis constitutes the most rapidly evolving organ, hence may have especially large selection for new genes.