## RECOMB 2015

I attended my first RECOMB over the last few days. It was an excellent conference. There were some discussion regarding the +/- of single track conferences vs multi-track conferences. Overall I prefer the single track format, like RECOMB. It’s good for the community to have everyone in the same room and also forcing the speakers to give more general audience talks. For the speakers, you get a broader exposure for your work; and for the audience, you get to hear more diverse topics. It’s also good for the general morale of the conference for everyone to have the same reference points.

Some highlights from this year’s RECOMB.

• Madan Badu gave a nice talk on disordered proteins. He did a good job in describing the prevalent paradigm (sequence -> structure -> function) and then presenting a different framework (sequence -> disorder -> function). Even though this is an oversimplification and I don’t remember all the details of the talk, the general concepts are intriguing and memorable.
• Bonnie Berger also gave a nice keynote on compressive genomics. It ticked all the hallmarks of a good talk: one single, clear conceptual idea; repeat until it hits home; lots of different applications.

Python is very convenient because we don’t have to type each variable, but this can also lead to sloppy mistakes.  A common mistake occurs when two variables point to the same memory and operations on one variable also changes the value of the other. This is not a problem for ‘simple’ objects like floats or strings, where x = y is equivalent to x = copy.deepcopy(y), but it is an issue for more complex data structures like lists and dicts.

For example, let x = [1,2]. If we say y = x, then x and y are point to the same memory that holds [1,2]. x.append(3) acts on this memory and changes both x and y to [1,2,3]. However, if we set x = [1,2,3], then x is pointing at a different memory holding [1,2,3] while y is still pointing at the original [1,2], so the variables become disjoint. If the RHS of ‘=’ is a variable, then the LHS is pointed to the memory associated with the RHS. If RHS of ‘=’ is a concrete object like a string or list or dict, then a new piece of memory is created and the LHS variable now points to it. Also, if the RHS is var + something, then a new memory is created. Variables are not tied to other variables, but only points to specific memory.

## SDP and max-cut

Semi-definite programming generalizes linear programming. One standard definition of SDP is the following:

$\min C \bullet X$ such that $A_i \bullet X = b_i, \forall i$ and $X \succeq 0$.

Here $C, X, A_i, and b_i$ are all matrices and the $\bullet$ operation corresponds to entry-wise multiplication. An equivalent formulation of SDP is $\min \sum_{i,j} c_{ij} x_i \cdot x_j$ such that $\sum_{j,k} a_{ijk} x_j \cdot x_k=b_i$. So we can either view it as linear programming with matrices and a funny PSD constraint or as quadratic programming where the variables are vectors. Note that PSD matrices $X \succeq 0$ forms a convex cone in $\mathbb{R}^{n^2}$. If we simply treat it as a convex optimization problem, then everything scales as $n^2$ and is not efficient.

MAX-CUT. Given a graph $G = (E, V)$ with weight $w_{ij}$ on edge $e_{ij}$, the goal is to partition the vertices into two sets so that the sum of the crossing edges is maximized. This is a NP-hard problem and can be formulated as an integer program. Assign a binary variable $x_i$ to each vertex such that $x_i = 1$ if $i$ belongs to the left partition and $x_i = -1$ otherwise. Then MAX-CUT is equivalent to:

$\max \sum_{i,j} w_{ij} (1 - x_ix_j)$ such that $x_i^2 = 1$.

As a relaxation, let $x_i$ be a vector in $\mathbb{R}^n$, and solve the following SDP:

$\min w_{ij} x_i \cdot x_j$ such that $x_i \cdot x_i = 1$. The solutions $x_i$ are vectors on the unit sphere. We think of the vector $x_i$ as representing the node $i$ in some $n$-dim feature space and $x_i \cdot x_j$ as measuring the similarity between the two points. Is there a way to interpret these features?

To derive a hard partition, we randomly select a hyperplane and put all the vectors on the one side of the hyperplane into one partition. The neat thing is that the probability of $x_i, x_j$ being separated by the hyperplane is quite close to $0.5 - 0.5 x_i \cdot x_j$. So the randomized rounding is close to the optimal of the SDP, which is guaranteed to be better than the maximum MAX-CUT value. Truly an amazing algorithm.

## China 2014

I traveled in China from 12/28/2014 to 1/6/2015. With Yvonne and my parents, we went on a barnstorming tour through Longman Grottoes, Xi’an, Fu Zhou, Beijing and Tianjin. Highlights include: the headless, defaced stone statues at Longman; the Terracotta warriors that for all its glory are only guarding the most periphery of the emperor’s tomb; the amazingly preserved city wall of Xi’an; the Forbidden Palace, which is somehow completely different from my memory; the Muslim district in Xi’an and its amazing street food.

A few other observations:

1. Smart phones everywhere. There are more people using it than in the U.S. and they seemed to be of high quality too! In stark contrast with the slow and sparse internet connections for desktops.

2. Amazing transportation infrastructure: shiny new train stations, airports, highways and high-speed rail that shrinks the country.

3. Over-construction. This is especially severe in middle level cities like Fu Zhou and smaller country sides, where colonies of tall, clonal apartment complexes sprouted in the midst of farmland and deserted country-side.

## nips 2014

Posters

Neural word embedding as implicit matrix factorization. word2vec is equivalent to SVD of the word-by-context matrix. The proposed algorithm using negative sampling is computationally more efficient than SVD. Some of the preprocessing, e.g. eliminating rare words, improves the results.

Large scale canonical correlation analysis with iterative least squares. A generalized dual power iteration method for computing CCA. Iteratively running regressions of the two datasets onto lower dimensions.

Hamming ball auxiliary sampling for FHMM. Gibbs sampling for FHMM samples one chain at a time while fixing all other chains. This mixes very slowly. The paper introduces using auxiliary variables (one for each latent state), so that each column of latent variables is sampled uniform on the Hamming ball from a vector of auxiliary variables. Speeds up mixing.

Talks

Privacy in the land of plenty. Dwork. Add noise to preserve differential privacy. The noise only depends on the statistical algorithm you want to run and does not scale with the size of data, so the larger the dataset the more tolerable the noise is. Natural connection between privacy and generalization performance of the algorithm. Add noise to training data.

Randomized experimental design for causal graph discovery.

Games, networks, and people. Kearns. Interesting experiments where people occupy nodes on a graph and perform actions. In the biased consensus experiment, a minority of hubs seems to win over a majority of leaves.

Constrained Bayesian Optimization. Gramacy. Use augmented Lagrangian formulation of constrained optimization. Apply Bayesian Opt to each term of the Lagrangian separately. Seems to work for a very low dimensional toy problem.

cuDNN

Po Ling. Convex loss function and non-convex regularizer. Can prove that the convex term dominates over large scales and the sum is only nonconvex near the global optimum. Therefore all the local optima are close to the global optimum in the parameter space.

Jure. Stanford large networks collection. RINGO: a simple data structure for graphs. \$30k terabyte memory machine.

Guestrin. Graphlab Create. A python system to work with tables and networks. Seems similar to pandas but more optimized for large data and with additional network analysis primitives. Columnar encoding is better for compression for some reason. SFrame. SGraph. Maintains locality in edge traversal.

IBM. HIV prediction using EMR.

Brudno. Deep phenotyping using patient information + model organisms + gene knockouts. Phenotips. Human phenotype ontology. Create patient databases to reduce diagnostic odyssey of rare diseases.

Bengio. Underfitting is the main problem for DNN because the model capacity is so large. Curriculum learning (scheduling the training examples) makes optimization easier? How do human get around saddle points?

Chris Moore. Phase transition in stochastic block model: both information-theoretic and in terms of the convergence time of Belief Propagation.

Kevin Murphy. Knowledge vault and procedure learning in recipes.

Python is useful for many things, but in trying to be simple it’s also terrible in several ways.

• typing. For numpy arrays, if we have an int array then adding a float to an element of the array will automatically be floored. So operations on individual elements of the array are casted to the original type of the array. Each np.array has exactly one type for all of its elements. So for calculation, always initialize as floats.
• It can be confusing whether two variables point to the same data or different data. To be sure, initially the variable temp = [0] or something like it.

## amazing ideas

Talk is cheap, idea is valuable. There are so many talks at Berkeley and Simons in particular and most of them are forgotten within a few weeks. The best talks convey one fresh and engaging idea. It is such ideas that sustain new research. They are rare and we can spent days and weeks foraging at conferences before encountering a new one. And when we do, they should be treasured.

Here are some recent ideas:

1. Brain computer interface. By recording neuronal activity of several dozens of neurons in our brain as we perform a specific task (think picking up a cup of water and bringing it to our lips), we can train a mapping from physical activity to neural signals. We can invert this mapping, so that given a set of brain signals, we can infer what activity it correspond to. This is useful for people with physical disability for example. If they can think, “pick up the cup”, and we can decipher it, then we can pick up the cup for them. This works to some extent, but not very well. The reason is somatosensory perception, which, as far as I can tell, is the idea the we are aware of our body. Touch is an example of somatosensation, and apparently normal individuals who have temporarily lost touch have a hard time picking up small objects even if their motor system is not impaired. The idea is that brain computer interface without feedback to the brain that somehow mimic somatosensation would be inaccurate. Ok so let’s build better neural UI by stimulating some neurons to imitate such feedback.

2. EPAS1 gene and high altitude adaptation. One reason that Tibetans are well adapted to high altitude is because of several mutations in the EPAS1 gene. It’s clearly under strong positive selection. Is it selection on a de novo mutation or a standing variation? Actually turned out to be an introgression from neanderthals at EPAS1 that was present in the initial population and then spread due to selection. A clean (and very rare) example of beneficial gene flow from neanderthals to a human population. Very cool!

3. Godel’s incompleteness theorem, as told by the incomparable Christos. My favorite Simons lunch! This statement is unprovable. This is the type of self-referential statement that drives logicians to the wall. If true then the logical system is incomplete. If false, then we can “prove” that this statement is unprovable and the framework is inconsistent. Hmmm. So to show that a mathematical system, say integer arithmetic, is incomplete then we want to encode into mathematical language this self-referential statement. To do this rigorously, the key idea is arithmetization. We can assign each symbol to a hexadecimal number. So that each logical statement (e.g. for all x there exits y such that x = 2y or x=2y+1) now corresponds to a number. Moreover each proof, which is just a string of logical statements, is also just a number! Then we can encode a self-referential statement as a number.

4. Grobner basis. Given a multivariate polynomial f and a set of polynomials g1, g2, …, is f in the ideal generated by {g’s}? This is equivalent to asking can we write f as a linear combination of the g’s where the coefficients are other polynomials. The straightforward thing to do is to take f, divide by g1 as much as possible, and the divide the remainder by g2 and so on. If in the end, the remainder is 0 then f is in the ideal. The problem is that the outcome depends on the order of the g’s, so we might need to try all the orderings before conclusively concluding where f is in the ideal or not! Let I = <g’s> be the ideal. A Grobner basis of I is a set of polynomials h in I such the <leadterm(h)> = leadterm(I). Moreover, <h> = I. These h’s form a nice basis of I meaning that the order of h does not matter in the division. So we can test for ideal membership easily. In the worst case, it’s doubly exponential to find a Grobner basis (not unique), but works in practice.