Neural word embedding as implicit matrix factorization. word2vec is equivalent to SVD of the word-by-context matrix. The proposed algorithm using negative sampling is computationally more efficient than SVD. Some of the preprocessing, e.g. eliminating rare words, improves the results.
Large scale canonical correlation analysis with iterative least squares. A generalized dual power iteration method for computing CCA. Iteratively running regressions of the two datasets onto lower dimensions.
Hamming ball auxiliary sampling for FHMM. Gibbs sampling for FHMM samples one chain at a time while fixing all other chains. This mixes very slowly. The paper introduces using auxiliary variables (one for each latent state), so that each column of latent variables is sampled uniform on the Hamming ball from a vector of auxiliary variables. Speeds up mixing.
Privacy in the land of plenty. Dwork. Add noise to preserve differential privacy. The noise only depends on the statistical algorithm you want to run and does not scale with the size of data, so the larger the dataset the more tolerable the noise is. Natural connection between privacy and generalization performance of the algorithm. Add noise to training data.
Randomized experimental design for causal graph discovery.
Games, networks, and people. Kearns. Interesting experiments where people occupy nodes on a graph and perform actions. In the biased consensus experiment, a minority of hubs seems to win over a majority of leaves.
Constrained Bayesian Optimization. Gramacy. Use augmented Lagrangian formulation of constrained optimization. Apply Bayesian Opt to each term of the Lagrangian separately. Seems to work for a very low dimensional toy problem.
Po Ling. Convex loss function and non-convex regularizer. Can prove that the convex term dominates over large scales and the sum is only nonconvex near the global optimum. Therefore all the local optima are close to the global optimum in the parameter space.
Jure. Stanford large networks collection. RINGO: a simple data structure for graphs. $30k terabyte memory machine.
Guestrin. Graphlab Create. A python system to work with tables and networks. Seems similar to pandas but more optimized for large data and with additional network analysis primitives. Columnar encoding is better for compression for some reason. SFrame. SGraph. Maintains locality in edge traversal.
IBM. HIV prediction using EMR.
Brudno. Deep phenotyping using patient information + model organisms + gene knockouts. Phenotips. Human phenotype ontology. Create patient databases to reduce diagnostic odyssey of rare diseases.
Bengio. Underfitting is the main problem for DNN because the model capacity is so large. Curriculum learning (scheduling the training examples) makes optimization easier? How do human get around saddle points?
Chris Moore. Phase transition in stochastic block model: both information-theoretic and in terms of the convergence time of Belief Propagation.
Kevin Murphy. Knowledge vault and procedure learning in recipes.