Seminar | Johns Hopkins Algorithms and Complexity

[Theory Seminar] Xue Chen

Feb 27 @ 12:00 pm – 1:00 pm

Speaker: Xue Chen
Affiliation: Northwestern University

Title: Active Regression via Linear-Sample Sparsification

Abstract:

We present an approach that improves the sample complexity for a variety of curve fitting problems, including active learning for linear regression, polynomial regression, and continuous sparse Fourier transforms. In the active linear regression problem, one would like to estimate the least squares solution \beta^* minimizing ||X \beta – y||_2 given the entire unlabeled dataset X \in \R^{n \times d} but only observing a small number of labels y_i. We show that O(d/\eps) labels suffice to find an \eps-approximation \wt{\beta} to \beta^*:

E[||X \wt{\beta} – X\beta^*||_2^2] \leq \eps ||X \beta^* – y||_2^2.

This improves on the best previous result of O(d \log d + d/\eps) from leverage score sampling. We also present results for the inductive setting, showing when \wt{\beta} will generalize to fresh samples; these apply to continuous settings such as polynomial regression. Finally, we show how the techniques yield improved results for the non-linear sparse Fourier transform setting.

Bio: Xue Chen is broadly interested in randomized algorithms and the use of randomness in computation. Specific areas include Fourier transform, learning theory and optimization, and pseudorandomness. He obtained his Ph.D. at the University of Texas at Austin, under the supervision of David Zuckerman. Currently, he is a postdoctoral fellow in Northwestern University.

Mon

Rediet Abebe

Mar 4 @ 12:00 pm – 1:00 pm

Speaker: Rediet Abebe
Affiliation: Cornell University

Title: Using Search Queries to Understand Health Information Needs in Africa

Abstract:
Access to healthcare and health information is of major global
concern. The stark inequality in the availability of health data by
country, demographic groups, and socioeconomic status impedes the
identification of major public health concerns and implementation of
effective interventions. This data gap ranges from basic disease
statistics, such as disease prevalence rates, to more nuanced
information, such as public attitudes. A key challenge is
understanding health information needs of under-served and
marginalized communities. Without understanding people’s everyday
needs, concerns, and misconceptions, health organizations lack the
ability to effectively target education and programming efforts.

In this presentation, we focus on the lack of comprehensive,
high-quality data about information needs of individuals in developing
nations. We propose an approach that uses search data to uncover
health information needs of individuals in all 54 nations in Africa.
We analyze Bing searches related to HIV/AIDS, malaria, and
tuberculosis; these searches reveal diverse health information needs
that vary by demographic groups and geographic regions. We also shed
light on discrepancies in the quality of content returned by search
engines.

We conclude with a discussion on computationally-informed
interventions both on- and off-line in health and related domains and
the Mechanism Design for Social Good research initiative.

Bio:
Rediet Abebe is a computer scientist with a strong interest in the
promotion of equality and justice. Her research focuses on algorithms,
AI, and their applications to social good. As part of this research
agenda, she co-founded and co-organizes Mechanism Design for Social
Good (MD4SG), an interdisciplinary, multi-institutional research
initiative with over 300 individuals. She is also a co-founder and
co-organizer of Black in AI, an international network of over 1000
individuals focused on increasing the presence and inclusion of Black
and African researchers in AI. Her research is deeply influenced by
her upbringing in her hometown of Addis Ababa, Ethiopia, where she
lived until moving to the U.S. in 2009. Her work has been generously
supported by fellowships and scholarships through Facebook, Google,
the Cornell Graduate School, and the Harvard-Cambridge Fellowship.

[Theory Seminar] Grigory Yaroslavtsev

Mar 6 @ 12:00 pm – 1:00 pm

Speaker: Grigory Yaroslavtsev
Affiliation: Indiana University, Bloomington

Title: Advances in Hierarchical Clustering for Vector Data
Abstract:
Compared to the highly successful flat clustering (e.g. k-means), despite its important role and applications in data analysis, hierarchical clustering has been lacking in rigorous algorithmic studies until late due to absence of rigorous objectives. Since 2016, a sequence of works has emerged and gave novel algorithms for this problem in the general metric setting. This was enabled by a breakthrough by Dasgupta, who introduced a formal objective into the study of hierarchical clustering.

In this talk I will give an overview of our recent progress on models and scalable algorithms for hierarchical clustering applicable specifically to high-dimensional vector data. I will first discuss various linkage-based algorithms (single-linkage, average-linkage) and their formal properties with respect to various objectives. I will then introduce a new projection-based approximation algorithm for vector data. The talk will be self-contained and doesn’t assume prior knowledge of clustering methods.

Based on joint works with Vadapalli (ICML’18) and Charikar, Chatziafratis and Niazadeh (AISTATS’19)

[Theory Seminar] Arka Rai Choudhury

Mar 27 @ 12:00 pm – 1:00 pm

Speaker: Arka Rai Choudhury
Affiliation: Johns Hopkins University

Title: Finding a Nash Equilibrium is No Easier than Breaking Fiat-Shamir

Abstract:
The Fiat-Shamir heuristic transforms a public-coin interactive proof into a non-interactive argument, by replacing the verifier with a cryptographic hash function that is applied to the protocol’s transcript. Constructing hash functions for which this transformation is sound is a central and long-standing open question in cryptography.

We show that solving the END-OF-METERED-LINE problem is no easier than breaking the soundness of the Fiat-Shamir transformation when applied to the sumcheck protocol. In particular, if the transformed protocol is sound, then any hard problem in #P gives rise to a hard distribution in the class CLS, which is contained in PPAD. Our result opens up the possibility of sampling moderately-sized games for which it is hard to find a Nash equilibrium, by reducing the inversion of appropriately chosen one-way functions to #SAT.

Our main technical contribution is a stateful incrementally verifiable procedure that, given a SAT instance over n variables, counts the number of satisfying assignments. This is accomplished via an exponential sequence of small steps, each computable in time poly(n). Incremental verifiability means that each intermediate state includes a sumcheck-based proof of its correctness, and the proof can be updated and verified in time poly(n).

Joint work with Pavel Hubacek, Chethan Kamath, Krzysztof Pietrzak, Alon Rosen, and Guy Rothblum.

Apr

[Theory Seminar] Amitabh Basu

Apr 10 @ 12:00 pm – 1:00 pm

Speaker: Amitabh Basu
Affiliation: JHU

Title: Admissibility of solution estimators for stochastic optimization

Abstract:
We look at stochastic optimization problems through the lens of statistical decision theory. In particular, we address admissibility, in the statistical decision theory sense, of the natural sample average estimator for a stochastic optimization problem (which is also known as the empirical risk minimization (ERM) rule in learning literature). It is well known that for general stochastic optimization problems, the sample average estimator may not be admissible. This is known as Stein’s paradox in the statistics literature. We show in this paper that for optimizing stochastic linear functions over compact sets, the sample average estimator is admissible.

Apr

[Theory Seminar] Daniel Reichman

Apr 17 @ 12:00 pm – 1:00 pm

Speaker: Daniel Reichman
Affiliation: Princeton University

Title: Contagious sets in bootstrap percolation

Abstract:
Consider the following activation process in undirected graphs: a vertex is active either if it belongs to a set of initially activated vertices or if at some point it has at least r active neighbors. This process (commonly referred to as bootstrap percolation) has been studied in several fields including combinatorics, computer science, statistical physics and viral marketing. A contagious set is a set whose activation results with the entire graph being active. Given a graph G, let m(G,r) be the minimal size of a contagious set.

I will survey upper and lower bounds for m(G,r) in general graphs and for graphs with special properties (random and pseudo-random graphs, graphs without short cycles) and discuss many unresolved questions.

Based on joint work with Amin Coja-Oghlan, Uriel Feige and Michael Krivelevich.

May

[Theory Seminar] Xuan Wu

May 1 @ 12:00 pm – 1:00 pm

Speaker: Xuan Wu
Affiliation: Johns Hopkins

Title: Coreset for Ordered Weighted Clustering

Abstract:

Ordered k-Median is a generalization of classical clustering problems such as k-Median and k-Center, that offers a more flexible data analysis, like easily combining multiple objectives (e.g., to increase fairness or for Pareto optimization). Its objective function is defined via the Ordered Weighted Averaging (OWA) paradigm of Yager (1988), where data points are weighted according to a predefined weight vector, but in order of their contribution to the objective (distance from the centers).

Coreset is a powerful data-reduction technique which summarizes the data set into a small (weighted) point set while approximately preserving the objective value of the data set to all centers. When there are multiple objectives (weights), the above standard coreset might have limited usefulness, whereas in a \emph{simultaneous} coreset, which was introduced recently by Bachem and Lucic and Lattanzi (2018), the above approximation holds for all weights (in addition to all centers). Our main result is the first construction of simultaneous coreset for the Ordered k-Median problem of small size.

In this talk, I will introduce the basics of coreset construction for the clustering problem and the main ideas of our new results. Finally, we discuss some remaining open problems.

This talk is based on joint work with Vladimir Braverman, Shaofeng Jiang, and Robert Krauthgamer.

Oct

Fri

[Theory Seminar] Christopher Musco

Oct 11 @ 12:30 pm – 1:30 pm

Speaker: Christopher Musco
Affiliation: NYU

Title: Structured Covariance Estimation

Abstract:
Given access to samples from a distribution D over d-dimensional vectors, how many samples are necessary to learn the distribution’s covariance matrix, T? Moreover, how can we leverage a priori knowledge about T’s structure to reduce this sample complexity?

I will discuss this fundamental statistical problem in the setting where T is known to have Toeplitz structure. Toeplitz covariance matrices arise in countless signal processing applications, from wireless communications, to medical imaging, to time series analysis. In many of these applications, we are interested in learning algorithms that only view a subset of entries in each d-dimensional vector sample from D. We care about minimizing two notions of sample complexity 1) the total number of vector samples taken and 2) the number of entries accessed in each vector sample. The later goal typically equates to minimizing equipment or hardware requirements.

I will present several new non-asymptotic bounds on these sample complexity measures. We will start by taking a fresh look at classical and widely used algorithms, including methods based on selecting entries from each sample according to a “sparse ruler”. Then, I will introduce a novel sampling and estimation strategy that improves on existing methods in many settings. Our new approach for learning Toeplitz structured covariance utilizes tools from random matrix sketching, leverage score sampling for continuous signals, and sparse Fourier transform algorithms. It fits into a broader line of work which seeks to address fundamental problems in signal processing using tools from theoretical computer science and randomized numerical linear algebra.

Bio:
Christopher Musco is an Assistant Professor in the Computer Science and Engineering department at NYU’s Tandon School of Engineering. His research focuses on the algorithmic foundations of data science and machine learning. Christopher received his Ph.D. in Computer Science from the Massachusetts Institute of Technology and B.S. degrees in Applied Mathematics and Computer Science from Yale University.

Oct

[Theory Seminar] Guy Kortsarz

Oct 16 @ 12:00 pm – 1:00 pm

Speaker: Guy Kortsarz
Affiliation: Rutgers Universty – Camden

Title: A survey on the Directed Steiner tree problem
Abstract:
The directed Steiner problem is one of the most important problems in optimization, and in particular is more general than Group Steiner and other problems.

I will discuss the (by now classic) 1/\epsilon^3 n^epsilon approximation for the problem by Charikar et al (the algorithm was invented by Kortsarz and Peleg and is called recursive greedy. A technique who people in approximation should know). The running time is more than n^{1/epsilon}. One of the most important open questions in Approximation Algorithms is if there is a polynomial time polylog ratio for this problem. This is open from 1997.

I will discuss the Group Steiner problem ( a special case of the Directed Steiner problem) and the Directed Steiner Forest (a generalization of the Directed Steiner problem) and many more related problems.

Nov

[Theory Seminar] Jiapeng Zhang

Nov 6 @ 12:00 pm – 1:00 pm

Speaker: Jiapeng Zhang
Affiliation: Harvard University

Title:An improved sunflower bound

Abstract:

A sunflower with $r$ petals is a collection of $r$ sets so that the intersection of each pair is equal to the intersection of all. Erdos and Rado in 1960 proved the sunflower lemma: for any fixed $r$, any family of sets of size $w$, with at least about $w^w$ sets, must contain a sunflower. The famous sunflower conjecture is that the bound on the number of sets can be improved to $c^w$ for some constant $c$. Despite much research, the best bounds until recently were all of the order of $w^{cw}$ for some constant c. In this work, we improve the bounds to about $(log w)^w$.

Joint work with Ryan Alweiss, Shachar Lovett and Kewen Wu.

Nov

Fri

[Theory Seminar] Robert Krauthgamer

Nov 8 @ 12:00 pm – 1:00 pm

Speaker: Robert Krauthgamer
Affiliation: Weizmann Institute of Science

Title: On Solving Linear Systems in Sublinear Time

Abstract:
I will discuss sublinear algorithms that solve linear systems locally. In
the classical version of this problem, the input is a matrix S and a vector
b in the range of S, and the goal is to output a vector x satisfying Sx=b.

We focus on computing (approximating) one coordinate of x, which potentially
allows for sublinear algorithms. Our results show that there is a
qualitative gap between symmetric diagonally dominant (SDD) and the more
general class of positive semidefinite (PSD) matrices. For SDD matrices, we
develop an algorithm that runs in polylogarithmic time, provided that S is
sparse and has a small condition number (e.g., Laplacian of an expander
graph). In contrast, for certain PSD matrices with analgous assumptions, the
running time must be at least polynomial.

Joint work with Alexandr Andoni and Yosef Pogrow.

FOCS Workshops

Nov 9 all-day

FOCS

Nov 10 – Nov 12 all-day

Dec

[Theory Seminar] Yasamin Nazari

Dec 4 @ 12:00 pm – 1:00 pm

Speaker: Yasamin Nazari
Affiliation: Johns Hopkins University

Title: Sparse Hopsets in Congested Clique

Abstract:

We give the first Congested Clique algorithm that computes a sparse hopset with polylogarithmic hopbound in polylogarithmic time. Given a graph G=(V,E), a (β,ϵ)-hopset H with “hopbound” β, is a set of edges added to G such that for any pair of nodes u and v in G there is a path with at most β hops in G∪H with length within (1+ϵ) of the shortest path between u and v in G.

Our hopsets are significantly sparser than the recent construction of Censor-Hillel et al. [6], that constructs a hopset of size Õ (n^{3/2}), but with a smaller polylogarithmic hopbound. On the other hand, the previously known constructions of sparse hopsets with polylogarithmic hopbound in the Congested Clique model, proposed by Elkin and Neiman [10],[11],[12], all require polynomial rounds.

One tool that we use is an efficient algorithm that constructs an ℓ-limited neighborhood cover, that may be of independent interest.

Finally, as a side result, we also give a hopset construction in a variant of the low-memory Massively Parallel Computation model, with improved running time over existing algorithms.

Dec

[Theory Seminar] Richard Shea

Dec 11 @ 12:00 pm – 1:00 pm

Speaker: Richard Shea

Affiliation: Applied and Computational Math program, Johns Hopkins University

Title: Progress towards building a Dynamic Hawkes Graph

Abstract:

This talk will introduce the Dynamic Hawkes Graph. It builds from developments in multivariate Hawkes Processes, locally stable Hawkes distributions, and representations of the Hawkes process as an Integer Valued autoregressive (INAR) fit. The background to enable understanding of the Dynamic Hawkes Graph will be the bulk of the talk. Richard is presenting this as part of his Master’s thesis.

[Theory Seminar] Aditya Krishnan

Mar 11 @ 12:00 pm – 1:00 pm

Speaker: Aditya Krishnan
Affiliation: Johns Hopkins University

Title: Schatten Norms in Matrix Streams: The Role of Sparsity

Abstract:
Spectral functions of large matrices contain important structural information about the underlying data, and are thus becoming increasingly important to efficiently compute. Many times, large matrices representing real-world data are sparse or doubly sparse (i.e., sparse in both rows and columns), and are accessed as a stream of updates, typically organized in row-order. In this setting, where space (memory) is the limiting resource, all known algorithms require space that is polynomial in the dimension of the matrix, even for sparse matrices. We address this challenge by providing the first algorithms whose space requirement is independent of the matrix dimension, assuming the matrix is doubly-sparse and presented in row-order.

In addition, we prove that multiple passes are unavoidable in this setting and show extensions of our primary technique, including a trade-off between space requirements and number of passes. Our algorithms approximate the Schatten p-norms, which we use in turn to approximate other spectral functions, such as logarithm of the determinant, trace of matrix inverse, and Estrada index.

Joint work with Vladimir Braverman, Robert Krauthgamer and Roi Sinoff.

[Theory Seminar] Arnold Filtser

Mar 25 @ 12:00 pm – 1:00 pm

Speaker: Arnold Filtser
Affiliation: Columbia University

Title: TBD
Abstract: TBD

Sep

[Theory Seminar] Welcome and Introductions

Sep 2 @ 12:00 pm – 1:00 pm

Sep

[Theory Seminar] Jasper Lee

Sep 16 @ 12:00 pm – 1:00 pm

Speaker: Jasper Lee
Affiliation: Brown University

Title: Real-Valued Sub-Gaussian Mean Estimation, Optimal to a (1+o(1)) Multiplicative Factor

Abstract:
We revisit one of the most fundamental problems in statistics: given access to independent samples from a 1D random variable (with finite but unknown mean and variance), what is the best way to estimate the mean, in terms of error convergence with respect to sample size? The conventional wisdom is to use the sample mean as our estimate. However it is known that the sample mean has optimal convergence if and only if the underlying distribution is (sub-)Gaussian. Moreover, it can even be exponentially slower than optimal for certain heavy-tailed distributions. On the other hand, the median-of-means estimator (invented and reinvented in various literature) does have sub-Gaussian convergence for all finite-variance distributions, albeit in the big-O sense with a sub-optimal multiplicative constant. The natural remaining question then, is whether it is possible to bridge the gap, to have an estimator that has optimal sub-Gaussian concentration with an optimal constant, for all finite-variance distributions.

In this talk, we answer the question affirmatively by giving an estimator that converges with the optimal constant inside the big-O, up to a (1+o(1)) multiplicative factor. Our estimator is furthermore computable in time linear in the sample size. The convergence analysis involves deriving tail bounds using tools from linear and convex programming, which may be of independent interest.

Joint work with Paul Valiant.

Sep

[Theory Seminar] Edinah Gnang