Dr. James G. Shanahan has spent the past 25 years developing and researching cutting-edge artificial intelligent systems. He has (co) founded several companies including: Church and Duncan Group Inc. (2007), a boutique consultancy in large scale AI which he runs in San Francisco; RTBFast (2012), a real-time bidding engine infrastructure play for digital advertising systems; and Document Souls (1999), a document-centric anticipatory information system. In 2012 he went in-house as the SVP of Data Science and Chief Scientist at NativeX, a mobile ad network that got acquired by MobVista in early 2016. In addition, he has held appointments at AT&T (Executive Director of Research), Turn Inc. (founding chief scientist), Xerox Research, Mitsubishi Research, and at Clairvoyance Corp (a spinoff research lab from CMU). Dr. Shanahan has been affiliated with the University of California at Berkeley (and Santa Cruz) since 2008 where he teaches graduate courses on big data analytics, large-scale machine learning, and stochastic optimization. He also advises several high-tech startups (including Quixey, Aylien, VoxEdu, and others) and is executive VP of science and technology at Irish Innovation Center (IIC). He has published six books, more than 50 research publications, and over 20 patents in the areas of machine learning and information processing. Dr. Shanahan received his PhD in engineering mathematics from the University of Bristol, U. K., and holds a Bachelor of Science degree from the University of Limerick, Ireland. He is a EU Marie Curie fellow. In 2011 he was selected as a member of the Silicon Valley 50 (Top 50 Irish Americans in Technology).
Liang Dai is a Ph.D. candidate in Technology Information and Management department, UC Santa Cruz. There he does research in data mining on digital marketing, including campaign evaluation, online experiment design, customer value improvement, etc. Liang received the B.S. and the M.S. from Information Science and Electronic Engineering department, Zhejiang University, China. Liang is also working as a applied research scientist in Facebook, focusing on data modeling for ads product. He has hands-on experience on end to end large scale data mining projects in distributed platform, e.g. AWS, Hadoop, Spark, etc.
In the continuing big data revolution, Apache Spark’s open-source cluster computing framework has overtaken Hadoop MapReduce as the big data processing engine of choice. Spark maintains MapReduce’s linear scalability and fault tolerance, but offers two key advantages: Spark is much faster – as much as 100x faster for certain applications; and Spark is much easier to program, due to its inclusion of APIs for Python, Java, Scala, SQL and R, plus its user-friendly core data abstraction, the distributed data frame. In addition, Spark goes far beyond traditional batch applications to support a variety of compute-intensive tasks, including interactive queries, streaming data, machine learning, and graph processing.
This tutorial offers you an accessible introduction to large-scale distributed machine learning and data mining, and to Spark and its potential to revolutionize academic and commercial data science practices. The tutorial includes discussions of algorithm design, presentation of illustrative algorithms, relevant case studies, and practical advice and experience in writing Spark programs and running Spark clusters. Part I familiarizes you with fundamental Spark concepts, including Spark Core, functional programming a la MapReduce, RDDs/data frames/datasets, the Spark Shell, Spark Streaming and online learning, Spark SQL, MLlib, and more. Part 2 gives you hands-on algorithmic design and development experience with Spark, including building algorithms from scratch such as decision tree learning, association rule mining (aPriori), graph processing algorithms such as PageRank and shortest path, gradient descent algorithms such as support vector machines and matrix factorization, distributed parameter estimation, and deep learning. Your homegrown implementations will shed light on the internals of Spark’s MLlib libraries and on typical challenges in parallelizing machine learning algorithms. You will see examples of industrial applications and deployments of Spark.
Keywords. Distributed systems, HDFS, Spark, Hadoop, large-scale distributed machine learning, online learning, deep learning, Spark Streaming, mobile advertising
Materials. You will receive electronic handouts and a web-based iPython Jupyter notebook with example code and data. You will deploy Spark on your own multicore laptop to run and develop examples there. So that you can rapidly provision remote Spark clusters on the fly, we plan to work with Amazon Web Services to provide you with free access to Amazon's Elastic Compute Cloud (EC2) during the tutorial.
Industry practitioners and researchers who wish to learn the best practices for large scale data science using next generation tools. No prior familiarity with Spark, distributed systems, how to distribute algorithms, or large-scale machine learning is required.
Gain an integrated view of the data processing pipeline as the tutorial highlights its components, including exploratory data analysis, feature extraction, supervised learning, and model evaluation.
Introduced to the underlying statistical and algorithmic principles required to develop scalable machine learning pipelines, and gain hands-on experience in applying these principles using Apache Spark, a cluster computing system well suited for large-scale machine learning tasks.
Learn how to implement scalable algorithms for fundamental statistical models (linear regression, logistic regression, matrix factorization, principal component analysis) while tackling key data science problems from various domains: mobile advertising, personalized recommendation, and consumer segmentation.
Learn about data intensive industrial applications and deployments of Spark in fields such as mobile advertising. You will leave with an understanding of scalability challenges and the tradeoffs associated with distributed processing of large datasets.