* This tutorial is not included in Tutorial Day Registration (10 Nov). To attend this tutorial, please register for Main Conference.

Tutorial on Massively Scalable Production Grade Deep Learning with the Microsoft Cognitive Toolkit *

Full-day hands-on tutorial at CIKM 2017, Tuesday 7 November 2017

Sayan Pathak

Microsoft Research, University of Washington

Sayan Pathak, PhD is a Principal Engineer and Machine Learning Scientist in the Cognitive Toolkit (CNTK) team at Microsoft. He has published and commercialized cutting edge computer vision and machine learning technology to big data problems applied to medical imaging, neuroscience, computational advertising and social network domains. He has developed ML based technologies that have been FDA 510k cleared for clinical use in US. Prior to joining Microsoft, he worked at Allen Institute for Brain Science where published in top journals such as Nature, Nature Neuroscience, IEEE Journals etc. He has over 25 peer reviewed journal papers and has presented at conferences across the globe.

He is also a faculty at the University of Washington for 15 years and has active collaboration with different faculty at the Indian Institute of Technology for over 4 years. He has been a consultant to several startups and principal investigator on several US National Institutes of Health (NIH) grants. He received his BS from Indian Institute of Technology, Kharagpur, India in 1994. He earned his MS and PhD in Bioengineering (with computer vision specialization) in 1996 and 2000 respectively.

Frank Seide

Microsoft Research

Frank Seide, a native of Hamburg, Germany, is a Principal Researcher at Microsoft Research, and an architect of Microsoft's Cognitive Toolkit for deep learning. His current research focus is on deep neural networks for conversational speech recognition. Together with co-author Dong Yu, he was first to show the effectiveness of deep neural networks for recognition of conversational speech, and he was part of the effort to break through human parity on this task in 2016. Throughout his career, he has been interested in and worked on a broad range of topics and components of automatic speech recognition, including spoken-dialogue systems, recognition of Mandarin Chinese, and, particularly, large-vocabulary recognition of conversational speech with application to audio indexing, transcription, speech-to-speech translation, and distributed neural-network model training.

In 1993, Frank received a Master degree in electrical engineering from the University of Technology of Hamburg-Harburg, Germany, and joined the speech research group of Philips Research in Aachen, Germany, to work on spoken-dialogue systems. He then transferred to Taiwan as one of the founding members of Philips Research East-Asia, Taipei, to lead a research project on Mandarin speech recognition. In June 2001, he joined the speech group at Microsoft Research Asia, Beijing, initially as a Researcher, since 2003 as Project Leader for offline speech applications, and since October 2006 as Research Manager. In 2014, Frank joined the Speech & Dialogue group at MSR Redmond as a Principal Researcher.

What is this tutorial about?

Train neural networks like the Microsoft product groups do! This hands-on tutorial introduces the Microsoft Cognitive Toolkit (formerly known as CNTK), a scalable open-source deep-learning toolkit for Windows and Linux that has been used for Skype, Cortana, Bing, Xbox, and much more. For example, Microsoft product groups achieved a major breakthrough when they used CNTK to learn to recognize conversational speech as well as a human can. In general, CNTK trains models as fast or faster than its competitors and scales up well on modern GPUs (see the performance comparison from HKBU and recent demos from NVidia at SC’16 and Cray at NIPS’16). CNTK is also compatible with popular algorithms (feed-forward, convolutional, and recurrent networks) and languages, including APIs for Python and C++.

This tutorial is split into three sections. In Part 1, we introduce how the toolkit has been used in different domains within and outside Microsoft. We also show how it is being used by customers both on-premises and on Azure cloud. In Part 2, we dive into the structure of deep learning models and elaborate on CNTK’s optimizations that make it scale well across multiple servers and GPUs. Part 3 is a hands-on session where you will use Jupyter notebooks to train several types of deep neural networks, taken from use cases in the vision, text, time series and speech domains. In particular, you will work with residual networks using fully connected CNN, which achieved record breaking effectiveness in recognizing objects in the ImageNet challenge; RNN/LSTM for sequence-to-sequence and reading comprehension of text documents (ResoNet); autoencoders for semi-supervised learning; Generative Adversarial Networks (GAN) for unsupervised learning; and reinforcement learning for enabling machines to optimally perform tasks such as playing games. Network bandwidth permitting, we will also explore scalable learning on Azure with simple introductory to very advanced end-to-end use cases. Time permitting, we will discuss integration of CNTK with Spark for scaling out.

Keywords. Deep learning, large-scale distributed machine learning, online learning, text (sequence to sequence, reasoning networks), computer vision (ResNet, GANs, video) and reinforcement learning, cloud based distributed computing, convolutional neural networks, LSTMs, Recurrence, GAN, autoencoders, Azure, Spark

Materials. The tutorial’s code and data will be made available through Jupyter iPython notebooks; CNTK itself can be downloaded from GitHub. The hands-on sessions require a laptop with Windows 7+ or Linux, and a CUDA-capable GPU is recommended.

Who should attend?

Current or future deep-learning practitioners and researchers looking for a tool that is easy to use yet efficient and scalable across multi-machine GPU clusters for real-world workloads. In Part 1, those new to deep learning will gain an understanding of what deep learning can achieve and its applications in the field. Part 2 requires familiarity with basic concepts from programming, linear algebra, and probability. Part 3 requires basic programming skills. If you are not familiar with Python, you will be able to run the code and gain experience. If you are, you will be able to explore more on your own.

What will I get out of this tutorial?

If your background and interest are primarily non-technical, e.g. if you are an executive or business-oriented practitioner, you will learn how deep learning is becoming pervasive in different areas including how we are using it at Microsoft, as well as our external partners. If you are technically minded, you will learn how to perform classification and regression using supervised, semi-supervised and unsupervised technique. If you are an advanced user, you will be exposed to different advanced algorithms and learn to code, modify and scale using state-of-the-art techniques.

You will leave this tutorial with working recipes for the different use cases, along with the experience of running their jobs in Azure (network bandwidth permitting). Time permitting, you will also learn how to distribute the workload by integrating with Spark.

Detailed Outline

Section

Topics

Intro to deep learning

Application, Performance and scalability

Scalability & performance

Comparison of different toolkits, Performance scalability tricks and high-level introduction to Azure & Spark

Deep dive

Dissect a feed forward network, Fun with recurrence

Hands-on (beginners)

Predict cancer using logistic regression, Classify digits with feed forward, Analyze time series data with LSTM, Integrate with Spark

Hands-on (intermediate)

Text / Language

Sequence classification, Sequence to sequence learning

Vision

ResNet, Auto encoder

Hands-on (advanced)

Text

Sequence to sequence with attention, ReasoNet

Vision

GAN, Video processing

Reinforcement learning

Atari game or self-driving car

Tutorial on Large Scale Distributed Data Science from Scratch with Apache Spark 2.0 & Deep Learning *

Full-day hands-on tutorial at CIKM 2017, Wednesday 8 November 2017

Dr. James G. Shanahan

CEO and Chief Scientist at Church and Duncan Group and UC Berkeley

Dr. James G. Shanahan has spent the past 25 years developing and researching cutting-edge artificial intelligent systems. He has (co) founded several companies including: Church and Duncan Group Inc. (2007), a boutique consultancy in large scale AI which he runs in San Francisco; RTBFast (2012), a real-time bidding engine infrastructure play for digital advertising systems; and Document Souls (1999), a document-centric anticipatory information system. In 2012 he went in-house as the SVP of Data Science and Chief Scientist at NativeX, a mobile ad network that got acquired by MobVista in early 2016. In addition, he has held appointments at AT&T (Executive Director of Research), Turn Inc. (founding chief scientist), Xerox Research, Mitsubishi Research, and at Clairvoyance Corp (a spinoff research lab from CMU). Dr. Shanahan has been affiliated with the University of California at Berkeley (and Santa Cruz) since 2008 where he teaches graduate courses on big data analytics, large-scale machine learning, and stochastic optimization. He also advises several high-tech startups (including Quixey, Aylien, VoxEdu, and others) and is executive VP of science and technology at Irish Innovation Center (IIC). He has published six books, more than 50 research publications, and over 20 patents in the areas of machine learning and information processing. Dr. Shanahan received his PhD in engineering mathematics from the University of Bristol, U. K., and holds a Bachelor of Science degree from the University of Limerick, Ireland. He is a EU Marie Curie fellow. In 2011 he was selected as a member of the Silicon Valley 50 (Top 50 Irish Americans in Technology).

Liang Dai

NativeX, University of California Santa Cruz

Liang Dai is a Ph.D. candidate in Technology Information and Management department, UC Santa Cruz. There he does research in data mining on digital marketing, including campaign evaluation, online experiment design, customer value improvement, etc. Liang received the B.S. and the M.S. from Information Science and Electronic Engineering department, Zhejiang University, China. Liang is also working as a applied research scientist in Facebook, focusing on data modeling for ads product. He has hands-on experience on end to end large scale data mining projects in distributed platform, e.g. AWS, Hadoop, Spark, etc.

What is this tutorial about?

In the continuing big data revolution, Apache Spark’s open-source cluster computing framework has overtaken Hadoop MapReduce as the big data processing engine of choice. Spark maintains MapReduce’s linear scalability and fault tolerance, but offers two key advantages: Spark is much faster – as much as 100x faster for certain applications; and Spark is much easier to program, due to its inclusion of APIs for Python, Java, Scala, SQL and R, plus its user-friendly core data abstraction, the distributed data frame. In addition, Spark goes far beyond traditional batch applications to support a variety of compute-intensive tasks, including interactive queries, streaming data, machine learning, and graph processing.

This tutorial offers you an accessible introduction to large-scale distributed machine learning and data mining, and to Spark and its potential to revolutionize academic and commercial data science practices. The tutorial includes discussions of algorithm design, presentation of illustrative algorithms, relevant case studies, and practical advice and experience in writing Spark programs and running Spark clusters. Part I familiarizes you with fundamental Spark concepts, including Spark Core, functional programming a la MapReduce, RDDs/data frames/datasets, the Spark Shell, Spark Streaming and online learning, Spark SQL, MLlib, and more. Part 2 gives you hands-on algorithmic design and development experience with Spark, including building algorithms from scratch such as decision tree learning, association rule mining (aPriori), graph processing algorithms such as PageRank and shortest path, gradient descent algorithms such as support vector machines and matrix factorization, distributed parameter estimation, and deep learning. Your homegrown implementations will shed light on the internals of Spark’s MLlib libraries and on typical challenges in parallelizing machine learning algorithms. You will see examples of industrial applications and deployments of Spark.

Keywords. Distributed systems, HDFS, Spark, Hadoop, large-scale distributed machine learning, online learning, deep learning, Spark Streaming, mobile advertising

Materials. You will receive electronic handouts and a web-based iPython Jupyter notebook with example code and data. You will deploy Spark on your own multicore laptop to run and develop examples there. So that you can rapidly provision remote Spark clusters on the fly, we plan to work with Amazon Web Services to provide you with free access to Amazon's Elastic Compute Cloud (EC2) during the tutorial.

Who should attend?

Industry practitioners and researchers who wish to learn the best practices for large scale data science using next generation tools. No prior familiarity with Spark, distributed systems, how to distribute algorithms, or large-scale machine learning is required.

What will I get out of this tutorial?

Gain an integrated view of the data processing pipeline as the tutorial highlights its components, including exploratory data analysis, feature extraction, supervised learning, and model evaluation.

Introduced to the underlying statistical and algorithmic principles required to develop scalable machine learning pipelines, and gain hands-on experience in applying these principles using Apache Spark, a cluster computing system well suited for large-scale machine learning tasks.

Learn how to implement scalable algorithms for fundamental statistical models (linear regression, logistic regression, matrix factorization, principal component analysis) while tackling key data science problems from various domains: mobile advertising, personalized recommendation, and consumer segmentation.

Learn about data intensive industrial applications and deployments of Spark in fields such as mobile advertising. You will leave with an understanding of scalability challenges and the tradeoffs associated with distributed processing of large datasets.

Detailed Outline

Section

Topics

Spark introduction

History of Spark
Introduction to data analysis with Spark
Downloading Spark and getting started on your laptop

Parallel computing

Divide and conquer, semaphores, barriers, shared nothing architectures

Core Spark

Spark basics, Functional programming, Transformations and actions, MapReduce patterns, Data frames and datasets, RDD (no keys) and Pair RDDs, PySpark, Scala and Spark Shell, Broadcast variables

Spark APIs

Java, Scala, Python, R, SQL

Data analysis and handling with Spark

Tools for exploratory data analysis, Standardization, Reservoir sampling, SparkSQL and joins, Statistics in Spark

Algorithms and programming in Spark

Algorithmic design and development with Spark

Developing algorithms from scratch

Decision tree learning
Naive Bayes

Association rule mining

aPriori algorithm

Graph processing algorithms

PageRank
Shortest path
Friend of friends
TextRank

Unsupervised algorithms

Expectation maximization

Gradient descent algorithms

Support vector machines
Matrix factorization

Deep Learning

Word2Vec/Glove, multilayer perceptrons, recurrent neural networks, convolutional neural networks, BackProp, deep learning with Apache Spark and TensorFlow

Spark at scale

Spark on your laptop versus Spark on an EC2 cluster

Spark libraries

SparkSQL, MLlib, GraphX, Spark Streaming

Online learning

Online learning for classification and for regression via gradient descent

Spark deployments and case studies

Mobile advertising, Recommendation engines, Security

Spark 2.0 and beyond

Desiderata for large scale data processing environments