Machine Learning with Spark - Second Edition.

Saved in:
Bibliographic Details
Online Access: Full text (MCPHS users only)
Main Author: Dua, Rajdeep
Other Authors: Ghotra, Manpreet Singh, Pentreath, Nick
Format: Electronic eBook
Language:English
Published: Birmingham : Packt Publishing, 2016
Edition:2nd ed.
Subjects:
Local Note:ProQuest Ebook Central
Table of Contents:
  • Cover
  • Credits
  • About the Authors
  • About the Reviewer
  • www.PacktPub.com
  • Customer Feedback
  • Table of Contents
  • Preface
  • Chapter 1: Getting Up and Running with Spark
  • Installing and setting up Spark locally
  • Spark clusters
  • The Spark programming model
  • SparkContext and SparkConf
  • SparkSession
  • The Spark shell
  • Resilient Distributed Datasets
  • Creating RDDs
  • Spark operations
  • Caching RDDs
  • Broadcast variables and accumulators
  • SchemaRDD
  • Spark data frame
  • The first step to a Spark program in Scala
  • The first step to a Spark program in Java
  • The first step to a Spark program in Python
  • The first step to a Spark program in R
  • SparkR DataFrames
  • Getting Spark running on Amazon EC2
  • Launching an EC2 Spark cluster
  • Configuring and running Spark on Amazon Elastic Map Reduce
  • UI in Spark
  • Supported machine learning algorithms by Spark
  • Benefits of using Spark ML as compared to existing libraries
  • Spark Cluster on Google Compute Engine
  • DataProc
  • Hadoop and Spark Versions
  • Creating a Cluster
  • Submitting a Job
  • Summary
  • Chapter 2: Math for Machine Learning
  • Linear algebra
  • Setting up the Scala environment in Intellij
  • Setting up the Scala environment on the Command Line
  • Fields
  • Real numbers
  • Complex numbers
  • Vectors
  • Vector spaces
  • Vector types
  • Vectors in Breeze
  • Vectors in Spark
  • Vector operations
  • Hyperplanes
  • Vectors in machine learning
  • Matrix
  • Types of matrices
  • Matrix in Spark
  • Distributed matrix in Spark
  • Matrix operations
  • Determinant
  • Eigenvalues and eigenvectors
  • Singular value decomposition
  • Matrices in machine learning
  • Functions
  • Function types
  • Functional composition
  • Hypothesis
  • Gradient descent
  • Prior, likelihood, and posterior
  • Calculus
  • Differential calculus
  • Integral calculus.
  • Lagranges multipliers
  • Plotting
  • Summary
  • Chapter 3: Designing a Machine Learning System
  • What is Machine Learning?
  • Introducing MovieStream
  • Business use cases for a machine learning system
  • Personalization
  • Targeted marketing and customer segmentation
  • Predictive modeling and analytics
  • Types of machine learning models
  • The components of a data-driven machine learning system
  • Data ingestion and storage
  • Data cleansing and transformation
  • Model training and testing loop
  • Model deployment and integration
  • Model monitoring and feedback
  • Batch versus real time
  • Data Pipeline in Apache Spark
  • An architecture for a machine learning system
  • Spark MLlib
  • Performance improvements in Spark ML over Spark MLlib
  • Comparing algorithms supported by MLlib
  • Classification
  • Clustering
  • Regression
  • MLlib supported methods and developer APIs
  • Spark Integration
  • MLlib vision
  • MLlib versions compared
  • Spark 1.6 to 2.0
  • Summary
  • Chapter 4: Obtaining, Processing, and Preparing Data with Spark
  • Accessing publicly available datasets
  • The MovieLens 100k dataset
  • Exploring and visualizing your data
  • Exploring the user dataset
  • Count by occupation
  • Movie dataset
  • Exploring the rating dataset
  • Rating count bar chart
  • Distribution of number ratings
  • Processing and transforming your data
  • Filling in bad or missing data
  • Extracting useful features from your data
  • Numerical features
  • Categorical features
  • Derived features
  • Transforming timestamps into categorical features
  • Extract time of Day
  • Extract time of day
  • Text features
  • Simple text feature extraction
  • Sparse Vectors from Titles
  • Normalizing features
  • Using ML for feature normalization
  • Using packages for feature extraction
  • TFID
  • IDF
  • Word2Vector
  • Skip-gram model
  • Standard scalar
  • Summary.
  • Chapter 5: Building a Recommendation Engine with Spark
  • Types of recommendation models
  • Content-based filtering
  • Collaborative filtering
  • Matrix factorization
  • Explicit matrix factorization
  • Implicit Matrix Factorization
  • Basic model for Matrix Factorization
  • Alternating least squares
  • Extracting the right features from your data
  • Extracting features from the MovieLens 100k dataset
  • Training the recommendation model
  • Training a model on the MovieLens 100k dataset
  • Training a model using Implicit feedback data
  • Using the recommendation model
  • ALS Model recommendations
  • User recommendations
  • Generating movie recommendations from the MovieLens 100k dataset
  • Inspecting the recommendations
  • Item recommendations
  • Generating similar movies for the MovieLens 100k dataset
  • Inspecting the similar items
  • Evaluating the performance of recommendation models
  • ALS Model Evaluation
  • Mean Squared Error
  • Mean Average Precision at K
  • Using MLlib's built-in evaluation functions
  • RMSE and MSE
  • MAP
  • FP-Growth algorithm
  • FP-Growth Basic Sample
  • FP-Growth Applied to Movie Lens Data
  • Summary
  • Chapter 6: Building a Classification Model with Spark
  • Types of classification models
  • Linear models
  • Logistic regression
  • Multinomial logistic regression
  • Visualizing the StumbleUpon dataset
  • Extracting features from the Kaggle/StumbleUpon evergreen classification dataset
  • StumbleUponExecutor
  • Linear support vector machines
  • The naive Bayes model
  • Decision trees
  • Ensembles of trees
  • Random Forests
  • Gradient-Boosted Trees
  • Multilayer perceptron classifier
  • Extracting the right features from your data
  • Training classification models
  • Training a classification model on the Kaggle/StumbleUpon evergreen classification dataset
  • Using classification models.
  • Generating predictions for the Kaggle/StumbleUpon evergreen classification dataset
  • Evaluating the performance of classification models
  • Accuracy and prediction error
  • Precision and recall
  • ROC curve and AUC
  • Improving model performance and tuning parameters
  • Feature standardization
  • Additional features
  • Using the correct form of data
  • Tuning model parameters
  • Linear models
  • Iterations
  • Step size
  • Regularization
  • Decision trees
  • Tuning tree depth and impurity
  • The naive Bayes model
  • Cross-validation
  • Summary
  • Chapter 7: Building a Regression Model with Spark
  • Types of regression models
  • Least squares regression
  • Decision trees for regression
  • Evaluating the performance of regression models
  • Mean Squared Error and Root Mean Squared Error
  • Mean Absolute Error
  • Root Mean Squared Log Error
  • The R-squared coefficient
  • Extracting the right features from your data
  • Extracting features from the bike sharing dataset
  • Training and using regression models
  • BikeSharingExecutor
  • Training a regression model on the bike sharing dataset
  • Linear regression
  • Generalized linear regression
  • Decision tree regression
  • Ensembles of trees
  • Random forest regression
  • Gradient boosted tree regression
  • Improving model performance and tuning parameters
  • Transforming the target variable
  • Impact of training on log-transformed targets
  • Tuning model parameters
  • Creating training and testing sets to evaluate parameters
  • Splitting data for Decision tree
  • The impact of parameter settings for linear models
  • Iterations
  • Step size
  • L2 regularization
  • L1 regularization
  • Intercept
  • The impact of parameter settings for the decision tree
  • Tree depth
  • Maximum bins
  • The impact of parameter settings for the Gradient Boosted Trees
  • Iterations
  • MaxBins
  • Summary.
  • Chapter 8: Building a Clustering Model with Spark
  • Types of clustering models
  • k-means clustering
  • Initialization methods
  • Mixture models
  • Hierarchical clustering
  • Extracting the right features from your data
  • Extracting features from the MovieLens dataset
  • K-means
  • training a clustering model
  • Training a clustering model on the MovieLens dataset
  • K-means
  • interpreting cluster predictions on the MovieLens dataset
  • Interpreting the movie clusters
  • Interpreting the movie clusters
  • K-means
  • evaluating the performance of clustering models
  • Internal evaluation metrics
  • External evaluation metrics
  • Computing performance metrics on the MovieLens dataset
  • Effect of iterations on WSSSE
  • Bisecting KMeans
  • Bisecting K-means
  • training a clustering model
  • WSSSE and iterations
  • Gaussian Mixture Model
  • Clustering using GMM
  • Plotting the user and item data with GMM clustering
  • GMM
  • effect of iterations on cluster boundaries
  • Summary
  • Chapter 9: Dimensionality Reduction with Spark
  • Types of dimensionality reduction
  • Principal components analysis
  • Singular value decomposition
  • Relationship with matrix factorization
  • Clustering as dimensionality reduction
  • Extracting the right features from your data
  • Extracting features from the LFW dataset
  • Exploring the face data
  • Visualizing the face data
  • Extracting facial images as vectors
  • Loading images
  • Converting to grayscale and resizing the images
  • Extracting feature vectors
  • Normalization
  • Training a dimensionality reduction model
  • Running PCA on the LFW dataset
  • Visualizing the Eigenfaces
  • Interpreting the Eigenfaces
  • Using a dimensionality reduction model
  • Projecting data using PCA on the LFW dataset
  • The relationship between PCA and SVD
  • Evaluating dimensionality reduction models.