Advanced analytics with Spark : patterns for learning from data at scale 🔍
Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills O'Reilly Media, Incorporated, 2nd Edition, Jul 06, 2017
English [en] · PDF · 5.5MB · 2017 · 📘 Book (non-fiction) · 🚀/lgli/lgrs/nexusstc/upload/zlib · Save
description
Sandy Ryza, Uri Laserson, Sean Owen And Josh Wills. Previous Edition Published: 2015. Includes Index.
Alternative filename
upload/bibliotik/A/Advanced Analytics with Spark - Sandy Ryza, Uri Laserson, Sean .pdf
Alternative filename
nexusstc/Advanced Analytics with Spark: Patterns for Learning from Data at Scale/f8f3f8f67afce35d2a65f0ab15b04b0c.pdf
Alternative filename
lgli/Advanced Analytics with Spark 2nd ed.pdf
Alternative filename
lgrsnf/Advanced Analytics with Spark 2nd ed.pdf
Alternative title
Advanced analytics with Spark : patterns from learning from data at scale
Alternative author
Sanford Ryza, Uri Laserson, Sean Owen, Joshua Wills. All rights reserved
Alternative author
Ryza, Sandy, Laserson, Uri, Owen, Sean, Wills, Josh
Alternative edition
United States, United States of America
Alternative edition
Second edition, Sebastopol, CA, 2017
Alternative edition
Second edition, Beijing, 2017
metadata comments
0
metadata comments
lg2120399
metadata comments
producers:
Antenna House PDF Output Library 6.2.609 (Linux64)
metadata comments
{"edition":"2","last_page":275,"publisher":"O’Reilly Media"}
Alternative description
Copyright 4
Table of Contents 5
Foreword 9
Preface 11
What’s in This Book 12
The Second Edition 12
Using Code Examples 12
O’Reilly Safari 13
How to Contact Us 13
Acknowledgments 13
Chapter 1. Analyzing Big Data 15
The Challenges of Data Science 17
Introducing Apache Spark 18
About This Book 20
The Second Edition 21
Chapter 2. Introduction to Data Analysis with Scala and Spark 23
Scala for Data Scientists 24
The Spark Programming Model 25
Record Linkage 26
Getting Started: The Spark Shell and SparkContext 27
Bringing Data from the Cluster to the Client 33
Shipping Code from the Client to the Cluster 36
From RDDs to Data Frames 37
Analyzing Data with the DataFrame API 40
Fast Summary Statistics for DataFrames 46
Pivoting and Reshaping DataFrames 47
Joining DataFrames and Selecting Features 51
Preparing Models for Production Environments 52
Model Evaluation 54
Where to Go from Here 55
Chapter 3. Recommending Music and the Audioscrobbler Data Set 57
Data Set 58
The Alternating Least Squares Recommender Algorithm 59
Preparing the Data 62
Building a First Model 65
Spot Checking Recommendations 68
Evaluating Recommendation Quality 70
Computing AUC 72
Hyperparameter Selection 74
Making Recommendations 76
Where to Go from Here 78
Chapter 4. Predicting Forest Cover with Decision Trees 79
Fast Forward to Regression 79
Vectors and Features 80
Training Examples 81
Decision Trees and Forests 82
Covtype Data Set 85
Preparing the Data 85
A First Decision Tree 88
Decision Tree Hyperparameters 94
Tuning Decision Trees 96
Categorical Features Revisited 100
Random Decision Forests 102
Making Predictions 105
Where to Go from Here 105
Chapter 5. Anomaly Detection in Network Traffic with K-means Clustering 107
Anomaly Detection 108
K-means Clustering 108
Network Intrusion 109
KDD Cup 1999 Data Set 110
A First Take on Clustering 111
Choosing k 113
Visualization with SparkR 116
Feature Normalization 120
Categorical Variables 122
Using Labels with Entropy 123
Clustering in Action 125
Where to Go from Here 126
Chapter 6. Understanding Wikipedia with Latent Semantic Analysis 129
The Document-Term Matrix 130
Getting the Data 132
Parsing and Preparing the Data 132
Lemmatization 134
Computing the TF-IDFs 135
Singular Value Decomposition 137
Finding Important Concepts 139
Querying and Scoring with a Low-Dimensional Representation 143
Term-Term Relevance 144
Document-Document Relevance 146
Document-Term Relevance 147
Multiple-Term Queries 148
Where to Go from Here 150
Chapter 7. Analyzing Co-Occurrence Networks with GraphX 151
The MEDLINE Citation Index: A Network Analysis 153
Getting the Data 154
Parsing XML Documents with Scala’s XML Library 156
Analyzing the MeSH Major Topics and Their Co-Occurrences 157
Constructing a Co-Occurrence Network with GraphX 160
Understanding the Structure of Networks 164
Connected Components 164
Degree Distribution 167
Filtering Out Noisy Edges 169
Processing EdgeTriplets 170
Analyzing the Filtered Graph 172
Small-World Networks 173
Cliques and Clustering Coefficients 174
Computing Average Path Length with Pregel 175
Where to Go from Here 180
Chapter 8. Geospatial and Temporal Data Analysis on New York City Taxi Trip Data 183
Getting the Data 184
Working with Third-Party Libraries in Spark 185
Geospatial Data with the Esri Geometry API and Spray 186
Exploring the Esri Geometry API 186
Intro to GeoJSON 188
Preparing the New York City Taxi Trip Data 190
Handling Invalid Records at Scale 192
Geospatial Analysis 196
Sessionization in Spark 199
Building Sessions: Secondary Sorts in Spark 200
Where to Go from Here 203
Chapter 9. Estimating Financial Risk Through Monte Carlo Simulation 205
Terminology 206
Methods for Calculating VaR 207
Variance-Covariance 207
Historical Simulation 207
Monte Carlo Simulation 207
Our Model 208
Getting the Data 209
Preprocessing 209
Determining the Factor Weights 212
Sampling 215
The Multivariate Normal Distribution 218
Running the Trials 219
Visualizing the Distribution of Returns 222
Evaluating Our Results 223
Where to Go from Here 225
Chapter 10. Analyzing Genomics Data and the BDG Project 227
Decoupling Storage from Modeling 228
Ingesting Genomics Data with the ADAM CLI 231
Parquet Format and Columnar Storage 237
Predicting Transcription Factor Binding Sites from ENCODE Data 239
Querying Genotypes from the 1000 Genomes Project 246
Where to Go from Here 249
Chapter 11. Analyzing Neuroimaging Data with PySpark and Thunder 251
Overview of PySpark 252
PySpark Internals 253
Overview and Installation of the Thunder Library 255
Loading Data with Thunder 255
Thunder Core Data Types 262
Categorizing Neuron Types with Thunder 263
Where to Go from Here 268
Index 269
About the Authors 275
Colophon 275
Alternative description
The authors bring Spark, statistical methods, and real-world data sets together to teach you how to approach analytics problems by presenting examples and a set of self-contained patterns for performing large-scale data analysis with Spark. You'll start with an introduction to Spark and its ecosystem, and then dive into patterns that apply common techniques-classification, collaborative filtering, and anomaly detection among others-to fields such as genomics, security, and finance. If you have an entry-level understanding of machine learning and statistics, and you program in Java, Python, or Scala, you'll find these patterns useful for working on your own data applications
date open sourced
2017-09-29
Read more…

🐢 Slow downloads

From trusted partners. More information in the FAQ. (might require browser verification — unlimited downloads!)

All download options have the same file, and should be safe to use. That said, always be cautious when downloading files from the internet, especially from sites external to Anna’s Archive. For example, be sure to keep your devices updated.
  • For large files, we recommend using a download manager to prevent interruptions.
    Recommended download managers: JDownloader
  • You will need an ebook or PDF reader to open the file, depending on the file format.
    Recommended ebook readers: Anna’s Archive online viewer, ReadEra, and Calibre
  • Use online tools to convert between formats.
    Recommended conversion tools: CloudConvert and PrintFriendly
  • You can send both PDF and EPUB files to your Kindle or Kobo eReader.
    Recommended tools: Amazon‘s “Send to Kindle” and djazz‘s “Send to Kobo/Kindle”
  • Support authors and libraries
    ✍️ If you like this and can afford it, consider buying the original, or supporting the authors directly.
    📚 If this is available at your local library, consider borrowing it for free there.