Get Hands-On Experience Training and Applying Machine Learning Models on Big Data
Data science is one of most sought after skillsets in the the tech world today, but data science is just part of the equation. Getting data into the right format and the right location so it can be classified by machine learning algorithms requires a specialized toolbox. Welcome to the world of big data pipelines!
Building Machine Learning Pipelines teaches the skills required to create and use the infrastructure needed to run modern intelligent systems. You’ll learn how to collect and enrich data to train and then apply machine learning models using Spark, Kafka, and Elasticsearch – some of the hottest technologies across the big data industry.
By implementing a real-world use case, this book will walk you through building batch and streaming big data pipelines to train and apply machine learning models.
Who This Book Is For
This book is intended for software engineers and data scientists who have basic to intermediate skills in the Python programming language. These readers are interested in learning to build machine learning pipelines to get hands on skills that are mission critical at many major tech companies. Since this book assumes readers are able to navigate the command line interface and have basic Python programming skills it is not intended for someone simply interested in learning to code.
What You’ll Get
- Case method approach for learning to train and apply machine learning models on big data using Spark: batch processing of Bitcoin Historical Data for model training and predictive analysis of streaming data from the GDAX digital asset exchange
- A comprehensive tutorial showing you how to build machine learning pipelines to handle batch and streaming data using a real-world use case.
- Code files to build and run machine learning pipelines locally on your computer using Docker, Python, Kafka and more.
- Everything you need to quickly and easily visualize your enriched data using Elasticsearch.
Want More Detail?! You’ve Got It!
Here’s the preliminary outline for the book so you’ll know just what you’ll be getting:
Chapter One: Introduction
Gain an initial familiarity with machine learning pipelines.
- What’s a machine learning pipeline good for?
- Our goal for the book
- The real-world use case you’ll be implementing: batch processing of Bitcoin Historical Data for model training and predictive analysis of streaming data from the GDAX digital asset exchange
- What we assume you know
- What we don’t assume you know
- System requirements
Chapter Two: Machine Learning Pipelines Overview
Dig into the moving parts of machine learning pipelines.
- Types, uses and examples of machine learning pipelines
- General phases of an ML pipeline
- Examples of real-world pipelines we’ve worked on
Chapter Three: What We’re Building and Setting Up Your Environment
Get details on the two data pipelines we’ll be building and then download, install and configure everything you need to create them.
- The two pipelines we’ll be building: batch and streaming
- The tech we’re using: Python, Kafka, Apache Spark, Elasticsearch, Kibana
- Everything you’ll need: Git, Docker, Python (Anaconda), and a text editor
Chapter Four: Get Hands-On With the Pieces of the Pipeline
Get a hands-on introduction to each piece of the pipeline. By the end of this chapter you’ll be familiar with industry-standard components used in many pipelines today.
- What is Kafka?
- How do you create a topic?
- How do you produce data to a topic?
- How do you consume from a topic?
- What is Spark?
- How do you create an RDD?
- How do you perform a map/reduce operation on an RDD?
- How do you output or store data from an RDD?
- What is Elasticsearch?
- How do you create an index?
- How do you index data?
- How do you search your index?
- What is Kibana?
- How do you access Kibana?
- How do you search your data?
- How do you create a dashboard?
Chapter Five: Batch Process Bitcoin Data for Model Training
A comprehensive tutorial on creating a pipeline for batch processing data and training a machine learning model.
- Build a python application to download a CSV file for processing.
- Use Spark to process the file and perform a number of enrichments
- Use Spark to use the enriched data to train an ML model
Chapter Six: Performing Predictive Analysis on Streaming Cryptocurrency Transaction Data
Perform predictive analysis on streaming data using the model trained in the previous chapter. This chapter will use the GDAX websocket feed.
- Build a python application to connect to the GDAX websocket feed and push data to a Kafka topic.
- Use Spark-Streaming to process the data from the Kafka topic and enrich it.
- Use Spark-Streaming to apply the ML model to the enriched data and index the results into Elasticsearch.
- Use Kibana to visualize data in Elasticsearch
Chapter Seven: Next Steps for Heading to Production
Suggested next steps for putting the data pipeline(s) into production.
- Deploy and scale using AWS services (EC2, ECS, Lambda, etc.)
- Monitor using DataDog and Graphana
- Going meta: Docker containers that monitor Docker containers
Robert Dempsey is a tested leader and technology professional delivering solutions and products that solve tough business challenges. His experience forming and leading agile teams combined with more than 17 years of technology experience enables him to solve complex problems while always keeping the bottom line in mind. He founded and built three startups in tech and marketing, developed and sold online applications, consulted to Fortune 500 and Inc. 500 companies, authored the Python Business Intelligence Cookbook, and has spoken nationally and internationally on software development and agile project management. He has expertise in from-the-front leadership and mentoring, microservices architectures and API development, cloud services particularly Amazon Web Services, and distributed data gathering and processing systems.
Brandon Rose is a technologist who has spent his career seeking ways to use data to transform organizations across numerous industries and in both the public and private sector. As a management consultant, Brandon helped transform how Emergency Medical Services are delivered using a data driven approach. As a Data Strategist in an R&D shop in the Federal Government Brandon wrangled huge volumes of unstructured data to deliver mission critical information to some of the top leaders across the U.S. Government and to fight human trafficking. As an entrepreneur, he has created businesses in the data as a service, online reputation management, and lead generation spaces while consulting for multiple big data brokerages. Throughout, he has leveraged big data technology, machine learning, and cloud infrastructure to deliver efficient, effective, and transformative technology.
Get a Behind-The-Scenes Look Into the Writing of the Book
Sign up below to get the behind-the-scenes look at the process we’re going through to write this book. You’ll see the entire process of taking an idea from a mindmap through to publishing.