Books and resources

Reinforce what you learned and broaden your horizons!

CategoryTitle / LocationAuthorDescription
Data mining Introduction to Data Mining

Pang-Ning Tan, Michael Steinbach and Vipin KumarIntroduction to Data Mining presents fundamental concepts and algorithms for those learning data mining for the first time. Each major topic is organized into two chapters, beginning with basic concepts that provide necessary background for understanding each data mining technique, followed by more advanced concepts and algorithms.
Data mining Exploratory Data Analysis

John TukeyThe approach in this introductory book is that of informal study of data. Methods range from plotting picture-drawing techniques to rather elaborate numerical summaries. Several of the methods are the original creations of the author, and all can be carried out either with pencil or aided by hand-held calculator.
Text analysis Mining the Social Web

Matthew A. RussellHow can you tap into the wealth of social web data to discover who's making connections with whom, what they're talking about, and where they're located? With this expanded and thoroughly revised edition, you'll learn how to acquire, analyze, and summarize data from all corners of the social web, including Facebook, Twitter, LinkedIn, Google+, GitHub, email, websites, and blogs.
Text analysis Social Media Mining with R

Nathan Danneman and Richard HeimannWhether you are an undergraduate who wishes to get hands-on experience working with social data from the Web, a practitioner wishing to expand your competencies and learn unsupervised sentiment analysis, or you are simply interested in social data analysis, this book will prove to be an essential asset. No previous experience with R or statistics is required. This book provides detailed instructions on how to obtain, process, and analyze a variety of socially-generated data while providing a theoretical background to help you accurately interpret your findings. You will be shown R code and examples of data that can be used as a springboard as you get the chance to undertake your own analyses of business, social, or political data.
Network analysis Analyzing the Social Web

Jennifer GolbeckAnalyzing the Social Web provides a framework for the analysis of public data currently available and being generated by social networks and social media, like Facebook, Twitter, and Foursquare. Access and analysis of this public data about people and their connections to one another allows for new applications of traditional social network analysis techniques that let us identify things like who are the most important or influential people in a network, how things will spread through the network, and the nature of peoples' relationships. Analyzing the Social Web introduces you to these techniques, shows you their application to many different types of social media, and discusses how social media can be used as a tool for interacting with the online public.
Network analysis Linked

Albert-Laszlo BarabasiA cocktail party. A terrorist cell. Ancient bacteria. An international conglomerate. All are networks, and all are a part of a surprising scientific revolution. In Linked, Albert-Laszlo Barabasi, a top expert in the new science of networks, takes us on an intellectual adventure to prove that social networks, corporations, and living organisms are more similar than previously thought. Barabasi shows that grasping a full understanding of network science will someday allow us to design blue-chip businesses, stop the outbreak of deadly diseases, and influence the exchange of ideas and information. Linked tells the story of the true science of the future and of experiments in statistical mechanics on the internet, all vital parts of what would eventually be called the Barabasi-Albert model.
Visualization R Graphics Cookbook

Winston ChangThis practical guide provides more than 150 recipes to help you generate high-quality graphs quickly, without having to comb through all the details of R's graphing systems. Each recipe tackles a specific problem with a solution you can apply to your own project, and includes a discussion of how and why the recipe works. Most of the recipes use the ggplot2 package, a powerful and flexible way to make graphs in R. If you have a basic understanding of the R language, you're ready to get started.
Visualization The Visual Display of Quantitative Information

Edward TufteThis is a classic book on statistical graphics, charts and tables. This book discussed the theory and practice in the design of data graphics and includes 250 illustrations of the best (and a few of the worst) statistical graphics, with detailed analysis of how to display data for precise, effective, quick analysis. Topics covered include design of high-resolution displays, editing and improving graphics, the data-ink ratio, time-series, relational graphics, data maps, multivariate designs and detection of graphical deception: design variation vs. data variation
Visualization The Elements of Graphing Data

William S. ClevelandThis book covers many visualization methods in science and technology. The Elements of Graphing Data includes many new ideas and methods and is an excellent methodological resource for researchers.
Predictive analytics Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die

Eric SiegelThis book is easily understood by all readers. Rather than a "how to" for hands-on techies, the book entices lay-readers and experts alike by covering new case studies and the latest state-of-the-art techniques. You have been predicted by companies, governments, law enforcement, hospitals, and universities. Their computers say, "I knew you were going to do that!" These institutions are seizing upon the power to predict whether you're going to click, buy, lie, or die. Why? For good reason: predicting human behavior combats financial risk, fortifies healthcare, conquers spam, toughens crime fighting, and boosts sales.
Predictive analytics Data Smart: Using Data Science to Transform Information into Insight

John W. ForemanData Science gets thrown around in the press like it's magic. Major retailers are predicting everything from when their customers are pregnant to when they want a new pair of Chuck Taylors. It's a brave new world where seemingly meaningless data can be transformed into valuable insight to drive smart business decisions. But how does one exactly do data science? Do you have to hire one of these priests of the dark arts, the "data scientist," to extract this gold from your data? Nope. Data science is little more than using straight-forward steps to process raw data into actionable insight. And in Data Smart, author and data scientist John Foreman will show you how that's done within the familiar environment of a spreadsheet.
Predictive analytics Doing Data Science

Cathy ONeil & Rachel SchuttNow that people are aware that data can make the difference in an election or a business model, data science as an occupation is gaining ground. But how can you get started working in a wide-ranging, interdisciplinary field that's so clouded in hype? This insightful book, based on Columbia University's Introduction to Data Science class, tells you what you need to know. In many of these chapter-long lectures, data scientists from companies such as Google, Microsoft, and eBay share new algorithms, methods, and models by presenting case studies and the code they use. If you're familiar with linear algebra, probability, and statistics, and have programming experience, this book is an ideal introduction to data science.
Predictive analytics Data Science for Business

Foster Provost and Tom FawcettWritten by renowned data science experts Foster Provost and Tom Fawcett, Data Science for Business introduces the fundamental principles of data science, and walks you through the "data-analytic thinking" necessary for extracting useful knowledge and business value from the data you collect. This guide also helps you understand the many data-mining techniques in use today. Based on an MBA course Provost has taught at New York University over the past ten years, Data Science for Business provides examples of real-world business problems to illustrate these principles. Youll not only learn how to improve communication between business stakeholders and data scientists, but also how participate intelligently in your companys data science projects. Youll also discover how to think data-analytically, and fully appreciate how data science methods can support business decision-making.
Predictive analytics Collective Intelligence

Toby SegaranThis fascinating book demonstrates how you can build Web 2.0 applications to mine the enormous amount of data created by people on the Internet. With the sophisticated algorithms in this book, you can write smart programs to access interesting datasets from other web sites, collect data from users of your own applications, and analyze and understand the data once you've found it.
Predictive analytics Data Classification: Algorithms and Applications

Charu C. AggarwalResearch on the problem of classification tends to be fragmented across such areas as pattern recognition, database, data mining, and machine learning. Addressing the work of these different communities in a unified way, Data Classification: Algorithms and Applications explores the underlying algorithms of classification as well as applications of classification in a variety of problem domains, including text, multimedia, social network, and biological data.
Predictive analytics Semi-Supervised Learning

Olivier Chapelle, Bernhard Sch_lkopf and Alexander ZienIn the field of machine learning, semi-supervised learning (SSL) occupies the middle ground, between supervised learning (in which all training examples are labeled) and unsupervised learning (in which no label data are given). Interest in SSL has increased in recent years, particularly because of application domains in which unlabeled data are plentiful, such as images, text, and bioinformatics. This first comprehensive overview of SSL presents state-of-the-art algorithms, a taxonomy of the field, selected applications, benchmark experiments, and perspectives on ongoing and future research.
Statistics and probability Think Stats

Allen B. DowneyIf you know how to program, you have the skills to turn data into knowledge using the tools of probability and statistics. This concise introduction shows you how to perform statistical analysis computationally, rather than mathematically, with programs written in Python. You'll work with a case study throughout the book to help you learn the entire data analysis processÊfrom collecting data and generating statistics to identifying patterns and testing hypotheses. Along the way, you'll become familiar with distributions, the rules of probability, visualization, and many other tools and concepts.
Statistics and probability Think Bayes

Allen B. DowneyIf you know how to program with Python and also know a little about probability, youre ready to tackle Bayesian statistics. With this book, you'll learn how to solve statistical problems with Python code instead of mathematical notation, and use discrete probability distributions instead of continuous mathematics. Once you get the math out of the way, the Bayesian fundamentals will become clearer, and youll begin to apply these techniques to real-world problems. Bayesian statistical methods are becoming more common and more important, but not many resources are available to help beginners. Based on undergraduate classes taught by author Allen Downey, this books computational approach helps you get a solid start.
Statistics and probability Statistics in Plain English

Timothy C. UrdanThis book provides a brief, simple overview of statistics to help readers gain a better understanding of how statistics work and how to interpret them correctly. Each chapter describes a different statistical technique, ranging from basic concepts like central tendency and describing distributions to more advanced concepts such as t tests, regression, repeated measures ANOVA, and factor analysis. Each chapter begins with a short description of the statistic and when it should be used. This is followed by a more in-depth explanation of how the statistic works. Finally, each chapter ends with an example of the statistic in use, and a sample of how the results of analyses using the statistic might be written up for publication. A glossary of statistical terms and symbols is also included.
Regression and time series analysis Introduction to Time Series Analysis and Forecasting

Douglas C. Montgomery, Cheryl L. Jennings and Murat KulahciIntroduction to Time Series Analysis and Forecasting is great for readers who need to apply the methods and models presented but have little background in mathematics and statistics. This book presents the underlying theories of time series analysis that are needed to analyze time-oriented data and construct real-world short- to medium-term statistical forecasts. Authored by highly-experienced academics and professionals in engineering statistics, the Second Edition features discussions on both popular and modern time series methodologies as well as an introduction to Bayesian methods in forecasting. Introduction to Time Series Analysis and Forecasting, Second Edition also includes: * Over 300 exercises from diverse disciplines including health care, environmental studies, engineering, and finance * More than 50 programming algorithms using JMP(R), SAS(R), and R that illustrate the theory and practicality of forecasting techniques in the context of time-oriented data.
R The Art of R Programming

Norman MatloffR is the world's most popular language for developing statistical software: Archaeologists use it to track the spread of ancient civilizations, drug companies use it to discover which medications are safe and effective, and actuaries use it to assess financial risks and keep economies running smoothly. The Art of R Programming takes you on a guided tour of software development with R, from basic types and data structures to advanced topics like closures, recursion, and anonymous functions. No statistical knowledge is required, and your programming skills can range from hobbyist to pro.
R R for Everyone

Jared P. LanderUsing the open source R language, you can build powerful statistical models to answer many of your most challenging questions. Most R books assume far too much knowledge to be of help. R for Everyone is the solution. Drawing on his unsurpassed experience teaching new users, professional data scientist Jared P. Lander has written the perfect tutorial for anyone new to statistical programming and modeling. Organized to make learning easy and intuitive, this guide focuses on the 20 percent of R functionality youll need to accomplish 80 percent of modern data tasks.
Python Python for Data Analysis

Wes McKinneyPython for Data Analysis is concerned with the nuts and bolts of manipulating, processing, cleaning, and crunching data in Python. It is also a practical, modern introduction to scientific computing in Python, tailored for data-intensive applications. This is a book about the parts of the Python language and libraries youll need to effectively solve a broad set of data analysis problems. This book is not an exposition on analytical methods using Python as the implementation language.
Python Learning Python

Mark LutzGet a comprehensive, in-depth introduction to the core Python language with this hands-on book. Based on author Mark Lutzs popular training course, this updated fifth edition will help you quickly write efficient, high-quality code with Python. Its an ideal way to begin, whether youre new to programming or a professional developer versed in other languages.
Python Learn Python in One Day and Learn It Well

Jamie ChanHave you always wanted to learn computer programming but are afraid it'll be too difficult for you? Or perhaps you know other programming languages but are interested in learning the Python language fast? This book is for you. You no longer have to waste your time and money learning Python from lengthy books or complicated Python tutorials. The best way to learn Python is by doing. This book includes a complete project at the end of the book that requires the application of all the concepts taught previously. Working through the project will not only give you an immense sense of achievement, it'll also help you retain the knowledge and master the language.
Big data technology Hadoop: The Definitive Guide

Tom WhiteReady to unlock the power of your data? With this comprehensive guide, youll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.
Big data technology Advanced Analytics with Spark

Sandy Ryza, Uri Laserson, Sean Owen, and Josh WillisIn this practical book, four Cloudera data scientists present a set of self-contained patterns for performing large-scale data analysis with Spark. The authors bring Spark, statistical methods, and real-world data sets together to teach you how to approach analytics problems by example.
Big data technology Hive

Edward Capriolo, Dean Wampler and Jason RutherglenNeed to move a relational database application to Hadoop? This comprehensive guide introduces you to Apache Hive, Hadoops data warehouse infrastructure. Youll quickly learn how to use Hives SQL dialectÊHiveQLÊto summarize, query, and analyze large datasets stored in Hadoops distributed filesystem. This example-driven guide shows you how to set up and configure Hive in your environment, provides a detailed overview of Hadoop and MapReduce, and demonstrates how Hive works within the Hadoop ecosystem. Youll also find real-world case studies that describe how companies have used Hive to solve unique problems involving petabytes of data.
Big data technology HBase

Lars GeorgeIf you're looking for a scalable storage solution to accommodate a virtually endless amount of data, this book shows you how Apache HBase can fulfill your needs. As the open source implementation of Google's BigTable architecture, HBase scales to billions of rows and millions of columns, while ensuring that write and read performance remain constant. Many IT executives are asking pointed questions about HBase. This book provides meaningful answers, whether youre evaluating this non-relational database or planning to put it into practice right away.
Big data technology Mahout in Action

Sean Owen, Robin Anil, Ted Dunning and Ellen FriedmanA computer system that learns and adapts as it collects data can be really powerful. Mahout, Apache's open source machine learning project, captures the core algorithms of recommendation systems, classification, and clustering in ready-to-use, scalable libraries. With Mahout, you can immediately apply to your own projects the machine learning techniques that drive Amazon, Netflix, and others. Mahout in Action is a hands-on introduction to machine learning with Apache Mahout. Following real-world examples, the book presents practical use cases and then illustrates how Mahout can be applied to solve them. Includes a free audio- and video-enhanced ebook.
Databases MongoDB: The Definitive Guide

Kristina ChodorowManage the huMONGOus amount of data collected through your web application with MongoDB. This authoritative introductionÊwritten by a core contributor to the projectÊshows you the many advantages of using document-oriented databases, and demonstrates how this reliable, high-performance system allows for almost infinite horizontal scalability.
Other Analyzing the Analyzers

Harlan Harris, Sean Murphy and Marck VaismanDespite the excitement around "data science," "big data," and "analytics," the ambiguity of these terms has led to poor communication between data scientists and organizations seeking their help. In this report, authors Harlan Harris, Sean Murphy, and Marck Vaisman examine their survey of several hundred data science practitioners in mid-2012, when they asked respondents how they viewed their skills, careers, and experiences with prospective employers. The results are striking.
Additional resources R-Bloggers.com R-Bloggers.com is a central hub (e.g: A blog aggregator) of content collected from bloggers who write about R (in English). The site will help R bloggers and users to connect and follow the _R blogosphere.î
Additional resources cran.r-project.org CRAN is a network of ftp and web servers around the world that store identical, up-to-date, versions of code and documentation for R. Here you can find the latest R packages and help files.
Additional resources https://plot.ly/r/ An R library of interactive visualization packages.
Additional resources www.htmlwidgets.org A library of dynamic R visualizations. Learn how to create an R binding for your favorite JavaScript library and enable use of it in the R console, in R Markdown documents, and in Shiny web applications.
Additional resources cran.r-project.org/web/packages/ Database of R packages
Additional resources R packages by industry and subject area A searchable index of R packages and functions by industry of application and field of study
Additional resources Full Time Data Science Bootcamps Masters in Data ScienceA list of full time data science bootcamps - what to look for and what to avoid!
Statistics and probability www.statsoft.com/Textbook StatSoftThe Electronic Statistics Textbook begins with an overview of the relevant elementary (pivotal) concepts and continues with a more in depth exploration of specific areas of statistics, organized by "modules" and accessible by buttons, representing classes of analytic techniques. A glossary of statistical terms and a list of references for further study are included.