Agriculture | R package by Kevin Wright, kw.stat@gmail.com | This package contains datasets from published papers and books relating to agriculture including field crops, tree crops, animal studies, and a few others. |
Biology | The National Center for Biotechnology Information | GEO is a public functional genomics data repository supporting MIAME-compliant data submissions. Array- and sequence-based data are accepted. Tools are provided to help users query and download experiments and curated gene expression profiles. |
Biology | The National Institute of Health Microbiome Project | A collection of data about the human genome |
Biology | Broad Institute | A collection of genomics cancer data |
Biology | Interdisciplinary Computing and Complex BioSystems (ICOS) research group | A database of protein structures. The ICOS PSP benchmarks repository contains an adjustable real-world family of benchmarks suitable for testing the scalability of classification/regression methods. When we test a machine learning method we usually choose a test suite containing datasets with a broad set of characteristics, as we are interested in knowing how the learning method reacts to a veriety of scenarios. The PSP field provides us with a whole family of real-world classification/regression problems that can be adjusted almost arbitrarily in terms of number of variables, number of classes, class balance, etc. Thus, these datasets are an ideal benchmark suite for data mining methods. |
Consumer retail | Best Buy | Provides access to Best Buy's product data |
Consumer retail | Walmart | The Walmart Open API provides access to its extensive product catalog |
Crime | Montgomery County, MD | Traffic citations in Montgomery County, MD |
Crime | City of Cambridge, MA | City of Cambridge, MA crime data |
Crime | University of Maryland | A global terrorism database of attacks and their perpetrators |
Data set aggregators & APIs | Kansas City, MO | Kansas City public data |
Data set aggregators & APIs | University of California at Irvine | An aggregation of data sets that can be used to test and practice machine learning alorithms |
Data set aggregators & APIs | City of Cambridge, MA | Open data provided by the city of Cambridge, MA |
Data set aggregators & APIs | asdfree.com | Various survey data |
Data set aggregators & APIs | NYC open data | A collection of datasets about the population and economy of New York City |
Data set aggregators & APIs | The United Kingdom Data Service | The UK Data Service provides access to over 6,000 digital data collections for research and teaching purposes covering an extensive range of key economic and social data, both quantitative and qualitative, and spanning many disciplines and themes. |
Data set aggregators & APIs | New York Times | A list of APIs provided by the New York Times about a range of subjects including articles, blog and political data |
Data set aggregators & APIs | Canada's Open Government portal | Here you can explore how the Government of Canada is working with the national and international open government community to create greater transparency and accountability, increase citizen engagement, and drive innovation and economic opportunities through Open Data, Open Information, and Open Dialogue. |
Data set aggregators & APIs | Wiki | Open government intiatives archive |
Data set aggregators & APIs | Simply Stats | List of cities with open data |
Data set aggregators & APIs | City of London, U.K. | Various statisitcs about the population and economy of London, U.K. |
Data set aggregators & APIs | Statistics New Zealand Tatauranga Aotearoa | Statistics collected by the New Zealand government |
Data set aggregators & APIs | NYC Open Data | Hundreds of data sets containing information about New York City |
Data set aggregators & APIs | DataPortals | A list of open data portals around the world |
Data set aggregators & APIs | U.S. Open Data | The home of the U.S. Government_s open data. Here you will find data, tools, and resources to conduct research, develop web and mobile applications, design data visualizations, and more. |
Data set aggregators & APIs | United Kingdom Open Data | An archive of data made available by the British government |
Data set aggregators & APIs | France Open Data | An archive of data made available by the French government |
Data set aggregators & APIs | The City of San Francisco | An archive of data about San Francisco |
Data set aggregators & APIs | U.S. Federal Government Agencies (Data.gov) | An archive of data from various U.S. government agencies |
Data set aggregators & APIs | Causality Workbench | A broad aggregation of data sets intended to test machine learning skills and algorithms |
Data set aggregators & APIs | Kaggle | Data sets from the world's largest data science and machine learning competition organizer |
Data set aggregators & APIs | Machine Learning Data Set Repository | A broad range of data sets intended to be mined and examined with machine learning techniques |
Data set aggregators & APIs | Reddit | Data repositories posted on Reddit |
Data set aggregators & APIs | R | A collection of datasets that come with R |
Data set aggregators & APIs | Google | Google's directory of publicly data sets on a broad range of topics |
Data set aggregators & APIs | Stats4Stem | A collection of datasets that come with R |
Data set aggregators & APIs | The Washington Post | A collection of data sets on demographics, health, safety, real estate, sports, education and government & politics assembled by The Washington Post |
Data set aggregators & APIs | Data Market | This is a collection of time series data on a broad variety of topics |
Data set aggregators & APIs | Augmented Intel | Searchable list of public data mining data sets |
Data set aggregators & APIs | ProgrammableWeb | A collection of APIs across a broad range of sectors |
Data set aggregators & APIs | Federal Emergency Management Agency (FEMA) | An aggregation of FEMA data sets about housing, public assistance, hazard mitigants, etc. |
Data set aggregators & APIs | Yahoo! | PlaceSpotter is a web service that identifies places mentioned in text, disambiguates those places, and returns unique identifiers (WOEIDs) for each. This also includes information about how many times the place was found in the text, and where in the text it was found. |
Data set aggregators & APIs | University College London | A database of web searches and click-throughs aggregated by the University College London |
Economics & Demographics | CIA World Fact book | The World Factbook, produced for US policymakers and coordinated throughout the US Intelligence Community, marshals facts on every country, dependency, and geographic entity in the world. The Factbook provides information on the history, people, government, economy, energy, geography, communications, transportation, military, and transnational issues for 267 world entities. |
Economics & Demographics | Knoema | An aggregation of data sets on over 1,000 about the population and economic development of numerous countries |
Economics & Demographics | United Nations | A collection of 34 databases with over 64 million records on economic and demographic trends across the globe |
Economics & Demographics | City Data | A collection of profiles of all cities in the U.S. |
Economics & Demographics | CKAN | An aggregation of open data sites from around the world including federal government and city government data |
Economics & Demographics | The World Bank | An aggregation of demographic and economic data sets from around the world |
Economics & Demographics | Gapminder | A collection of demographic and economic data from around the world as well as dynamic visualizations of these data sets |
Economics & Demographics | The Organisation for Economic Co-operation and Development (OECD) | An archive of census and economic data from around the world collected by the OECD |
Economics & Finance | International Monetary Fund (IMF) | Various data sets provided by the International Monetary Fund |
Economics & Finance | The Federal Reserve Board | A wide variety of economic data provided by The Federal Reserve Board of the United States |
Economics & Finance | University of Maryland | Several thousand economic time series, produced by a number of U.S. Government agencies and distributed in a variety of formats and media, can be found here. Data has been put into a standard, highly efficient, easy-to-use form for personal computers and made publicly available through this site. These series include national income and product accounts (NIPA), labor statistics, price indices, current business indicators, and industrial production. |
Economics & Finance | Chicago Board Options Exchange | Data on trading of financial options |
Economics & Finance | Feredral Reserve Bank of St. Louis | Economic research data compiled by the Feredral Reserve Bank of St. Louis |
Economics & Finance | NASDAQ | Stock market and financial data provided by NASDAQ |
Economics & Finance | Yahoo! Finance | Stock market and financial data provided by Yahoo! |
Economics & Finance | Google Finance | Stock market and financial data provided by Google |
Economics & Finance | Australian Bureau of Statistics | Census and economics data collected by the Australian Bureau of Statistics |
Economics & Finance | KIVA | A database of microfinance loans extended to small businesses around the world |
Entertainment | Columbia University | Feature data and metadata for 1 million songs |
Entertainment | Freebase | A community-curated database of well-known people, places, and things |
Environment | Climatic Research Unit (University of East Anglia) | A collection of data about weather patterns throughout the world |
Environment | Arizona State University | A collection of geospatial data that is well suited for geographic analysis |
Government & Politics | Archive-It | The leading web archiving service for collecting and accessing cultural heritage on the web |
Healthcare | Medicare | Medicare claims data |
Healthcare | Medicare | Medicare provider utilization and payment data |
Healthcare | Centers for Disease Control and Prevention (CDC) | Data sets about the health of the U.S. population |
Healthcare | Medicare | The National Health Expenditure Accounts (NHEA) are the official estimates of total health care spending in the United States. Dating back to 1960, the NHEA measures annual U.S. expenditures for health care goods and services, public health activities, government administration, the net cost of health insurance, and investment related to health care. The data are presented by type of service, sources of funding, and type of sponsor. |
Healthcare | Centers for Disease Control and Prevention (CDC) | US CDC Public Health datasets |
Healthcare | The Food and Drug Administration | OpenFDA provides APIs for a number of high-value structured datasets, including adverse events, drug product labeling, and recall enforcement reports. |
Healthcare | The Office on Women's Health | The system provides state- and county-level data for all 50 states, the District of Columbia, and US territories and possessions. Data are available by gender, race and ethnicity and come from a variety of national and state sources. The system is organized into eleven main categories, including demographics, mortality, natality, reproductive health, violence, prevention, disease and mental health. Within each main category, there are numerous subcategories. |
Healthcare | AARP | The AARP Public Policy Institute analyzes and publishes a wide range of state-specific data related to Americans 50+. |
Networks data | Princeton University | The International Networks Archive collects extensive current and historical data in the numerous areas. All of the data is public and available for free download. |
Networks data | Carnegie Mellon University | Network data provided by Ancestry.com |
Networks data | The Koblenz Network Collection (KONECT) | KONECT contains over a hundred network datasets of various types, including directed, undirected, bipartite, weighted, unweighted, signed and rating networks. The networks of KONECT are collected from many diverse areas such as social networks, hyperlink networks, authorship networks, physical networks, interaction networks and communication networks. |
Networks data | Carnegie Mellon University | This is the collection of Enron e-mails released after one of the largest frauds in U.S. history was publicly revealed. This data can be used for text mining and network analysis. |
Networks data | Stanford University | This is one of the largest publicly available collections of network data provided by Stanford University for research puposes |
Networks data | Wikipedia | Wikipedia allows its entire database to be downloaded. One file that is available for download is a list of all page-to-page links. This might therefore be an excellent intermediate-sized data set to try out techniques such as PageRank. |
Networks data | Tore Opsahl's blog | A database of social and transportation network data |
Politics | New York Times | With the Campaign Finance API, you can retrieve data from United States Federal Election Commission filings. |
Politics | USAspending.gov | USAspending.gov is the publicly accessible, searchable website mandated by the Federal Funding Accountability and Transparency Act of 2006 to give the American public access to information on how their tax dollars are spent. |
Politics | Google | For any U.S. residential address, you can look up who represents that address at each elected level of government. During supported elections, you can also look up polling places, early vote location, candidate data, and other election official information. |
Politics | Federal Election Commission | U.S. campaign finance reports and data |
Public opinion | General Social Survey | The GSS contains a standard 'core' of demographic, behavioral, and attitudinal questions, plus topics of special interest. Many of the core questions have remained unchanged since 1972 to facilitate time-trend studies as well as replication of earlier findings. The GSS takes the pulse of America, and is a unique and valuable resource. It has tracked the opinions of Americans over the last four decades. |
Public opinion | University of California, Los Angeles | This is a collection of survey data of the American public |
Space | NASA | NSSDCA archives more than 230 TB of digital data from about 550 mostly-NASA space science spacecraft, of which the most important 7 TB are electronically accessible. |
Sports | NBAStuffer.com | NBA play-by-play data |
Transportation | Capital bikeshare | A collection of data about the Washington D.C. bikeshare program |
Transportation | CarQuery | CarQuery API is an easy to use JSON based API for retrieving detailed car information, including year, make, model, trim, and specifications. |
Transportation | Massachusetts Institute of Technology | A compilation of airline data from airplane types to operating data |
Transportation | United States Department of Transportation | A database of transportation statistics for the United States. Some data sets can be used to analyze the transportation network in the U.S. |
Web traffic | Google Trends | A compilation of data about what people are Googling around the world |
Data set aggregators & APIs | R packages | Type in this code in RStudio and a list of all the data sets that are included with all the packges you have installed will come up: data(package = .packages(all.available = TRUE)) |