These public data sources can be used for machine learning and deep learning research. Datasets are an integral part of the field of machine learning.
Finding a good machine learning dataset is often the biggest hurdle a developer has to cross before starting any data science project. Whether you’re new to machine learning, or a professional data scientist, finding a good machine learning dataset is the key to extracting actionable insights.
Below is an up-to-date list of freely available data sources.
Datasets covering population demographics and a huge number of economic and development indicators from across the world.
The International Monetary Fund publishes data on international finances, debt rates, foreign exchange reserves, commodity prices and investments.
Data on educational institutions and education demographics from the US and around the world.
The UK’s largest collection of social, economic and population data.
A large number of polls providing data on public opinion of political and sporting issues.
The FBI is responsible for compiling and publishing national crime statistics, with free data available at national, state and county level.
Here you can find data on law enforcement agencies, jails, parole and probation agencies and courts.
Offers a free package with access to datasets covering world population, currencies, development indicators and weather data.
Public datasets covering planets and stars gathered by NASA’s space exploration missions.
Statistics compiled and published by the United Nations on international trade. Includes Comtrade Lab which is a showcase of how cutting edge analytics and tools are used to extract value from the data.
Up to date information on financial markets from around the world, including stock price indexes, commodities and foreign exchange.
Examine and analyze data on internet search activity and trending news stories around the world.
The advantage Twitter has over the others are that most conversations are public. This means that huge amounts of data is available through their API on who is talking about what, where, when and why.
Entire texts of academic papers, journals, books and legal case law.
As with Twitter, Instagram posts and conversations are public by default. Their APIs allow likes, mentions and business details to be analyzed.
The world’s largest open database of companies.
Information about job vacancies, candidates, salaries and employee satisfaction is available through their developer API.
Datasets in a number of formats drawn from the web’s largest resource on movies, television and people working in those industries.
Datasets on books including catalogues from libraries around the world
13,000 collated and labelled images of human faces, for use in developing applications involving facial recognition.
Microsoft’s open machine learning datasets for training systems in reading comprehension and question answering.
Collection of open datasets contributed by data scientists involved in machine learning projects.
Data on millions of online sales and auctions from Ebay
Information on nearly 4 million historical specimens in the London museum’s collection, as well as scientific sound recordings of the natural world.
More than one petabyte of data from particle physics experiments carried out by CERN.
Dataset hosted at archive.org covering music released around the world, for use in image processing research
Over one billion public comments posted to Reddit between 2007 and 2015, for training language algorithms
Freely available datasets covering everything from agriculture to weather
Collates data from the body which oversees the network of EV charge points across the Republic of Ireland and Northern Ireland.
Pollution and air quality data from across London
If you are interested in Natural Language Processing, try our free Demo App (NLP in practice – text summarization, Named-entity extractor and sentiment analysis)