The top data science datasets right now

The Top Data Science Datasets Right Now


This post was originally published by Matt Przybyla at Towards Data Science

  1. Introduction
  2. Kaggle
  3. Datasets
  4. Summary
  5. References

Over a certain amount of time, you might notice that there are similar datasets being utilized in data science blogs, undergraduate studies, graduate courses, and online learning. These datasets can sometimes reflect the current events happening in the world or can be general, yet extremely popular datasets used for practicing and showcasing data science techniques and processes. The most important aspect of these datasets is that they are ultimately used to promote the greater good by bringing together intelligent minds to solve a pressing issue. There are several sites where datasets can be housed, but I find myself going to the same one — that is Kaggle. This platform offers countless datasets and ranks them by trending metrics. I will be discussing four of the top 10 data science datasets right now.

As data becomes more easily obtainable, it is crucial to be aware that with this data there becomes an even bigger focus on what you do with it. These datasets highlight certain call-to-actions, tasks, and inspirations, so if you are unsure of how to handle the data, this part of the dataset information can be quite useful.

Kaggle [2] is a platform for data analysis, data scientists, and machine learning engineers that allow for collaboration of solving problems, competing, and overall, learning from one another. At the time that this article is written, there are nearly 46,000 datasets on Kaggle. You can filter the datasets by ‘Hottest’, ‘Most Votes’, ‘New’, ‘Updated’, and ‘Usability’.

The datasets I will be describing in this article are sorted by the ‘Hottest’ filter and consist of four of the top 10 datasets.

Below, I will highlight names, descriptions, and facts about four of the most popular datasets on Kaggle. Some datasets also have call-to-actions, tasks, inspiration, and prizes. Of course, in these unprecedented times, the top dataset is pertaining to COVID-19.

Description —

This dataset has around 7,900 votes. The main purpose of the dataset is to be utilized as an artificial intelligence (AI) challenge with AI2, CZI, MSR, Georgetown, as well as NIH & The White House. This open dataset is in response to the COVID-19 pandemic consisting of nearly 15 GB of data. There are about 17 tasks associated with this dataset. An example of a task would be ‘What do we know about COVID-19 risk factors?’. It is recommended that data scientists use this dataset with natural language processing and AI techniques to ultimately serve as support in fighting this prevalent disease.

This reason alone is what separates Kaggle from other dataset websites — the website encourages people from different backgrounds to come together to fight a pressing cause.

As with the description, there are also other key features of a dataset, including the ‘Call to Action’ and ‘Prizes’.

Call to Action —

Creating text and data mining tools from posing scientific questions with the use of data science.

Prizes —

$1,000 per task award.

Description —

This dataset describes the electricity of India from the years 2017–2020. It consists of 265 KB. The dataset context mentions that India has been apart of rapid growth in electricity from nearly 35 years ago, and in turn, has shown an increase in the economy, exports, infrastructure, and household incomes. The main tags include computing, education, news, energy, renewable energy, and research. The inspiration of the dataset is to discover how data science can impact renewable and non-renewable energy sources in India.

Description —

This unique dataset includes features over financial matters, brain research, national insights, and wellbeing. The exact factors are:

GDP per capitaHealth Life Expectancy Social supportFreedom to make life choicesGenerosityCorruption PerceptionResidual error

Composed of about 116 KB, this dataset has six separate CSVs including respective years of 2015, 2016, 2017, 2018, 2019, and 2020. There is one task associated with this dataset: ‘Compare countries by happiness and other human metrics’. The goal of this dataset can ultimately be up to you, as with any dataset. It serves as a different approach to quantifying happiness.

Similar to the COVID-19 dataset, this data can serve to provide support to a pressing health topic that is inhibiting in several countries. The good news, according to the context of this dataset, is that Malaria is preventable and curable. The features in this 212 KB sized dataset include, but are not limited to: country, year, and the number of cases. There are three CSVs including: reported_numbers.csv, estiamted_numbers.csv, and incidenceper100popat_risk.csv. The one task of this dataset is to ‘Explore whether the no. of cases of malaria increases every year?’.

All in all, these datasets are just some of the most popular datasets on the prominent platform, Kaggle. There are thousands more, but these are some of the most voted and relevant datasets right now. The datasets surround topics of health mainly with COVID-19, power/electricity, happiness, and Malaria. To find out more information with detailed features/columns, source of data, as well as examples of how to use the dataset with code and visualizations, check out the respective links attached to each title of the dataset.

Spread the word

This post was originally published by Matt Przybyla at Towards Data Science

Related posts