Skip to Main Content

Gould Data Knowledge Base

Kaggle Tutorial Overview

The Importance of Validity:

This is a great source for finding data- especially sports data. But, it is up to you to estimate the validity and authority of the data you find. This is a wonderful a resource, but you should always be asking the question, "Do I feel comfortable using this data for my project?"

Goals:

  • Find a dataset using specific search terms.
  • Read and understand what’s in your dataset.
  • Download and check the validity of your dataset.

Start:

  • Head to​ Kaggle. Then click on datasets on the top row.
  • You will need to make an account at some point if you want to download a dataset, comment, or create a kernel, but it’s not necessary yet.

Finding a Dataset

In the picture, a few of the important search tools are highlighted.

File Types:

    Most of you will want to stick to the CSV (comma separated values) file type. That will allow you to open the dataset in Microsoft Excel (or similar programs). The other file types are used for databases and web development. These are only recommended if you have past experience with them or are eager to do some extra work.

Sort By:

  • This is a typical tool that will allow you to sort by relevance, votes, date released or hotness. Relevance or votes (popularity) are probably what you’re looking for.

Search Datasets:

  • For a phrase: Put it in quotes e.g. “This is my phrase”

Understanding Your Dataset- It’s more than just a CSV file.

Once you’ve made your search, click on a result and it will bring you up to their page. Again, a few features are highlighted.

Download:

  • Both of the download buttons on the right will download all of the data sources. If you click to download, it will lead you into creating an account. It’s as simple as signing in with Google and accepting some terms. If you scroll down, you will see some visualizations of the current data sources.

Visualizations:

  • If you click on “2016.csv”, for example, it will show you some visualizations about that dataset. It does not show you visualizations over all of the sources.

Kernels:

  • If you want to contribute to the open source community and you know some Python, you can create a new kernel which can run a Python script on this dataset. Click on “Kernels” to see what some people have done with this dataset. This is a kernel for the World Happiness Report.

Validity:

  • Click on “Overview” to see some information about the dataset and its source. On the popular datasets, some people have made interesting visualizations or created statistical models. If you go to “Insights”, you can see some statistics regarding the use of the dataset and kernels that people have made. If you click on the creator which in this case is “Sustainable Development Solutions Network” you can view their profile. In some cases, they will have their LinkedIn profile or their website attached. These are great ways to validate their data.

Wrap Up

This tutorial helped you find a dataset on Kaggle using specific search tools. It also helped you understand all of the different features of your dataset such as the kernels and visualizations the Kaggle provides.