GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together. If nothing happens, download GitHub Desktop and try again.
If nothing happens, download Xcode and try again. If nothing happens, download the GitHub extension for Visual Studio and try again. We have provided a new way to contribute to Awesome Public Datasets. The original PR entrance directly on repo is closed forever. This list of a topic-centric public data sources in high quality. They are collected and tidied from blogs, answers, and user responses.
Most of the data sets listed below are free, however, some are not. Other amazingly awesome lists can be found in sindresorhus's awesome list. Skip to content. Dismiss Join GitHub today GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together. Sign up. A topic-centric list of HQ open datasets.
Branch: master. Find file. Sign in Sign up. Go back. Launching Xcode If nothing happens, download Xcode and try again. Latest commit. Latest commit 4cdb42a Apr 14, I am well. Please fix me. You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Add titanic dataset. Nov 21, Update license copyright info. Apr 30, Apr 14, Social media and social networking sites are online platforms where people can connect with their real-life family, friends, and colleagues, and build new relations with others who share similar interests.
The most popular English social media sites in are Twitter, Facebook, and Reddit. Social media data is the largest, most dynamic dataset about human behavior.Read .CSV file in Jupyter notebook for Python from any directory
It gives social scientists and business experts a world of new opportunities to understand people, groups, and society.
Sentiment analysis is the common way that machine learning is applied in social media. For example, when a new product is released, your customers might tweet about it or leave their review on Amazon.
One way to gather social media data is to use a web scraping tool that extracts data from social media channels, such as Facebook, Twitter, LinkedIn, and Instagram. Please note for some social networking sites, using the data from their platform is a terms violation. You should read the terms of service carefully to avoid legal issues. Another good place to start is the official API documentation for social media sites like Facebook and Twitter.
This will tell you how to build a query and how to search for posts with exact words. Lionbridge AI provides custom social media datasets in languages for your specific machine learning project needs. Born and raised in Tokyo, but also studied abroad in the US.
A huge people person, and passionate about long-distance running, traveling, and discovering new music on Spotify. Sign up to our newsletter for fresh developments from the world of training data. Lionbridge brings you interviews with industry experts, dataset collections and more.
Article by Rei Morikawa June 03, Get high-quality social media data now. Related resources. A curated list of image datasets for computer vision. Because finding enough relevant datas in Korean is difficult, we at Lionbridge have put together a comprehensive list of public Korean datasets for machine learning.
Well trained models can effectively reduce dependency on human moderators. To help you get started with building your own content moderation system, we at Lionbridge have put together the best open-source content moderation datasets for machine learning. This article will highlight some of the most widely-used coronavirus datasets covering data from all the countries with confirmed COVID cases.We consider all the YouTube videos to form a directed graph, where each video is a node in the graph.
If a video b is in the related video list first 20 only of a video athen there is a directed edge from a to b. Our crawler uses a breadth-first search to find videos in the graph. We define the initial set of 0-depth video IDs, which the crawler reads in to a queue at the beginning of the crawl. When processing each video, it checks the list of related videos and adds any new ones to the queue. Given a video ID, the crawler first extracts information from the YouTube APIwhich contains all the meta-data except age, category and related videos.
The crawler then scrapes the video's webpage to obtain the remaining information. The crawl went to more than four depths, finding approximately thousand videos in about five days. In the following weeks we ran the the crawler every two to three days, each time defining the initial set of videos from the list of "Most Viewed", "Top Rated", and "Most Discussed", for "Today" and "This Week", which is about to videos. On average, the crawl finds 73 thousand distinct videos each time in less than 9 hours.
All the 35 datasets can be downloaded from here. In each package, there are: 1 "0. To study the growth trend of the video popularity, we also updated the statistics of some previously found videos. For this crawl we only retrieve the number of views for relatively new videos uploaded after February 15th, This crawl is performed once a week from March 5th to April 16thwhich results in seven datasets. The 7 datasets can be downloaded from here.
In each package, there are: 1 "update. Inwe updated the statistics of videos once a week for 21 weeks. The new data are also presented here:. We have also separately crawled the video file size and video bitrate information. To get the file size, the crawler retrieves the response information from the server when requesting to download the video file and extracts the information on the size of the download.
Some videos have the bitrate embedded in the FLV video meta-data, which the crawler extracts after downloading the meta-data of the video file. In "We all know that to build up a machine learning projectwe need a dataset. Generally, these machine learning datasets are used for research purpose. A dataset is the collection of homogeneous data.
Dataset is used to train and evaluate the machine learning model. It plays a vital role to build up an efficient and reliable system. If your dataset is noise-free and standard, then your system will give better accuracy. However, at present, we are enriched with numerous datasets. It can be business-related data, or it can be medical data and many more. However, the actual problem is to find out the relevant ones according to the system requirements.
For developing a machine learning and data science project its important to gather relevant data and create a noise-free and feature enriched dataset. Below we are narrating the 20 best machine learning datasets such a way that you can download the dataset and can develop your machine learning project.
After analyzing the web hours after hours, we have outlined this to boost up your machine learning knowledge. ImageNet is one of the best datasets for machine learning. Generally, it can be used in computer vision research field. This project is an image dataset, which is consistent with the WordNet hierarchy.
In WordNet, each concept is described using synset. Synset is multiple words or word phrases. Another mentionable machine learning dataset for classification problem is breast cancer diagnostic dataset.
This breast cancer diagnostic dataset is designed based on the digitized image of a fine needle aspirate of a breast mass. In this digitized image, the features of the cell nuclei are outlined. We all know that sentiment analysis is a popular application of natural language processing NLP.
Are you interested in building a model of sentiment analyzer? Then, this twitter sentiment analysis dataset is for you — also, its a task of text processing. It may help you to enhance your machine learning skill. One of the most renowned problems of text classification is news classification. So, to develop your news classifier, you need a standard dataset. This BBC news dataset is just worthy. There are five predefined classes.
In business class, there are documents, in entertainment class, documents, in a politics class, documents, in sport class, documents, and in technology class, documents. Do you want to work with handwritten digits?
This Machine learning dataset is for image recognition. Its a well known and interesting machine learning dataset. The surprising fact of this dataset is that it offers both instances for training and for testing.
We all know natural language processing is about text data.GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together. If nothing happens, download GitHub Desktop and try again. If nothing happens, download Xcode and try again.
If nothing happens, download the GitHub extension for Visual Studio and try again.
This repo contains starter code for training and evaluating machine learning models over the YouTube-8M dataset. The code gives an end-to-end working example for reading the dataset, training a TensorFlow model, and evaluating the performance of the model.
The starter code requires Tensorflow. If you haven't installed it yet, follow the instructions on tensorflow. This code has been tested with Tensorflow 1. Going forward, we will continue to target the latest released version of Tensorflow. Please verify that you have Python 3. Please see our dataset website for up-to-date download instructions.
So the structure should look like. Train using train. TLDR - frame-level features are compressed, and this flag uncompresses them. NOTE : This script can be slow for the first time running. It will read TFRecord data and build label cache. Once label cache is built, the evaluation will be much faster later on.
12 Best Social Media Datasets for Machine Learning
We find it useful to keep the tensorboard instance always running, as we train and evaluate different models. If your Tensorflow installation has GPU support, e.
You can verify your installation by running. If at least one GPU was found, the forward and backward passes will be computed with the GPUs, whereas the CPU will be used primarily for the input and output pipelines. If you have multiple GPUs, the current default behavior is to use only one of them. This option requires you to have an appropriately configured Google Cloud Platform account. To create and configure your account, please make sure you follow the instructions here. Please also verify that you have Python 3.
You can browse the storage buckets you created on Google Cloud, for example, to access the trained models, prediction CSV files, etc. Alternatively, you can use the 'gsutil' command to download the files directly. For example, to download the output of the inference code from the previous section to your local machine, run:. All gcloud commands should be done from the directory immediately above the source code. You should be able to see the source code directory if you run 'ls'.
As you are developing your own models, you will want to test them quickly to flush out simple problems without having to submit them to the cloud.
In the 'gsutil' command above, the 'package-path' flag refers to the directory containing the 'train. The module-name refers to the specific python script which should be executed in this case the train module. It may take several minutes before the job starts running on Google Cloud. When it starts you will see outputs like the following:. At this point you can disconnect your console by pressing "ctrl-c".
The model will continue to train indefinitely in the Cloud.
12 Best Social Media Datasets for Machine Learning
If you are using Google Cloud Shell, you can instead click the Web Preview button on the upper left corner of the Cloud Shell window and select "Preview on port ". This will bring up a new browser tab with the Tensorboard view.Classification Regression Clustering 92 Other Categorical 38 Numerical Mixed Less than 10 10 to Greater than Less than 27 to Greater than Matrix Non-Matrix Data Types.
Default Task. Attribute Types. Anonymous Microsoft Web Data. Audiology Standardized. Breast Cancer Wisconsin Original. Breast Cancer Wisconsin Prognostic. Breast Cancer Wisconsin Diagnostic. Chess King-Rook vs. Contraceptive Method Choice. Molecular Biology Promoter Gene Sequences.
Molecular Biology Protein Secondary Structure. Molecular Biology Splice-junction Gene Sequences. Page Blocks Classification. Optical Recognition of Handwritten Digits. Pen-Based Recognition of Handwritten Digits. Qualitative Structure Activity Relationships. Low Resolution Spectrometer. Teaching Assistant Evaluation.WRI produces and curates data sets as part of our commitment to turn information into action.
These products are based on our research, which are held to traditional academic standards of excellenceincluding objectivity and rigor. The database covers approximately 30, power plants from countries and Aqueduct Global Maps 2. Featured Aqueduct Global Flood Risk Maps For the current scenario, we used hydrological data from through for generating flood inundations for 9 return periods, from 2-year flood to year flood, and GDP, population, and land use data for assessing flood impacts.
Featured Aqueduct Global Maps 2.
Top 20 Best Machine Learning Datasets for Practicing Applied ML
In response to this demand, the World Resources Institute developed the Map of blast and poison fishing 1 km grid was developed for use in the Reefs at Risk Revisited project as a component of the model of overfishing and destructive fishing pressure on coral reefs.
This layer designates threat of blast and Displays tree cover loss in the country from torepresented by pixel valuesrespectively. Featured Intact Forest Landscapes "The Intact Forest Landscapes IFL data set identifies unbroken expanses of natural ecosystems within the zone of forest extent that show no signs of significant human activity and are large enough that all native biodiversity, including viable populations of wide-ranging species, could be These include: past thermal stress i.
Shipping Activity This dataset was used as base data in Reefs at Risk. Shipping activity data used in the model of threat to coral reefs from marine-based pollution and damage in the Reefs at Risk Revisited project. The certification includes plantations managed by the mill and plantations managed by other suppliers, including smallholders. To have its oil mill certified, a palm oil producer must show a Digital Globe GEO1 Satellite Imagery Effective emergency planning and response requires quick and easy access to accurate, up-to-date information.
Digital Globe QB01 Satellite Imagery Effective emergency planning and response requires quick and easy access to accurate, up-to-date information. Featured Sarawak oil palm concessions This data set provides the boundaries of known oil palm concessions for the state of Sarawak, Malaysia, and was compiled from available public documents. Where available, associated information provided with this data set includes licensee name, permit number Operational Ticketcorporate Reefs at Risk Revisited Local Threats Data Threats from coastal development, marine-based pollution and damage, overfishing and destructive fishing, and watershed-based pollution were analyzed separately.
These threats were integrated into the Integrated Local Threat index. Past thermal stress was integrated with local threats into the Digital Globe WV01 Satellite Imagery Effective emergency planning and response requires quick and easy access to accurate, up-to-date information.
Bleaching observations This dataset was used as base data in Reefs at Risk. Point locations of reported observations of coral bleaching between and Stay Connected.