[인공지능 뉴스 | Forbes] 20+ Amazing (And Free) Data Sources Anyone Can Use To Build AIs

728x90

인공지능 영문 뉴스 (5)

When we talk about artificial intelligence(AI) in business and society today, what we really mean is machine learning (ML). This refers to applications that use algorithms (a set of instructions) to become increasingly good at performing a particular task as it is exposed to more and more data relating to that task.
오늘날 우리가 비즈니스와 사회에서 인공지능(AI)에 대해 이야기할 때, 우리가 정말 의미하는 것은 기계 학습(ML)입니다. 이는 알고리즘(명령어 집합)을 사용하여 특정 작업과 관련된 데이터에 점점 더 많이 노출됨에 따라 해당 작업을 수행하는 응용 프로그램을 참조하십시오.

20+ Amazing (And Free) Data Sources Anyone Can Use To Build AIs

These tasks could be anything from answering questions and creating text or images (as demonstrated by apps like ChatGPT or Dall-E) to recognizing images (computer vision) or navigating self-driving autonomous vehicles from A to B.
이러한 작업은 질문에 대답하고 텍스트 또는 이미지를 생성하는 것(예: ChatGPT 또는 Dall-E)부터 이미지 인식(컴퓨터 비전) 또는 A에서 B로 자율 주행 차량을 탐색하는 것까지 모든 것이 될 수 있습니다.

All of these tasks require data, and businesses that want to train their own ML algorithms in order to automate their day-to-day tasks need sources of data.
이러한 모든 작업에는 데이터가 필요하며 일상적인 작업을 자동화하기 위해 자체 ML 알고리즘을 교육하려는 기업에는 데이터 소스가 필요합니다.

What types of data are there?
어떤 종류의 데이터가 있습니까?

Business data is commonly divided into one of two categories – internal and external data.
비즈니스 데이터는 일반적으로 내부 데이터와 외부 데이터의 두 가지 범주 중 하나로 나뉩니다.

Internal data is data collected by organizations themselves from within their own operations. This commonly includes financial data, customer feedback data, HR data, operational data, and many more sources. Data collected by an organization monitoring its own operations is said to be proprietary data, and is valuable because it gives information specific to that business.
내부 데이터는 조직이 자체 운영 내에서 자체적으로 수집한 데이터입니다. 여기에는 일반적으로 재무 데이터, 고객 피드백 데이터, HR 데이터, 운영 데이터 및 더 많은 소스가 포함됩니다. 조직이 자체 운영을 모니터링하여 수집한 데이터는 독점 데이터라고 하며, 해당 비즈니스에 대한 정보를 제공하기 때문에 가치가 있습니다.

External data comes from sources outside of the organization and is typically collected from third-party data sources such as those listed below. If data is freely available to anyone, it is called open data.
외부 데이터는 조직 외부의 소스에서 수집되며 일반적으로 아래 나열된 것과 같은 타사 데이터 소스에서 수집됩니다. 누구나 자유롭게 사용할 수 있는 데이터를 오픈 데이터라고 합니다.

Further to this, data can also be classified as either structured, unstructured, or semi-structured data.
또한 데이터는 정형 데이터, 비정형 데이터 또는 준정형 데이터로 분류할 수 있습니다.

Structured data is information that fits nicely and neatly into tables – for example, sales data showing what products were sold by a business, when, where, and at what price would be an example of internal, structured data. Alternatively, it might choose to analyze historical market data and economic indicators to predict future movements in the markets they operate in (structured, external data).
정형 데이터는 표에 잘 맞는 정보입니다. 예를 들어, 기업이 어떤 제품을 판매했는지, 언제, 어디서, 어떤 가격으로 판매했는지를 보여주는 영업 데이터가 구조화된 내부 데이터의 예입니다. 또는 과거 시장 데이터와 경제 지표를 분석하여 운영 중인 시장의 향후 동향(구조화된 외부 데이터)을 예측할 수도 있습니다.

Unstructured data is everything else – for example, pictures, videos, text, and social media posts. It can certainly contain valuable insights but is more difficult to analyze. AI, however, has proven particularly useful for extracting meaning from unstructured data. Image recognition algorithms, for example, might tell a business useful facts about customer behavior by analyzing in-store CCTV images (internal, unstructured data). They might also find valuable insights by analyzing images related to the business posted on social media (unstructured, external data).
사진, 비디오, 텍스트 및 소셜 미디어 게시물과 같은 비정형 데이터가 가장 중요합니다. 물론 귀중한 통찰력을 포함할 수 있지만 분석하기가 더 어렵습니다. 그러나 AI는 비정형 데이터에서 의미를 추출하는 데 특히 유용한 것으로 입증되었습니다. 예를 들어, 이미지 인식 알고리즘은 매장 내 CCTV 이미지(내부, 비정형 데이터)를 분석하여 고객 행동에 대한 유용한 사실을 알려줄 수 있습니다. 또한 소셜 미디어에 게시된 비즈니스 관련 이미지(비정형, 외부 데이터)를 분석하여 귀중한 통찰력을 얻을 수도 있습니다.

Luckily, data is everywhere. Whatever you’re trying to do, if it requires external data, there’s likely to be a source for it online. Governments, research institutions, private companies, and non-governmental organizations all routinely make data freely available for research and even commercial purposes. So here are some of the best sources of free online data available in 2023.
다행히도 데이터는 어디에나 있습니다. 외부 데이터가 필요한 경우 온라인에서 데이터 소스를 사용할 수 있습니다. 정부, 연구 기관, 민간 기업 및 비정부 조직은 모두 일상적으로 연구 및 상업적 목적으로 데이터를 자유롭게 사용할 수 있습니다. 여기 2023년에 이용 가능한 무료 온라인 데이터의 최고 소스가 있습니다.

Data Search Engines and Repositories
데이터 검색 엔진 및 저장소

Google Dataset Search – This is actually a search engine for datasets cataloged by Google; use this to find data on just about anything you could need.
실제로 구글이 카탈로그화한 데이터셋을 위한 검색 엔진입니다. 필요한 모든 것에 대한 데이터를 찾으려면 이것을 사용하십시오.

AWS Open Data Search – Another dataset search engine, this one, is provided by Amazon's AWS service.
또 다른 데이터 세트 검색 엔진인 이 검색 엔진은 아마존의 AWS 서비스에서 제공합니다.

Microsoft Research Open Data – Free, open datasets collected by Microsoft, with a mainly scientific focus.
Microsoft에서 수집한 무료 개방형 데이터 세트로 주로 과학적인 분야를 다루고 있습니다..

UCI Machine Learning Repository – A repository of more than 600 open datasets curated and maintained by the University of California, Irvine, and made available for the purpose of training machine learning algorithms.
600개 이상의 개방형 데이터 세트 저장소가 캘리포니아 대학교 어바인에 의해 큐레이션 및 유지 관리되며 기계 학습 알고리듬 교육을 목적으로 사용할 수 있습니다.

Kaggle Datasets – Online data science platform Kaggle also offers a curated catalog of datasets covering everything from university rankings to trending Google searches, retail sales, online movie reviews, and crime statistics.
온라인 데이터 과학 플랫폼인 Kaggle은 대학 순위에서 구글 검색, 소매 판매, 온라인 영화 리뷰 및 범죄 통계에 이르기까지 모든 것을 다루는 데이터 세트의 큐레이션된 카탈로그를 제공합니다.

Reddit R/Datasets – A vast collection of datasets submitted by users of the online community site Reddit covering literally hundreds of subjects.
온라인 커뮤니티 사이트 Reddit 사용자가 제출한 방대한 데이터 세트는 말 그대로 수백 개의 주제를 포함합니다.

Government and Inter-Governmental Organization Datasets
정부 및 정부 간 조직 데이터 세트

Data.Gov – Open data portal provided by the US government, hosting nearly a quarter of a million datasets published by all government agencies.
미국 정부가 제공하는 개방형 데이터 포털로, 모든 정부 기관에서 발행한 약 25만 개의 데이터 세트를 호스팅합니다.

Data.Census.Gov – If you’re specifically looking for US demographic data, this is a good place to start!
특히 미국 인구 통계 데이터를 찾고 있다면 여기에서 시작하는 것이 좋습니다!

Data.EU – The European Union's open data portal contains data from EU organizations and member state governmental data.
유럽 연합의 개방형 데이터 포털에는 EU 조직의 데이터와 회원국 정부 데이터가 포함되어 있습니다.

Data.gov.uk – Open data sets published by UK government agencies.
영국 정부 기관에서 발행한 개방형 데이터 세트입니다.

World Health Organization Data – Datasets related to global health and wellbeing.
글로벌 건강 및 복지와 관련된 데이터 세트입니다.

World Bank Open Data – Datasets related to economic development, international financial markets, social indicators, and environmental issues.
경제 발전, 국제 금융 시장, 사회 지표 및 환경 문제와 관련된 데이터 세트입니다.

Image Data
이미지 자료

Google Open Images – Millions of images classified and labeled in various ways, suitable for training many different types of computer vision algorithms.
다양한 유형의 컴퓨터 비전 알고리듬을 훈련하는 데 적합한 다양한 방식으로 분류되고 레이블이 지정된 수백만 개의 이미지.

ImageNet Open Dataset – Another dataset consisting of labeled images that’s free to use for non-commercial machine learning applications.
비상업적 기계 학습 애플리케이션에 무료로 사용할 수 있는 레이블이 지정된 이미지로 구성된 또 다른 데이터 세트입니다.

COCO Dataset – Common Objects in Context (COCO) is a dataset consisting of over 200,000 images selected for training object detection and captioning algorithms.
COCO(Common Objects in Context)는 객체 감지 및 캡션 알고리즘 훈련을 위해 선택된 200,000개 이상의 이미지로 구성된 데이터 세트입니다.

Sound Data
음향 자료

Mozilla Common Voice – An open dataset of voice recordings that can be used to train any AI application that involves speech.
음성과 관련된 모든 AI 응용 프로그램을 훈련하는 데 사용할 수 있는 음성 녹음의 개방형 데이터 세트입니다.

Audioset – Another Google-curated dataset, this one focusing on sounds and containing hundreds of thousands of 10-second samples broken down into categories such as musical instruments, vehicles, and vocals.
또 다른 구글 큐레이션 데이터 세트는 소리에 초점을 맞추고 악기, 차량, 보컬과 같은 범주로 분류된 수십만 개의 10초 샘플을 포함하고 있습니다.

Million Song Dataset – Samples and metadata from one million contemporary popular music tracks.
백만 개의 현대 대중 음악 트랙의 샘플과 메타데이터.

Text Data
텍스트 자료

Wikidata – Database downloads of Wikipedia articles in a number of different formats.
다양한 형식의 Wikipedia 기사 데이터베이스 다운로드 사이트.

Common Crawl – An open repository of data scraped from the world wide web, famously used to train the GPT large language models powering ChatGPT and many other chatbots.
월드 와이드 웹에서 스크랩된 데이터의 개방형 저장소로, ChatGPT 및 기타 많은 챗봇에 전원을 공급하는 GPT 대형 언어 모델을 훈련하는 데 사용되는 것으로 유명합니다.

Other and Miscellaneous Datasets
기타 데이터 세트

Amazon Reviews – A database of around 35 million reviews for Amazon products, including product information and ratings.
제품 정보 및 등급을 포함한 아마존 제품에 대한 약 3,500만 건의 리뷰 데이터베이스.

Waymo Open Dataset – Alphabet’s autonomous driving subsidiary Waymo makes a huge amount of data collected via self-driving vehicles publicly accessible, including sensor data from cameras and LiDAR.
Alphabet의 자율주행 자회사인 Waymo는 카메라와 LiDAR의 센서 데이터를 포함하여 자율주행 차량을 통해 수집된 방대한 양의 데이터를 공개적으로 액세스할 수 있도록 합니다.

Apolloscape Dataset – More autonomous driving data, this time provided by Baidu’s open-source Apollo platform.
추가적인 자율주행 데이터로 이번에는 바이두의 오픈소스 아폴로 플랫폼이 제공하는 데이터.

20+ Amazing (And Free) Data Sources Anyone Can Use To Build AIs

Forbes@Bernard Marr
May 17, 2023

728x90

저작자표시 (새창열림)

'📂 기타 > ◾ NEWS REVIEW' 카테고리의 다른 글

[인공지능 뉴스 \| Forbes] Apple Vision Pro Signals Another Move Into Digital Identity for Apple (14)	2023.06.17
[인공지능 뉴스 \| MIT News] Training machines to learn more like humans do (8)	2023.05.25
[인공지능 뉴스 \| Forbes] Google I/O 2023: New Google AI Products Take On Amazon And Microsoft (17)	2023.05.17
[인공지능 뉴스 \| MIT News] Researchers develop novel AI-based estimator for manufacturing medicine (38)	2023.05.16