IEMOCAP
가장 유명하고 널리 쓰이는 데이터셋
EmoDB
RAVDESS
Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)
내가 사용하는 건 RAVDESS의 오디오파일만 있는 음성 데이터(16bit, 48kHz .wav)다. 음성과 노래, 영상데이터를 포함한 모든 데이터(24.8 GB)는 Zenodo에서, 그리고 더 자세한 정보는 paper in PLoS ONE에서 얻을 수 있다.
이 데이터는 총 1440 개의 파일로 구성되었다. 남여 각 12명의 배우, 총 24명의 배우가 60개의 lexically-matched statements(사전적으로 말이 되는)를 맡았다.
즉 trials per actor x 24 actors = 1440 files.
North American accent.
감정의 종류: calm, happy, sad, angry, fearful, surprise, and disgust expressions.
감정 세분화: 각 감정은 두 레벨로 구분된다. (normal, strong) 추가로 neutral expression도 있다.
문장은 두 종류: 01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door"
파일의 이름 규칙:
7개의 숫자 나열. (e.g., 03-01-06-01-02-01-12.wav).
- Modality (01 = full-AV, 02 = video-only, 03 = audio-only). --> 여기서는 03만 고려하기로 한다.
- Vocal channel (01 = speech, 02 = song). --> 여기서는 01만 고려하기로 한다.
- Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).
- Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the 'neutral' emotion.
- Statement (01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door").
- Repetition (01 = 1st repetition, 02 = 2nd repetition).
- Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).
example: 03-01-06-01-02-01-12.wav
- Audio-only (03)
- Speech (01)
- Fearful (06)
- Normal intensity (01)
- Statement "dogs" (02)
- 1st Repetition (01)
- 12th Actor (12)
Female, as the actor ID number is even.
How to cite the RAVDESS
Academic citation
If you use the RAVDESS in an academic publication, please use the following citation: Livingstone SR, Russo FA (2018) The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5): e0196391. https://doi.org/10.1371/journal.pone.0196391.
All other attributions
If you use the RAVDESS in a form other than an academic publication, such as in a blog post, school project, or non-commercial product, please use the following attribution: "The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)" by Livingstone & Russo is licensed under CC BY-NA-SC 4.0.
MSP-Improv
An emotional audiovisual database of spontaneous improvisations
시청각 감정 데이터
12명의 배우(남여 6명)
다양한 길이의 20개의 타겟 문장
감정: happy, sadness, anger and neutral state
data 크기: 8,438 speaking turns. 이중 652 문장만 타겟 문장에 포함됨.
(We collected 8,438 speaking turns, out of which 652 turns correspond to the target sentences.)
There are 6 sessions each session is a dyadic interaction between two speakers.
each session consists of 20 target sentences,
the folder notation is written as sentence number followed by inteneded emotion, so S01A folder contains recordings for target sentence 1 with intended Angry Emotion.
Intended emotions are ( Angry, Happy, Sad, Neutral)
Inside the target sentence folder there are 4 folders corresponding to the Recording scenario
P: recordings during preparations ( natural spontanous interactions)
R: recordings of the target sentence read.
S: recordings of Improvised scene turns.
T: recordings of target sentence in the Improvised scene.
file naming notation: ex MSP-IMPROV-S01A-M01-P-FM01
M01: male speaker 01 in the dyadic interaction.
FM01: Female listener,male speaker, turn 01
Session 1 ~ 6
S01A S02H S03N S04S S06A S07H S08N S09S S11A S12H S13N S14S S16A S17H S18N S19S
S01H S02N S03S S05A S06H S07N S08S S10A S11H S12N S13S S15A S16H S17N S18S S20A
S01N S02S S04A S05H S06N S07S S09A S10H S11N S12S S14A S15H S16N S17S S19A S20H
S01S S03A S04H S05N S06S S08A S09H S10N S11S S13A S14H S15N S16S S18A S19H S20N
S02A S03H S04N S05S S07A S08H S09N S10S S12A S13H S14N S15S S17A S18H S19N S20S
P R S T
이 밑은 개수 다 다름
MSP-IMPROV-S01A-F02-P-FM01.wav MSP-IMPROV-S01A-F02-P-MF03.wav MSP-IMPROV-S01A-M02-P-MF01.wav
MSP-IMPROV-S01A-F02-P-MF01.wav MSP-IMPROV-S01A-M02-P-FM01.wav
MSP-IMPROV-S01A-F02-P-MF02.wav MSP-IMPROV-S01A-M02-P-FM02.wav
이런식
P: recordings during preparations ( natural spontanous interactions)
R: recordings of the target sentence read.
S: recordings of Improvised scene turns.
T: recordings of target sentence in the Improvised scene.
출처는 데이터셋 폴더 안 readme.rm
근데 turn이 상호작용하는 그 turn을 말하는 것 같은데 사실상 청취자는 지금 우리에게는 그리 필요하지 않은 정보다.
그리고 다른 Evaluate.txt 같은건 뭐하는 파일인지 모르겠음
'추가 활동' 카테고리의 다른 글
ICASSP 준비 (0) | 2022.08.02 |
---|---|
[딥러닝 시대의 언어연구] 학술대회 (0) | 2022.07.31 |
[교육] RC CAR 자율주행 교육 프로그램 (0) | 2021.12.31 |
[대회] 2021 AI 데이터 해커톤 대회 타임라인식 보고서 3 (0) | 2021.12.16 |
2021 AI 데이터 해커톤 대회 타임라인식 보고서 2 (2) | 2021.12.13 |
댓글