본문 바로가기
추가 활동

감정인식 데이터셋 비교 분석

by import ysy 2022. 8. 15.
IEMOCAP: 2speakers / 5-fold
RAVDESS: 4speakers / 6-fold
EMO-DB: 2speakers / 5-fold
MSP-improv: 2speakers / 6-fold


가장 유명하고 널리 쓰이는 데이터셋







Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)

내가 사용하는 건 RAVDESS의 오디오파일만 있는 음성 데이터(16bit, 48kHz .wav)다. 음성과 노래, 영상데이터를 포함한 모든 데이터(24.8 GB)는 Zenodo에서, 그리고 더 자세한 정보는 paper in PLoS ONE에서 얻을 수 있다.


이 데이터는 총 1440 개의 파일로 구성되었다. 남여 각 12명의 배우, 총 24명의 배우가 60개의 lexically-matched statements(사전적으로 말이 되는)를 맡았다.

즉 trials per actor x 24 actors = 1440 files.

North American accent. 

감정의 종류: calm, happy, sad, angry, fearful, surprise, and disgust expressions.

감정 세분화: 각 감정은 두 레벨로 구분된다. (normal, strong) 추가로 neutral expression도 있다.

문장은 두 종류: 01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door"


파일의 이름 규칙:

7개의 숫자 나열. (e.g., 03-01-06-01-02-01-12.wav).

  • Modality (01 = full-AV, 02 = video-only, 03 = audio-only). --> 여기서는 03만 고려하기로 한다.
  • Vocal channel (01 = speech, 02 = song). --> 여기서는 01만 고려하기로 한다.
  • Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).
  • Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the 'neutral' emotion.
  • Statement (01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door").
  • Repetition (01 = 1st repetition, 02 = 2nd repetition).
  • Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).

example: 03-01-06-01-02-01-12.wav
  1. Audio-only (03)
  2. Speech (01)
  3. Fearful (06)
  4. Normal intensity (01)
  5. Statement "dogs" (02)
  6. 1st Repetition (01)
  7. 12th Actor (12)
    Female, as the actor ID number is even.

How to cite the RAVDESS

Academic citation

If you use the RAVDESS in an academic publication, please use the following citation: Livingstone SR, Russo FA (2018) The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5): e0196391. https://doi.org/10.1371/journal.pone.0196391.

All other attributions

If you use the RAVDESS in a form other than an academic publication, such as in a blog post, school project, or non-commercial product, please use the following attribution: "The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)" by Livingstone & Russo is licensed under CC BY-NA-SC 4.0.

한 폴더에 60개의 wav files
왜 파일이 반복되어 들어가 있는 건지...



An emotional audiovisual database of spontaneous improvisations


시청각 감정 데이터

12명의 배우(남여 6명)

다양한 길이의 20개의 타겟 문장

감정: happy, sadness, anger and neutral state

data 크기: 8,438 speaking turns. 이중 652 문장만 타겟 문장에 포함됨.

(We collected 8,438 speaking turns, out of which 652 turns correspond to the target sentences.)

There are 6 sessions each session is a dyadic interaction between two speakers.
each session consists of 20 target sentences,
the folder notation is written as sentence number followed by inteneded emotion, so S01A folder contains recordings for target sentence 1 with intended Angry Emotion.
Intended emotions are ( Angry, Happy, Sad, Neutral)

Inside the target sentence folder there are 4 folders corresponding to the Recording scenario

P: recordings during preparations ( natural spontanous interactions)
R: recordings of the target sentence read.
S: recordings of Improvised scene turns.
T: recordings of target sentence in the Improvised scene.

file naming notation: ex MSP-IMPROV-S01A-M01-P-FM01

M01: male speaker 01 in the dyadic interaction.

FM01: Female listener,male speaker, turn 01

Session 1 ~ 6

S01A  S02H  S03N  S04S  S06A  S07H  S08N  S09S  S11A  S12H  S13N  S14S  S16A  S17H  S18N  S19S
S01H  S02N  S03S  S05A  S06H  S07N  S08S  S10A  S11H  S12N  S13S  S15A  S16H  S17N  S18S  S20A
S01N  S02S  S04A  S05H  S06N  S07S  S09A  S10H  S11N  S12S  S14A  S15H  S16N  S17S  S19A  S20H
S01S  S03A  S04H  S05N  S06S  S08A  S09H  S10N  S11S  S13A  S14H  S15N  S16S  S18A  S19H  S20N
S02A  S03H  S04N  S05S  S07A  S08H  S09N  S10S  S12A  S13H  S14N  S15S  S17A  S18H  S19N  S20S

P  R  S  T

이 밑은 개수 다 다름

MSP-IMPROV-S01A-F02-P-FM01.wav  MSP-IMPROV-S01A-F02-P-MF03.wav  MSP-IMPROV-S01A-M02-P-MF01.wav
MSP-IMPROV-S01A-F02-P-MF01.wav  MSP-IMPROV-S01A-M02-P-FM01.wav
MSP-IMPROV-S01A-F02-P-MF02.wav  MSP-IMPROV-S01A-M02-P-FM02.wav


P: recordings during preparations ( natural spontanous interactions)
R: recordings of the target sentence read.
S: recordings of Improvised scene turns.
T: recordings of target sentence in the Improvised scene.

출처는 데이터셋 폴더 안 readme.rm

근데 turn이 상호작용하는 그 turn을 말하는 것 같은데 사실상 청취자는 지금 우리에게는 그리 필요하지 않은 정보다.

그리고 다른 Evaluate.txt 같은건 뭐하는 파일인지 모르겠음

