'분류 전체보기' 카테고리의 글 목록

[논문리뷰] Speaker-Distinguishable CTC: Learning Speaker Distinction Using CTC for Multi-Talker Speech Recognition

SOT(Serialized Output Training)의 병목이 token alignment보다 speaker distinction으로 봄즉, ASR 보다 speaker를 맞추지 못해서 생기는 오차가 더 크다고 본거임auxiliary information 없이 encoder가 frame-level로 “어떤 token을, 어느 speaker가 냈는지”를 학습하도록 SD-CTC (Speaker-Distinguishable CTC)를 붙인 작업결과적으로 LibriSpeechMix 2-speaker 평가에서 cpWER 4.7 → 3.5로 줄여 26% relative error reductiontoken-level timestamp가 필요한 SA-SOT(3.4)와 거의 비슷한 성능에 도달1) Bibliograp..

format_list_bulleted Paper review/SA-ASR
· 2026. 4. 2.

[논문리뷰] Modeling Overlapped Speech with Shuffles

multi-talker task에서 CTC 가이드 제시.토큰을 1번만 출력하도록 → 명확하게 1개 frame 시점으로 정의overlap 되었을 때 경로를 유연하게1) Bibliographic InfoTitle: Modeling Overlapped Speech with ShufflesAuthors: Matthew Wiesner, Samuele Cornell, Alexander Polok, Lucas Ondel Yang, Lukáš Burget, Sanjeev KhudanpurYear: 2026Venue / Journal: arXiv preprint2) Problem Statementoverlapped speech에서의 speaker-attributed transcription과 forced alignmen..

format_list_bulleted Paper review/SA-ASR
· 2026. 4. 1.

CTC(Connectionist Temporal Classification) 에 대한 공부

CTC를 이해하는데 꽤 오래 걸림.CTC는 어떤 모델인가?beam-search 가 어떻게 가능한가? frame 각각의 독립적인 출력을 내는 게 ctc인데?→ 곱함. CTC (Connectionist Temporal Classification)시계열 정보의 trajectory를 통해 하나의 완성된 출력을 뽑는 요소_ A _ B C _A A _ B _ C_ A B _ C C→ ABCtorch / k2 / icefall 에 CTC 코드가 구현되어있음.여기서 알아야할게 HMM, graph , token, 실제 모델 출력wav → encoder → linear → outputoutput은 총 token 개수만큼의 dimension을 가지고 실제 1개 출력으로 mapping됨 여기서 최종 linear layer를 ..

format_list_bulleted AI/Audio processing
· 2026. 4. 1.

[논문리뷰]SOMSRED-SVC: Sequential Output Modeling with Speaker Vector Constraints for Joint Multi-Talker Overlapped ASR and Speaker Diarization

1) Bibliographic InfoTitle: SOMSRED-SVC: Sequential Output Modeling with Speaker Vector Constraints for Joint Multi-Talker Overlapped ASR and Speaker DiarizationAuthors: Naoki Makishima, Naotaka Kawata, Taiga Yamane, Mana Ihori, Tomohiro Tanaka, Satoshi Suzuki, Shota Orihashi, Ryo MasumuraVenue / Year: Interspeech 20252) Problem Statement이전 모델에서는 fully overlapped speech에서 serialized output seque..

format_list_bulleted Paper review/SA-ASR
· 2026. 3. 25.

[논문리뷰] SOMSRED: Sequential Output Modeling for Joint Multi-talker Overlapped Speech Recognition and Speaker Diarization

SOT(Serialized Output Training) 계열의 multi-talker ASR를 확장해 SD (speaker Diarization) 까지 수행하나의 autoregressive output sequence 안에 transcription + timestamp + speaker token을 함께 예측하고, 그 과정에서 얻은 hidden feature를 speaker embedding으로 재사용해 joint ASR + speaker diarization을 수행 Bibliographic InfoTitle: SOMSRED: Sequential Output Modeling for Joint Multi-talker Overlapped Speech Recognition and Speaker Diariza..

format_list_bulleted Paper review/SA-ASR
· 2026. 3. 25.

[논문리뷰] BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR

Bibliographic Info저자: Yuhao Liang, Fan Yu, Yangze Li, Pengcheng Guo, Shiliang Zhang, Qian Chen, Lei Xie.연도: 2023.제목: BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR.학회: INTERSPEECH 2023Problem StatementSOT (Serialized Output Training) 기반 multi-talker ASR 문제 1 : speaker change token () prediction 이 자주 실패SOT는 여러 utterance 사이에 speaker change token을 넣어 하나의 sequence로 만들지만, 이 to..

format_list_bulleted Paper review/SA-ASR
· 2026. 3. 25.

[논문리뷰] A Sidecar Separator Can Convert a Single-Talker Speech Recognition System to a Multi-Talker One

1) Bibliographic Info저자: Lingwei Meng, Jiawen Kang, Mingyu Cui, Yuejiao Wang, Xixin Wu, Helen Meng연도: 2023제목: A Sidecar Separator Can Convert a Single-Talker Speech Recognition System to a Multi-Talker One학회/저널: IEEE ICASSP 2023 conference paper2) Problem Statement일반적인 single-talker ASR는 non-overlapping speech에서는 잘 작동하지만,overlapped multi-talker speech에서는 성능이 크게 떨어집니다.multi-talker ASR를 처음부터 새로 학습..

format_list_bulleted Paper review/SA-ASR
· 2026. 3. 20.

[논문리뷰] Anatomy of the Modality Gap: Dissecting the Internal States of End-to-End Speech LLMs

1. Bibliographic InformationTitle: Anatomy of the Modality Gap: Dissecting the Internal States of End-to-End Speech LLMsAuthors: Ming-Hao Hsu, Xueyao Zhang, Xiaohai Tian, Jun Zhang, Zhizheng WuVenue: arXiv preprintYear: 2026arXiv: 2603.01502v1 (2026-03-03)DOI: 10.48550/arXiv.2603.015022. Problem StatementEnd-to-end Speech LLM (LSLM)은 같은 의미의 입력이라도Text → Text (T2T)Speech → Text (S2T)에서 성능 차이(modal..

format_list_bulleted Paper review/Audio Language Model
· 2026. 3. 16.

[논문리뷰] Understanding the Modality Gap: An Empirical Study on the Speech-Text Alignment Mechanism of Large Speech Language Models

speech llm 모델이 성능이 낮은 이유 → speech 와 text의 modality gap 때문 Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation LearningWe present modality gap, an intriguing geometric phenomenon of the representation space of multi-modal models. Specifically, we show that different data modalities (e.g. images and text) are embedded at arm's length in their shared representat..

format_list_bulleted Paper review/Audio Language Model
· 2026. 3. 16.

[논문리뷰] Towards a Definition of Disentangled Representations

항목내용저자Irina Higgins, David Amos, David Pfau, Sebastien Racaniere, Loic Matthey, Danilo Rezende, Alexander Lerchner연도2018 (December)제목Towards a Definition of Disentangled Representations학회.소속Google DeepMind (DeepMind) Disentangled Representation Learning(분리 표현 학습)Robustness, Generalisability 를 위해 dataset, model architecture 등을 수정하는 다양한 시도들이 있었음아키텍처를 고정하는 대신, 데이터의 구조를 잘 반영하는 표현(Representation)을 학..

format_list_bulleted Paper review/Disentanglement
· 2025. 12. 15.

[논문리뷰] Towards Better Disentanglement in Non-Autoregressive Zero-Shot Expressive Voice Conversion

1. Bibliographic Info항목내용저자Seymanur Akti, Tuan Nam Nguyen, Alexander Waibel연도2025제목Towards Better Disentanglement in Non-Autoregressive Zero-Shot Expressive Voice Conversion학회Interspeech 2025 (Rotterdam, Netherlands)소속Karlsruhe Institute of Technology, Carnegie Mellon University2. Problem StatementEVC : expressive voice conversion기존의 목소리 변환(Voice Conversion, VC) 기술에서 한 단계 더 나아가, '감정', '억양', '운율(..

format_list_bulleted Paper review/Disentanglement
· 2025. 12. 15.

Spectral Theorem : ML

일반적인 Eigen Vecter들은 서로 Orthogonal 할 필요가 없음비스듬히 미는 변환(Shear)에서 변하지 않는 축(고유벡터)들이 90도가 아니라 30도나 45도로 좁게 모여 있을 수도 있다. Linearly Independent는 맞지만 Orthogonal일 필요는 없음. 하지만 Symmetric Matrix 을 만족하는 경우 항상 Orthogonal 하며 이를 Spectral Theorem 이라 부름 Spectral Theorem [Linear Algebra] Lecture 25 대칭 행렬(Symmetric Matrix)과 스펙트럼 정리(Spectral Theorem)이번 강의에서는 대칭 행렬(Symmetric Matrix)에 대해 이야기 하도록 하겠다. 지난 강의 에서 간략히 배우긴 했..

format_list_bulleted AI/ML basic
· 2025. 12. 11.