PHÂN LOẠI ÂM THANH NHẠC CỤ DỰA TRÊN ĐẶC TRƯNG PHỔ VÀ MẠNG NƠ-RON TÍCH CHẬP

Phạm Thị Thu Trang, Dương Tấn Nghĩa

doi:10.59266/houjs.2025.729

Authors

Phạm Thị Thu Trang, Dương Tấn Nghĩa

DOI:

https://doi.org/10.59266/houjs.2025.729

Keywords:

Trí tuệ nhân tạo, Học sâu, Phân loại âm thanh, Trích chọn đặc trưng, Mạng nơ-ron tích chập

Abstract

Trong những năm gần đây, bài toán phân loại âm thanh đã thu hút sự quan tâm đáng kể nhờ tiềm năng ứng dụng rộng rãi trong lĩnh vực kỹ thuật và việc trích xuất đặc trưng âm thanh đóng vai trò then chốt trong quá trình phân loại này. Bài báo này trình bày hệ thống phân loại âm thanh nhạc cụ sử dụng kết hợp các phương pháp trích chọn đặc trưng âm thanh phổ biến và các mô hình học sâu. Tín hiệu âm thanh được chuyển đổi thành các biểu diễn phổ thời-tần số như Short-Time Fourier Transform (STFT), Mel-spectrogram, hệ số MFCC và MFCC kèm hệ số delta. Các đặc trưng này được đưa vào ba mô hình CNN: SimpleCNN, MobileNetV2 và VGG16. Thí nghiệm trên dữ liệu Solo MUSIC cho thấy Mel-spectrogram và MFCC+delta cho kết quả chính xác và F1-score cao nhất. MobileNetV2 đáp ứng tốt yêu cầu tính toán nhẹ nhưng vẫn giữ được hiệu quả.

References

[1]. Boddapati, V., Petef, A., Rasmusson, J., & Lundberg, L. (2017). Classifying environmental sounds using image recognition networks. Procedia Computer Science, 112, 2048–

https://doi.org/10.1016/j. procs.2017.08.250

[2]. Bountourakis, V., Vrysis, L., & Papanikolaou, G. (2015). Machine learning algorithms for environmental sound recognition: Towards soundscape semantics. In Proceedings of the Audio Mostly Conference.

[3]. Bountourakis, V., Doukakis, D., Stathis, K., & Papanikolaou, G. (2019). An enhanced temporal feature integration method for environmental sound recognition. Acoustics, 1(2), 410–422. https://doi.org/10.3390/ acoustics1020023

[4]. Choi, K., Fazekas, G., & Sandler, M. (2016). Automatic tagging using deep convolutional neural networks. arXiv preprint arXiv:1606.00298.

[5]. Cowling, M., & Sitte, R. (2003). Comparison of techniques for environmental sound recognition. Pattern Recognition Letters, 24(15), 2895–2907. https://doi.org/10.1016/ S0167-8655(03)00147-8

[6]. Gong, Y., Chung, Y. A., & Glass, J. (2021). Ast: Audio spectrogram transformer. arXiv preprint arXiv:2104.01778.

[7]. Li, D., Sethi, I. K., Dimitrova, N., & McGee, T. (2001). Classification of general audio data for content- based retrieval. Pattern Recognition Letters, 22(5), 533–544. https://doi. org/10.1016/S0167-8655(00)00119-7

[8]. Lu, L., Li, S. Z., & Zhang, H.-J. (2001). Content-based audio segmentation using support vector machines. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME) (pp. 749–752). https://doi. org/10.1109/ICME.2001.1237830

[9]. Montesinos, J. F., Slizovskaia, O., & Haro, G. (2020). Solos: A dataset for audio-visual music analysis. 2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP), 1–6. https://doi.org/10.1109/ MMSP48831.2020.9287124

[10]. Naman, A., & Zhang, G. (2025, April). FAST: Fast Audio Spectrogram Transformer. In ICASSP 2025- 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). IEEE.

[11]. Pons, J., & Serra, X. (2017). Designing efficient architectures for modeling temporal features with convolutional neural networks. In Proceedings of the IEEE ICASSP (pp. 2472–2476).

[12]. P. Dhanalakshmi, S. Palanivel, and V. Ramalingam. Classification of audio signals using svm and rbfnn. In Expert Systems with Applications, vol. 36, no. 3, Part 2, pp. 6069–6075, 2009, ISSN:0957-4174.

[13]. Wu, Y., Mao, H., & Yi, Z. (2018). Audio classification using attention- augmented convolutional neural network. Knowledge-Based Systems, 161, 90–100. https://doi.org/10.1016/j.knosys.2018.07.033