TỐI ƯU CÁ NHÂN HÓA THIẾT KẾ THỜI TRANG DỰA TRÊN TRANG PHỤC ĐỘC ĐÁO VÀ MÔ TẢ VĂN BẢN TIẾNG VIỆT

Nguyễn Thu Phượng, Cao Thanh Tùng

doi:10.59266/houjs.2025.914

Authors

Nguyễn Thu Phượng, Cao Thanh Tùng

DOI:

https://doi.org/10.59266/houjs.2025.914

Keywords:

mô hình hình ảnh-ngôn ngữ, mô tả văn bản tiếng Việt, tối ưu cá nhân hoá thiết kế thời trang, trang phục độc đáo

Abstract

Nghiên cứu này giới thiệu một quy trình mới để tạo ra ảnh thời trang con người chân thực từ mô tả bằng tiếng Việt, bằng cách tích hợp dịch máy (MarianMT), xử lý ngôn ngữ tự nhiên (PhoBERT) và khung sinh ảnh hai giai đoạn lấy cảm hứng từ Text2Human. Quy trình này tinh chỉnh mô hình Stable Diffusion với LoRA (Low-Rank Adaptation) và phân tích GAN điều kiện trên bộ dữ liệu tùy chỉnh "FASHION-HITU" bao gồm 83 mục thời trang Việt Nam với các thuộc tính chi tiết. Sử dụng mã hóa Véc-tơ-Quantized Variational AutoEncoder (VQVAE) phân cấp và hỗn hợp chuyên gia (MoE), hệ thống tối ưu hóa hiệu quả tính toán đồng thời đạt độ trung thực cao trong tái tạo kết cấu và hình dáng phức tạp, độc đáo. Kết quả thực nghiệm trên DeepFashion-MultiModal và bộ dữ liệu tùy chỉnh cho thấy Khoảng cách xuất phát Fréchet (FID) đạt 23.90 (Parsing) và 25.87 (Pose), với độ chính xác dự đoán thuộc tính đạt 95.88% cho denim và 89.92% cho kẻ sọc, vượt trội hơn các phương pháp cơ sở như HumanGAN. Dù gặp hạn chế trong xử lý pose phức tạp do thiếu dữ liệu densepose, các hướng phát triển tương lai sẽ tập trung mở rộng bộ dữ liệu, cải thiện kiểm soát pose bằng ControlNet, và phát triển ứng dụng thử đồ ảo trên nền tảng web để hỗ trợ ngành thời trang Việt Nam.

References

[1]. Chen, M., Liu, Y., Yi, J., Xu, C., Lai, Q., Wang, H., . . . Xu, Q. (2024). Evaluating text-to-image generative models: An empirical study on human image synthesis. arXiv preprint arXiv:2403.05125.

[2]. Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D., & Taigman, Y. (2022). Make-a-scene: Scene-based text-to-image generation with human priors. Paper presented at the European Conference on Computer Vision.

[3]. Good, I.J. (1956). The surprise index for the evaluation of probabilistic hypotheses. Annals of Mathematical Statistics, 27(4), 1136-1138.

[4]. Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., . . . Bengio, Y. (2014).

Generative adversarial nets. Advances in neural information processing systems, 27.

[5]. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). Gans trained by a two time- scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30.

[6]. Jiang, Y., Yang, S., Qiu, H., Wu, W., Loy, C.C., & Liu, Z. (2022). Text2human: Text-driven controllable human image generation. ACM Transactions on Graphics (TOG), 41(4), 1-11.

[7]. Junczys-Dowmunt, M., Grundkiewicz, R., Dwojak, T., Hoang, H., Heafield, K., Neckermann, T., . . . Bogoychev, N. (2018). Marian: Fast neural machine translation in C++. arXiv preprint arXiv:1804.00344.

[8]. Liu, Z., Luo, P., Qiu, S., Wang, X., & Tang, X. (2016). Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. Paper presented at the Proceedings of the IEEE conference on computer vision and pattern recognition.

[9]. Nguyen, D.Q., & Nguyen, A.T. (2020). Phobert: Pre-trained language models for Vietnamese. arXiv preprint arXiv:2003.00744.

[10]. Sarkar, K., Liu, L., Golyanik, V., & Theobalt, C. (2021). Humangan: A generative model of human images. Paper presented at the 2021 International Conference on 3D Vision (3DV).

[11]. Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., & Dean, J. (2017). Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538.

[12]. Van Den Oord, A., & Vinyals, O. (2017). Neural discrete representation learning. Advances in neural information processing systems, 30.

[13]. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., . . . Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

[14]. Wang, T.-C., Liu, M.-Y., Zhu, J.- Y., Tao, A., Kautz, J., & Catanzaro, B. (2018). High-resolution image synthesis and semantic manipulation with conditional GANs. Paper presented at the Proceedings of the IEEE conference on computer vision and pattern recognition.

[15]. Zhang, Y., Wu, W., Loy, C. C., & Liu, Z.,. (2021). Humangan: Towards realistic human image generation. 12844-12853.