Development of Lip Reading Method From Video Using Deep Learning

Please use this identifier to cite or link to this item: http://ithesis-ir.su.ac.th/dspace/handle/123456789/5322

Title:	Development of Lip Reading Method From Video Using Deep Learning การพัฒนาวิธีการอ่านริมฝีปากจากภาพเคลื่อนไหวโดยใช้การเรียนรู้เชิงลึก
Authors:	Aekapob JITTAKOTI เอกภพ จิตตโคติ SOPON PHUMEECHANYA โสภณ ผู้มีจรรยา Silpakorn University SOPON PHUMEECHANYA โสภณ ผู้มีจรรยา phumeechanya_s@su.ac.th phumeechanya_s@su.ac.th
Keywords:	การอ่านริมฝีปาก โครงข่ายประสาทเทียมแบบคอนโวลูชัน หน่วยความจำสั้นยาว Lip Reading Convolutional Neural Network Long Short-Term Memory
Issue Date:	28
Publisher:	Silpakorn University
Abstract:	This thesis presents a method for improving the efficiency of lip reading through the analysis of keyframes using CNN and LSTM working together, which combines the characteristics of image-based learning with sequential learning features. When attempting to enhance lip reading performance using the entire raw dataset, satisfactory results cannot be achieved. Thus, the selection of an appropriate number of frames and frame selection for learning directly affects the model's efficiency. The frame selection method is proposed through the Mediapipe face detection library in Python. The study divides experiments into three main groups: selecting 3, 5, and 10 frames. Additionally, the frame selection includes full-Lip image frames and half-Lip image frames options, based on the hypothesis of the symmetry of human body parts, both left and right. Furthermore, it demonstrates the reduction of input size by half and compares the performance of the obtained results. This proposes a lip reading method that has not been conducted before. The purpose of lip reading is to aid in speech retrieval from heavily corrupted audio-video files and also to facilitate communication for hearing-impaired individuals. In the database part, the AVDigits database, an English language database consisting of participants who are native and non-native speakers of English from 16 nationalities, is used. The results of this study show that the proposed models, including the crucial frame selection process, significantly improve lip reading performance for both full-Lip image and half-Lip image, achieving high and comparable results. วิทยานิพนธ์ฉบับนี้ได้นำเสนอวิธีการพัฒนาประสิทธิภาพของการอ่านริมฝีปากผ่านการวิเคราะห์เฟรมสำคัญโดยใช้ CNN และ LSTM ที่ทำงานร่วมกันซึ่งเป็นการใช้คุณลักษณะของการเรียนรู้แบบรูปภาพร่วมกับคุณลักษณะการเรียนรู้แบบลำดับขั้น หากต้องการเพิ่มประสิทธิของการอ่านริมฝีปากการใช้ชุดข้อมูลดิบทั้งหมดไม่สามารถให้ผลลัพธ์ที่ดีได้ ดังนั้นการเลือกจำนวนเฟรมและเฟรมที่เหมาะสมต่อการเรียนรู้จะส่งผลต่อประสิทธิภาพของแบบจำลองโดยตรง โดยวิธีการเลือกเฟรมได้ถูกนำเสนอผ่านไลบรารี่การตรวจจับใบหน้าของ Mediapipe บนโปรแกรมภาษา Python โดยการศึกษาได้มีการแบ่งการทดลองออกเป็น 3 กลุ่มหลัก นั่นคือ การเลือกจำนวนเฟรมที่ 3 5 และ 10 เฟรม อีกทั้งการเลือกเฟรมดังกล่าวยังแบ่งออกเป็นการเลือกแบบเฟรมเต็มปากและการเลือกแบบเฟรมครึ่งปาก โดยมีที่มาจากสมมติฐานเรื่องของความสมมาตรทางด้านร่างกายซ้ายและขวาของมนุษย์ อีกทั้งยังแสดงถึงการลดขนาดของอินพุตลงครึ่งนึงและเปรียบเทียบประสิทธิภาพของผลลัพธ์ที่ได้ ซึ่งเป็นการนำเสนอวิธีการวิธีการอ่านริมฝีปากที่ไม่มีงานวิจัยใดเคยทำมาก่อน โดยวัตถุประสงค์ของการอ่านริมฝีปากนั้น สามารถช่วยด้านการกู้ข้อมูลคำพูดจากไฟล์วิดีโอที่มีเสียงรบกวนจำนวนมาก รวมถึงการสื่อสารของผู้พิการทางการได้ยินด้วยเช่นกัน ในส่วนของฐานข้อมูลใช้ฐานข้อมูลที่ชื่อ AVDigits ซึ่งเป็นฐานข้อมูลภาษาอังกฤษที่มีการรวบรวมอาสาสมัครที่เป็นเจ้าของภาษาและไม่ใช่เจ้าของภาษากว่า 16 สัญชาติ โดยผลลัพธ์ทีได้จากการศึกษานี้พบว่า แบบจำลองที่ได้นำเสนอรวมถึงขั้นตอนของการเลือกเฟรมสำคัญทำให้ประสิทธิภาพของการอ่านริมฝีปากทั้งแบบเต็มปากและครึ่งปากให้ผลลัพธ์อยู่ในระดับที่สูงและมีความใกล้เคียงกัน
URI:	http://ithesis-ir.su.ac.th/dspace/handle/123456789/5322
Appears in Collections:	Engineering and Industrial Technology

Files in This Item:

File	Description	Size	Format
640920030.pdf		11.93 MB	Adobe PDF	View/Open

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets