Research Article | Open Access | Download PDF
Volume 13 | Issue 5 | Year 2026 | Article Id. IJECE-V13I5P108 | DOI : https://doi.org/10.14445/23488549/IJECE-V13I5P108Augmenting Multi-Disease Indian Clinical Notes via Transformer-Based Symptom Fusion Techniques
Swati Varma, Anil Hingmire, Megha Trivedi, Smita Jawale, Rucha C. Samant, Neeta Deshpande
| Received | Revised | Accepted | Published |
|---|---|---|---|
| 08 Feb 2026 | 09 Mar 2026 | 09 Apr 2026 | 27 May 2026 |
Citation :
Swati Varma, Anil Hingmire, Megha Trivedi, Smita Jawale, Rucha C. Samant, Neeta Deshpande, "Augmenting Multi-Disease Indian Clinical Notes via Transformer-Based Symptom Fusion Techniques," International Journal of Electronics and Communication Engineering, vol. 13, no. 5, pp. 77-87, 2026. Crossref, https://doi.org/10.14445/23488549/IJECE-V13I5P108
Abstract
The success of clinical note research has been experienced all over the world, but without emphasis on real-time Indian data. The study first consisted of the formulation of models based on the clinical notes of the Medical Information Mart Intensive Care (MIMIC) database, with a specific focus on such conditions as Asthma, Myocardial Infarction (MI), and Chronic Kidney Disease (CKD). These trained models were then tested on the actual Indian healthcare clinical data in real-time, received in two hospitals. Due to privacy concerns, the volume of collected notes was limited. To augment the data, two methods were devised. The first used a two-step strategy with weighted symptoms and fuzzy techniques for similarity calculations, followed by a synonym replacement technique. The second method augmented data using symptoms of co-occurring diseases. Augmentation was extended by concatenating these notes with a synonym replacement strategy. Bidirectional Encoder Representations from Transformers (BERT), Distilled BERT (DISTILBERT), and Symptom-Driven BERT (SMDBERT) were used on both strategies and compared with the baseline models: Easy Data Augmentation (EDA) and Synthetic Minority Over-sampling Technique (SMOTE). Method 1 outperformed both SMOTE and EDA for all models, while Method 2 gave superior results with DISTILBERT and SMDBERT, attaining, respectively, accuracy levels of 0.98 and 0.96, compared to 0.92 and 0.94 with EDA.
Keywords
DistilBERT, Data Augmentation, Real-time Clinical Notes, Neuro-Fuzzy Approach, Transfer Learning.
References
- JaWanna Henry et al., “Adoption of Electronic Health Record Systems among U.S. Non-Federal Acute Care Hospitals: 2008-2015,” ONC Data Brief, no. 35, pp. 1-11, 2016.
[Google Scholar] [Publisher Link] - Alistair E.W. Johnson et al., “MIMIC-III, A Freely Accessible Critical Care Database,” Scientific Data, vol. 3, pp. 1-9, 2016.
[CrossRef] [Google Scholar] [Publisher Link] - Tom J. Pollard et al., “The eICU Collaborative Research Database, A Freely Available Multi-Center Database for Critical Care Research,” Scientific Data, vol. 5, pp. 1-13, 2018.
[CrossRef] [Google Scholar] [Publisher Link] - Jin Yang et al., “Brief Introduction of Medical Database and Data Mining Technology in Big Data Era,” Journal of Evidence-Based Medicine, vol. 13, no. 1, pp. 57-69, 2020.
[CrossRef] [Google Scholar] [Publisher Link] - Jiancheng Ye et al., “Predicting Mortality in Critically Ill Patients with Diabetes using Machine Learning and Clinical Notes,” BMC Medical Informatics and Decision Making, vol. 20, pp. 1-7, 2020.
[CrossRef] [Google Scholar] [Publisher Link] - Kexin Huang, Jaan Altosaar, and Rajesh Ranganath, “ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission,” ArXiv Preprint, pp. 1-9, 2019.
[CrossRef] [Google Scholar] [Publisher Link] - Jingyi Wu et al., “Predicting Prolonged Length of ICU Stay through Machine Learning,” Diagnostics, vol. 11, no. 18, pp. 1-18, 2021.
[CrossRef] [Google Scholar] [Publisher Link] - Syed Atif Moqurrab et al., “An Accurate Deep Learning Model for Clinical Entity Recognition from Clinical Notes,” IEEE Journal of Biomedical and Health Informatics, vol. 25, no. 10, pp. 3804-3811, 2021.
[CrossRef] [Google Scholar] [Publisher Link] - Ning Liu et al., “Med-BERT: A Pre-Training Framework for Medical Records Named Entity Recognition,” IEEE Transactions on Industrial Informatics, vol. 18, no. 8, pp. 5600-5608, 2022.
[CrossRef] [Google Scholar] [Publisher Link] - Natalie C. Ernecoff et al., “Electronic Health Record Phenotypes for Identifying Patients with Late-Stage Disease: A Method for Research and Clinical Application,” Journal of General Internal Medicine, vol. 34, pp. 2818-2823, 2019.
[CrossRef] [Google Scholar] [Publisher Link] - Pratheeba Jeyananthan, “Machine Learning in the Identification of Phenotypes of Multiple Sclerosis Patients,” International Journal of Information Technology, vol. 16, pp. 2307-2313, 2024.
[CrossRef] [Google Scholar] [Publisher Link] - Rayan Alanazi, “Identification and Prediction of Chronic Diseases Using Machine Learning Approach,” Journal of Healthcare Engineering, vol. 2022, no. 1, pp. 1-9, 2022.
[CrossRef] [Google Scholar] [Publisher Link] - Sumaira Ahmed et al., “Prediction of Cardiovascular Disease on Self-Augmented Datasets of Heart Patients Using Multiple Machine Learning Models,” Journal of Sensors, vol. 2022, no. 1, pp. 1-21, 2022.
[CrossRef] [Google Scholar] [Publisher Link] - Spencer L. James et al., “Global, Regional, and National Incidence, Prevalence, and Years Lived with Disability for 354 Diseases and Injuries for 195 Countries and Territories, 1990–2017: A Systematic Analysis for the Global Burden of Disease Study 2017,” Lancet, vol. 392, pp. 1789-1858, 2018. [CrossRef] [Google Scholar] [Publisher Link]
- Csaba P. Kovesdy, “Epidemiology of Chronic Kidney Disease: An Update 2022,” Kidney International Supplements, vol. 12, no.1, pp. 7-11, 2022.
[CrossRef] [Google Scholar] [Publisher Link] - Dharini Ramachandran, and R. Parvathi, “A Novel Domain and Event Adaptive Tweet Augmentation Approach for Enhancing the Classification of Crisis Related Tweets,” Data & Knowledge Engineering, vol. 135, 2021.
[CrossRef] [Google Scholar] [Publisher Link] - Ye Zhang et al., “GAN-based One-Dimensional Medical Data Augmentation,” Soft Computing, vol. 27, pp. 10481-10491, 2023.
[CrossRef] [Google Scholar] [Publisher Link] - Chinmayee Athalye, and Rima Arnaout, “Domain-Guided Data Augmentation for Deep Learning on Medical Imaging,” PLoS One, vol. 18, no. 3, pp. 1-12, 2023.
[CrossRef] [Google Scholar] [Publisher Link] - Mohammed Muzaffar Hussain et al., “Enhancing Parkinson’s Disease Identification using Ensemble Classifier and Data Augmentation Techniques in Machine Learning,” Clinical eHealth, vol. 6, pp. 150-158, 2023.
[CrossRef] [Google Scholar] [Publisher Link] - Yonatan Belinkov, and Yonatan Bisk, “Synthetic and Natural Noise Both Break Neural Machine Translation,” arXiv preprint, pp. 1-13, 2017.
[CrossRef] [Google Scholar] [Publisher Link] - Djamila Romaissa Beddiar, Md Saroar Jahan, and Mourad Oussalah, “Data Expansion Using Back Translation and Paraphrasing for Hate Speech Detection,” Online Social Networks and Media, vol. 24, pp. 1-13, 2021.
[CrossRef] [Google Scholar] [Publisher Link] - Chenxi Whitehouse, Monojit Choudhury, and Alham Fikri Aji, “LLM-Powered Data Augmentation for Enhanced Cross-Lingual Performance,” Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 671-686, 2023.
[CrossRef] [Google Scholar] [Publisher Link] - Hongxia Lu, and Cyril Rakovski, “The Effect of Text Data Augmentation Methods and Strategies in Classification Tasks of Unstructured Medical Notes,” Research Square, pp. 1-29, 2022.
[CrossRef] [Google Scholar] [Publisher Link] - Abdul Majeed Issifu, and Murat Can Ganiz, “A Simple Data Augmentation Method to Improve the Performance of Named Entity Recognition Models in Medical Domain,” 2021 6th International Conference on Computer Science and Engineering (UBMK), Ankara, Turkey, pp. 763-768, 2021.
[CrossRef] [Google Scholar] [Publisher Link] - Alexander Ylnner Choquenaira Florez et al., “Augmentation Techniques for Sequential Clinical Data to Improve Deep Learning Prediction Techniques,” 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS), Rochester, MN, USA, pp. 597-602, 2020.
[CrossRef] [Google Scholar] [Publisher Link] - Nitesh V. Chawla et al., “Smote: Synthetic Minority Over-sampling Technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321-357, 2002.
[CrossRef] [Google Scholar] [Publisher Link] - Asma Ben Abacha et al., “An Empirical Study of Clinical Note Generation from Doctor-Patient Encounters,” Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, Dubrovnik, Croatia, pp. 2291-2302, 2023.
[CrossRef] [Google Scholar] [Publisher Link] - Mahdi Abdollahi et al., “Ontology-Guided Data Augmentation for Medical Document Classification,” International Conference on Artificial Intelligence in Medicine, vol. 12299. pp. 78-88, 2020.
[CrossRef] [Google Scholar] [Publisher Link] - Rukhma Qasim et al., “A Fine-Tuned BERT-Based Transfer Learning Approach for Text Classification,” Journal of Healthcare Engineering, vol. 2022, no. 1, pp. 1-17, 2022.
[CrossRef] [Google Scholar] [Publisher Link] - Swati Saigaonkar, and Vaibhav Narawade, “Predicting Chronic Diseases Using Clinical Notes and Fine-Tuned Transformers,” 2022 IEEE Bombay Section Signature Conference (IBSSC), Mumbai, India, pp. 1-6, 2022.
[CrossRef] [Google Scholar] [Publisher Link] - Swati Saigaonkar, and Vaibhav Narawade, “SM-DBERT: A Novel Symptom-based Technique for Chronic Disease Classification using DISTILBERT,” International Journal on Recent and Innovation Trends in Computing and Communication, vol. 11, no. 9, pp. 2370-2377, 2023.
[CrossRef] [Google Scholar] [Publisher Link] - Swati Saigaonkar, and Vaibhav Narawade, “Domain Adaptation of Transformer-Based Neural Network Model for Clinical Note Classification in Indian Healthcare,” International Journal of Information Technology, 2024.
[CrossRef] [Google Scholar] [Publisher Link] - Swati Saigaonkar, and Vaibhav Narawade, “Explainable Zero-Shot Learning and Transfer Learning for Real Time Indian Healthcare,” International Journal of Informatics and Communication Technology, vol. 14, no. 1, pp. 91-101, 2025.
[CrossRef] [Google Scholar] [Publisher Link]