Analysis of Tuberculosis Disease Case Growth From Medical Record Data, Viewed Through Clustering Algorithms (Case Study: Islamic Hospital Bogor)

by Mycobacterium tuberculosis infection. Tuberculosis can spread from one person to another through airborne transmission. This disease is most commonly found in the Asian region. Currently, Indonesia ranks second after India in terms of tuberculosis cases. The discovery of tuberculosis cases by province in Indonesia reveals that West Java Province is one of the contributors to the highest tuberculosis cases. It is known that the tuberculosis case rate in Bogor Regency is one of the highest in West Java. This serves as the foundation for the focus of this research, which will be conducted at Islamic Hospital Bogor, to determine the average age and gender of patients who are more susceptible to tuberculosis. One way to understand the growth of tuberculosis cases is through clustering using Data Mining Techniques, specifically several clustering algorithms such as k-means clustering, fuzzy c-means, and Gaussian mixture. These techniques aim to identify the growth of tuberculosis cases based on age range and gender. Therefore, the research results are expected to provide new insights, which could be valuable for decision-makers in various capacities, such as preventive measures, healthcare facility provision, and medication considerations. Attribution-ShareAlike 4.0 International (CC BY-SA


Introduction
Tuberculosis is a chronic infectious disease caused by infection with Mycobacterium tuberculosis.This tuberculosis is most commonly found in Southeast Asia (44%) and Africa (24%) ( Istiharoh, Djannah, &;Ajmala, 2022).Currently , Indonesia ranks second after India related to tuberculosis.Currently, it is known that Indonesia ranks second after India related to tuberculosis (TB), with 969 thousand cases and 93 thousand deaths per year or equivalent to 11 deaths per hour ( WHO, 2022).The coverage of TB case discovery according to provinces in Indonesia in 2017, the highest cases are in West Java Province with a population of 48,037,827 people with case findings of 31,598 cases, East Java with a population of 39,292,972 people with 22,585 case findings, Central Java with a population of 34,257,865 people (Yani, Pebrianti, &;Purnama, 2022).
Tuberculosis has a relationship between humans and their environment, especially in urban areas that have the highest population and density, so accurate information about the urban environment of tuberculosis areas is important.According to data obtained from West Java Open Data, namely Tuberculosis Data in West Java Province displays data that in 2020 all cities and regencies in the West Java region had a number of Tuberculosis cases starting from 320 cases in Banjar Regency which is the lowest case, and 10,248 cases in Bogor Regency which is the highest case in West Java (Fadhlan Sulistiyo Hidayat1, Rizma Berliana Putri Affandi2, Virgaria Zuliana3, 2022).
The basis of research that will be the focus of this study will be carried out at Bogor Islamic Hospital, to find out patients with what average age and gender are more susceptible to tuberculosis.One way to determine the growth of tuberculosis cases is to cluster with Data Mining Techniques, namely in several clustering algorithms, namely cluster k-means, fuzzy c-means and gaussian mixture to determine the growth of tuberculosis cases based on age range and gender.Thus, the results of the research are expected to become new information, which can later be one of the considerations for related parties in decision making, such as preventive measures, provision of health facilities, and medicines.

Materials and Methods
This section will explain the Literature Review, and the methods used in building basic knowledge in the context of the topic under study.

Tuberculosis
Tuberculosis (TB) is a chronic infectious disease caused by Mycobacterium tuberculosis infection and can be cured.Tuberculosis can spread from one person to another through airborne transmission (phlegm droplets of tuberculosis patients).Patients infected with tuberculosis will produce droplets containing a number of TB germ bacilli when they cough, sneeze, or talk.People who inhale these TB germ bacilli can become infected with Tuberculosis (Oktaviani, Sumarni, & Supriyanto, 2023).

Data Mining
Data mining integrates data modeling and analytics.Although based on several disciplines, data mining differs from them in its orientation towards the end rather than the means to achieve it, utilizing all these disciplines to extract patterns, describe trends, and predict behavior, utilizing information.Data mining is only one stage, but the most important, in the process of knowledge discovery in databases (KDD).Note that KDD is defined as a non-trivial process for identifying valid, new, potentially useful, and ultimately understandable patterns in often large data sets, and for extracting relevant information from available databases.The KDD methodology includes an iterative and interactive process in which the subject's experience is combined with various analytical techniques including ML algorithms for pattern recognition and modeling development (Palacios, Reyes-Suárez, Bearzotti, Leiva, & Marchant, 2021).

Clustering
Clustering is an unattended procedure for organizing data into groups of similar pattern items typical for each group.Grouping procedures can be classified as hierarchical or non-hierarchical.Group objects group hierarchical into clusters and define relationships between items in the cluster.In contrast, non-hierarchical methods group items into clusters without establishing relationships between objects in the same cluster (Agapito, Milano, & Cannataro, 2022).

K-means
The K-Means algorithm is one of the clustering algorithms in data mining to group data.The K-Means algorithm can partition data into two or more groups based on nonhierarchical data groupings.This method will group data into groups that have the same data characteristics while data with different characteristics will be added to other groups (H.Syukron et al. 2022).The K-means algorithm method starts with a specific value for K (number of categories) and tries to categorize a specific set of samples in group K so that the hypothesis is expressed in the Equation (Shirazy, Hezarkhani, Shirazi, Khakmardan, & Rooki, 2022).

Fuzzy c-means
The Fuzzy C-Means method groups data by degree of membership.Data can be grouped by degree of membership, which ranges from 0 to 1, and there are some data types that only display partial membership.Fuzzy Clustering is used by fuzzy C-Means to assign data ownership to each cluster that each has a different membership.The degree of membership controls the range between 0 and 1 of data presence in the cluster.The Fuzzy C-Means algorithm has an excellent advantage in detecting high-level clusters and revealing relationships between various cluster models (Syukron, Fayyad, Fauzan, Ikhsani, & Gurning, 2022).Gaussian Mixture Model is a method that models or clusters the data of a dataset into several groups of data that have a Gaussian or Normal probabilistic distribution.This method assumes that all individuals are a mixture of Gaussian probability distributions, representing Gaussian distributions where each distribution typically has distribution parameters (Joko Riyono et al. 2022).

Literature review
Systematic Literature Review (SLR) was chosen with the aim of justifying based on previous research related to mathematical proof ability.This research stage includes data collection, data analysis, and conclusion drawing (Niken Shofiana Dewi et al. 2023).At the data collection stage, researchers traced and collected data in the form of primary research conducted at Bogor Islamic Hospital, where the data taken was 2022 data related to tuberculosis, where the data was sourced directly from patient medical record data.

Results and Discussions
This study aims to analyze the growth of tuberculosis (TB) cases through the application of clustering algorithms.Clustering is a data analysis technique used to group entities that have similar characteristics into larger groups.The dataset used in this study is a collection of medical record data from 2217 patients diagnosed with tuberculosis.The dataset used in this study consists of three main attributes that carry critical information about patients suffering from Tuberculosis (TB).These attributes are Control Month, Gender and Age or Age.

3.1
Output Silhouette Score and Davies-Bouldin Score are two evaluation metrics used to measure clustering quality in data analysis.These two metrics provide guidance on how well data has been grouped into meaningful clusters.The Silhouette Score measures how close each sample is to the cluster in which it is located compared to its nearest neighboring cluster.This metric provides information about how well objects are in their own cluster and how separated that cluster is from other clusters.In cluster analysis, the higher the Silhouette Score, the better the clustering results.But it should be remembered that Silhouette Score values need to be analyzed along with visual interpretation of clustering results.The Davies-Bouldin Score measures how well segregated the clusters are.This metric measures the average distance between each cluster and the other clusters that are most similar to that cluster.The lower the Davies-Bouldin Score, the better the separation between clusters.Basically, choosing the right number of clusters involves a trade-off between Silhouette Score and Davies-Bouldin Score.The main goal is to find the number of clusters that have good separation and cohesion.Based on the results of evaluating clustering metrics, especially Silhouette Score and Davies-Bouldin Score, and considering the overall performance of the algorithm, it can be concluded that the number of 9 clusters is the most optimal choice used for complex clustering.Overall, the total of 9 clusters provides a good balance between cluster separation and cohesion, which is reflected in the relatively high and low Silhouette Score and Davies-Bouldin Score, respectively.From the results of clustering analysis of cases of tuberculosis patients based on age and sex, we can identify several meanings that can be taken:

Age Difference between Men and Women with Tuberculosis
There is significant variation in the age distribution of men and women with tuberculosis.The age of male sufferers is divided into several groups that cover a wider age range, including younger and older age groups.On the other hand, the age of female sufferers tends to be more focused on certain age groups, namely young age groups and older age groups.The presence of a very young group of men (less than 5 years) may indicate the risk of mother-to-child transmission of tuberculosis or early exposure to the disease at an early age.

Young Age Group in Women
There is a group of women with tuberculosis with a very young age (less than 5 years).This could indicate a case of mother-to-child transmission or early exposure to tuberculosis at an early age.Special care and attention is needed to prevent transmission and ensure proper care of this group.

Elderly Age Group in Men and Women
There are older groups of men and women with tuberculosis, especially in some clustering methods.This could indicate a higher risk in the elderly group.Prevention, early detection, and appropriate treatment efforts are needed to overcome these cases.

Differences in clustering methods
Different clustering methods can result in different age groups.For example, the Fuzzy C Means method tends to produce more groups with greater age variation, while the Gaussian Mixture and K-Means methods are more likely to produce more focused age groups.
These meanings provide a view of the age profile of men and women with tuberculosis in relation to grouping based on clustering methods.This information can provide healthcare professionals with insights into designing more effective prevention, detection, and treatment strategies for different age groups and genders.From the results of this table, it can be seen that the average age in clusters generated by Fuzzy C-Means and K-Means algorithms tends to be relatively similar.For example, in the first cluster, the average age for Fuzzy C-Means was 3.71 while K-Means was 3.71 as well.Similarly, for the latter cluster, the average age of Fuzzy C-Means was 68.13 while K-Means was 67.48.This suggests that both algorithms tend to generate age groups that have similar age characteristics.On the other hand, the Gaussian Mixture algorithm has more significant differences in some cases.For example, in the first cluster, the mean age for the Gaussian Mixture was 3.34, which is lower compared to Fuzzy C-Means and K-Means.This suggests that the Gaussian Mixture algorithm can generate different age groups than other algorithms in some situations.In conclusion, this table provides an overview of how the three clustering algorithms behave in age grouping.There are notable differences in some cases, especially in Gaussian Mixture algorithms, while Fuzzy C-Means and K-Means algorithms tend to produce more similar results in terms of average age in clusters.Overall, the average percentage of TB sufferers in the age group tends to be high in the age group 0-13 years, with a value of around 34.52%, and in the age group of 46-60 years with a value of 19.27%.While the age groups of 14-30 years, 31-45 years, and 61-85 years have lower percentages, respectively are 15.53%, 17.08%, and 13.61%.These results suggest that in the male population, the age group of children and the elderly group tend to be more susceptible to TB infection, while the age group of young adults has a lower risk.This information indicates that in the female population, the age groups of young adolescents (14-33 years) and children (0-13 years) are the groups that tend to be susceptible to TB infection.However, the middle adult age group (34-54 years) and the elderly age group (55-81 years) are also inseparable from the risk.These results suggest that TB sufferers in the male population spread evenly across age groups, but young children and adolescents (1-28 years) appear to be more susceptible to infection.In addition, older age groups, especially 59-86 years, also have a significant risk of the disease.These results indicate that the 9-34 age group, especially adolescents and young adults, has a higher risk of TB infection in the female population.In addition, the age group of 35-51 years also has a significant risk.In general, the age groups of 1-12 years and 30-45 years have a higher number of cases of TB patients in the male population, with an average percentage of about 34.52% and 17.44% respectively.Followed by the age groups of 13-29 years, 46-59 years, and 60-86 years, with average percentages of around 15.16%, 18.08%, and 14.79%, respectively.These results suggest that the age groups 1-12 years and 30-45 years have a higher risk of TB infection in the male population.Overall, the age groups of 1-13 years and 14-33 years have a fairly high number of TB cases in the female population, with an average percentage of about 28.07% and 27.99% respectively.Followed by the age groups of 34-53 years and 54-81 years, with average percentages of around 23.26% and 20.68% respectively.These results suggest that the age group 1-33 years has a higher risk of TB infection in the female population.

Cluster Growth
The fundamental difference between these three algorithms lies in the mathematical approach and logic behind grouping data.Fuzzy c-means uses fuzzy membership degrees to indicate the extent to which data is involved in each cluster.The Gaussian mixture focuses on modeling the probability distribution of data assuming the data comes from Gaussian distributions that may overlap.K-Means, on the other hand, seeks to group data into cluster centers based on the shortest distance.

Conclusion
The results of this study provide a deep understanding of the growth characteristics of TB disease and its implications for decision making in the health sector.The results of clustering analysis using three different algorithms, namely Fuzzy C-Means (FCM), Gaussian Mixture (GM), and K-Means, have provided mixed views on the growth of TB disease in certain age groups and genders.
Based on these results, several conclusions can be drawn as follows: TB Growth in Age Groups Clustering results consistently show that the age group of children, especially at the age of 3-4 years, tends to have a higher level of risk of TB infection.This pattern holds true in both male and female populations.Therefore, children were identified as the most vulnerable group to TB disease.The Effect of Sex on TB Growth Despite differences in the distribution of the number of TB cases between men and women, clustering results show that the age group of male children is more susceptible to TB disease than the female age group.Nonetheless, these results confirm that young age groups remain the more vulnerable group to infection, regardless of gender.

Health and Decision Implications
The information generated from this clustering analysis has important implications in TB prevention and management.A focus of attention on children's age groups, particularly those aged 3-4 years, is critical in formulating effective prevention strategies.In addition, awareness of higher levels of risk in younger age groups needs to be used as a basis for the allocation of health resources.Through clustering analysis and interpretation of the results, this study makes an important contribution in understanding the growth characteristics of TB disease in male and female populations at RSI Bogor.This information is expected to provide guidance for efforts to prevent and manage TB disease more effectively and efficiently, especially with a focus on the age group of children.

Table 4 .
Clustering Algorithm Score

Table 5 .
Average Age by Gender Cluster

Table 6 .
Average Age by Cluster

Table 5
illustrates the average age in clusters generated by three clustering algorithms, namely Fuzzy C-Means, Gaussian Mixture, and K-Means.Each row in the table represents a different age group and each column represents a different clustering algorithm.

Table 8 .
Cluster Growth Fuzzy c-means Male

Table 9 .
Cluster Growth FCM Female

Table 10 .
Cluster Growth GM Male

Table 11 .
Cluster Growth GM Female

Table 12 .
Cluster Growth K-means Male

Table 13 .
Cluster Growth K-means Female