Data Mining using Big Data in Health Informatics



Data created in Health Informatics has increased tremendously. Thedata refers to big data. Effective analyzing of the data results innumerous opportunities for gaining knowledge to develop healthcaredelivery. Health Informatics collects medical information and usesthe data to improve our knowledge in medicine in addition to medicalpractices. The paper is a presentation of research by use of big datatools as well as strategies for analyzing health informatics datacollected from numerous levels.

The use of technology in health informatics has made it possible todeal with big data. Such technology is apparent in data mining usingbig data. As a result, diagnosis, treatment, assistance and healingof sick persons are now simpler. The outcome has been progress inhealthcare output (HCO). HCO is defined as the quality of healthcare,which end users receive. Health informatics handles diverse fieldssuch as “bio informatics, image informatics, clinical informatics,public health informatics, translational bio informatics (TBI)”(Kamesh, Nelima &amp Priya, 2015).

The research carried out on health informatics employs informationfrom specific human existence levels. Bio informatics relies on datafrom the molecular level, data from the tissue level is used in neuroinformatics, and data from sick persons is employed in clinicalinformatics, while population data is applied in public informatics(Kamesh, Nelima &amp Priya, 2015). For TBI, data from all levels isapplicable as it is utilized to respond to issues at the clinicallevel. The studies conducted at the different fields make it possibleto advance the system of healthcare.

The essay provides a general idea of big data as it applies tohealth informatics and evaluates the different fields apparent inhealth informatics. In addition, using the different levels of bigdata, the essay demonstrates how data mining of these data results inmeaningful contributions in delivering healthcare.

Big Data

Big data is defined as the procedures and apparatus that make itpossible for an organization to produce use and handle vast datagroups (Herland, Khoshgoftaar &amp Wald, 2014). In regards to healthinformatics, the definition of big data makes use of five V’s.These are:

  • Volume – vast level of data employed

  • Velocity – pace of creating advent data

  • Variety – level of data intricacy

  • Veracity – employed in determining the authenticity of information

  • Value – fineness in data quality

Data that is used in health informatics studies depicts the fiveV’s. Volume derives from vast information on patient records in thedifferent datasets. An illustration is those employing MRI images.The volume may also derive from the use of social media whenconducting studies of a specific healthcare issue. Velocity happensfollowing the entry of new medical information at a fast pace. It isdiscernible when observing real-time happenings like accessing a sickindividual’s medical condition via medical sensors. It is alsoobservable when tracking a calamity via updated social media posts(Herland, Khoshgoftaar &amp Wald, 2014).

Variety in big data is datasets that have different kinds ofindependent traits the data is gathered from different sources orintricate datasets that require being viewed at diverse data levelsall through health informatics. Data veracity becomes an issue whendealing with loud, deficient and flawed information. Such data ispossible to collect from defective clinical sensors. Thus, it iscrucial to analyze properly such data. There is high value of healthinformatics data since the objective involves enhancing healthcareoutput (Herland, Khoshgoftaar &amp Wald, 2014).

Big data has the capability of replacing as well as provide supportto decisions made by individuals providing health care. It availsstrong analytics authority to acquire more comprehension and closureinto the aspects, which have an effect on health. In addition, bigdata enhances statistical legitimacy, which might result in improvedanalysis when contrasted to conventional health research techniques(Vithiatharan, 2014). As a result, mining data from for instance,electronic health records can be aligned to personal lifestyles,which enhances patient care, makes it possible to detect health risksand foretell trends in public health, as well as minimizing healthcare expenses. Avoidable errors detected via big data save healthcarecosts. This demonstrates that using big data does not merely benefitpublic health, rather results in economic advantages, regardingsaving costs.

Health Informatics Data Subfields

Health informatics comprises of several subfields. These are neuro,bio, public health as well as clinical informatics. When conductinghealth informatics study, two data levels must be put intoconsideration. They are the data level that acts as a source ofinformation and the point where the research subject is asked. Thesubject level might differ from the research matching unique datalevels.

Neuro informatics – it is a subfield that entails the study ofbrain image information or tissue level information. The objective ofthe study involves finding out the functionality of the brain, linkdata collected from brain imaging to medical occurrences (Kamesh,Neelima &amp Priya, 2015). The objective of neuro informatics is toenhance medical knowhow at different levels. The subfield has beenused in this essay to denote the wider field of “medical imageinformatics” since through restriction of the scale to brainimages, it becomes possible to conduct more research, and stillcollect ample information to make up big data.

Bio informatics – bio informatics study might not be regarded asconstituent to conventional health informatics. Instead, the studiesconducted in the subfield acts as a significant health informationsource at diverse levels. The subfield concentrates on systematicstudy with the aim of gaining knowhow on the human body’sfunctionality employing molecular level data (McDonald &amp Brown,2013). Additionally, bio informatics makes it possible to come upwith techniques for efficiently managing data. The enhancing amountof information in healthcare has resulted in a rise in thesignificance of coming up with data mining as well as analysismethods that are effective, responsive, and highly capable ofhandling big data. Bio informatics information like gene expressionprogresses to increase because of the capability of technology tocreate additional molecular information for every person. Hence, bioinformatics falls under the category of big volume in big data(McDonald &amp Brown, 2013). The presence of big volume in moleculardata has resulted in the solving of computational issues usingsoftware.

Public health informatics – the subfield uses data mining toanalyze population data to gain medical knowledge. The informationused is from the general public, collected using conventionalapproaches that is from hospital records, or collected from theinternet through social networking sites (Ryu &amp Song, 2014).Information from the population comprises of big velocity, volume andvariety.

Clinical informatics – the study entails coming up withpredictions, which assist medical practitioners in making improved,quick and precise decisions concerning their patients. Achieving sucheffects is possible following the analysis of sick person’s data(Ryu &amp Song, 2014). Clinical questions are a crucial subjectlevel for health informatics because it deals directly with the sickindividuals. The term clinical has a different use in healthinformatics either directly or obliquely. However, when using thesubfield, clinical informatics, it specifically refers to patientinformation (Kamesh, Neelima &amp Priya, 2015). There is a large gapamid clinical research and use of the research findings in medicalpractice. Most clinical decisions derive from information that hasbeen applied and worked before. Clinical informatics is an importantsubfield in healthcare, as it will help in bridging the gap fromresearch to practice.

Analysis of Big Data using the Different Levels

Molecular level data – Information collected from molecularlevel frequently faces the challenge of increased dimensionality.This means that the information comprises of high levels ofindependent traits. The reason for the high levels is that moleculelevel data comprises of numerous potential molecules expressed indatasets as elements.

Research on molecular level data utilizes gene expressioninformation in responding to clinical issues. The analysis of geneexpression data makes it possible for physicians to predict onmedical results. For instance, the data makes it possible todetermine the cancer subtype a sick individual has. In addition, geneexpression data is applicable in foretelling the possibility of anindividual’s relapse to cancer diagnosis after a given period(Salazar et al, 2011). This is apparent in studies that use geneexpression to make predictions on cancer patients. One researchapplies gene expression to group leukemia in two unique subclasses.The other applies gene expression in foretelling the possibility ofrelapse amid cancer patients.

The first study involved creating a gene expression profile that madeit possible to group patients into either lymphoid leukemia subclassor myeloid (Kamesh, Nelima &amp Priya, 2015). What follows is theobtaining of gene survey samples from every patient included in theresearch. The research then employed a pair wise classification todifferentiate amid ideal match and disparity. The method resulted inthe attainment of improved results as compared to usual diagnosisapproaches. It depicts that the application of “microarray geneexpression data patients” provides a reliable method of classifyingsick persons into diverse leukemia forms (Kamesh, Nelima &amp Priya,2015).

Similarly, the researchers in the second research use different setsof gene probes. They then use a test to act as the determining factoron what gene types are likely to relapse given a five-year period.The studies demonstrate the significance of gene expression in makingmedical predictions using the case of leukemia (Kamesh, Nelima &ampPriya, 2015). Supposing the processes employed in the studies areapplied in different cancer types, it will assist physicians in earlydiagnosis and treatment of cancer patients.

Tissue level data – analysis of the data helps in respondingto human scale issues by the creation of a brain connectivity mapthat makes it possible to make medical predictions.

The current study of brain tissue involves neuroimaging. However,researchers note that it is inaccurate as it fails to incorporatehistological techniques of researching real brain tissue (Ragupathi,2014). In addition, MRI “Magnetic Resonance Imaging” calculationsthat match anatomical calculations impede with creating an inclusivehuman brain connectivity model. This is because MRI’s are vast aswell as comprise of strong resolution compared to histologicaltechniques (Ragupathi, 2014). Thus, the results obtained from MRI’shave different sizes and quality.

On the contrary, the use of tissue level data results in promisingimprovements in medical predictions. The data makes it possible tocreate a complete connectivity image of the brain that enhancesknowhow on the functionality of the brain (Wang, Li &amp Perrizo,2015). The connectivity map makes it possible to understand whyindividuals present with specific brain disorders. It providesphysicians with an opportunity for simpler diagnosis and earlydetecting of future diseases. Such advances in medical practice alsoincrease the possibility of physicians preventing mental illnesses.This is because data mining from the tissue level data of sickindividuals creates a perfect opportunity for medical practitionersto conduct studies on the brain. Studies using this data are currentand provide a basis for better medical practice in future (Wang, Li &ampPerrizo, 2015).

MRI on its own has demonstrated to be ineffective. However, acombination of MRI calculations with patient’s tissue level dataenhances the possibility of finding connections amid physicaldiseases to diverse brain locations. Such information helpsphysicians in making diagnosis and medical predictions. Such dataanalysis will greatly enhance healthcare because as technologydevelops, it is possible that physicians will use brain MRI’s indetermining if individuals have a certain illness. As a result, earlydiagnosis results in effective treatment and reduces the adversityassociated with certain illnesses.

Patient level data – data mining using information from thepatient level results in response to clinical issues like, predictingof a patient’s readmission to ICU and the possibility of a patientpassing away following discharge from ICU. The data is alsoapplicable when making medical predictions via streams of data.

Currently, it is not possible to predict how well a patient will doonce released from ICU “Intensive Care Unit”. However,developments in technology such as data mining of patient level datamight act as an important contributor in this sector. The data makesit possible for physicians to come up with prediction models thatinform them of the health condition of their patients following ICUdischarge (Kamesh, Nelima &amp Priya, 2015). The models employphysiological variables that are relevant in making predictions. Anillustration of the variables is the patient’s age duringdischarge.

Such data analysis has the capability of enhancing clinical dischargeprocesses, validating the patients that are healthy enough to bedischarged from ICU and those that require further treatment prior tobeing released from ICU. The aim of the analysis is to determine whatcauses patient’s relapse back to ICU or why some pass away afterrelease. Endeavoring to discover why sick people pass away or relapseafter release from ICU is an important discovery in prolonging andsaving lives. If physicians are able to foretell which sick personsnot to discharge hence, give more care to them, then mortality rateand health relapse reduces.

Data at this level is also significant in making clinicalpredictions. This is possible due to the utilization of data streamsto predict the conditions of sick persons in real-time. Data streamsrefer to data, which mandates constant evaluation in order to provideinstantaneous results. Mining of these data streams, with theobjective of making analysis to patient data, results in real-timediagnosis as well as prognosis (Herland, Khoshgoftaar &amp Wald,2014). Hence, physicians are in a better position of making fast andmore precise clinical decisions. The physician is able to beginsolving the medical problem as soon as it arises without wasting timeon the development of a plan to solve the problem.

Population level data – physicians, hospital records andclinics acts as a convenient source for collecting health informaticsdata. However, the development of social media has led to a rise inthe exchange of information over the internet. This informationcomprises of health information. There are numerous online sources ofhealth information such as face book, blogs, Google searching andtwitter among other sites. The outcome of a vast array of social datais the availability of big data on health information (Signorini,Segre &amp Polgreen, 2011). Such kind of big data has the potentialto result in numerous breakthroughs in the medical field.

There is a lot to learn through using health information found onsocial media platforms. This includes information about the outbreakof illnesses, immediate tracking of perilous illnesses followingtheir outbreak, enhancing the knowhow of several illnessesinternationally, and creation of an exceptionally easy manner ofensuring that individuals access information on medical queries theymay need responses to (Signorini, Segre &amp Polgreen, 2011). Hence,data mining of social media content makes it possible to providehelpful information to sick individuals concerning an illness. Inaddition, when an individual posts online about the outbreak of anillness, the information can be employed in tracking the spread ofthe disease or even severity.

Data mining on patient data level depicts that social media platformslike, message boards are effective in assisting sick persons. Theboards act as a source for information that is helpful in respondingto clinical queries. For instance, through the creation of healthinformatics platforms, it is possible to assist individuals connectwith others that may have suffered from similar illnesses. People areable to discuss the treatments that have worked for them, informothers on expected side effects of consuming certain medication,information on recommended hospitals and suitable medication(Herland, Khoshgoftaar &amp Wald, 2014).

Using patient level data will make it possible for medicalpractitioners to invent an advent system of using social media healthforums. The forums involve validating the sick individual’s medicalcondition based on individual health information, determining forumsfor users with a similar illness, implementation of a metric toanalyze and categorize forum topics to validate what information willbe helpful to patients. The forums can also be used in creatingclinical pathways that assist physicians in tracking the wellbeing oftheir patients (Herland, Khoshgoftaar &amp Wald, 2014). From when anindividual is diagnosed to have a specific illness, the physicianscreate a clinical pathway for the sick individual to follow theirtreatment and health development.

The use of the internet as well as social networking sites has becomemore widespread. People are continuously using these platforms tosearch and share health related information. There is a probabilitythat social media information may be employed in connecting patientsto meaningful health information. Although information from patientlevel data has not been widely tested to demonstrate itseffectiveness, data mining of such information might be extended toassist physicians diagnose and treat their patients (Herland,Khoshgoftaar &amp Wald, 2014). This is because they get to accesshelpful information from other physicians and patients from socialmedia platforms, analyze the information, collect, and use what ismeaningful. Both patients and physicians may post information onhealth forums where they seek clarification or direction on whatmedical action to take.

It is possible to use search questions in tracking epidemics usingpatient level data. Normally, the “Center of Disease Control andPrevention” CDC makes public information on epidemics following aperiod of a week or two. However, by use of search query, it might bepossible to release epidemic information to the public faster ascompared to using conventional approaches like relying on CDC reports(Herland, Khoshgoftaar &amp Wald, 2014). Such data mining can assisthospitals and medical practitioners to become aware of when as wellas where an epidemic has broken out. This is because real-timeupdates makes it possible to act fast in curbing the illness fromspreading and assist infected individuals.

Research on the use of search query has been applied in the case ofinfluenza-like illness outbreak. Using an automated system that hasthe capability to analyze big volume, the system comprises ofselecting keywords, sorting the words, definition of combinationsearch index and matching the regression model to the keyword index.The keyword ought to be involving factors that may affect theinfluenza outbreak. Using regression analysis researchers have testedhow search queries apply in tracking an epidemic. It is apparent thatsearch query data acts as an important tool in rapidly and preciselydiscovering the incidence of an influenza-like disease outbreak. Suchfindings can be employed in tracking different outbreaks (Signorini,Segre &amp Polgreen, 2011).

Analysis of patient level data also demonstrates the possibility ofapplying twitter posts in tracking outbreaks. Twitter refers to asocial media site that makes it possible for users to postinformation they desire online, as a tweet. It has millions offollowers, which means that a single tweet stands the possibility ofviewership by millions (Signorini, Segre &amp Polgreen, 2011). Theseusers come from different nations. Such a big volume of usersenhances the possibility of generating helpful information onepidemics from the tweets posted. However, there is also a highpossibility for noisy sensors. Hence, it is via data mining thatimportant data is traced. Such a line of study creates thepossibility of coming up with a system that can be employed intracking twitter data for creating a global epidemic map. As aresult, populations and physicians get to learn of an epidemic assoon as it happens (Signorini, Segre &amp Polgreen, 2011).


Using big data results in numerous benefits to health informaticsbecause it makes it possible to conduct research that will eventuallyimprove healthcare delivery. Big data refers to data that comprisesthe five V’s, value, volume, veracity, velocity and variety.Analysis of big data has the capability to replace and back updecisions made by people providing healthcare. When compared toconventional health research approaches, big data enhancesstatistical legitimacy leading to better healthcare analysis.

Health informatics comprises of different subfields that provide thedifferent levels of data needed in order to conduct data mining.Neuro informatics involves studying the brain image and providestissue level data. Bio informatics provides molecular level data. Itis a subfield that focuses on systematic study with the goal ofacquiring knowhow about how the human body functions. Public healthinformatics provides population level data. This is information thathas been gathered from the populace via the use of social mediaplatforms and the internet. Such information is classified as bigvolume, velocity and variety. This is because it comes from differentsources hence is varied, the amount of information is large hencelarge volume and people are constantly posting information onlinehence high velocity. Clinical informatics provides patient level datahelpful in making health related predictions.

Research on molecular level data demonstrates that physicians can usegene expression in early diagnosis and treatment of cancer patients.Tissue level data involving brain studies makes it possible to usebrain imaging in making medical predictions. Patient level dataanalysis presents the possibility of physicians making predictions onthe health conditions of their patients, for instance if a patient ishealthy enough to be discharged from ICU. Analysis of populationlevel data creates the possibility of making epidemic predictions inprospect through real-time reporting and assessment of an outbreak assoon as it emerges.


Herland, M., Khoshgoftaar, T. M &amp Wald, R. (2014). A review ofdata mining using big data in health informatics. Journal of BigData, 1 (2), 1-35.

Kamesh, D. B. K., Neelima, V &amp Priya, R. (2015). A review of datamining using big data in health informatics. Journal ofScientific and Research Publications, 5(3), 1-7.

McDonald, E &amp Brown, C. T. (2013). Working with big data inBioinformatics. CoRR ,1–18. Retrieved from

Ragupathi, W. &amp Ragupathi, V. (2014). Big data analytics inhealthcare: promise and potential. Health Information, Serviceand Systems, 2(3), 1-10.

Ryu, S &amp Song, T. (2014). Big data analysis in healthcare.Healthcare Research Information, 20(4), 247-248.

Salazar, R., Roepman, P., Capella, G., Moreno, V., Simon, I.,Dreezen, C., Lopez-Doriga, A., Santos, C… (2011). Gene expressionsignature to improve prognosis prediction of stage II and IIIcolorectal cancer. Journal of Clinical Oncology 29, 17–24.

Signorini A., Segre, A. M &amp Polgreen, P. M. (2011). The use oftwitter to track levels of disease activity and public concern in theU.S. during the influenza A H1N1 pandemic. PLoS ONE, 6 (5).

Vithiatharan, R. N. (2014). The potentials and challenges of big datain public health. Australian eHealth Informatics and SecurityConference, 22-27.

Wang, B., Li, R &amp Perrizo, W. (2015). Big data analytics inbioinformatics and healthcare. Hershey, PA: Medical InformationScience Reference.