Data cleaning and knowledge discovery in process data

dc.contributor.advisorEdgar, Thomas F.en
dc.contributor.committeeMemberWojsznis, Willyen
dc.contributor.committeeMemberDjurdjanovic, Draganen
dc.contributor.committeeMemberRochelle, Gary T.en
dc.contributor.committeeMemberBaldea, Michaelen
dc.contributor.committeeMemberDaniels, Michael J.en
dc.creatorXu, Ph. D., Shuen
dc.date.accessioned2016-02-09T16:35:17Zen
dc.date.accessioned2018-01-22T22:29:27Z
dc.date.available2016-02-09T16:35:17Zen
dc.date.available2018-01-22T22:29:27Z
dc.date.issued2015-12en
dc.date.submittedDecember 2015en
dc.date.updated2016-02-09T16:35:17Zen
dc.description.abstractThis dissertation presents several methods for overcoming the Big Data challenges, with an emphasis on data cleaning and knowledge discovery in process data. Data cleaning and knowledge discovery is chosen as a main research area here due to its importance from both theoretical and practical points of view. Theoretical background and recent developments of data cleaning methods are reviewed from four aspects: missing data imputation, outlier detection, noise removal and time delay estimation. Moreover, the impact of contaminated data on model performance and corresponding improvement obtained by data cleaning methods are analyzed through both simulated and industrial case studies. The results provide a starting point for further advanced methodology development. It is hard to find a universally applicable method for data cleaning since every data set may have its own distinctive features. Thus, we have to customize available methods so that the quality of the data set is guaranteed. An integrated data cleaning scheme is proposed, which incorporates model building and performance evaluation, to provide guidance in tuning the parameters of data cleaning methods and prevent over-cleaning. A case study based on industrial data has been used to verify the feasibility and effectiveness of the proposed new method, during which a partial least squares (PLS) model was built and three univariate data cleaning procedures is tested. A time series Kalman filter (TSKF) is proposed that successfully handles outlier detection in dynamic systems, where normal process changes often mask the existence of outliers. The TSKF method combines a time series model fitting procedure with a modified Kalman filter to deal with additive outlier (AO) and innovational outlier (IO) detection problems in dynamic process data set. A comparative analysis of TSKF and available methods is performed on simulated and real chemical plant data. Root cause diagnosis of plant-wide oscillations, as a concrete example of data cleaning and knowledge discovery in the process data, is provided. Plant-wide oscillations can negatively influence the overall control performance of the process and the detection results are often affected by noise at different frequency ranges. To address such a problem, an information transfer method combining spectral envelope algorithm with spectral transfer entropy is proposed to detect and diagnose such oscillations within a specific frequency range, mitigating the effects from measurement noise. The feasibility and effectiveness of the proposed method are verified and compared with available methods through both simulated and industrial case studies.en
dc.description.departmentChemical Engineeringen
dc.format.mimetypeapplication/pdfen
dc.identifierdoi:10.15781/T2XT13en
dc.identifier.urihttp://hdl.handle.net/2152/32920en
dc.language.isoenen
dc.subjectData cleaningen
dc.subjectKnowledge discoveryen
dc.titleData cleaning and knowledge discovery in process dataen
dc.typeThesisen

Files