Preprocessing

Background

Data sources

Details

XXX

XXX

Total records

Data range

Age 15 - 20,

Age 21-30

Age 31-40

Age 41-50

Age >50

Number of clinics

Nature of the data - known issues

  • Missing complete date of birth

  • Age fields mixed with client age, or client year of birth. In very rare instances, a client's actual date of birth is included

  • Pre-processing Implementation

The approach to have a polymorphic interfaces that allows for various data sources has necessitated the definition of reference types that would allow for a more intuitive, modular and reusable interface. The defined references types are defined below

References types -

  • Full name - This is a string of names that does not have a predefined ordering in names. It is possible that first name may come last in the string, or surname coming first in the string. The output from this clean needs to be limited to the number of name tokens needed. For instance, if the output return 6 name tokens, it can be predefined that only 2 tokens, first name and surname, are returned or first name, middle name and last name are returned

FullName.config(column=name, ouput=[firstname, middlename, lastname], output_cols=[first, middle, last,other])
  • Age - this is an integer field but tends to include year values. It returns integer value

Age.config(src_column=age, output_type=int)
  • Birth year - this is obtained in two ways 1) calculated, 2) retrieved from the age columns as well.

BirthYear.config(src_column=age, output=birth_year)
BirthYear.calc(src_col=[birthyear, scr_date], output=birth_year)
  • Screening date - cleaned as a datetime value

  • VIA - cleaned by matching and replacing values

  • HIV result - cleaned by matching and replacing values

Last updated

Was this helpful?