1 / 10

Data Cleaning in Financial Modules

Data Cleaning in Financial Modules. Workshop in Frankfurt Mario Schnalzenberger. Which Modules exactly. EP Pensions and Benefits (ep078*, ep094*) Income from various sources ( ep201, ep045, ep314, ep041, ep205, ep207, ep305) AS

thora
Download Presentation

Data Cleaning in Financial Modules

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Cleaning in Financial Modules Workshop in Frankfurt Mario Schnalzenberger

  2. Which Modules exactly • EP • Pensions and Benefits (ep078*, ep094*) • Income from various sources (ep201, ep045, ep314, ep041, ep205, ep207, ep305) • AS • Financial assets and dividends from them (as003, as005, as007, as009, as011, as015, as017, as058, as021, as024, as027, as030, as032*, as034*, as042, as049, as051) • Total debts (as051) • Savings of other HH members (as069) • HC – out of pocket payments (hc045, hc047, hc049, hc051) • FT – financial transfers (ft004*, ft011*, ft018) • HO – rents and value of real property • (ho003, ho005, ho008, ho015, ho020, ho024, ho027, ho030) • HH – household income (hh002, hh011, hh017) • CO – amout spent (co002, co003, co004, co011)

  3. 1. Check for outliers • Merge all relevant sections (include also IWER – id from dataset origin, makes it easier for the agency) • Recode „Don‘t Know“, „Refuse“ as missing • (eg. use „mvdecode _all, mv(9e+20=.a \ 8e+20=.b)” in STATA) • Use graphical methods (e.g. histograms if that is easier for you) • Search for outliers in each section: use percentiles, min, max or other automatic procedures (see hadimvo)

  4. Here you see that without correcting the outliers, all data is concentrated in few bins Corrected data is distributed nicely Graphical check

  5. Automatic check for the FT section foreach var of varlist ft004_1 ft004_2 ft004_3 ft011_1 ft011_2 ft018_1 { noi disp "`var'" noi sum `var', detail hadimvo `var' if `var'>0, gen(g_`var') sort laptop date1 time1 noi list laptop sampid respid `var' if g_`var'==1 } egen g_ft=rsum(g_ft*) sort laptop date1 time1 noi disp "Rows with more than one outlier in FT" noi list laptop sampid respid g_ft if g_ft>1

  6. Look at those outputs • Find real outliers & typos • Find currency problems (this may be hard in various countries) • Make a list (eg. Excel – File) for the agency • Those issues should be easy to fix.

  7. In addition to ask the agency • Check things which are duplicate in the data • Gross and net income of same source (but Attention: sometimes apply to different years)! • Income (ep078* ep094* and others) + hh002 + hh011 ~= hh017 • It often occured that hh002 was set to the partners income (exactly), even though this is wrong (see question text)

  8. Find absolute brackets • Maximum pensions (maybe per type, in Austria 3850 Euro per month) • Find some useful brackets for assets (we used 100000 Euro for many of them) • Look for proportions of large values in the whole dataset (reasonable?) • Look for wealthy households

  9. Don‘t forget • Usually we have a skewed distribution (monetary numbers) • Therefore outliers are more easily found on the maximum side! • No outliers will be usually found on the minimum side. • Check also the minimal numbers (e.g. we had houses worth 0, 2 or 5 Euro, penisons of 0 or 1 Euro a month, …) – see also Dimitri‘s presentation

  10. Other issues • Periodicity of payments (e.g. in Austria ALL(!) pensions are paid monthly only) • Sometimes the respondents told they got additional payments (last month, i.e. ep201 is higher than in usual month), but those payments (ep314) are equal to or even higher than ep201 • This would mean they have no or even negative regular income payment (obviously wrong)

More Related