Financial informatics
This presentation is the property of its rightful owner.
Sponsored Links
1 / 41

Financial Informatics: PowerPoint PPT Presentation


  • 69 Views
  • Uploaded on
  • Presentation posted in: General

Financial Informatics:. Startup, low-cost, dataload Challenges and Solutions. What are we talking about today?. Financial Data, more specifically stock market data as an example The basic design of a MySQL database that contains a daily history of stock prices

Download Presentation

Financial Informatics:

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Financial Informatics:

  • Startup, low-cost, dataload Challenges and Solutions


What are we talking about today?

  • Financial Data, more specifically stock market data as an example

  • The basic design of a MySQL database that contains a daily history of stock prices

  • Building a stock machine and some of the challenges posed

  • Some large data ‘gotchas’ and solves

  • Some large mysql ‘gotchas’ and solves


Financial data

  • csv records of what happened that day or a signal

  • often have unexplaind anomalies

  • daily arrival of row data which doesn’t conform to spec


1. Big Picture

  • Who is your audience? Make your analytics and application work with a small dataset first

  • Market data rules: You can’t scrape Yahoo

  • QA is not a bad word: Data Quality is key

  • What’s a security?

  • What’s a corporate action?

  • OLAP: This is once a day processing

  • Take performance of your dev boxes seriously: Dell 2950 with 32GB of ram, 6 disks, RAID10.


Where does financial data come from?

  • Thompson / Reuters

  • McGraw Hill / Interactive Data

  • Securities and Exchange Commission

  • Dow Jones

  • Standard and Poors

  • Bloomberg

  • Lots of ‘boutique’ $100M companies


Market data rules

  • Information about a security’s trade on an exchange is owned by the exchange and distributed to those who have made a license agreement (Reuters, Interactive Data, et al.) Your license agreement with these 3rd parties will start at $20k-$50k a year

  • Scraping yahoo, msn money, Forbes or another site is infringement

  • There are different license levels with financial data providers, redistribution usually costs more than a quantitative black box

  • After three days most data is less valuable / expensive, you may get a bargain for dev phase

  • Working with financial data providers is a slow process, it may take you 8 weeks from your initial point of contact with a rep before securing a license agreement. Work with your business decision team to prepare for this

  • Even indexes like S&P 500 and industry data is under license.


What Data Do You Need?

  • Historical Price - Everyone needs this for charts, models, etc

  • Corporate Actions - Adjustments going forward for historical data

  • Real-time Price - You may want this for real-time charts (100’s of Megs a Day)

  • SEC Filings - You may want to decompose for quant models or present reports to users

  • 3rd Party Quant Data - Black box trading solution, quant box


Don’t load everything day 1

  • AAPL, INTL, T, X, XOM, DVW, DELL, GE

  • S&P 500

  • Russell 3000

  • FTSE

  • APAC

  • OTC / PINK / BB

  • Mutual Funds

  • Money Market

  • Indexes


What’s a security?

  • Stocks, bonds, mutual funds and more

  • In this context traded on an exchange

  • A note held for you by your broker

  • Represents a debt to be paid by issuer -or-

  • Represents a share of the issuer -or-

  • Represents a bet on the issuer -or-

  • Represents an index of multiple securities -or-

  • Represents another abstraction of ownership or bet


SECURITY


What’s a corporate action?

  • A change to an attribute of a security or a security’s price

  • Split; reverse split

  • Dividend

  • Name change

  • Listing; delisting;

  • Exchange change

  • Notes change

  • Regional change

  • Currency change


CORPORATE_ACTIONS (ABRIDGED) 70 Cols!!!


QA is not a bad word

  • QA of financial data is much different than qa of software

  • row data can arrive empty, wrong, portions missing

  • row data can fail to arrive

  • stocks may be priced wrong

  • corporate actions may be for the wrong stock

  • Canadian stock can be listed in the us with Canadian dollar prices

  • all kinds of other fun

  • You must have Excel jockeys to identify and explain noise to:

    • Engineers

    • Your data provider

    • Your customers


2. Table Designs


2. Table Designs

  • SECURITY - Attributes of a security

  • RAW_PRICE - Attributes of a security’s trades from csv, unadjusted

  • PRICE - Attributes of a security’s trades, adjusted for corporate actions

  • CORPORATE_ACTIONS - Change records of a security or price attributes

  • JOBS - Attributes of a job

  • COUNTRY - A reference table for a security’s country

  • EXCHANGE - A reference table for a security’s exchange

  • REGION - A reference table for a security’s region

  • SOURCE - A reference table of the data provider for a security


SECURITY


SECURITY

  • security_id is your abstraction of data industry identifiers

    • SECURITY_ID, your identifier int NOT NULL AUTO_INCREMENT unsigned pk

    • SECURITY_NAME, exchanges name for company

    • SOURCE_ID, what data provides this char(1)

    • CUSIP, us and canada unique identifier, char(9)

    • TICKER, an identifier, a gotcha, varchar(14)

    • SYMBOL, an identifier, a gotcha, varchar(14)

    • EXCHANGE_ID, what exchange is it traded on

    • REGION_ID, what region does this trade in int

    • COUNTRY_ID, what country does this trade in int


SECURITY table

  • Uses internal identifier SECURITY_ID

  • If you’re experimenting with different providers, SOURCE_ID should be added to pk

  • Holds providers key for a security (ric, symbol, ticker, cusip)

  • 500k rows max


RAW_PRICE


RAW_PRICE table (Load Everything)

  • The rows just as they’ve come from the provider with an artificial key

  • Price corrections with asof_date in the past may come in, check for these

  • Sometimes attributes don’t exist in source files, missing asof, open, etc, not null loses the whole row and it might take days to get another one resent


PRICE


PRICE table

  • SECURITY_ID, your identifier int unsigned pk

  • ASOF_DATE, what data provides this char(1) pk

  • OPEN, the opening price decimal

  • LOW, the low price for the day decimal

  • CLOSE, the closing price for the day

  • HIGH, the high price for the day

  • VOLUME, how many shares sold that day

  • SPLIT ADJUSTMENT: (REUTERS, not COMSTOCK) multiplier decimal


PRICE table

  • Only one price per security per day

  • Validation happens from RAW_PRICE to PRICE

  • Instead of bouncing rows you may consider a suspect data flag which bubbles up to UI


CORPORATE_ACTIONS


CORPORATE_ACTIONS table

  • Comstock: Splits and Reverses are in this file

  • Reuters: Splits and Reverses are in price file

  • Denormalized - Boo!

  • Much of this information is display information

  • Changes to exchange or trading status are in here (bankruptcy, emerging from bankruptcy, changing from NASDAQ to OTC.BB, etc)

  • Dividend information is in here too


COUNTRY, REGION, EXCHANGE


COUNTRY, REGION, EXCHNAGE tables

  • COUNTRY, keeps track of what country a security trades in

    • USA

    • CANADA

  • REGION, keeps track of what region a security trades in

    • NORTH AMERICA

    • APAC

  • EXCHANGE, keeps track of what Exchange a security is traded on

    • VANCOUVER

    • NASDAQ

    • NASDAQ OTC.BB


SOURCE


SOURCE table

  • Keeps track of who provides what data in the security table

  • Good to side-by-side comparisons where data comes from two different providers

  • Helps build organizational knowledge over what providers have good data-quality


Data Gotchas

  • Do: load everything, don’t build constraints based on provider specs prior to understanding the data

  • Do: use 5.0.31 or above with innodb

  • Do: wrap batches in BEGIN / END

  • Do: set innodb_rollback_on_timeout = ON

  • Do: stage feeds in raw tables b/c if you adjust for splits in the live history table and make mistakes you’re be loading millions of rows again

  • Don’t run things like:

    • exec(“mysql -u user -e “source /feed/load_statements.sql”);

  • Don’t: foreign keys until process is hardened or never


  • 3. Gears

    • load_raw_prices();

    • daily_price_clean();

    • load_security();

    • load_price();

    • split(); Special sauce for you to write

    • undo_split(); Ditto


    An approach to data loads

    • Daily load phase 1

      • Get data from provider in csv or xml

      • Don’t translate

      • Import into raw tables

      • Run variance checks to throw alerts (~50k securities)

        • is ( yesterday n rows / today n rows ) between 99.99 and 100.01%?

    • Daily load phase 2

      • Load data into live tables

      • Make adjustments for corporate actions

      • Run your models

      • Run variance checks to throw alerts


    load_raw_prices()

    • function load_prices( $price_file ) {

    • $lines = file($price_file);

    • $counter = 0;

    • foreach ($lines as $line_num => $line ) {

    • $counter = $counter+1;

    • $row = explode(",",$line);

    • $cusip = $row[0];

    • $ric = $row[1];

    • $asof_date = $row[2];

    • $open = $row[3];

    • $high = $row[4];

    • $low = $row[5];

    • $close = $row[6];

    • $volume = str_replace( "\n", "", $row[7] );

    • $split_adjustment = str_replace( "\n", "", $row[8] );

    • $today = date('Y-m-d');

    • if($split_adjustment=='') {

    • $split_adjustment = '0.00000';

    • }


    load_raw_prices() (cont’d)

    • $query = "INSERT INTO RAW_PRICE ( CUSIP, RIC, ASOF_DATE, OPEN, HIGH, LOW, CLOSE, VOLUME, SPLIT_FACTOR, LOAD_DATE ) VALUES ( "

    • . "'" . $cusip . "',"

    • . "'" .$ric . "',"

    • . "'" .$asof_date . "',"

    • . $open . ","

    • . $high . ","

    • . $low . ","

    • . $close . ","

    • . $volume . ","

    • . $split_adjustment . ","

    • . "'" . $today . "')" ;

    • # put the rows in the raw_prices table

    • sm_query( $query );

    • if (($counter%100)==0) {

    • echo $counter . " lines processed.\n";

    • }

    • }

    • echo $counter . " total lines processed.\n";

    • }


    daily_price_clean()

    • function daily_price_clean( $source_file, $new_file ) {

    • $lines = file($source_file);

    • foreach ($lines as $line_num => $line ) {

    • # strip "-9,999,401"

    • $line = str_replace("\"-9,999,401\"","NULL",$line);

    • # strip volume quotes and commas

    • $pieces = explode("\"",$line);

    • $pieces[1] = str_replace(",","",$pieces[1]);

    • $fixed_line = implode("",$pieces);

    • # do some more funky stuff to get the date re-arranged

    • $date_repair = explode(",",$fixed_line);

    • $date_digits = explode("/",$date_repair[2]);

    • $date_repair[2] = "20" . $date_digits[2] . "-" . $date_digits[0] . "-" . $date_digits[1];

    • $fixed_line2 = implode(",",$date_repair);

    • # write out new file

    • if ( !file_exists($new_file)) {

    • touch ($new_file);

    • }

    • $handle = fopen ($new_file, 'a');

    • fwrite($handle, $fixed_line2);

    • fclose($handle);

    • }

    • }


    load_secuirty()

    • function load_security( $security_file ) {

    • $lines = file($security_file);

    • $counter = 0;

    • foreach ($lines as $line_num => $line ) {

    • $counter = $counter+1;

    • $row = explode(",",$line);

    • $cusip = $row[0];

    • $ric = $row[1];

    • $ticker = $row[2];

    • $today = date('Y-m-d');

    • $query = "INSERT INTO SECURITY ( CUSIP, RIC, TICKER, CREATED_DATE ) VALUES ( "

    • . "'" . $cusip . "',"

    • . "'" . $ric . "',"

    • . "'" . $ticker . "',"

    • . "'" . $today . "')" ;


    load_secuirty() (cont’d)

    • # put the rows in the raw_prices table

    • sm_query( $query );

    • if (($counter%100)==0) {

    • echo $counter . " lines processed.\n";

    • }

    • }

    • echo $counter . " total lines processed.\n";

    • }


    load_prices()

    • function load_prices( $date ) {

    • $query = "INSERT INTO PRICE

    • SELECT

    • S.SECURITY_ID, RP.ASOF_DATE, RP.OPEN, RP.HIGH,

    • RP.LOW, RP.CLOSE, RP.VOLUME, RP.SPLIT_FACTOR,

    • date(now())

    • FROM

    • RAW_PRICE RP,

    • SECURITY S

    • WHERE

    • S.RIC = RP.RIC

    • AND

    • RP.ASOF_DATE = '" . $date . "'";

    • echo $query ;

    • sm_query( $query );

    • }


    Dependency Task Scheduling

    • Php and shell scripts are useful tools to download and process price data

    • But cron doesn’t do a very good job of keeping track in a database of when something starts, finishes, fails, fails to start

    • If email is broken or cron isn’t reporting correctly you may not know of problems until it’s too late

    • Often a layer of metadata fails b/c of failed or weird market data, a missing price can make a graph or signal look weird to customers

      • You can’t load prices if the ftp or feed fails

      • You can’t process corporate actions until you know the price

      • You can’t get accurate calculations against time-series if there’s holes in the series

      • You can’t send signals or present accurate graphs if anything related to a security fails

    • Keeping track of failed jobs gives you a flag that can also tell your users what they’re seeing is questionable and will be corrected

    • You can report on a jobs list and throw alerts on failed jobs


    Tracking variances in data quality

    • Price weirdness:

      • yesterday’s price / today’s price

    • Row weirdness:

      • num rows yesterday / num rows today

    • Range weirdness:

      • yesterday’s average of a sum / today’s average of a


    Questions?Acknowledgements

    • Starmine: Tripp, Flanzer, Foster, Breffle, Miller

    • Cake Financial: Reed


  • Login