Financial informatics
This presentation is the property of its rightful owner.
Sponsored Links
1 / 41

Financial Informatics: PowerPoint PPT Presentation


  • 59 Views
  • Uploaded on
  • Presentation posted in: General

Financial Informatics:. Startup, low-cost, dataload Challenges and Solutions. What are we talking about today?. Financial Data, more specifically stock market data as an example The basic design of a MySQL database that contains a daily history of stock prices

Download Presentation

Financial Informatics:

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Financial informatics

Financial Informatics:

  • Startup, low-cost, dataload Challenges and Solutions


What are we talking about today

What are we talking about today?

  • Financial Data, more specifically stock market data as an example

  • The basic design of a MySQL database that contains a daily history of stock prices

  • Building a stock machine and some of the challenges posed

  • Some large data ‘gotchas’ and solves

  • Some large mysql ‘gotchas’ and solves


Financial data

Financial data

  • csv records of what happened that day or a signal

  • often have unexplaind anomalies

  • daily arrival of row data which doesn’t conform to spec


1 big picture

1. Big Picture

  • Who is your audience? Make your analytics and application work with a small dataset first

  • Market data rules: You can’t scrape Yahoo

  • QA is not a bad word: Data Quality is key

  • What’s a security?

  • What’s a corporate action?

  • OLAP: This is once a day processing

  • Take performance of your dev boxes seriously: Dell 2950 with 32GB of ram, 6 disks, RAID10.


Where does financial data come from

Where does financial data come from?

  • Thompson / Reuters

  • McGraw Hill / Interactive Data

  • Securities and Exchange Commission

  • Dow Jones

  • Standard and Poors

  • Bloomberg

  • Lots of ‘boutique’ $100M companies


Market data rules

Market data rules

  • Information about a security’s trade on an exchange is owned by the exchange and distributed to those who have made a license agreement (Reuters, Interactive Data, et al.) Your license agreement with these 3rd parties will start at $20k-$50k a year

  • Scraping yahoo, msn money, Forbes or another site is infringement

  • There are different license levels with financial data providers, redistribution usually costs more than a quantitative black box

  • After three days most data is less valuable / expensive, you may get a bargain for dev phase

  • Working with financial data providers is a slow process, it may take you 8 weeks from your initial point of contact with a rep before securing a license agreement. Work with your business decision team to prepare for this

  • Even indexes like S&P 500 and industry data is under license.


What data do you need

What Data Do You Need?

  • Historical Price - Everyone needs this for charts, models, etc

  • Corporate Actions - Adjustments going forward for historical data

  • Real-time Price - You may want this for real-time charts (100’s of Megs a Day)

  • SEC Filings - You may want to decompose for quant models or present reports to users

  • 3rd Party Quant Data - Black box trading solution, quant box


Don t load everything day 1

Don’t load everything day 1

  • AAPL, INTL, T, X, XOM, DVW, DELL, GE

  • S&P 500

  • Russell 3000

  • FTSE

  • APAC

  • OTC / PINK / BB

  • Mutual Funds

  • Money Market

  • Indexes


What s a security

What’s a security?

  • Stocks, bonds, mutual funds and more

  • In this context traded on an exchange

  • A note held for you by your broker

  • Represents a debt to be paid by issuer -or-

  • Represents a share of the issuer -or-

  • Represents a bet on the issuer -or-

  • Represents an index of multiple securities -or-

  • Represents another abstraction of ownership or bet


Security

SECURITY


What s a corporate action

What’s a corporate action?

  • A change to an attribute of a security or a security’s price

  • Split; reverse split

  • Dividend

  • Name change

  • Listing; delisting;

  • Exchange change

  • Notes change

  • Regional change

  • Currency change


Corporate actions abridged 70 cols

CORPORATE_ACTIONS (ABRIDGED) 70 Cols!!!


Qa is not a bad word

QA is not a bad word

  • QA of financial data is much different than qa of software

  • row data can arrive empty, wrong, portions missing

  • row data can fail to arrive

  • stocks may be priced wrong

  • corporate actions may be for the wrong stock

  • Canadian stock can be listed in the us with Canadian dollar prices

  • all kinds of other fun

  • You must have Excel jockeys to identify and explain noise to:

    • Engineers

    • Your data provider

    • Your customers


2 table designs

2. Table Designs


2 table designs1

2. Table Designs

  • SECURITY - Attributes of a security

  • RAW_PRICE - Attributes of a security’s trades from csv, unadjusted

  • PRICE - Attributes of a security’s trades, adjusted for corporate actions

  • CORPORATE_ACTIONS - Change records of a security or price attributes

  • JOBS - Attributes of a job

  • COUNTRY - A reference table for a security’s country

  • EXCHANGE - A reference table for a security’s exchange

  • REGION - A reference table for a security’s region

  • SOURCE - A reference table of the data provider for a security


Security1

SECURITY


Security2

SECURITY

  • security_id is your abstraction of data industry identifiers

    • SECURITY_ID, your identifier int NOT NULL AUTO_INCREMENT unsigned pk

    • SECURITY_NAME, exchanges name for company

    • SOURCE_ID, what data provides this char(1)

    • CUSIP, us and canada unique identifier, char(9)

    • TICKER, an identifier, a gotcha, varchar(14)

    • SYMBOL, an identifier, a gotcha, varchar(14)

    • EXCHANGE_ID, what exchange is it traded on

    • REGION_ID, what region does this trade in int

    • COUNTRY_ID, what country does this trade in int


Security table

SECURITY table

  • Uses internal identifier SECURITY_ID

  • If you’re experimenting with different providers, SOURCE_ID should be added to pk

  • Holds providers key for a security (ric, symbol, ticker, cusip)

  • 500k rows max


Raw price

RAW_PRICE


Raw price table load everything

RAW_PRICE table (Load Everything)

  • The rows just as they’ve come from the provider with an artificial key

  • Price corrections with asof_date in the past may come in, check for these

  • Sometimes attributes don’t exist in source files, missing asof, open, etc, not null loses the whole row and it might take days to get another one resent


Price

PRICE


Price table

PRICE table

  • SECURITY_ID, your identifier int unsigned pk

  • ASOF_DATE, what data provides this char(1) pk

  • OPEN, the opening price decimal

  • LOW, the low price for the day decimal

  • CLOSE, the closing price for the day

  • HIGH, the high price for the day

  • VOLUME, how many shares sold that day

  • SPLIT ADJUSTMENT: (REUTERS, not COMSTOCK) multiplier decimal


Price table1

PRICE table

  • Only one price per security per day

  • Validation happens from RAW_PRICE to PRICE

  • Instead of bouncing rows you may consider a suspect data flag which bubbles up to UI


Corporate actions

CORPORATE_ACTIONS


Corporate actions table

CORPORATE_ACTIONS table

  • Comstock: Splits and Reverses are in this file

  • Reuters: Splits and Reverses are in price file

  • Denormalized - Boo!

  • Much of this information is display information

  • Changes to exchange or trading status are in here (bankruptcy, emerging from bankruptcy, changing from NASDAQ to OTC.BB, etc)

  • Dividend information is in here too


Country region exchange

COUNTRY, REGION, EXCHANGE


Country region exchnage tables

COUNTRY, REGION, EXCHNAGE tables

  • COUNTRY, keeps track of what country a security trades in

    • USA

    • CANADA

  • REGION, keeps track of what region a security trades in

    • NORTH AMERICA

    • APAC

  • EXCHANGE, keeps track of what Exchange a security is traded on

    • VANCOUVER

    • NASDAQ

    • NASDAQ OTC.BB


Source

SOURCE


Source table

SOURCE table

  • Keeps track of who provides what data in the security table

  • Good to side-by-side comparisons where data comes from two different providers

  • Helps build organizational knowledge over what providers have good data-quality


Data gotchas

Data Gotchas

  • Do: load everything, don’t build constraints based on provider specs prior to understanding the data

  • Do: use 5.0.31 or above with innodb

  • Do: wrap batches in BEGIN / END

  • Do: set innodb_rollback_on_timeout = ON

  • Do: stage feeds in raw tables b/c if you adjust for splits in the live history table and make mistakes you’re be loading millions of rows again

  • Don’t run things like:

    • exec(“mysql -u user -e “source /feed/load_statements.sql”);

  • Don’t: foreign keys until process is hardened or never


  • 3 gears

    3. Gears

    • load_raw_prices();

    • daily_price_clean();

    • load_security();

    • load_price();

    • split(); Special sauce for you to write

    • undo_split(); Ditto


    An approach to data loads

    An approach to data loads

    • Daily load phase 1

      • Get data from provider in csv or xml

      • Don’t translate

      • Import into raw tables

      • Run variance checks to throw alerts (~50k securities)

        • is ( yesterday n rows / today n rows ) between 99.99 and 100.01%?

    • Daily load phase 2

      • Load data into live tables

      • Make adjustments for corporate actions

      • Run your models

      • Run variance checks to throw alerts


    Load raw prices

    load_raw_prices()

    • function load_prices( $price_file ) {

    • $lines = file($price_file);

    • $counter = 0;

    • foreach ($lines as $line_num => $line ) {

    • $counter = $counter+1;

    • $row = explode(",",$line);

    • $cusip = $row[0];

    • $ric = $row[1];

    • $asof_date = $row[2];

    • $open = $row[3];

    • $high = $row[4];

    • $low = $row[5];

    • $close = $row[6];

    • $volume = str_replace( "\n", "", $row[7] );

    • $split_adjustment = str_replace( "\n", "", $row[8] );

    • $today = date('Y-m-d');

    • if($split_adjustment=='') {

    • $split_adjustment = '0.00000';

    • }


    Load raw prices cont d

    load_raw_prices() (cont’d)

    • $query = "INSERT INTO RAW_PRICE ( CUSIP, RIC, ASOF_DATE, OPEN, HIGH, LOW, CLOSE, VOLUME, SPLIT_FACTOR, LOAD_DATE ) VALUES ( "

    • . "'" . $cusip . "',"

    • . "'" .$ric . "',"

    • . "'" .$asof_date . "',"

    • . $open . ","

    • . $high . ","

    • . $low . ","

    • . $close . ","

    • . $volume . ","

    • . $split_adjustment . ","

    • . "'" . $today . "')" ;

    • # put the rows in the raw_prices table

    • sm_query( $query );

    • if (($counter%100)==0) {

    • echo $counter . " lines processed.\n";

    • }

    • }

    • echo $counter . " total lines processed.\n";

    • }


    Daily price clean

    daily_price_clean()

    • function daily_price_clean( $source_file, $new_file ) {

    • $lines = file($source_file);

    • foreach ($lines as $line_num => $line ) {

    • # strip "-9,999,401"

    • $line = str_replace("\"-9,999,401\"","NULL",$line);

    • # strip volume quotes and commas

    • $pieces = explode("\"",$line);

    • $pieces[1] = str_replace(",","",$pieces[1]);

    • $fixed_line = implode("",$pieces);

    • # do some more funky stuff to get the date re-arranged

    • $date_repair = explode(",",$fixed_line);

    • $date_digits = explode("/",$date_repair[2]);

    • $date_repair[2] = "20" . $date_digits[2] . "-" . $date_digits[0] . "-" . $date_digits[1];

    • $fixed_line2 = implode(",",$date_repair);

    • # write out new file

    • if ( !file_exists($new_file)) {

    • touch ($new_file);

    • }

    • $handle = fopen ($new_file, 'a');

    • fwrite($handle, $fixed_line2);

    • fclose($handle);

    • }

    • }


    Load secuirty

    load_secuirty()

    • function load_security( $security_file ) {

    • $lines = file($security_file);

    • $counter = 0;

    • foreach ($lines as $line_num => $line ) {

    • $counter = $counter+1;

    • $row = explode(",",$line);

    • $cusip = $row[0];

    • $ric = $row[1];

    • $ticker = $row[2];

    • $today = date('Y-m-d');

    • $query = "INSERT INTO SECURITY ( CUSIP, RIC, TICKER, CREATED_DATE ) VALUES ( "

    • . "'" . $cusip . "',"

    • . "'" . $ric . "',"

    • . "'" . $ticker . "',"

    • . "'" . $today . "')" ;


    Load secuirty cont d

    load_secuirty() (cont’d)

    • # put the rows in the raw_prices table

    • sm_query( $query );

    • if (($counter%100)==0) {

    • echo $counter . " lines processed.\n";

    • }

    • }

    • echo $counter . " total lines processed.\n";

    • }


    Load prices

    load_prices()

    • function load_prices( $date ) {

    • $query = "INSERT INTO PRICE

    • SELECT

    • S.SECURITY_ID, RP.ASOF_DATE, RP.OPEN, RP.HIGH,

    • RP.LOW, RP.CLOSE, RP.VOLUME, RP.SPLIT_FACTOR,

    • date(now())

    • FROM

    • RAW_PRICE RP,

    • SECURITY S

    • WHERE

    • S.RIC = RP.RIC

    • AND

    • RP.ASOF_DATE = '" . $date . "'";

    • echo $query ;

    • sm_query( $query );

    • }


    Dependency task scheduling

    Dependency Task Scheduling

    • Php and shell scripts are useful tools to download and process price data

    • But cron doesn’t do a very good job of keeping track in a database of when something starts, finishes, fails, fails to start

    • If email is broken or cron isn’t reporting correctly you may not know of problems until it’s too late

    • Often a layer of metadata fails b/c of failed or weird market data, a missing price can make a graph or signal look weird to customers

      • You can’t load prices if the ftp or feed fails

      • You can’t process corporate actions until you know the price

      • You can’t get accurate calculations against time-series if there’s holes in the series

      • You can’t send signals or present accurate graphs if anything related to a security fails

    • Keeping track of failed jobs gives you a flag that can also tell your users what they’re seeing is questionable and will be corrected

    • You can report on a jobs list and throw alerts on failed jobs


    Tracking variances in data quality

    Tracking variances in data quality

    • Price weirdness:

      • yesterday’s price / today’s price

    • Row weirdness:

      • num rows yesterday / num rows today

    • Range weirdness:

      • yesterday’s average of a sum / today’s average of a


    Questions acknowledgements

    Questions?Acknowledgements

    • Starmine: Tripp, Flanzer, Foster, Breffle, Miller

    • Cake Financial: Reed


  • Login