1 / 55

(7) Data, Privacy and Metrics 数据 , 隐私和测量

The Networked Economy: Information Management, Strategy, and Innovation 网络经济 : 信息管理 , 战略 , 和创新. (7) Data, Privacy and Metrics 数据 , 隐私和测量. Agenda 议程. Role of data in decision making 数据在决策中的地位 Size and cost of storage 数据存储的规模和成本. River Nile 尼罗河. Notre Dame 巴黎圣母院.

mala
Download Presentation

(7) Data, Privacy and Metrics 数据 , 隐私和测量

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Networked Economy: Information Management, Strategy, and Innovation网络经济:信息管理, 战略, 和创新 (7) Data, Privacy and Metrics数据,隐私和测量

  2. Agenda议程 • Role of data in decision making数据在决策中的地位 • Size and cost of storage数据存储的规模和成本

  3. River Nile 尼罗河

  4. Notre Dame 巴黎圣母院

  5. From Faith to Data 从盲从到数据 • The Era of Faith盲从时代 • Massive investments into cathedrals etc.巨额投资建设大教堂 • Unclear ROI (Return on Investment)投资收益(ROI)不明确 • No feedback, or l_o_n_g feedback cycle无反馈或反馈周期长 • The Era of Data数据时代 • Massive investments into measuring, networking, communication, storage大量投资于测量、网络、通讯和存储 • ROI measurable投资收益率可测 • Short feedback cycle反馈周期很短 • Experiments试验

  6. Characteristics of our Era 当前时代特征 • What do we do with data?拿数据怎么办? • Gather data收集数据 • Explore data探测数据 • Exploit data采集数据 • Publish data发布数据 • Archive data归档数据 what does it mean? 这意味着什么? too much of it… 数据太多了… D A T A D A T A D A T A will it integrate with my systems?… 能否与我的系统整合?… how can I act on it… 怎么用数据?    Opportunities and challenges for marketers, publishers, agencies…给市场人员、出版机构、广告公司带来机遇与挑战

  7. Measuring information and storage 信息度量和存储 Relationship Byte (B) and bit (b): 1 Byte = 8 bits字节与比特之间的关系:1字节=8比特

  8. Measuring information and storage 信息度量和存储

  9. Internet互联网 • Surface web表层页面 • = static pages=静态页面 • 10 billion pages100亿个页面 • 10 kB… 100 kB / page每页1万… 10万字节 •  100 TB … 1PB total storage总容量为100万亿…1000万亿字节 • Deep web深层网页 • 10x size of surface web 10倍于表层页面 • Email 电子邮件 • 3 billion email accounts30亿个电子邮件帐户 • 10 emails / day / account每天每个邮箱10封邮件 •  30 billion emails / day每天300亿封邮件 • 1 kB / email每封电子邮件1kB •  30 TB traffic per day每天流量为10 TB •  100 petabyte / year每年100 千万亿字节 • Storage cost (2008 ASW) • 1 petabyte = USD 100k1千万亿字节 = 10万美元

  10. Turning behavior into data将行为转换为数据 • Revealed preferences显示出的偏好 • Music 音乐 • Search 搜索 • Online trading在线交易 • Online dating网上交友

  11. Additional sources of data about people人类的其它数据来源 • Movement移动 • Mobile phones手机 • GPS全球卫星定位系统

  12. “车开起来再付钱”保险

  13. Everything can and will become data任何东西都够能且一定会变成数据 • Additional sources of data about people人类的其它数据来源 • Movement移动 • Mobile phones, GPS手机,全球卫星定位系统 • Brain activity大脑活动 • Neuromarketing神经市场营销 • fMRI analysis of response to stimuli大脑皮层对刺激反映的fMRI分析 • RFIDs (Radio Frequency Identifiers) • Unique identifiers for objects, bridging physical and digital目标:物体独特的标识

  14. RFIDs and e-business电子标签技术和电子商务 • Facts 基本数字 • Price: 2 US cent 价格:2美分 • Size: 2 mm 大小:2毫米 • It will happen: Big business 总会发生:大公司 • Opportunities 机会 • Inventory systems, Supply chain 库存系统,供应链 • Wal-Mart saves USD 8 billion per year by using RFIDs沃尔玛:使用电子标签技术后初步估算每年节省80亿美元 • Shipping screw-ups: 1 in 20运输途中差错;概率1/20 • Personalization个性化 • Fears 担心 • Loss of privacy 隐私泄露 • Abuse of data恶意使用 • Consumers need to be educated to make informed, conscious decisions about their data消费者被教授如何利用数据做出精明的决策 • This level of transparency is “native” in e-business如此透明对于电子商务可谓与生俱来

  15. Aspects of privacy隐私的不同方面 • Information信息隐私 • name, address, hobbies…名字,地址,嗜好... • Communication交流隐私 • phone calls, e-mail, SMS, …电话号码,电子邮箱,短信... • Territory区域隐私 • privacy of your office, home, bedroom, …办公室,家庭,卧室隐私... • Bodily privacy身体隐私 • strip searches, drug tests, …裸体检查,毒品测试... • LILY – PLEASE FIX AS I TOLD YOU IN CLASS

  16. Some privacy concerns隐私顾虑 • Collection and storage搜集和存贮 • Extensive amounts of personally identifiable data collected and stored 广泛而大量个人可确认的数据被收集和存储 • Unauthorized secondary use未经授权而转做他用 • Information collected from individuals for one purpose is used for another, purpose without authorization from the individuals 为了某一用途收集的个人信息,没有经过信息提供者的授权被用作其它用途 • Improper access非法访问 • Data about individuals are available to people not authorized to view or work with these data个人数据被一些没有授权的人浏览或使用 • Combining data合并数据 • Personal data in disparate databases may be combined into larger databases在不同数据库中的个人数据被合并成更大的数据库. • *Source: Smith, H.J., Milberg, S. J., Burke, S. J., „ • Information Privacy: MeasuringIndividuals‘ Concerncs About Organizational Practices“, MIS Quarterly, June 1996

  17. Errors in personal data个人数据错误 • People worry that protections against errors in personal data are inadequate人们担心针对个人数据错误而采取的保护措施不够 • Errors by accident or deliberate意外或者故意的错误 • People increasingly demand access to their personal information 人们越来越需要有权使用他们的个人信息 • Revealed preferences often differ from stated preferences显示的参数通常与规定的参数有一些区别 • Perception matters often more than objective facts感性认识通常比客观事实更重要 • Question:问题 • Describe processes how people can correct errors in data about themselves 描述一下增强人们对他们个人数据纠错的过程

  18. Different people have different privacy concerns 不同的人有不同的隐私顾虑 Never从不 Privacy fundamental list 什么信息 也不愿透露 30% Profiling averse 不愿透露 个人信息 26% Under certain conditions 在一定情况下 Profile revelation透露个人信息 1 2 3 4 5 Identity concerned 不愿透露 身份 20% Marginally concerned 没有顾虑 24% Under certain conditions 在一定情况下 Always总是 Never从不 1 2 3 4 5 Identity revelation透露身份

  19. Privacy becoming increasingly more relevant 隐私越来越重要 • Personal information becomes ubiquitous with electronic transactions 由于电子交易的存在,个人信息无处不在 • Personal information is at the core of privacy个人信息是隐私的核心 • Privacy is a fundamental right that has been recognized by democratic societies across centuries and across geographies隐私长期、不分地域地被民主社会认为是人的最基本权利 • Privacy is a proven customer concern隐私问题是被证实存在的顾客的顾虑 • Privacy breach increasingly becomes a relevant social cost (including companies)对隐私的破坏成为一项越来越大的社会成本 • As companies have begun to treat customer information as an asset, people learn to consider their information as an asset由于公司已经把顾客信息当作一项资产来对待,人们也学着把他们自己的信息当作资产

  20. People trust less in way companies deal with their data人们逐渐对公司如何利用他们的数据产生怀疑 • Most businesses handle the personal information they collect about consumers in a proper and confidential way.大多数公司把他们搜集到的消费者个人信息以正确和保密的方式处理19993 Strongly/Somewhat Disagree 34%1999年 强烈/有些不同意 34%20004 Strongly/Somewhat Disagree 43%2000年 强烈/有些不同意 43 %2001 Strongly/Somewhat Disagree 56%2001年 强烈/有些不同意 56 % *Source: Ernst & Young „Privacy: What Consumers Want“, January 2003

  21. People feel increasingly less protected人们越来越觉得没有受到应有的保护 • Existing laws and organizational practices provide a reasonable level of protection for consumer privacy today.目前现存的法律和组织活动为消费者隐私保护提供了合理的保护。 19995 Strongly/Somewhat Disagree 38%1999年 强烈/有些不同意 38%20006 Strongly/Somewhat Disagree 47%2000年 强烈/有些不同意 47%2001 Strongly/Somewhat Disagree 62%2001年 强烈/有些不同意 62% *Source: Ernst & Young „Privacy: What Consumers Want“, January 2003

  22. Privacy backlash could have a considerable impact on a companies‘ bottom line.隐私的负面作用会对公司的业绩产生相当大的影响。 • If you were to hear or read that a company with which you were a customer was collecting, sharing or using customer’s personal information in a way you did not think was proper, which one of the following best describes what you would do?如果你听到或读到你作为他们消费者的公司在搜集,分配和使用客户个人信息时使用了你认为不合适的方式,那么你会采取下面哪一种方式?停止与公司的业务往来 83%减少与公司的业务往来 16%继续与公司的业务往来因为对我来说无所谓 1% *Source: Ernst & Young „Privacy: What Consumers Want“, January 2003

  23. Professionals are significantly less concerned about privacy issues when they are being asked as professionals compared to when they are asked in private.*人们在工作的时候被问及对待隐私问题的态度时的回答与在家里被问到时相比,显得很不在意。* 1=unimportant, 1= 不重要 7=extremely important7=非常重要 *Source: Esrock, S.L.., Ferré, J.P., „ A Dichotomy of Privacy: Personal and Professional Attitudes of Marketers“, Business and Society Review, 104: 1, 1999, pp.107-120

  24. Privacy Principles by US Federal Trade Commission (FTC)美国联邦贸易委员会(FTC)提出的5个原则来确保尊重隐私

  25. Storage is free免费存储 每千兆字节的成本 硬盘存储容量 PB 美元 10 exabyte10 EB 年 年 • Dramatic drop in price 价格大幅下降 • (2008: 1GB costs 10 US cents) • Exponential increase in storage存储量呈指数增长

  26. sina.com Oct 8, 1997 (web.archive.org)新浪1997年10月8日

  27. Internet Archive互联网历史档案 • http://web.archive.org • Stores versions of surface web since 1996存储1996年以来的表层网页 • Collected via opt-out通过opt-out收集 • 1TB / day raw data每天1万亿字节的原始数据 • 1 petabyte stored total总存储量为1000万亿字节

  28. Data collectedper day 每日采集的数据量 Cost 成本 Implicit隐秘采集数据(Clicks etc.) (点击等) Explicit公开采集数据 (Surveys etc.)(调查问卷等) Communication 通信 Storage 储存 Time 时间 Time 时间 2010 2010 1990 1990 Why now? 为什么现在发生? • Data collected implicitly: Dramatic growth over time隐秘采集数据:随时间推移急剧增长 • Data collected explicitly: Amount constant over time公开采集数据:随时间推移变化不大

  29. Data collectedper day 每日采集的数据量 Cost 成本 Implicit隐秘采集数据(Clicks etc.) (点击等) Explicit公开采集数据 (Surveys etc.)(调查问卷等) Communication 通信 Storage 储存 Time 时间 Time 时间 2010 2010 1990 1990 Why now? 为什么现在发生? • Malthus’s Law of Information:马尔萨斯信息定律: • New information content is doubling every year新信息内容每年翻一番 • Time spent on information consumption is constant而信息消费时间几乎不变

  30. Why now? 为什么现在发生? • Malthus’s Law of Information:马尔萨斯信息定律: • New information content is doubling every year新信息内容每年翻一番 • Time spent on information consumption is constant而信息消费时间几乎不变 Communication 通信

  31. 美国当地IP电话用户数量 单位:百万人 预测 Voice over IP (VOIP)网络电话 • IP := Internet ProtocolIP即互联网协议 • Traditional phones are on their way out传统电话正逐步走下历史舞台 • Example: skype例:skype电话 • skypeskype: freeskype skype免费 • skype phone: 1c/ minskype每分钟通话费1美分 • Concurrent users (3/06): 5M • Why is it so inexpensive?为什么会如此便宜?

  32. Nr of words transmitted vs cost of transmission (US 1960-1980)传输量与传输成本(美国 1960-1980) 电视机 收音机 报纸 美国每年产生的单词量(单位:万亿) 杂志 直接邮件 教育 电话 书籍 有线电视 邮件 电影 数字通信 传真 电报 邮递电报 电传 每1000单词的传输成本(折成1972年的美元价值计算)

  33. Large e-business company: Amount of data created per year大型电子商务公司年均数据产量 • New data per year每年新数据量 • 100 MB • 10 GB • 1 TB • 100 TB • Level层次 • Customer消费者 • Orders订单 • Session aggregates访问总计 • Clicks 点击 Amount of data 数据量

  34. The iterative process of modeling and decision making 建模和决策的互动过程 Re- 1.Define定义 • Business metrics, objectives and baselines业务度量,目标和基准 2. Measure 测量 • Collect, store, manage data 收集、储存和管理数据 3. Describe 描述 • Exploratory data analysis 探测性的数据分析 4. Predict and evaluate 预测和评估 • Probabilistic models 概率模型 5. Decide, act,and evaluate 决策,行为和评估 (重新) 

  35. Trade-off此消彼长 Trade-off此消彼长 1.Business metrics and objectives 1.业务度量和目标 • Stock price股票价格 • Profit利润 • Number of items sold销售数量 • Number of visits访问量 • Conversion rate行动转化率 • Customer acquisition赢得消费者 • Customer retention留住消费者 • Customer satisfaction消费者满意度

  36. 2. Measure 2.测量 Customer- CompanyInteractions消费者—公司互动 • Orders订单 • Overall use of the site网站的综合利用 • Buying vs selling 购买 vs. 出售 • Searching vs browsing 搜索 vs. 浏览 • Engagement: Reviews, etc.参与:评论等 • Customer service contacts消费者服务联系 • E-mail, phone 电子邮件,电话 • Surveys调查问卷 • Satisfaction 满意度 • Intentions and goals 意图/目标/模式 • Customer service response消费者服务回复 • Resolution解决方案 • Free replacement, refund 免费退换,退款 • Delivery date: Actual vs promised交货日期 : 实际的与允诺的 • Number of items returned in a search搜索结果 • E-mail campaigns and responses电子邮件广告和回应 CompanyBehavior公司行为 CustomerBehavior消费者行为

  37. Why is it hard? 为什么这么难? • Even simple behavioral analysis requires significant infrastructure 即使简单的行为分析也需要复杂的基础建设

  38. Business questions商业问题 • How many people are coming to my site?有多少人会来访问我的网站? • Who are they?他们是什么样的人? • Where are they coming from?他们来自什么地方? • What are they doing?他们从事什么职业? • Who’s coming back and how frequently?哪些人会再次访问网站,以什么样的频率? • How is all of this changing over time?这种情形随着时间会发生什么变化? • What is the impact of a recent site change?最近一次网站的变化会产生什么影响?

  39. Twyman’s Law图曼法则 • Any statistic that appears interesting is almost certainly a mistake任何一个看起来有趣的统计数据基本上都是错误 • Validate “amazing” discoveries in different ways以不同的方式证实“令人吃惊的”发现 • They are usually the result of a business process他们通常 是业务流程的结果 • 5% of customers were born on the exact same day (including year)5%的顾客同年同月同日出生 • 11/11/11 is the easiest way to satisfy the mandatory birth date field11/11/11是填写出生年月日的最简单的方式 • For US Web sites, there will be a small sales increase on Oct 4, 2008, for European Nov xx 2008对于美国的网站,2008年10月4日销售额会有小小的增长;而欧洲的网站则会在2008年11月xx日出现销售的增长

  40. Some experiences经验 • Synchronize clocks from all data collection points同步记录所有数据收集点 • Example: Some servers were set to GMT and others to Pacific time, leading to strange anomalies例如:有的服务商设定为格林尼治标准时间,而有的则设定为太平洋时间,导致异常出现 • Even being a few minutes off can cause add-to-carts to appear “prior” to the search即使只有几分钟的差异也会使得结果优先于搜索 • Remove test data清除测试数据 • QA organizations constantly test the system品质保证组织经常测试系统 • Make sure the data can be identified and removed from analysis因此要确保数据可以从分析结果中被识别和移除 • Remove robots/bots/spiders移除网络蜘蛛 (一种关键字查询程序) • 5-40% of site e-commerce site traffic is generated by crawlers from search engines (and students doing problem sets)5-40%的电子商务网站流量是由网站浏览者浏览搜索引擎以及学生做作业时查找资料时产生的 • These significantly skew results unless removed只有移除这些网络蜘蛛才不会干扰正常的结果

  41. Weekends周末 Picking the right visualization is key to seeing patterns选择正确的形象是识别特征的关键 • Traffic by day按天计流量 • Easy to see weekends容易识别周末 • Difficult to see other patterns很难区分其他的特征 • Heat map热图 • Shows traffic colored from green to yellow to red用颜色(从绿色到黄色到红色)显示流量 • Utilizes cyclical nature of the week利用一周的周期性特点 • Note 9/3 (Labor Day) and 9/11注意:9/3(劳动节)和9/11

  42. Business-level lessons业务层次上的经验 • Collect operation business data 从运营的角度收集业务层次上的数据 • Data usually not in web logs而不是网页记录 • Searches搜索 • Response times to return results 返回结果的回应时间 • Shopping cart events购物车 • Registration forms注册表 • External events外部事件 • Marketing promotions促销 • Site changes网站变更 • Choose to collect as much data as you realistically can because you do not know what might be relevant for a future question.选择收集尽可能多的数据,因为你不知道什么数据会与你将来的问题相关 • Consider privacy issues • Often aggregated or anonymous data suffices对于隐私性问题有一些难度,但是无差异的数据通常没有问题)

  43. Collection example – Form Errors数据收集例子-格式错误 Here is a good example of data collection that was introduced without knowing apriori whether it will help: form errors有一个数据收集很好的例子就是根本不知道apriori是否会起作用就把它收集进来,这就是格式错误 If a web form filled and a field did not pass validation, log field and value填写网页表格,区域未通过确认,登录域和数值 This was the Bluefly home page when they went live这是过去Bluefly的网页 Looking at form errors, we saw thousands of errors every day on this page我们在这个网页上发现了成千上万的格式错误 Any guesses?猜想?

  44. Summary总结 • Think about the problem end-to-end • Collection搜集 • Transformations转化 • Reporting报告 • Visualizations视觉化 • Modeling建模 • Taking action行动 • Agree on terminology术语的一致性 • How do you define a session?怎样定义访问停留? • How do you define a customer? 怎样定义客户? • (e.g., did every customer make a purchase)?例如:每一个顾客都购买了吗? • Beware of hidden variables when concluding causality当包含因果关系的时候注意隐藏变量 • E.g., Simpson’s paradox例如:辛普森的矛盾论 • Conduct controlled experiments (A/B tests) when possible -- our intuition is poor如果可能的话进行可以控制的实验(A/B实验),我们的直觉是不可靠的

  45. 209.209.111.59 - - [29/Jun/2006:13:38:50 -0700] "GET / HTTP/1.1" 200 17497 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" 209.209.111.59 [29/Jun/2006:13:38:50 -0700] "GET / HTTP/1.1“ 200 17497 "-” "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" Weblog entry

  46. APPENDIX 附录

  47. Books and Libraries书和图书馆 • 30 million books3000万册藏书 • 1 MB per book (text)每本书大小为1兆字节(文本) • 100 characters per line每行100个字符 • 100 lines per page每页100行 • 100 pages per book每本书100页 • 30TB for text文本大小共有30万亿字节(30TB) • 1 petabyte = $ 1 million1千万亿字节 = 100万美元 • Are books the right medium for archiving?书是合适的存储媒体吗? • Digital storage cheap: $10k for all 存储便宜:存储所有这些只需花5万美元 • Scanning expensive: $10 per book扫描昂贵:每册10美元 • But: $300M in one shot, then done forever!但3亿美元是一次性投资,已经完成 • Scanning all books is half a year of Library of Congress’s budget扫描所有书籍耗费国会图书馆半年的预算 • Books in print: 3.2 million在版书籍:320万册 • Books sold in US in 1999: 1.1 billion美国1999年售出的书籍:11亿册

  48. Internet互联网 • Web 网页 • Surface web表层页面 • = static pages=静态页面 • 10 billion pages100亿个页面 • 10 kB… 100 kB / page每页1万… 10万字节 •  100 TB … 1PB total storage总容量为100万亿…1000万亿字节 • Deep web深层网页 • = underlying databases =底层数据库 • 10x size of surface web 10倍于表层页面 •  1 … 10 petabyte1 … 10 PB • Email 电子邮件 • 1 billion email accounts 10亿个电子邮件帐户 • 10 emails / day / account每天每个邮箱10封邮件 • 10 billion emails / day每天100亿封邮件 • 1 kB / email每封电子邮件1kB •  10 TB traffic per day 每天流量为10 TB • 30 petabyte / year每年30 千万亿字节 • Comparison (banner ads)比较(网页广告) • 4 billion ads / day served by DoubleClickDoubleClick每天做40亿条广告

  49. Information production信息产量 • Surveillance监控摄像 • 30 exabyte / year每年30 EB • 30M cameras3000万摄像头 • 3 frames / sec -> 100M pics / sec3帧 / 秒 -> 1亿张图片/秒 • 10kB / pic -> 1 TB/sec10kB/张 -> 1TB/秒 • 100k secs / day -> 100 petabyte / day10万秒/天 -> 100 PB/ 天 • One day of production of surveillance cameras 监控摄像头一天产生的信息量= 1 year of all email traffic=1年的电子邮件流量= 100+ years of data stored by Amazon.com=亚马逊100多年的数据存储量

More Related