يکپارچه سازي بانک هاي اطلاعاتي زيست شناختي

يکپارچه سازي بانک هاي اطلاعاتي زيست شناختي دانشگاه تهران دانشکده فني- دانشکده برق و کامپيوتر توسط: کاوه کاوسي استاد راهنما: دکتر مسعود رهگذر

فهرست مطالب • الگوي بانک اطلاعاتي • الگوي مفهومي • الگوي منطقي • الگوي فيزيکي • اشياء الگو • فرآيند يکپارچه سازي سنتي • يکپارچه سازي ديد ها • انتقال و استفاده چند جانبه از داده ها (Data interoperability) • يکپارچه سازي بين سازماني • معماري سيستم هاي بانک اطلاعاتي فدراسيوني • انباره سازي داده ها • عصر وب، همترازي دادها (Data alignment) • ناهمگوني داده ها • انواع ناسازگاري در داده ها • معنايي • کمي

فهرست مطالب (ادامه) • چالش هاي مديريت بانک هاي اطلاعاتي بيوانفورماتي • انگيزه هاي يکپارچه سازي بانک هاي اطلاعاتي زيست شناختي

الگوي بانک اطلاعاتي ساختار يک سيستم بانک اطلاعاتي که توسط يک زبان فرمال توصيف مي شود، الگوي آن بانک اطلاعاتي ناميده مي شود. در يک بانک اطلاعاتي رابطه اي، الگو وظيفه تعريف جداول، فيلد هاي موجود در هر جدول، و ارتباط بين اين فيلد ها و جداول را بر عهده دارد. الگوهاي بانک اطلاعاتي عمدتا در لغت نامه داده بانک اطلاعاتي ذخيره مي شود. اگرچه يک الگوي بانک اطلاعاتي با زبان يک بانک اطلاعاتي تعريف مي شود، ولي اين واژه عمدتا به شماي گرافيکي ساختار بانک اطلاعاتي اطلاق مي شود.

الگوي بانک اطلاعاتي (ادامه) • سطوح مختلف الگوي بانک اطلاعاتي • الگوي مفهومي، يک نقشه از مفاهيم و روابط مابين آنها • الگوي منطقي، يک نقشه از موجوديت ها و صفات آنها و روابط مابين آنها • الگوي فيزيکي، يک پياده سازي خاص از الگوي منطقي • اشياء الگو، مانند اشياء موجود در Oracle يا SQL Server

الگوي مفهومي يک الگوي مفهومي که به آن مدل داده اي مفهومي نيز گفته مي شود يک نقشه شامل مفاهيم و روابط بين آنهاست. اين الگو روابط معنايي بين اجزاء مختلف يک سازمان را توصيف مي کند و فرض هايي را در مورد طبيعت آن بيان مي کند. به صورت خاص اين الگو کلاس هاي موجوديت ها، ويژگي صفات، و روابط را توصيف مي کند. از آنجا که اين سطح از الگو ويژگي هاي معنايي يک سازمان را مشخص مي کند و نه طراحي يک ديتابيس را، بنابراين اين الگو به صورت انتزاعي مطرح مي شود. اين الگو به عنوان مجموعه ي فراگير مدل داده اي محسوب مي شود. چون مدل داده اي بر اساس پرسپکتيو يک شخص از دنياي خارج ساخته مي شود و چارچوبي غير منعطف دارد. اما يک الگوي مفهومي امکان در برگيري تمام چنين پرسپکتيوهايي را دارد. اين الگو ساختار توارث را پشتيباني مي کند. در واقع الگوي مفهومي را مي توان ترکيب چندين الگوي منطقي که هر يک بر پايه ي يک ساختار فکري بوجود آمده اند دانست.

الگوي منطقي يک اسکيماي منطقي يک مدل داده اي براي يک مساله ي مشخص است که با توجه به قابليت ها و محدوديت هاي يک تکنولوژي مديريت داده به وجود آمده است. اين مدل بدون آنکه محدود به يک محصول خاص مديريت بانک اطلاعاتي در بازار باشد، بسته به تکنولوژي مورد استفاده ممکن است از ترم هايي مانند جداول رابطه اي و ستون، کلاس هاي مبتني بر مدل هاي شي گرا، و يا تگ هاي XML استفاده کند. اين اسکيما بر خلاف مدل داده اي مفهومي که در آن هيچ اشاره اي به تکنولوژي مورد استفاده نمي شود، و يا بر خلاف مدل داده اي فيزيکي که در آن جزئيات ذخيره و بازيابي اطلاعات در محيط فيزيکي مورد بحث قرار مي گيرند مي باشد.

الگوي فيزيکي اين اسکيما در واقع توصيف مي کند که داده ها چگونه بر روي مدياي فيزيکي ذخيره سازي ذخيره مي شوند. اين ذخيره سازي مي تواند در محيطي متمرکز و پيوسته و يا محيطي توزيع شده صورت گيرد. نگرش اين اسکيما به داده نگرشي کاملا آميخته با شيوه ي ذخيره سازي و محيط فيزيکي آن مي باشد

اشيائ الگو يک شي الگو يک ساختار منطقي نگهداري اطلاعات است. يک شي الگو رابطه اي يک به يک با فايل هاي فيزيکي روي ديسک که اطلاعات آنرا ذخيره مي کنند ندارد. مثلا در Oracle اطلاعات يک شي الگو ممکن است در يک يا چند فضاي مربوط به جداول Oracle نگهداري شوند. در Oracle کاربر مي تواند ميزان فضايي که بايد به آن شي اختصاص يابد تعريف نمايد. جداول، شاخص ها، ديدها، پيوندهاي داده اي، پروسيجرها، تريگرها، توابع، و ... نمونه هايي از الگوهاي شي هستند

مرور تاريخي • فرآيند يکپارچه سازي سنتي • يکپارچه سازي ديد ها • انتقال و استفاده چند جانبه از داده ها (Data interoperability) • يکپارچه سازي بين سازماني • معماري لايه سرويس سيستم هاي بانک اطلاعاتي فدراسيوني • انباره سازي داده ها • عصر وب، همترازي دادها (Data alignment)

يکپارچه سازي ديدها (1965-75) • حرکت از نياز هاي مختلف برنامه هاي کاربردي به سوي الگوي بانک اطلاعاتي نياز هاي برنامه کاربردي 1 نياز هاي برنامه کاربردي 2 نياز هاي برنامه کاربردي 3 نياز هاي برنامه n کاربردي … الگوي مفهومي يکپارچه سازي شده

انتقال و استفاده چند منظوره از اطلاعات (1975-80) + I N T E G R A T I O N • ترنسپرنت بودن نسبت به مکان (Location Transparency) الگوي سراسري دسترسي سراسري: بانک هاي اطلاعاتي توزيع شده دسترسي سراسري + دسترسي محلي: بانک هاي اطلاعاتي فدراسيوني • عدم ترنسپرنت بودن نسبت به مکان ( LOCATION VISIBILITY) عدم وجود الگوي سراسري multiDB views, multi DB access language: MULTIDATABASE SYSTEMS • داده هاي ساختار نيافته و يا نيمه ساختار يافته (files, repositories, knowledge bases, spreadsheets, …) information exchange protocols / languages: -

I.S.: interoperability software 1 I.S. network I.S. DBMS 1 DBMS 3 DB1 FED. SCHEMA A DB3 FED. SCHEMA B I.S. FED.SCHEMA B DBMS 2 يکپارچه سازي بين سازماني(80-95) • Federated Databases تعريف Interoperability : توانايي انتقال و تعويض اطلاعات مابين دو ويا چند سيستم و استفاده از اين اطلاعات

Synthesized Information Order Management Application Financial Application CRM Application Shipping and Distribution Application Contract Management Application يکپارچه سازي بين سازماني (ادامه) Courtesy Oracle

معماري سيستم هاي بانک اطلاعاتي فدراسيوني Filtering Integration Translation: Wrappers / Mediators local export schemas

انباره سازي داده ها (1995-2000) • معماري يک انباره داده سنتي

transformation integration source DBs homogeneized DBs DW منابع اطلاعاتي ساختاريافته ناهمگن در انباره سازي داده ها • اعمال تبديلات مورد نياز جهت همگن سازي • يکپارچه سازي

يکپارچه سازي در عصر وب تطابق کاتالوگ • براي آنکه يک شرکت حضور موفقي در بازار داشته باشد (مانند eBay) الزامي است که بين مداخل کاتالوگ هاي آن شرکت و کاتالوگ هاي موجود در بازار تجارت مطابقت هاي مورد نياز به وجود آيد. • پس از ايجاد اين مطابقت بين دو الگوي مختلف قدم بعدي ايجاد کوئري هايي است که به صورت اتوماتيک داده ها را از کاتالوگ هاي موجود ترجمه و به کاتالوگ مقصد که اطلاعات در آن يکپارچه مي شوند انتقال مي دهند. • پس از مطابقت اين کاتالوگ ها کاربران به صورت واحد و يکپارچه به اقلامي که براي فروش عرضه شده اند دسترسي خواهند داشت.

Generalization Specialization Schemas Aggregation Typing Completeness Structural Model Taxonomy Data "Conflicts" Syntactical Semantic Values Language Cognitive مساله ناهمگوني اطلاعات

يک متدولوژي 4 لايه براي يکپارچه سازي بانک هاي اطلاعاتي ناهمگن

Book title ISBN authors name birthdate Author name birthdate books title ISBN مثال کتابخانه (همگن) • محتويات يکسان، ساختار هاي متفاوت Schema 1 Schema 2

مثال پايان نامه دانشجويي (ناهمگن) Schema S1 (OO) • مدل هاي داده اي متفاوت، محتويات داراي همپوشاني The integrated schema (OO) Person Person Pin Name Pin Name Faculty Student Student Faculty Rank GPA Rank GPA PhD Student Phd-advisor Schema S2 (relational) Thesis Thesis (Phd-advisor, Phd-student, title) Adv. Title Student

پيش پردازش داده ها • تميز کردن داده ها • مقدار دهي مقادير مفقوده، حذف نويز، حذف و يا تصحيح داده هاي پرت، حل و فصل ناسازگاري ها و تناقض ها • يکپارچه سازي داده ها • يکپارچه سازي بانک هاي اطلاعاتي يا مکعب هاي اطلاعاتي و يا فايل هامختلف • تبديل داده ها (Data transformation) • انبوهش و به هنجار کردن داده ها ( Normalization and aggregation) • کاهش حجم داده ها (Data reduction) • بازنمايي داده ها در حجمي کمتر به گونه اي که نتايج پردازش تحليل آنها تفاوت معني داري نداشته باشد (مانند کاهش ابعاد بردار ويژگي)

انواع ناسازگاري ها در الگوهاي محلي • ناسازگاري هاي معنايي • ناسازگاري هاي کمي

ناسازگاري هاي معنايي Naming Conflicts In any data model, the schemata incorporate names for various entities/objects represented by them. Since these schemata are designed independently, the designer of each schema uses his or her own vocabulary to name these objects. Objects in different schemata representing the same real world concept may contain dissimilar names

Semantic Incompatibilities (cont.) Naming Conflicts (Cont.) • Homonyms: This inconsistency arises when the • same name is used for two different concepts. For • example, 'SALARY' may mean weekly salary in • one database, and monthly salary in another. • Synonyms: This type of naming conflict arises • when the same concept is identified by two or • more names. • For example, the term 'DOMESTIC CUSTOMER' • in one database may refer to the same concept as the term 'BUYERS' in another database

Semantic Incompatibilities (cont.) Type Conflicts These conflicts arise when the same concept is represented by different coding constructs in different schemata. For example, an object may be represented as an entity in one schema and as an attribute in another schema.

Semantic Incompatibilities (cont.) Key Conflicts Different keys may be assigned to the same concept in different schemata [15], [46]. For example, ss# and EMP-ID may be keys for employees in two component schemata.

Semantic Incompatibilities (cont.) Behavioral Conflicts These conflicts arise when different insertion/deletion policies are associated with the same class of objects in different schemata. For example, in one database, the relation DEPT may exist without having any employee records being associated with it, where as in another database, the deletion of the last employee record may also delete the relation DEPT from the database.

Semantic Incompatibilities (cont.) Missing Data Different attributes may be defined for the same concept in different schemata. For example, EMPI(SSN, NAME, AGE) and EMP2(SSN, NAME,ADDRESS) may represent the same concept in two database schemata. Attribute 'AGE' is missing in EMP2, and attribute 'ADDRESS' is missing in EMPI.

Semantic Incompatibilities (cont.) Levels of Abstraction This incompatibility is encountered when information about an entity is stored at dissimilar levels of detail in two databases. For example, 'LABOR-COST' and 'MATERIAL-COST' may be stored separately in one database and combined together as 'TOTAL-COST' in a second database.

Semantic Incompatibilities (cont.) Identification of Related Concepts For concepts in the component schemata that are not the same but are related, one needs to discover all the inter-schema properties that relate to them. For example, two entities belonging to two different databases may not be equivalent but one entity may be a generalization of the other entity.

Semantic Incompatibilities (cont.) Scaling Conflicts This incompatibility arises when the same attribute of an entity is stored in dissimilar units in different databases. For example, the attribute 'LENGTH' of an entity may be stored in terms of centimeters in one database and as inches in another database.

Quantitative Data Incompatibilities Different Levels of Precision Different databases may be storing an attribute at dissimilar levels of precision. For example, one database may contain the weight of a particular part up to an precision of a milligram, whereas another database may specify precision only up to a gram

Quantitative Data Incompatibilities (Cont.) Asyncronous Updates: Since each database is managed independently, all databases may not update the value simultaneously

Quantitative Data Incompatibilities (Cont.) Lack of Security Due to lack of information security at component databases, unauthorized users may have changed the data

Challenges of Bioinformatics Databases Management Bioinformatics Databases format: • Flat files: GenBank, EMBL, DDBJ, PDB. • Relational databases: HGMD, MGMD • Object-oriented database: AceDB. • XML databases: PIR, SwissProt, InterPro. Characteristics: • The Diversity/variety of data. • The representational heterogeneity. • Autonomous and web-based sources. • Varied interface and query capabilities

Motivation • Very large heterogeneous databases. Need to • link. • Integration. • Complex relation.

Volume and Variety • Two interacting issues in the generating information • 1. The volume is large -- we need automation • 2. The data is varied & heterogeneous • many autonomous sources • many distinct objectives • many incompatibilities, errors

Diversity &Heterogeneity • A wide variety of knowledge is needed to interpret the data • A large variety of experts is developing this knowledge • The scope of interests differs among those experts • The knowledge is expressed in diverse ways • The terms differs in precise meaning: semantics • A large variety of data types is needed • A wide variety of representations is used • The database and file schemas differ • A wide variety of representations is used • The openness and accessibility of the information differs

Heterogeneity inhibits Integration • An essential feature of science • autonomy of fields • differing granularity and scope of focus • growth of fields requires new terms • A feature of technological process • standards require stability • yesterday’s innovations are today’s infrastructure • Must be dealt with explicitly • sharing, integration, and aggregation are essential • Large quantities of data require precision

Heterogeneity among domains is natural Interoperation creates mismatch • Autonomy conflicts with consistency, - Local Needs have Priority, - Outside uses are a Byproduct • Heterogeneity must be addressed • Platform and Operating Systems 4 4 • Data Representation and Access Conventions 4 • Metadata: Annotations, Naming, and Ontology : needed to share data from distinct sources

Obstacles to Integration • Data spread over multiple, heterogeneous dbs • Not all are easily queried • flat file sequence dbs, web sites, BLAST alignments • Some are not even easily parsed! • Not all represent biology optimally • Genbank is sequence-centric, not gene-centric • SwissProt is sequence-centric, not domain-centric • Hard to keep results up-to-date • Non-traditional query approaches are needed to exclude extraneous results

What are the Data Sources? • Flat Files • URLs • Proprietary Databases • Public Databases • Data Marts • Spreadsheets • Emails • …

يکپارچه سازي بانک هاي اطلاعاتي زيست شناختي

يکپارچه سازي بانک هاي اطلاعاتي زيست شناختي

Presentation Transcript