The Swiss experience with TIMSS, PISA, HarmoS and their psychometric models Erich Ramseier,

The Swiss experience with TIMSS, PISA, HarmoS and their psychometric models Erich Ramseier, Bildungsplanung und Evaluation, ED Bern

Overview • PISA and TIMSS in Switzerland • HarmoS and its validation study:Overview, procedure and first results • Appraisal of HarmoS and consequences • A few reflections on questions of the conference

PISA and TIMSS in Switzerland • PISA – a big topic in Switzerland • PISA - enormous public and political attention in Switzerland and Germany. • A short evaluation of Network A of OECD shows:press coverage of PISA in Germany and in Switzerland is by far larger than in other countries • Probably due to missing student assessment in these countries • Therefore PISA is internally used to compare between cantons • Assessment important in a global trend featuring accountability, output control, new way of governance in public administrations and schools • Fact: Switzerland introduces student assessment and monitoring of the education system- because of PISA or because of a general trend in which PISA was just a suitable argument and got therefore its popularity • PISA and TIMSS produced many interesting results • E.g disappointing attainment in reading • Large variance of student achievement and large impact of social background on achievement • In TIMSS (Third International Mathematics and Science Study), Switzerland attained disappointing results in Science. This could be explained by the mismatch between curricular priorities in Switzerland and cognitive demands of the TIMSS test (Ramseier, 20. • Corresponding subscales based on cognitive requirement versus terminological knowledge show r = .73 between subscales, country level (.54 withou 3 outliers in 41 countries) • Not only a g-factor as discussed around PISA (Rindermann, 2006, 2007).

PISA and TIMSS in Switzerland • PISA and TIMSS as learning opportunities • PISA, TIMSS and other large-scale assesments use an elaborated methodology – from sampling to scaling, analysis, and reporting. • Participation was a perfect opportunity to learn:Methodologies were applied – but lead, assisted and controlled by the international project • New challenge: Swiss education monitoring including student assessments • Swiss large-scale student assessment start with HarmoS(or become a new importance) • All methodological problems of such a study must now be solved internally • I would like to inform about the methodological work in this project • Since methodological research questions have to be answered in the context of the content of the research, I will also describe the HarmoS project which defines this context in Switzerland for today and tomorrow

HarmoS A large political project to harmonize compulsory education in Switzerland HarmoS An important political project to define+monitor education standards HarmoS A scientific project to propose education standards Validation study To check and anchor standards empirically Overview of HarmoS

HarmoS (General level) • Goal: Harmonization of Obligatory School in Switzerland • Political level: Treaty among Swiss cantons (finalized in 2007) • Main Content • structural benchmark figures, e.g. age of starting school • binding national educational standards • quality control:continuing educational monitoring on national level • Educational standards will be included as an appendix • proposal of standards developed in a scientific project (end 2007) for • public consultation on these proposals – resolution by the Swiss conference of cantonal ministers of education (2008) • application in national assessments (beginning around 2011)

HarmoS (Scientific Project) • Goal: Proposition of Standards, based on a competence model • Subejcts / grades • First language • Mathematics • Second Language • Science • In grade 2, 6, 9

HarmoS (Scientific Project) • Organization • For each subject a scientific consortium is responsibleComposition: Mostly experts in didactics of this subject • Methodological group • Guidance and assistance regarding scaling and validation • Design of the validation study • Project coordination: Swiss conference of cantonal ministers of education • Time-frame • 2005 Theoretical foundations • 2006 draft of competence model creation of corresponding tasks, pilot tests design of validation study • 2007 validation study finalization of competence model proposition of minimal standards

Fundamentals of the Scientific project HarmoS • Performance standards in contrast tocontent standards or opportunity-to-learn standards • The project follows principles stated in a German expertise about the development of educational standards (Klieme et al., 2003)This means: • Concept of competence as defined by Weinert (2001) • Standards describe the minimum competence required (basic standards) • Standards are embedded in a model of competence • The model of competence has to be substantiated by concrete tasks and tests measuring this competence • HarmoS includes the empirical validation of the competence model in the process of developing the standards • A limited large-scale assessment is necessary for this • But competence models may not be restricted to components which can be validated at this stage • Reduced validation in grade 2

Concept of competence • Elements of the definition: • Mental conditions necessary for solving problems and fulfilling a certain class of demands • Application to new situations not only reproduction of knowledge or application of routine skills • structure of competence derives from the structure of demands, not from the system of its constituents like single abilities, skills, knowledge, metacognitions • Demands are socially defined competence is embedded in culture and society, not a natural phenomenon • Learning plays a central role in the acquisition of competence • Competence is the product of a cumulative, long-term process • Content-specific knowledge plays a key role • competencies are domain specific not generaldomains corresponding to school subjects • Mainly cognitive components but also including motivational, volitional, and ethical components (Weinert, 2001b) • Definition in the Klieme expertise (2003, p. 72): En accord avec Weinert (2001, p. 27 sqq.), nous entendons par compétences « les capacités et aptitudes cognitives dont l’individu dispose ou qu’il peut acquérir pour résoudre des problèmes précis, ainsi que les dispositions motivationnelles, volitives et sociales qui s’y rattachent pour utiliser avec succès et responsabilité les résolutions de problèmes dans des situations variables ».

L e v e l s Aspects of competence/action Content areas of competence Model of Competence • model of competence as a basis for standards • components of competence, describing the content structure of the domain: content and actions • levels of competence, describing the degree of attained competence; level characterized by typical cognitive processes mastered at a level • development of competence in the process of schooling • Common structure of competence models in HarmoS

Example: Competence Model in Mathematics • Content areas of competence • Shape and space (Form und Raum) • Numbers and variables (Zahl und Variable) • Functional relationships (funktionale Zusammenhänge) • Quantity and measure (Grössen und Masse) • Data and probability (Daten und Zufall) • Aspects of competence • Knowledge, cognition, and description (Wissen, Erkennen und Beschreiben) • Operation and calculation (Operieren und Berechnen) • Representing and formulating (Darstellen und Formulieren) • Modeling and transforming into mathematics (Mathematisieren und Modellieren) • Reasoning and justifying (Argumentieren und Begründen) • Interpretation and reflection of results (Interpretieren + Reflektieren der Resultate) • Using tools and instruments (Instrumente und Werkzeuge Verwenden) • Investigation and exploration (Erforschen und Explorieren) • “Cando’s” • Descriptions of the competence in each cell defined by a combination of content area and aspect of comptence • Level of competence / of cognitive demand, influenced by • difficulty of understanding the question/problem • complexity of cognitive processing of the task • complexity of mathematical concepts and skills • Motivation • Separate treatment

Validation Study: Mission • General goal: To empirically check the validity of competence models and standards • Elements of validation • Selection of feasible tasks/Items • Competence levels and standards have to be illustrated by tasks. Attribution of tasks to competence level can be based upon empirical task difficulty • Checking the structure of components of competence:Differentiation and correlation of content areas and aspects • Is the competence model valid for whole Switzerland?Or do item difficulties vary between language regions?(between types of schools, by gender?) • Are the proposed basic standards realistic:Challenging as well as attainable ?

General Design of the Study • Populations • Students in grade 9 (6,resp.), enrolled in public schools, including students with special needs • Sample size • For most study aims, the difficulty of many tasks needs to be known, e.g. for • Describing several levels of competence in 4 subjects • Comparing item difficulties in 3 language regions • The sample has primarily to allow testing enough TASKS • Sample size depends on • Number of items required (200 per subject) • Time needed to solve an item (varying, 1-3 minutes) • Number of students working on an item needed (150 per region) • Length of testing time per student (170 minutes in 2 sessions) • Expected attrition/response rate • (actual numbers vary by subject, region, …) • Planned Sample size • 6’600 for grade 6 and for grade 9 • Size also sufficient to estimate competence distributions

General Design of the Study • Representative sample • Necessary to estimate percentages of the student population distributed on competence levels and reaching standards • Two stage sampling • First level: schools, second level: 2 classes within school • Details: Renaud (2006), Ramseier & Moreau (2007) • Attained sample (depending on subject,by design smaller for 2nd language and Science) • About 5550 in grade 6, response rate 82% • About 6050 in grade 9, response rate 85% • Testing sessions • 2 half days, for mathematics/science and for languages • Test administered by school teachers

Test Design, Mathematics, grade 9 • Conditions • Test session allows for twice 20 minutes testing time before and after the break • 269 items, grouped in 34 clusters, (8 per cluster)estimated time per cluster: 20 minutes • Each cluster with items from only one contentarea(Items within clusters eventually grouped in fewer units, defined by a theme and common stimulus material) • Testing includes 24Science clusters • Printing arrangement allows: Booklets can be assigned individually to persons and positions 1,2,3,4 within the session; • Therefore each cluster is printed in a separate booklet • With 58 clusters to be tested in a session and allocation of only 4 clusters to a single student, a multi-matrix-design is necessary • Goal • Global goal: Balanced allocation of clusters to students and balanced and interlaced allocation of clusters to session positions, that is: • Each cluster is tested equally often in total and in each position • Each cluster is equally often combined with clusters from other content areas in pairs of clusters • Each student gets at least one pair of mathematics clusters • Pairs of clusters are equally often tested before and after break

Test Design, Mathematics, grade 9 • Procedure • A series composed of all possible cluster combinations which meet the above mentioned goals or restrictions is constructed • A sequence 1 set of all possible cluster combinations which meet the above mentioned goals or restrictions is constructed • Copies of the sequence are combined to a series longer than the number of students • Cluster combinations within a sequence are randomly sorted • The series is matched to the stratified list of students • Each student gets a set of test materials with a specific combination of 4 booklets • This procedure deviates from the usual way to combine item clusters into a limited number of booklets, based on balanced designs e.g. by means of youden square matrices • Result • Balance of clusters is reached with some random deviation • A huge number of individual combinations of mathematical testing materials are distributed:alone in position1 and 2 in the German part of Switzerland, 791 different combinations of mathematics clusters entered analysis – each only 1 or 2 times;in combination with position 3 and 4, more than a million different constellations auf mathematics clusters were received. • Possible unknown interactions between clusters are optimally controlled

Overall test Design • Conditions and goals • Conditions and goals of design are similar for Science, First and second Language • In detail, restrictions varied and the situation was more complex for languages: • First language: First section of 30 minutes dedicated to hearing; the same audio stimulus had to be presented to whole classes, booklets distributed on class level • Second language: Second language was either English, French, or German – depending on language region • Not all item clusters were tested in all language regions • Result: Number of different booklets to be printed and allocated to students and session positions: In total 100‘000 linkscluster-position-student realized

Scaling and Analysis • Overview • Selection of items • Differential item functioning by language region • Analysis of sub-dimensions • Adaptation of the scale • Definition of competence levels and standards • Illustrated with Mathematics, grade 9 • Analyzes done with CONQUEST • based on a generalised Rasch Model

Item Selection • Starting point: 269 Items • Selection criterion: Fit to Rasch model ( .7 < infit < 1.3) • 1 item excluded; only 2 outside .8 – 1.2  not a critical condition Infit=.92 • Criterion: Discrimination (item-total-correlation) : d >= .3 • Selection important: only limited pilot test! • 16 Items eliminated • 26 kept although d <.3: mostly very easy/difficult items (d IS small!) • To illustrate the full range of the scale • Item which is a preliminary question in an unit • 252 Items kept

Item Difficulty by Language Region • Why important? • If items are scaled within regions and the person sample mean is set at 0, the mean item difficulty shows achievement differences between regions • If an item has the same meaning in each region, its deviation from the regional mean (relative difficulty) should be similar in each region, otherwise it is a case of differential item functioning (DIF) • DIF may have many reasons • Translation or printing errors • Incidental differences in familiarity with content • Differences in mathematics instruction (familiarity with topic, notations, …) • If DIF is high and incidental reasons are excluded, the existence of a nationally defined mathematics competency is challenged!

relative item difficulty in French part relative item difficulty in German part Item Difficulty by Language Region (before final selection) 2 logits(roghly 2 SD, on population level)

Treatment of DIFF • Elimination of critical items • Comparison between German, French, and Italian difficulty • Item critical, if any |dreg – dnat| > 0.5 dreg/natmean item difficulty in a region/nationaland if p(dreg >< dnat) < .05 • Item treatment • 57 of 252 items critical • 11 kept for specific content reasons • 46 spitted into region-specific items 138 regional items(treated within each region as if item would only defined here) • 17 of 138 regional items eliminated (mainly d<.3) • Result • 195 national items without large regional DIF • 11 national items with considerable regional DI • 121 regionally used items • Item treatment allows a national scaling, but regional language DIF remains a research question

Sub dimensions of competence Estimated correlations among content areas Direct unweighted estimate of latent correlationsincluding regionalized items; missing=wrong 4 of 5 areas included in the test Correlations are high – facets of a general mathematics competence Correlations are clearly below 1:content areas are empirically distinguishable(Better than in PISA: Corr. of math subscales: .89-.92, OECD:Techn. Report 2003, S. 190)

Sub dimensions of competence Estimated correlations among aspects of competence Direct unweighted estimate of latent correlationsincluding regionalized items; missing=wrong • Correlations a little bit higher – but interpretation can be similar • Model test shows advantage of sub dimensions - but there might be better

Adaptation of the Scale • Scale unit • Originally most estimates of person values vary between -3 and 3 (logits) • Negative values for competencies are hard to communicate – therefore the scale is linearly transformed to a mean of 500 and a standard deviation of 100 (after weighting) • A linear transformation does not change substantive relations, but setting the mean to 500 makes the mean explicit and introduces a social norm • Position of items • Position of items on the same scale as persons is pivotal for the interpretation of Rasch scales • Originally, items are located on the scale, where a student has a chance of 50% to solve this task. • From a pedagogical view, it is better to place the item, where mostly master this task; for HarmoS, we choosed the place whre the chance of solving is 2/3 • For dichotomous items, this is a linear transformations which does not change interpretation • But for partial credit items (with scores 0, 1, 2, ..) this may change the progression of Thurstonian thresholds (p(X >= 1, 2,…), since cumulative item characteristic curves are not parallel • It is doubtful, if pedagogical users of the location of partial credit items are aware of this problem

Competence Levels and Basic Standards • I will be brief about levels and standards since these were mainly defined by the consortium without much involvement of the HarmoS methodological group • Competence levels • Before the validation study started, factors of item difficulty were formulated and tasks were attributed to 4 levels. This resulted in a a priori description of the levels • The validation study showed that expected and empirical difficulty correlated, but they did clearly not perfectly match(r = .56,142 dichotomous items)  The empirical anchoring of task difficulty is a main advantage of including a validation study in the process of standard development • Since items were attributed to levels and – by the test – to specific places on the competence scale, a first relation between levels and scales was established • Inspection of characteristics of typical tasks in a provisional level helped to revise the description of the level and the definition of its boundary (similar to the a posteriori process in PISA)

Competence Levels and Basic Standards • The primary definition of levels is their verbal description • Resulting levels the competence scale: • Levels are content-driven (contrary to PISA) and demonstrate high expectations • Since the breath of the intervals are unequal, percentages are not easily comparable • Level I (breadth: 140) includes students with quite different degrees of competence

Competence Levels and Basic Standards • Basic standard • The basic standard in mathematics, grade 9, corresponds to level I and to a cutoff-score of 400 • Level I has to be attained in each content area and in each competence aspect • The basic standard requires too, that students can contribute to the solution of level II tasks in a team • The basic standard was established in a consensus building process in the mathematics consortium

Appraisal of HarmoS and consequences • Already the basic results show that empirical validation of standards during development is worthwile: • Tasks illustrate levels of competence/difficulty in a realistic way • You know what you get from a basic standard, if you know how many students attain the standard today • The study’s timeline is extremely tight • 6 months or less between end of data coding and final report •  only first step in a validation process, e.g. treatment of developmental aspects has to follow • Differences between regions in the structure of mathematics competence may be important • Considerable DIF shows this importance •  Needs further analysis • A national discussion about priorities in mathematics instruction is desirable (if national standards are the aim) • Presence of three languages is a challenge • Not only on the level of content (see above) • But on the level of cooperation, too • Even if all do their best, pure language problems, differences in visions, differences in the scientific background make cooperation difficult but are enriching, too • Cooperation in heterogeneous language groups needs a lot of time!

Appraisal of HarmoS and consequences • Liberal model judgment • Selection criteria for testing adequacy of Rasch modeling and for testing differential item functioning by language regions were fairly genereously applied • Reasons: • Goal is not an efficient test for reliable individual measurement but illustration of a broadly defined competence • Mathematics competence is primarily defined by the community teaching mathematics and the community applying mathematics.If a psychometric model does not allow to capture this competence, one has to start a process of finding more adequate models, and revising and specifying the definition This needs an interaction between these communities and a scientific community • In a first step it is better to reasonably apply the psychometric models as a heuristic tool than to unduly narrow the concept

Appraisal of HarmoS and consequences • Validating competence models and standards in the present is necessary and problematic • Seeing, what students can do today is necessary for creating realistic models and standards • But: Standards are aimed as a means to develop school and to set new challenges. Therefore levels and tasks have also to illustrate, what is wanted • A perfect match with the present is not desirable! • Application of Rasch Model • The Rasch model was necessary for this study. There is no other way to integrate the analysis of so many tasks on the basis of such a small sample • In the continuation of HarmoS the model can help to ensure continuity between the validation process and the upcoming projects for monitoring the Swiss education system anchoring of new tests • The Rasch model did not introduce many restrictions on items • The model gives a tool for analysis of DIF • The model allows to analyse the structure of competence aspects and content areas empirically despite of a data set with 97% data missing by design • Continuation of HarmoS • The validation study shows that studies for monitoring the Swiss education system require a lot of preparation and development in methodological questions

A few reflections on questions of the conference • Overview • Status of Hierarchical levels of difficulty/competence • Limitations of psychometrics applied in education • Motivation as part of competence - as defined by Weiner • Impact of psychometric models in a broader context

Hierarchical levels of Difficulty/Competence or a Continuous Scale? • Strict perspective of levels • Levels are characterized by cognitive processes, which are mastered at that or a higher level • A task asking this specific process can be solved by students at this level. Students below it have no chance to solve the task – independent of their exact position on the competence scale.In fact there is no such scale but just a class of people unable to solve the task. • Latent class or mixed Rasch models are suitable to analyze such data and to identify classes (e.g. Rost, 2004a) • Sections on a Rasch scale • The probability to solve a certain item is a specific continuous function of the personal competence (ICC, logistic function); the boundary of a level has not a special meaning in this function • ICC’s for different items are the same beside a parallel translation – independent of the adherence of two items to the same or to different levels • Classes are not modeled and are irrelevant

Hierarchical levels of Difficulty/Competence or a Continuous Scale? • Scaling complex competences • Education standards describe complex competences – a large scope of tasks is included, tasks which differ in many sources of difficulty • A latent class model would need a few decisive processes – it cannot model competences of this sort. • If a rich but interrelated mixture of sources of difficulty is present, a continuous scale can be appropriate. • It is still possible to describe sections on this scale by typical tasks and their characteristics. Rost (2004b) describes such methods – already applied in German analyses of PISA data. In HarmoS, such levels are identified, too • If sources of difficulty are not explicitly modeled and explain much of the variability, the continuous scale has priority. Then levels are a heuristic means for communicating the meaning of a sector of the continuous competence dimension. • This means of communication is important. But it in the comparison between two students or cantons, the difference in scale position, not the adherence to the same or different levels is decisive.

Limitations of Psychometrics applied in education • General limitation • Usually, psychometric models describe interindividual differences of a competence. The relation to development of competence is complex. If a condition of growth does not vary between individuals, there is no correlation with the competence – the condition may still be crucial for the development of the competence. • Specific problems in education • Competencies as understood in HarmoS are defined by socially driven demands. Such demands may vary regionally or in time. Stability ideal for psychometric methods is perhaps missing. • Competencies are (partially) learned (see Weintert’s definition). In competencies defined by knowledge domains (subjects), learning opportunities are defined and organized at school. This organization may vary by regions or in time. Again stability of competences might be missing. • In psychology, one science has the lead. If it is difficult to measure a concept, the concept may be changed. In education, the concept is anchored in the practices of teaching and in didactical values. It is more difficult and probably undesirable to change the concept in order to improve measurement possibilities.

Motivation as part of competence - as defined by Weiner • Is this inclusion feasible? – question unrelated to the importance of motivation! • Against • A big difference:Competence has the character of “the more the better” – you then have to reasonably apply it. It is impossible to demonstrate more competence than you have.Motivation is of mesotic type – it has to be adequate (Patry, 1991). The maximum is not necessarily the best. • Motivational situations may strongly vary and dominate individual motivational differences while competence is more fixed and person-bound • Example: competence of car driving • It is fine to have a high competence in driving. This competence may be high, even if you don’t like driving or if you are convinced that it is better to use public transport: liking to drive is not a constituent part of the competence • But it is a bad example: Part of a high driving competence is the readiness to comply with rules and to respect the rights of other drivers.May be, aspects of action control are easier part of a competence than liking.The later is mainly important as far as there is no avoidance which would hinder the application of a competence. • This question is important in a methodological debate – since it introduces many difficult methodological problems • E.g. assessment/examination situations, ideal to measure cognitive competence, activate achievement-related aspects. Achievement motivation gets important. But motivational situations may be quite different later, when competences are applied – or not – in everyday life.

Impact of psychometric models in a broader context • Impact of psychometric models is realistic • The use of a psychometric model has an impact on what is measured – it may narrow what competence is measured and therefore have an undesirable influence on tests and indirectly on the aims and modalities of education • But: what is its importance in a broader context? • Especially if one understands “psychometric model” narrowly as the model that links raw data to scaled values (e.g. IRT models) this is just one small piece in a large undertaking - the monitoring of education • What is the influence of other pieces?

Impact of monitoring on education • Top level: Accountability • Main implication just comes from the fact that there is measurement of achievement and feedback and/or publication of results;The strengthening of accountability produces perceived pressure and control • The way of measurement may influence the degree of pressure – but other factors like public or confidential feedback may be more important (rankings!)

Impact of monitoring on education • Second level: Intended concept of proficiency • The scope of defining what one wants to measure is large: • Competence, e.g. an orientation to problem solving and application in new contexts • A concept of competence including motivation – or purely cognitive competence • A functional definition of e.g. mathematics literacy as in PISA “role for a successful life” • Or taking the the intrinsic value of mathematics into account as in HarmoS (Linneweber-Lammerskitten (2005, p. 4; HarmoS Mathematik Konsortium, 2008, p. 11) • Pure reproduction of knowledge • … • Definitions clearly influence what is measured • For most of these alternatives, different psychometric models are conceivable

Impact of monitoring on education • Third level: Range of tasks used • What is measured highly depends on the types of tasks used • Only multiple choice items and short fixed answers – enabling automatic coding • Complex constructed answers needing elaborated scoring • Authentic performance assessments • Costs, effort and time are more decisive for this choice than methods • Fourth level: Psychometric model • IRT and other models impose restrictions - and they offer possibilities • Rasch model: E.g. exclusion of tasks with a high rate of success by chance • Restrictions have to be compared to the possibilities like equating, DIF-analysis, …

Overall Consequence • The HarmoS activity, all the research questions and methodological problems, and the certainty of the coming education monitoringmeans that educational assessment in Switzerland is in a very critical and interesting phase. • It is a great chance to engage in this emerging field of science!

Thank you very much for your attention !

References • References • Klieme, E. et al. (2003). Zur Entwicklung nationaler Bildungsstandards. Expertise. Bonn: Bundesministerium für Bildung und Forschung (BMBF) • Linneweber-Lammerskitten, H., & Wälti, B. (2005). Is the definition of mathematics as used in the PISA Assessment Framework applicable to the HarmoS Project? ZDM Zentralblatt für Didaktik der Mathematik, 37(5). • OECD. (2005). PISA 2003. Technical Report. Paris: OECD. • Patry, J. L. (1991). Transsituationale Konsistenz des Verhaltens und Handelns in der Erziehung. Bern: Lang. • HarmoS Konsortium Mathematik. (2008). Kurzbericht HarmoS Mathematik. Bern: EDK/CDIP. • Ramseier, E. (1999). Task Difficulty and Curricular Priorities in Science: Analysis of Typical Features of the Swiss Performance in TIMSS. Educational Research and Evaluation, 5, 105-126. • Ramseier, E. & Moreau, J. (2007). Stichprobendesign und Gewichtung in der HarmoS-Validierungsstudie. Bern: EDK/CDIP • Renaud, A. (2006). Harmonisation de la scolarité obligatoire en Suisse (HarmoS). Design général de l'enquête et échantillon des écoles. Neuchâtel: Office fédéral de la statistique. • Rindermann, H. (2006). Was messen internationale Schulleistungsstudien? Schulleistungen, Schülerfähigkeiten, kognitive Fähigkeiten, Wissen oder allgemeine Intelligenz? Psychologische Rundschau, 57(2), 69-86. • Rindermann, H. (2007). Intelligenz, kognitive Fähigkeiten, Humankapital und Ratinalität auf verschiedene Ebenen. Psychologische Rundschau, 58(2), 118-128. • Rost, J. (2004a). Lehrbuch Testtheorie - Testkonstruktion. Bern: Huber. • Rost, J. (2004b). Psychometrische Modelle zur Überprüfung von Bildungsstandards anhand von Kompetenzmodellen. Zeitschrift für Pädagogik, 50(5), 662-678. • Weinert, F. E. (2001a). Vergleichende Leistungsmessung in Schulen - eine umstrittene Selbstverständlichkeit. In F. E. Weinert (Ed.), Leistungsmessungen in Schulen (pp. 17-31). Weinheim: Beltz. • Weinert, F. E. (2001b). Concept of Competence: A Conceptual Clarification. In D. S. Rychen & L. H. Salganik (Eds.), Defining and Selecting Key Competencies (pp. 45-65). Göttingen: Hogrefe & Huber.

The Swiss experience with TIMSS, PISA, HarmoS and their psychometric models Erich Ramseier,