Which type of validity occurs when investigators researchers use adequate definitions and measures of variables?

Validity, Data Sources

Michael P. McDonald, in Encyclopedia of Social Measurement, 2005

Criterion Validity

Criterion validity is the comparison of a measure against a single measure that is supposed to be a direct measure of the concept under study. Perhaps the simplest example of the use of the term validity is found in efforts of the American National Election Study (ANES) to validate the responses of respondents to the voting question on the post-election survey. Surveys, including the ANES, consistently estimate a measure of the turnout rate that is unreliable and biased upwards. A greater percentage of people respond that they voted than official government statistics of the number of ballots cast indicate.

To explore the reliability of the measure of turnout, ANES compared a respondent's answer to the voting question against actual voting records. A respondent's registration was also validated. While this may sound like the ideal case of validating a fallible human response to an infallible record of voting, the actual records are not without measurement error. Some people refuse to provide names or give incorrect names, either on registration files or to the ANES. Votes may be improperly recorded. Some people live outside the area where surveyed and records were left unchecked. In 1984, ANES even discovered voting records in a garbage dump. The ANES consistently could not find voting records for 12–14% of self-reported voters. In 1991, the ANES revalidated the 1988 survey and found 13.7% of the revalidated cases produced different results than the cases initially validated in 1989. These discrepancies reduced the confidence in the reliability of the ANES validation effort and, given the high costs of validation, the ANES decided to drop validation efforts on the 1992 survey.

The proceeding example is of criterion validity, where the measure to be validated is correlated with another measure that is a direct measure of the phenomenon of concern. Positive correlation between the measure and the measure it is compared against is all that is needed for evidence that a measure is valid. In some sense, criterion validity is without theory. “If it were found that accuracy in horseshoe pitching correlated highly with success in college, horseshoe pitching would be a valid measure of predicting success in college” (Nunnally, as quoted in the work of Carmines and Zeller). Conversely, no correlation, or worse negative correlation, would be evidence that a measure is not a valid measure of the same concept.

As the example of ANES vote validation demonstrates, criterion validity is only as good as the validity of the reference measure to which one is making a comparison. If the reference measure is biased, then valid measures tested against it may fail to find criterion validity. Ironically, two similarly biased measures will corroborate one another, so a finding of criterion validity is no guarantee that a measure is indeed valid.

Carmines and Zeller argue that criterion validation has limited use in the social sciences because often there exists no direct measure to validate against. That does not mean that criterion validation may be useful in certain contexts. For example, Schrodt and Gerner compared machine coding of event data against that of human coding to determine the validity of the coding by computer. The validity of the machine coding is important to these researchers, who identify conflict events by automatically culling through large volumes of newspaper articles. As similar large-scale data projects emerge in the information age, criterion validation may play an important role in refining the automated coding process.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0123693985000463

Latent Dirichlet Allocation

Joshua Charles Campbell, ... Eleni Stroulia, in The Art and Science of Analyzing Software Data, 2015

6.5.1 Criterion Validity

Criterion validity relates to the ability of a method to correspond with other measurements that are collected in order to study the same concept. LDA topics are not necessarily intuitive ideas, concepts, or topics. Therefore, results from LDA may not correspond with results from topic labeling performed by humans.

A typical erroneous assumption frequently made by LDA users is that an LDA topic will represent a more traditional topic that humans write about such as sports, computers, or Africa. It is important to remember that LDA topics may not correspond to an intuitive domain concept. This problem was explored in Hindle et al. [20]. Thus, working with LDA-produced topics has some hazards: for example, even if LDA produces a recognizable sports topic, it may be combined with other topics or there may be other sports topics.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780124115194000069

Wechsler Individual Achievement Test

DONNA RURY SMITH, in Handbook of Psychoeducational Assessment, 2001

Validity

Evidence of three traditional types of validity—content, construct, and criterion evidence—were evaluated in order to demonstrate that the WIAT measures what it is intended to measure. Even though the content validity was assessed by a panel of national curriculum experts and deemed representative, school curricula vary. Users should review achievement test items to determine how closely the items match what is taught in their school.

Evidence of construct validity includes the expected intercorrelations among subtests reported by age, intercorrelations with the Wechsler scales (see Table D.6 of the WIAT manual), studies of group differences between the standardization sample and various clinical groups as well as differences between the various age/grade groups, and a multitrait-multimethod study of the WIAT and other achievement tests (Roid, Twing, O'Brien, & Williams, 1992). In summary, there was a striking consistency in the correlations among scores on the reading, mathematics, and spelling subtests of the WIAT and those of the corresponding subtests on the other achievement measures.

Since the majority of school psychologists’ assessment time is spent with students with learning disabilities (Smith, Clifford, Hesley, & Leifgren, 1992), WIAT scores were correlated with school grades, group-administered achievement tests, and clinical study groups. Flanagan (1997) notes that a strength of the WIAT is the demonstrated treatment validity because “data are reported that indicate that the WIAT effectively aids in diagnosis of educational/clinical concerns” (p. 84). Special study groups included children classified as gifted and children with mental retardation, emotional disturbance, learning disabilities, attention-deficit/hyperactivity disorder (ADHD), or hearing impairment. Mean composite scores ranged from 112.1 (SD = 9.9) to 117.8 (SD = 9.5) for gifted children, from 58.0 (SD = 10.2) to 66.3 (SD = 10.3) for children with mental retardation, and from 74.6 (SD = 12.0) to 92.8 (SD = 12.6) for children with learning disabilities. These results confirmed predicted expectations for achievement scores in each group.

Independent studies (Slate, 1994; Martelle & Smith, 1994; Saklofske, Schwean, & O'Donnell, 1996; Michalko & Saklofske, 1996) have provided additional evidence of WIAT validity. Saklofske, Schwean, and O'Donnell (1996) studied a sample of 21 children on Ritalin and diagnosed with ADHD and obtained subtest and composite means quite similar to those reported in the WIAT manual. Gentry, Sapp, and Daw (1995) compared subtest scores on the WIAT and the Kaufman Test of Educational Achievement (K-TEA) for 27 emotionally conflicted adolescents and found higher correlations between pairs of subtests (range of .79 to .91) than those reported in the WIAT manual. Because comparisons are often made between the WIAT and the WJPB-R, Martelle and Smith (1994) compared composite and cluster scores for the two tests in a sample of 48 students referred for evaluation of learning disabilities. WIAT composite score means were reported as 83.38 (SD = 10.31) on Reading, 89.32 (SD = 10.60) on Mathematics, 99.24 (SD = 11.84) on Language, and 80.32 on Writing. WJPB-R cluster score means were 87.67 (SD = 11.80) on Broad Reading, 92.09 (SD = 11.62) on Broad Mathematics, and 83.88 (SD = 8.24) on Broad Written Language. Although global scales of the WIAT and WJPB-R relate strongly to each other, mean WIAT composites were 3 to 6 points lower than mean scores on the WJPB-R clusters. Subtest analysis indicated some differences in the way skills are measured; for example, the two reading comprehension subtests (WIAT Reading Comprehension and WJPB-R Passage Comprehension) are essentially unrelated (r = .06). Study authors suggest that “for students with learning disabilities, the two subtests measure reading comprehension in different ways, resulting in scores that may vary greatly from test to test” (p. 7). In addition to a strong relationship between the WIAT Mathematics Composite and the WJPB-R Broad Mathematics cluster, WIAT Numerical Operations correlated equally well with Applied Problems (r = .63) and with Calculation (r = .57) on the WJPB-R, suggesting that WIAT Numerical Operations incorporates into one subtest those skills measured by the two WJPB-R subtests. At the same time, the WJPB-R Quantitative Concepts subtest does not have a counterpart on the WIAT. The Language Composite of the WIAT, however, is a unique feature of that test.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780120585700500082

Analyzing qualitative data

Jonathan Lazar, ... Harry Hochheiser, in Research Methods in Human Computer Interaction (Second Edition), 2017

11.4.3.1 Validity

Validity is a very important concept in qualitative HCI research in that it measures the accuracy of the findings we derive from a study. There are three primary approaches to validity: face validity, criterion validity, and construct validity (Cronbach and Meehl, 1955; Wrench et al., 2013).

Face validity is also called content validity. It is a subjective validity criterion that usually requires a human researcher to examine the content of the data to assess whether on its “face” it appears to be related to what the researcher intends to measure. Due to its high subjectivity, face validity is more susceptible to bias and is a weaker criterion compared to construct validity and criterion validity. Although face validity should be viewed with a critical eye, it can serve as a helpful technique to detect suspicious data in the findings that need further investigation (Blandford et al., 2016).

Criterion validity tries to assess how accurate a new measure can predict a previously validated concept or criterion. For example, if we developed a new tool for measuring workload, we might want participants to complete a set of tasks, using the new tool to measure the participants’ workload. We also ask the participants to complete the well-established NASA Task Load Index (NASA-TLX) to assess their perceived workload. We can then calculate the correlation between the two measures to find out how the new tool can effectively predict the NASA-TLX results. A higher correlation coefficient would suggest higher criterion validity. There are three subtypes of criterion validity, namely predictive validity, concurrent validity, and retrospective validity. For more details regarding each subtype—see Chapter 9 “Reliability and Validity” in Wrench et al. (2013).

Construct or factorial validity is usually adopted when a researcher believes that no valid criterion is available for the research topic under investigation. Construct validity is a validity test of a theoretical construct and examines “What constructs account for variance in test performance?” (Cronbach and Meehl, 1955). In Section 11.4.1.1 we discussed the development of potential theoretical constructs using the grounded theory approach. The last stage of the grounded theory method is the formation of a theory. The theory construct derived from a study needs to be validated through construct validity. From the technical perspective, construct or factorial validity is based on the statistical technique of “factor analysis” that allows researchers to identify the groups of items or factors in a measurement instrument. In a recent study, Suh and her colleagues developed a model for user burden that consists of six constructs and, on top of the model, a User Burden Scale. They used both criterion validity and construct validity to measure the efficacy of the model and the scale (Suh et al., 2016).

In HCI research, establishing validity implies constructing a multifaceted argument in favor of your interpretation of the data. If you can show that your interpretation is firmly grounded in the data, you go a long way towards establishing validity. The first step in this process is often the construction of a database (Yin, 2014) that includes all the materials that you collect and create during the course of the study, including notes, documents, photos, and tables. Procedures and products of your analysis, including summaries, explanations, and tabular presentations of data can be included in the database as well.

If your raw data is well organized in your database, you can trace the analytic results back to the raw data, verifying that relevant details behind the cases and the circumstances of data collection are similar enough to warrant comparisons between observations. This linkage forms a chain of evidence, indicating how the data supports your conclusions (Yin, 2014). Analytic results and descriptions of this chain of evidence can be included in your database, providing a roadmap for further analysis.

A database can also provide increased reliability. If you decide to repeat your experiment, clear documentation of the procedures is crucial and careful repetition of both the original protocol and the analytic steps can be a convincing approach for documenting the consistency of the approaches.

Well-documented data and procedures are necessary, but not sufficient for establishing validity. A very real validity concern involves the question of the confidence that you might have in any given interpretive result. If you can only find one piece of evidence for a given conclusion, you might be somewhat wary. However, if you begin to see multiple, independent pieces of data that all point in a common direction, your confidence in the resulting conclusion might increase. The use of multiple data sources to support an interpretation is known as data source triangulation (Stake, 1995). The data sources may be different instances of the same type of data (for example, multiple participants in interview research) or completely different sources of data (for example, observation and time diaries).

Interpretations that account for all—or as much as possible—of the observed data are easier to defend as being valid. It may be very tempting to stress observations that support your pet theory, while downplaying those that may be more consistent with alternative explanations. Although some amount of subjectivity in your analysis is unavoidable, you should try to minimize your bias as much as possible by giving every data point the attention and scrutiny it deserves, and keeping an open mind for alternative explanations that may explain your observations as well as (or better than) your pet theories.

You might even develop some alternative explanations as you go along. These alternatives provide a useful reality check: if you are constantly re-evaluating both your theory and some possible alternatives to see which best match the data, you know when your theory starts to look less compelling (Yin, 2014). This may not be a bad thing—rival explanations that you might never find if you cherry-picked your data to fit your theory may actually be more interesting than your original theory. Whichever explanations best match your data, you can always present them alongside the less successful alternatives. A discussion that shows not only how a given model fits the data but how it is a better fit than plausible alternatives can be particularly compelling.

Well-documented analyses, triangulation, and consideration of alternative explanations are recommended practices for increasing analytic validity, but they have their limits. As qualitative studies are interpretations of complex datasets, they do not claim to have any single, “right” answer. Different observers (or participants) may have different interpretations of the same set of raw data, each of which may be equally valid. Returning to the study of palliative care depicted in Figure 11.2, we might imagine alternative interpretations of the raw data that might have been equally valid: comments about temporal onset of pain and events might have been described by a code “event sequences,” triage and assessment might have been combined into a single code, etc. Researchers working on qualitative data should take appropriate measures to ensure validity, all the while understanding that their interpretation is not definitive.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B978012805390400011X

Exploring How the Attribute Driven Design Method Is Perceived

Nour Ali, Carlos Solis, in Relating System Quality and Software Architecture, 2014

2.4.4 Threats to validity

As our research design is nonexperimental and we cannot make cause-effect statements, internal validity is not contemplated (Mitchell, 2004).

Face validity: The questions were shown to two researchers who were not involved in this research. They indicated that the terms efficiency and productivity, which are often used in TAM questions, are not easy to understand. As a result, the terms were explained in the introduction of the questionnaire.

Content validity: The questionnaire used is based on the established model of TAM for measuring usefulness and ease of use. The items in the questionnaire are similar to the questions used in several studies that have followed TAM.

Criterion validity: We checked whether the results behave according to the theoretical model (TAM). In this case, the criterion validity can be checked by the Spearman's ρ correlation. The correlations among the variables behave in the theoretical expected way. In addition, other TAM studies have also found similar correlations (Davis, 1989).

Construct validity: The internal consistency of the questions was verified with the Cronbach’s α. For minimizing bias errors, the researchers did not express to the participants opinions nor have any expectation. The surveys were collected anonymously. In addition, the researchers are not related to the creation of the ADD, and the results of the study do not affect them directly.

Conclusion validity: The main threat is the small sample used. However, in order to have more meaningful results, we used nonparametric tests instead of parametric tests. Because this is an exploratory study, the hypotheses built into this study can be used in future studies to be validated with a richer sample. In respect to the random heterogeneity of subjects, the participants have more or less the same design experience and have received the same training about software architecture design.

External validity: The results can be generalized to novel software architects who have received formal training in software architecture design and in the ADD method. We have repeated the experiment in order to confirm our initial findings with students. The domain of the project was changed in the two experiments, but both of them are Web applications with similar characteristics. Untrained architects and experienced architects in practice may have different perceptions than the ones found in this study. However, we believe there are practices common to all business-related (not critical or real time) domains. There is a threat that the academic context is not similar to industrial. In our case, we did not restrict the teams to work in specific hours and times such as in a lab. The development of the tasks was flexible. They were also given a deadline as in the real world to deliver the architecture documentation.

Ethical validity: The questionnaire questions and the study method were approved by The Research Ethics Committee of the University of Limerick.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780124170094000028

Indicator: Methodology

K.A. Bollen, in International Encyclopedia of the Social & Behavioral Sciences, 2001

3 Properties of Indicators

Properties of the indicators are useful to both current and future researchers who plan to use them. Among the two most important properties are the validity and the reliability of the indicators. Validity and reliability are properties that have received their greatest attention in the case of measurement models with continuous latent variables and approximately continuous effect indicators. Indicator validity concerns whether the indicator really measures the latent variable it is supposed to measure. Construct validity, criterion validity, and content validity are types of validity that researchers sometimes examine. Construct validity, for instance, assesses whether the indicator is associated with other constructs that it is supposed to relate to and not associated with those that it should not. Criterion validity compares the indicator to some standard variable that it should be associated with if it is valid. Content validity examines whether the indicators are capturing the concept for which the latent variable stands. See Nunnally and Bernstein (1994) for further discussion.

Reliability focuses on the consistency or ‘stability’ of an indicator in its ability to capture the latent variable. It is distinct from validity in that you can have a reliable indicator that does not really measure the latent variable. A general definition of the reliability of an indicator is the variance of the ‘true’ (latent variable) variance divided by the total indicator variance. A variety of statistics to estimate reliability exist. One of the most popular to measure the reliability of several combined effect indicators is Cronbach's (1951) alpha. It also makes a number of assumptions that might be difficult to satisfy in practice. Alternative measures of reliability built from less restrictive assumptions also are available (Bollen 1989).

IRT assumes a continuous latent trait and a categorical effect indicator, usually dichotomous or ordinal. IRT focuses on other properties of categorical items or indicators such as item discrimination and item difficulty (Hambleton and Swaminathan 1985). IRT is similar to the traditional treatments of reliability and validity in that it too focuses on effect indicators. Latent class or latent structure analysis (Lazarsfeld and Henry 1968) also deals with effect indicators. It applies when we have latent categorical variables with categorical indicators. The degree of classification error of the observed categorical variables provides information on the accuracy of the indicator. The combination of a latent categorical variable with continuous effect indicators are less extensively developed than are the cases of continuous latent variables with continuous or categorical effect indicators.

The measurement properties of causal indicators are less discussed. An important point is that use of the causal indicator assumes that it is the causal indicator that directly influences the latent variable. If the causal indicator itself contains measurement error, then this needs to be part of the measurement model.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0080430767007099

Conclusions, Future Work

Yun Yang, in Temporal Data Mining Via Unsupervised Ensemble Learning, 2017

Finally, we proposed a Weighted clustering ensemble with multiple representations in order to provide an alternative solution to solve the common problems such as selection of intrinsic cluster numbers, computational cost, and combination method raised by both former proposed clustering ensemble models from the perspective of a feature-based approach. The approach consists of three phases of work. First, temporal data are transformed into a different feature space and become the input for the clustering algorithm. Second, the clustering algorithm is applied for clustering analyses. Finally, clustering ensemble on different representations are employed, and the weighted consensus function, based on three different clustering validity criterion—Modified Huber's T Index (Theodoridis et al., 1999), Dunn's Validity Index (Davies and Bouldin, 1979), and NMI (Vinh et al., 2009)—is carried out to find out an optimal single consensus partition from multiple partitions based on different representations. Then, a final agreement function is used to construct the final partition from the candidates yielded by the weighted consensus function based on different clustering validity criterion. This proposed representation-based clustering ensemble model results in four major benefits:

1.

Through representation, the complex structures of temporal data with variable length and high dimensionality are transformed into lower-fixed dimensional feature spaces, significantly reducing computational burden, which has been demonstrated on the motion trajectories database (CAVIAR) in terms of execution time shown in Table 7.4.

2.

We see a high capability for capturing the properties of temporal data as well as the synergy of reconciling diverse partitions with different representations, which has been initially demonstrated on a synthetic 2D-data set as the motivation described in Section 7.2.1 with a visualization and better understanding on the experiment results shown in Fig. 7.1. Moreover, a set of experiments on time series benchmark shown in Table 7.1 and motion trajectories database (CAVIAR) shown in Fig. 7.5 demonstrated the benefit of using different representations in comparison of solely using single representation.

3.

The weighted consensus function has outstanding ability in automatic model selection and appropriate grouping for complex temporal data, which has been initially demonstrated on a complex Gaussian-generated 2D-data set shown in Fig. 7.2 as the motivation described in Section 7.2.1, then a set of experiments on time series benchmarks shown in Table 7.1 in comparison with standards temporal data clustering algorithms, Table 7.2 in comparison with three state-of-the-art ensemble learning algorithms, Table 7.3 in comparison with other proposed clustering ensemble models on motion trajectories database (CAVIAR).

4.

There is enhanced flexibility in association with most of existing clustering algorithms. As a general framework for ensemble learning, K-means, hierarchical, and Density-Based Spatial Clustering of Applications with Noise (DBSCAN) have been employed as the base learner of this proposed clustering ensemble model; each of them has shown the promising results on a collection of time series benchmark shown in Table 7.1. Also the proposed clustering ensemble model has been successfully applied for online time-series data streaming clustering, which has demonstrated on the Physiological Data Modeling Contest Workshop data set in Table 7.6 and Fig. 7.7.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128116548000087

Temporal Data Clustering

Yun Yang, in Temporal Data Mining Via Unsupervised Ensemble Learning, 2017

3.3.7 Normalized Mutual Information

Normalized mutual information (NMI) (Vinh et al., 2009) is proposed to measure the consistency between any two partitions, which indicates the amount of information (common structured objects) shared between two partitions. Given a set of partitions {Pt}t=1 T obtained from a target data set, the NMI-based clustering validity criteria of assessed partition Pa are determined by summation of the NMI between the assessed partition Pa and each individual partition Pm. Therefore, the high-valued NMI represents a well-accepted partition and indicates the intrinsic structure of the target data set. However, this approach always shows bias toward highly correlated partitions and favors the balanced structure of the data set. The NMI is calculated as following

(3.19)NMI(Pa,Pb)=∑i=1Ka∑j=1 KbNijablog(NN ijabNiaNjb)∑i =1KaNialog(Nia N)+∑j=1KbNjblog(NjbN)

(3.20)NMI(P)=∑t=1TNMI(P,Pt)

Here Pa and Pb are labelings for two partitions that divide a data set of N objects into Ka and Kb clusters, respectively. Nijab is the number of shared objects between clusters Cia∈Pa and Cjb∈Pb, where there are Nia and Njb objects in Cia and Cjb.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128116548000038

Affective facial computing: Generalizability across domains

Jeffrey F. Cohn, ... Zakia Hammal, in Multimodal Behavior Analysis in the Wild, 2019

19.4 Reliability and performance

Reliability of measurement. Reliability in the context of AFC refers to the extent to which labels from different sources (but of the same images or videos) are consistent. A useful distinction can be made between inter-observer reliability and inter-system reliability.

Inter-observer reliability refers to the extent to which labels assigned by different human annotators are consistent with one another. For example, inter-observer reliability is high if the annotators tended to assign images or videos the same labels (e.g., AUs). Note that, for inter-observer reliability, the “true” label of the image or video is often not knowable, so we are primarily interested in how much the annotators agreed with one another. Because such labels are used to train and evaluate supervised learning systems, inter-observer reliability matters.

Inter-system reliability refers to the extent to which labels assigned by AFC systems are consistent with labels assigned by human annotators. Inter-system reliability is also called “criterion validity” as the human labels are taken to be the gold standard or criterion measure. Inter-system reliability is the primary measure for the performance of an AFC system.

Inter-observer reliability of training data likely serves as an upper bound for what inter-system reliability is possible, and inter-observer reliability often exceeds inter-system reliability by a considerable margin [27–30]. To fulfill the goal of creating an AFC system that is interchangeable with (or perhaps even more accurate and consistent than) a trained human annotator, both forms of reliability must be maximized. Furthermore, the generalizability of the system (i.e., its inter-system reliability in novel domains) must be maximized.

Level of measurement. AFC systems typically analyze behaviors in single images or video frames, and reliability is calculated on this level of measurement. In the studies reviewed below, frame-level performance is almost always the focus. However, other levels of measurement are also possible and evaluating reliability on these levels may be appropriate for certain tasks or applications. For instance, behavioral events, which span multiple continguous frames, may be the focus when the dynamic unfolding of behavior is of interest [31–34]. Alternatively, when behavioral tendencies over longer periods of time are of interest, a more molar approach that aggregates many behaviors (e.g., into counts, means, or proportions) may be appropriate [35]. Note that reliability may differ between levels of measurement. Aggregated annotations are often more reliable than frame-level annotations [27], but they are also less detailed. It is important to match the analyzed level of measurement to the particular use-case of the system.

Metrics for quantifying reliability. Many metrics have been proposed for estimating reliability. When categorical labels are used, percentage agreement or accuracy (i.e., the proportion of objects that were assigned the same label) is an intuitive and popular option. However, accuracy is a poor choice when the categories are highly imbalanced, such as when a facial behavior has a very high (or very low) occurrence rate and the algorithm is trying to predict when the behavior did and did not occur. In such cases, a naïve algorithm that simply guessed that every image or video contained (or did not contain) the behavior would have a high accuracy.

To attempt to resolve this issue, a number of alternative metrics have been developed including the F-score, receiver operator characteristic (ROC) curve analyses, and various chance-adjusted agreement measures. In this paper, we focus on the three most popular metrics: accuracy, the F1 score, and 2AFC. Accuracy, as stated earlier, is the percentage of agreement. The F1 score or balanced F-score is the harmonic mean of precision and recall. Finally, 2AFC is resampling-based estimate of the area under the receiver operating characteristic (ROC) curve.

When dimensional labels are used, correlation coefficients (i.e., standardized covariances) are popular options [36]. The Pearson correlation coefficient (PCC) is a linearity index that quantifies how well two vectors can be equated using a linear transformation (i.e., with the addition of a constant and scalar multiplication). The consistency intra-class correlation coefficients (also known as ICC-C) are additivity indices that quantify how well two vectors can be equated with only the addition of a constant. Finally, the agreement intra-class correlation coefficients (also known as ICC-A) are identity indices that quantify how well two vectors can be equated without transformation. The choice of correlation type should depend on how measurements are obtained and how they will be used.

The existence and use of so many different metrics makes comparison between studies and approaches quite difficult. Different metrics are not similarly interpretable and may behave differently in response to imbalanced categories (Fig. 19.2) [37]. As such, we compare performance scores within metrics but never across them, and we acknowledge that differences in occurrence rates between studies may unavoidably confound some comparisons.

Which type of validity occurs when investigators researchers use adequate definitions and measures of variables?

Figure 19.2. The behavior of different metrics using simulated classifiers. The horizontal axis depicts the skew ratio while the vertical axis shows the given metric score. The different lines show the relative misclassification rates of the simulated classifiers. Adapted from [37].

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128146019000262

Content Analysis and Television

Erica Scharrer, in Encyclopedia of Social Measurement, 2005

Reliability and Validity in Studies of Television Content

Definitions

Studies that employ the method of content analysis to examine television content are guided by the ideals of reliability and validity, as are many research methods. Reliability has to do with whether the use of the same measures and research protocols (e.g., coding instructions, coding scheme) time and time again, as well as by more than one coder, will consistently result in the same findings. If so, those results can be deemed reliable because they are not unique to the subjectivity of one person's view of the television content studied or to the researcher's interpretations of the concepts examined.

Validity refers primarily to the closeness of fit between the ways in which concepts are measured in research and the ways in which those same concepts are understood in the larger, social world. A valid measure is one that appropriately taps into the collective meanings that society assigns to concepts. The closer the correspondence between operationalizations and complex real-world meanings, the more socially significant and useful the results of the study will be. In content analysis research of television programming, validity is achieved when samples approximate the overall population, when socially important research questions are posed, and when both researchers and laypersons would agree that the ways that the study defined major concepts correspond with the ways that those concepts are really perceived in the social world.

Validity: Categories and Indicators

Valid measures of general concepts are best achieved through the use of multiple indicators of the concept in content analysis research, as well as in other methods. A study of whether television commercials placed during children's programming have “healthy” messages about food and beverages poses an example. There are certainly many ways of thinking about what would make a food or beverage “healthy.” Some would suggest that whole categories of foods and beverages may be healthy or not (orange juice compared to soda, for instance). Others would look at the amount of sugar or perhaps fat in the foods and beverages to determine how healthy they were. Still others would determine healthiness by documenting whether the foods and beverages contain vitamins and minerals. The content analysis codes or categories used to measure the healthiness of the foods and beverages shown in commercials would ideally reflect all of these potential indicators of the concept. The use of multiple indicators bolsters the validity of the measures implemented in studies of content because they more closely approximate the varied meanings and dimensions of the concept as it is culturally understood.

There are two major types of validity. External validity has to do with the degree to which the study as a whole or the measures employed in the study can be generalized to the real world or to the entire population from which the sample was drawn. It is established through sampling as well as through attempts to reduce artificiality. An example of the latter is having coders make some judgments by watching television content only once, rather than stopping and starting a videotaped program multiple times, in order to approximate how the content would be experienced by actual viewing audiences. The other type of validity is internal validity, which refers to the closeness of fit between the meanings of the concepts that we hold in everyday life and the ways those concepts are operationalized in the research. The validity of concepts used in research is determined by their prima facie correspondence to the larger meanings we hold (face validity), the relationship of the measures to other concepts that we would expect them to correlate with (construct validity) or to some external criterion that the concept typically predicts (criterion or predictive validity), and the extent to which the measures capture multiple ways of thinking of the concept (content validity).

Reliability: Training, Coding, and Establishing Intercoder Agreement

A particular strength of content studies of television is that they provide a summary view of the patterns of messages that appear on the screens of millions of people. The goal of a content analysis is that these observations are universal rather than significantly swayed by the idiosyncratic interpretations or points of view of the coder. Researchers go to great lengths to ensure that such observations are systematic and methodical rather than haphazard, and that they strive toward objectivity. Of course, true objectivity is a myth rather than a reality. Yet, content analysis research attempts to minimize the influence of subjective, personal interpretations.

In order to achieve this aim, multiple coders are used in content analysis to perform a check on the potential for personal readings of content by the researcher, or for any one of the coders to unduly shape the observations made. Such coders must all be trained to use the coding scheme to make coding decisions in a reliable manner, so that the same television messages being coded are dealt with the same way by each coder each time they are encountered. Clear and detailed instructions must be given to each coder so that difficult coding decisions are anticipated and a procedure for dealing with them is in place and is consistently employed. Most likely, many pretests of the coding scheme and coding decisions will be needed and revisions will be made to eliminate ambiguities and sources of confusion before the process is working smoothly (i.e., validly and reliably). Researchers often limit the amount of coding to be done by one coder in one sitting because the task may get tiresome, and reliable, careful thought may dwindle over time.

In addition to training coders on how to perform the study, a more formal means of ensuring reliability— calculations of intercoder reliability—is used in content analysis research. The purpose of intercoder reliability is to establish mathematically the frequency with which multiple coders agree in their judgments of how to categorize and describe content. In order to compute intercoder reliability, the coders must code the same content to determine whether and to what extent their coding decisions align. Strategies for determining how much content to use for this purpose vary, but a general rule of thumb is to have multiple coders overlap in their coding of at least 10% of the sample. If they agree sufficiently in that 10%, the researcher is confident that each can code the rest of the sample independently because a systematic coding protocol has been achieved.

A number of formulas are used to calculate intercoder reliability. Holsti's coefficient is a fairly simple calculation, deriving a percent agreement from the number of items coded by each coder and the number of times they made the exact same coding decision. Other researchers use Pearson's correlation to determine the association between the coding decisions of one coder compared to another (or multiple others). Still other formulas, such as Scott's pi, take chance agreement into consideration. There is no set standard regarding what constitutes sufficiently high intercoder reliability, although most published accounts do not fall below 70–75% agreement.

Balancing Validity and Reliability

In studies of television content, the goals of establishing validity and reliability must be balanced. Measures used in content analysis research could be reliable but not valid if they repeatedly uncover the same patterns of findings, but those findings do not adequately measure the concepts that they are intending to measure. Furthermore, often the attempts to approximate the complex understandings of concepts that occur in the social world in research designs strengthen validity at the same time that they threaten reliability, because they are more nuanced and less transparent.

Consider an example in a study of television news coverage of presidential elections. The researcher wants to determine what proportion of the newscast is devoted to coverage of the presidential candidates during election season, as well as whether those candidates receive positive or negative coverage. The former portion of the research question would be relatively straightforward to study and would presumably be easily and readily agreed on by multiple coders. All of the items in the newscast could be counted and the number of items devoted to the presidential candidates could be compared to the total number (similarly, stories could be timed). The latter part of the research question, however, is likely to be less overt and relies instead on a judgment to be made by coders, rather than a mere observation of the conspicuous characteristics of the newscast. Indeed, if the researcher were to operationalize the tone of the coverage on a scale of 1 (very negative) to 5 (very positive), the judgments called for become more finely distinct, and agreement, and therefore reliability, may be compromised. On the other hand, that type of detailed measure enhances validity because it acknowledges that news stories can present degrees of positivity or negativity that are meaningful and potentially important with respect to how audiences actually respond to the stories.

The straightforward, readily observed, overt types of content for which coders use denotative meanings to make coding decisions are called “manifest” content. The types of content that require what Holsti in 1969 referred to as “reading between the lines,” or making inferences or judgments based on connotative meanings, are referred to as “latent” content. The former maximizes reliability and the latter maximizes validity. Although scholars using the method have disagreed about the best way to proceed, many suggest that it is useful to investigate both types of content and to balance their presence in a coding scheme. Coders must be trained especially well for making decisions based on latent meaning, however, so that coding decisions remain consistent within and between coders.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0123693985005053

What are the four types of validity?

Table of contents.
Construct validity..
Content validity..
Face validity..
Criterion validity..

What type of validity is predictive validity?

Predictive validity is a type of criterion validity, which refers to how well the measurement of one variable can predict the response of another variable.

Which type of validity measures the relationship between a measure and an outcome?

What is criterion validity? Criterion validity is used to measure/calculate the correlation between the outcome of the criterion measurement against that of your measurement.

What are the 3 main types of measurement validity?

Here we consider three basic kinds: face validity, content validity, and criterion validity.