Assessing the suitability of forensic authorship analysis methodologies for speech data

James Tompkinson; Andrea Nini

doi:10.5281/zenodo.16308151

Abstract

The development of new analytical methods and frameworks which could be integrated into forensic speaker comparison (FSC) work is a core focus for research in forensic speech science. In this paper, we explore the applicability of methods that have been used in forensic authorship analysis (FAA) to speech data. Our work has two main areas, 1) whether methods borrowed from authorship analysis can be used to analyse discrete phonetic variables using a likelihood-ratio based framework and 2) whether the embedding of auditory phonetic analysis with “higher order” features (Gold and French 2011) such as lexis, grammar and morphology, which are frequently considered in FAA tasks, can be used for speaker comparison. Our work builds on research by Sergidou et al. (2023), who showed that frequent words did have some speaker discriminatory power, and argued that this could be useful in FSC casework. We expand this work to examine how phonetic variation can be incorporated into such a framework. We analysed transcribed speech data from a random sample of 30 speakers from the West Yorkshire Regional English Database (Gold 2020) across two different speaking styles (Task 1 and Task 2), using two well-known authorship analysis methods which incorporate the likelihood ratio (LR) framework: Cosine Delta (Ishihara 2021) and Phi n-gram tracing (Nini 2023). We applied these methods to transcripts which had been adapted to represent a range of phonetic features - vocalised hesitation markers, syllable-initial realisations of /θ/, intervocalic word-medial /t/, syllable-initial /l/ and realisations of the -ing suffix - to assess 1) whether algorithms used in FAA are similarly effective on phonetic feature sets of this kind and 2) whether the combination of “higher-order” linguistic features with segmental phonetic analysis would achieve greater speaker discriminatory power. Our findings support previous research which has suggested that methods used to discriminate between authors can be usefully applied to transcribed speech data. We find that Cosine Delta and N-gram tracing are both effective in performing speaker comparison on transcribed speech data. In addition, our results show how a logistic regression calibrated Cosine Delta using the consonant phonetic features alone already offers valuable information. The analytical framework for this project, where phonetic information is embedded in transcripts and then subjected to authorship analysis techniques using the likelihood ratio paradigm, could potentially be used as a way of systematically evaluating auditory phonetic variables within a likelihood-ratio approach even when the phonetic features are discrete.

References
Gold, E. (2020). WYRED - West Yorkshire Regional English Database 2016-2019. [data collection]. UK Data Service. SN: 854354, DOI: 10.5255/UKDA-SN-854354
Ishihara, Shunichi. 2021. Score-based likelihood ratios for linguistic text evidence with a bag-of-words model. Forensic Science International. Elsevier 327. 110980.
Nini, A. (2023). A Theory of Linguistic Individuality for Authorship Analysis. Elements in Forensic Linguistics. Cambridge University Press.
Sergidou, E. K., Scheijen, N., Leegwater, J., Cambier-Langeveld, T., & Bosma, W. (2023). Frequent-words analysis for forensic speaker comparison. Speech Communication, 150, 1-8.

Original language	English
Number of pages	29
DOIs	https://doi.org/10.5281/zenodo.16308151
Publication status	Published - 21 Jul 2025
Event	Annual Conference of the International Association for Forensic Phonetics and Acoustics - Leiden University, The Hague, Netherlands Duration: 20 Jul 2025 → 23 Jul 2025 Conference number: 33 https://www.universiteitleiden.nl/en/events/2025/07/iafpa

Conference

Conference	Annual Conference of the International Association for Forensic Phonetics and Acoustics
Abbreviated title	IAFPA
Country/Territory	Netherlands
City	The Hague
Period	20/07/25 → 23/07/25
Internet address	https://www.universiteitleiden.nl/en/events/2025/07/iafpa

Access to Document

10.5281/zenodo.16308151

2 Abstract
1 Book

Evaluating the usefulness of embedding phonetic representations into an authorship analysis-based framework for the comparison of spoken data
Tompkinson, J. & Nini, A., 26 Jun 2024.
Research output: Contribution to conference › Abstract › peer-review

Open Access
Likelihood ratio based authorship verification methods applied to forensic voice comparison tasks
Brown, G., Nini, A. & Kirchhübel, C., 2024.
Research output: Contribution to conference › Abstract › peer-review

Open Access
A Theory of Linguistic Individuality for Authorship Analysis
Nini, A., 15 Jun 2023, Cambridge University Press. (Elements in Forensic Linguistics)
Research output: Book/Report › Book › peer-review

Forensic linguistic authorship analysis of disputed texts
Nini, A. (Participant)
Impact: Legal impacts, Societal impacts

Cite this

@conference{eb55da1295ff49289980c54f00678bcc,

title = "Assessing the suitability of forensic authorship analysis methodologies for speech data",

abstract = "The development of new analytical methods and frameworks which could be integrated into forensic speaker comparison (FSC) work is a core focus for research in forensic speech science. In this paper, we explore the applicability of methods that have been used in forensic authorship analysis (FAA) to speech data. Our work has two main areas, 1) whether methods borrowed from authorship analysis can be used to analyse discrete phonetic variables using a likelihood-ratio based framework and 2) whether the embedding of auditory phonetic analysis with “higher order” features (Gold and French 2011) such as lexis, grammar and morphology, which are frequently considered in FAA tasks, can be used for speaker comparison. Our work builds on research by Sergidou et al. (2023), who showed that frequent words did have some speaker discriminatory power, and argued that this could be useful in FSC casework. We expand this work to examine how phonetic variation can be incorporated into such a framework. We analysed transcribed speech data from a random sample of 30 speakers from the West Yorkshire Regional English Database (Gold 2020) across two different speaking styles (Task 1 and Task 2), using two well-known authorship analysis methods which incorporate the likelihood ratio (LR) framework: Cosine Delta (Ishihara 2021) and Phi n-gram tracing (Nini 2023). We applied these methods to transcripts which had been adapted to represent a range of phonetic features - vocalised hesitation markers, syllable-initial realisations of /θ/, intervocalic word-medial /t/, syllable-initial /l/ and realisations of the -ing suffix - to assess 1) whether algorithms used in FAA are similarly effective on phonetic feature sets of this kind and 2) whether the combination of “higher-order” linguistic features with segmental phonetic analysis would achieve greater speaker discriminatory power. Our findings support previous research which has suggested that methods used to discriminate between authors can be usefully applied to transcribed speech data. We find that Cosine Delta and N-gram tracing are both effective in performing speaker comparison on transcribed speech data. In addition, our results show how a logistic regression calibrated Cosine Delta using the consonant phonetic features alone already offers valuable information. The analytical framework for this project, where phonetic information is embedded in transcripts and then subjected to authorship analysis techniques using the likelihood ratio paradigm, could potentially be used as a way of systematically evaluating auditory phonetic variables within a likelihood-ratio approach even when the phonetic features are discrete. References Gold, E. (2020). WYRED - West Yorkshire Regional English Database 2016-2019. [data collection]. UK Data Service. SN: 854354, DOI: 10.5255/UKDA-SN-854354 Ishihara, Shunichi. 2021. Score-based likelihood ratios for linguistic text evidence with a bag-of-words model. Forensic Science International. Elsevier 327. 110980. Nini, A. (2023). A Theory of Linguistic Individuality for Authorship Analysis. Elements in Forensic Linguistics. Cambridge University Press. Sergidou, E. K., Scheijen, N., Leegwater, J., Cambier-Langeveld, T., \& Bosma, W. (2023). Frequent-words analysis for forensic speaker comparison. Speech Communication, 150, 1-8.",

author = "James Tompkinson and Andrea Nini",

year = "2025",

month = jul,

day = "21",

doi = "10.5281/zenodo.16308151",

language = "English",

note = "Annual Conference of the International Association for Forensic Phonetics and Acoustics, IAFPA ; Conference date: 20-07-2025 Through 23-07-2025",

url = "https://www.universiteitleiden.nl/en/events/2025/07/iafpa",

}

TY - CONF

T1 - Assessing the suitability of forensic authorship analysis methodologies for speech data

AU - Tompkinson, James

AU - Nini, Andrea

N1 - Conference code: 33

PY - 2025/7/21

Y1 - 2025/7/21

N2 - The development of new analytical methods and frameworks which could be integrated into forensic speaker comparison (FSC) work is a core focus for research in forensic speech science. In this paper, we explore the applicability of methods that have been used in forensic authorship analysis (FAA) to speech data. Our work has two main areas, 1) whether methods borrowed from authorship analysis can be used to analyse discrete phonetic variables using a likelihood-ratio based framework and 2) whether the embedding of auditory phonetic analysis with “higher order” features (Gold and French 2011) such as lexis, grammar and morphology, which are frequently considered in FAA tasks, can be used for speaker comparison. Our work builds on research by Sergidou et al. (2023), who showed that frequent words did have some speaker discriminatory power, and argued that this could be useful in FSC casework. We expand this work to examine how phonetic variation can be incorporated into such a framework. We analysed transcribed speech data from a random sample of 30 speakers from the West Yorkshire Regional English Database (Gold 2020) across two different speaking styles (Task 1 and Task 2), using two well-known authorship analysis methods which incorporate the likelihood ratio (LR) framework: Cosine Delta (Ishihara 2021) and Phi n-gram tracing (Nini 2023). We applied these methods to transcripts which had been adapted to represent a range of phonetic features - vocalised hesitation markers, syllable-initial realisations of /θ/, intervocalic word-medial /t/, syllable-initial /l/ and realisations of the -ing suffix - to assess 1) whether algorithms used in FAA are similarly effective on phonetic feature sets of this kind and 2) whether the combination of “higher-order” linguistic features with segmental phonetic analysis would achieve greater speaker discriminatory power. Our findings support previous research which has suggested that methods used to discriminate between authors can be usefully applied to transcribed speech data. We find that Cosine Delta and N-gram tracing are both effective in performing speaker comparison on transcribed speech data. In addition, our results show how a logistic regression calibrated Cosine Delta using the consonant phonetic features alone already offers valuable information. The analytical framework for this project, where phonetic information is embedded in transcripts and then subjected to authorship analysis techniques using the likelihood ratio paradigm, could potentially be used as a way of systematically evaluating auditory phonetic variables within a likelihood-ratio approach even when the phonetic features are discrete. References Gold, E. (2020). WYRED - West Yorkshire Regional English Database 2016-2019. [data collection]. UK Data Service. SN: 854354, DOI: 10.5255/UKDA-SN-854354 Ishihara, Shunichi. 2021. Score-based likelihood ratios for linguistic text evidence with a bag-of-words model. Forensic Science International. Elsevier 327. 110980. Nini, A. (2023). A Theory of Linguistic Individuality for Authorship Analysis. Elements in Forensic Linguistics. Cambridge University Press. Sergidou, E. K., Scheijen, N., Leegwater, J., Cambier-Langeveld, T., & Bosma, W. (2023). Frequent-words analysis for forensic speaker comparison. Speech Communication, 150, 1-8.

AB - The development of new analytical methods and frameworks which could be integrated into forensic speaker comparison (FSC) work is a core focus for research in forensic speech science. In this paper, we explore the applicability of methods that have been used in forensic authorship analysis (FAA) to speech data. Our work has two main areas, 1) whether methods borrowed from authorship analysis can be used to analyse discrete phonetic variables using a likelihood-ratio based framework and 2) whether the embedding of auditory phonetic analysis with “higher order” features (Gold and French 2011) such as lexis, grammar and morphology, which are frequently considered in FAA tasks, can be used for speaker comparison. Our work builds on research by Sergidou et al. (2023), who showed that frequent words did have some speaker discriminatory power, and argued that this could be useful in FSC casework. We expand this work to examine how phonetic variation can be incorporated into such a framework. We analysed transcribed speech data from a random sample of 30 speakers from the West Yorkshire Regional English Database (Gold 2020) across two different speaking styles (Task 1 and Task 2), using two well-known authorship analysis methods which incorporate the likelihood ratio (LR) framework: Cosine Delta (Ishihara 2021) and Phi n-gram tracing (Nini 2023). We applied these methods to transcripts which had been adapted to represent a range of phonetic features - vocalised hesitation markers, syllable-initial realisations of /θ/, intervocalic word-medial /t/, syllable-initial /l/ and realisations of the -ing suffix - to assess 1) whether algorithms used in FAA are similarly effective on phonetic feature sets of this kind and 2) whether the combination of “higher-order” linguistic features with segmental phonetic analysis would achieve greater speaker discriminatory power. Our findings support previous research which has suggested that methods used to discriminate between authors can be usefully applied to transcribed speech data. We find that Cosine Delta and N-gram tracing are both effective in performing speaker comparison on transcribed speech data. In addition, our results show how a logistic regression calibrated Cosine Delta using the consonant phonetic features alone already offers valuable information. The analytical framework for this project, where phonetic information is embedded in transcripts and then subjected to authorship analysis techniques using the likelihood ratio paradigm, could potentially be used as a way of systematically evaluating auditory phonetic variables within a likelihood-ratio approach even when the phonetic features are discrete. References Gold, E. (2020). WYRED - West Yorkshire Regional English Database 2016-2019. [data collection]. UK Data Service. SN: 854354, DOI: 10.5255/UKDA-SN-854354 Ishihara, Shunichi. 2021. Score-based likelihood ratios for linguistic text evidence with a bag-of-words model. Forensic Science International. Elsevier 327. 110980. Nini, A. (2023). A Theory of Linguistic Individuality for Authorship Analysis. Elements in Forensic Linguistics. Cambridge University Press. Sergidou, E. K., Scheijen, N., Leegwater, J., Cambier-Langeveld, T., & Bosma, W. (2023). Frequent-words analysis for forensic speaker comparison. Speech Communication, 150, 1-8.

U2 - 10.5281/zenodo.16308151

DO - 10.5281/zenodo.16308151

M3 - Abstract

T2 - Annual Conference of the International Association for Forensic Phonetics and Acoustics

Y2 - 20 July 2025 through 23 July 2025

ER -

Assessing the suitability of forensic authorship analysis methodologies for speech data

Abstract

Conference

Access to Document

Fingerprint

Research output

Evaluating the usefulness of embedding phonetic representations into an authorship analysis-based framework for the comparison of spoken data

Likelihood ratio based authorship verification methods applied to forensic voice comparison tasks

A Theory of Linguistic Individuality for Authorship Analysis

Impacts

Forensic linguistic authorship analysis of disputed texts

Cite this