Relationship Extraction from biomedical text corpora
On this page... (hide)
1. Bird's Eye View
2. Text processing
2.1 Input datasets
We use all PMC articles until (must check) for our analysis. Dataset is at http://onto.eva.mpg.de/tm-results/pmc-split.tar.gz
The dataset is completely stripped of all XML tags using Attach:strip.sh Δ
We used the following table of ontology terms: Attach:em.txt.gz
3. Enrichment
4. Tests
The tests used in the course of this project depend on several distributions. Let Let
- $Cat(O)$ = {C|C is a category in the ontology $O$}},
- $N$ be the number of permutations,
- $score^{n}(C,D)$ be the score between $C$ and $D$ in the $n^{th}$ permutation,
- $NQ(x,C,D)=P(score^n(C,D)\leq x)$, $1\leq n\leq N$, be the cumulative distribution function (CDF) of $score(C, D)$.
- $nq$ be the quantile of the score distribution of the cumulated score functions,
- $DQ^{j}(x,C,D)=P(score^n(C,D)-score^n(C_j,D)\leq x)$, $1\leq n \leq N$, be the CDF of the difference between the category $C$ and its $j^{th}$ daughter category,
- $d^j$ be the score difference $score(C,D)-score(C^j, D)$,
- $MQ^{j}(x,C,D)=P(score^n(C_j,D)-score^n(C,D)\leq x)$, $1\leq n \leq N$, be the CDF of the score difference between the category $C$ and its $j^{th}$ mother category,
- $VQ_{NQ}(x)=P(Var(NQ(x,X_1, X_2))\leq x)$, for all $X_1 \in Cat(O_1)$ and $X_2 \in Cat(O_2)$, be the CDF of the variances $Var$ of the distribution $NQ(x,X_1,X_2)$, and $VQ_{DQ}$ and $VQ_{MQ}$ for the distributions $DQ^{j}(x,X_1,X_2)$ and $MQ^{j}(x,X_1,X_2)$ respectively.
Based on these distributions, we designed SixTests. These tests are one-sided, the final tests combine two test-results (see paper).
5. Implementation
Here we describe the various scripts that we use for the analysis and how they fit together.
5.1 Requirements
We require a JRE 1.5 and Groovy 1.5.
The following Java libraries are required:
5.2 CVS Repository
The CVS repository can be found at Savannah. For CVS access, use
- cvs -z3 -d:pserver:anonymous@cvs.savannah.nongnu.org:/sources/gfo co cm
- or this link for WebCVS access.
5.3 OboReader
The OboReader is a parser for a part of the OBO Flatfile Format.
5.4 Collocation extraction
The library uses a the output of the OboReader and a text corpus (contained in a folder, subfolders are not processed) to compute the frequencies of categories in the text and the number of times they collocate. Attach:Collocation.zip
5.5 Generating random distributions
The attached groovy scripts use the TM results we get from the collocations, randomly distribute them throughout the graphs, and re-calculate the scores. They output three values, the scores (normalized or non-normalized), the differences in scores to the predecessors, and the differences to the successors in the graph.
5.6 Generate distributions from randomization
These scripts take four arguments, the results obtained from text-mining, a directory with random distribution files and the name of the output file for pdpl and for ndpl. It calculates the quantile, the variances and the arithmetic mean of the measured values with respect to the random distributions. The first is intended for score-files (two nodes as key), the second for diff-files (three nodes as key)
5.7 Calculate final test results
The tests are calculated using the script
6. Results
The best of our results can also be found in the BOWiki.
The collocation analysis of the PMC corpus results in two file:
- collocations.txt contains the number of common occurrences of two categories.
- frequencies.txt contains the number of occurrences of each category.
The first value is the total number of occurrences, the second the number of documents of (co-)occurrence.
The final results of the six tests can be found at http://onto.eva.mpg.de/tm-new/output. The file http://onto.eva.mpg.de/tm-new/output2 contains the names of the categories instead of their IDs. Full details can be found at http://onto.eva.mpg.de/tm-new.
In the course of our analysis, we generated several distributions. They are
- scoreDist.pdpl is NQ
- postDist.pdpl is MQ
- preDist.pdpl is DQ
These files contain the pair or triple of categories, then the measured value, the arithmetic mean of the distribution, the variance of the distribution, and the p-value of the measured value in the distribution.
The permutations are available as nonNormal.tar.gz (warning: 2.4GB in size).
