Relationship Extraction from biomedical text corpora

1.  Bird's Eye View

2.  Text processing

2.1  Input datasets

We use all PMC articles until (must check) for our analysis. Dataset is at http://onto.eva.mpg.de/tm-results/pmc-split.tar.gz

The dataset is completely stripped of all XML tags using Attach:strip.sh Δ

We used the following table of ontology terms: Attach:em.txt.gz

3.  Enrichment

4.  Tests

The tests used in the course of this project depend on several distributions. Let Let

  • $Cat(O)$ = {C|C is a category in the ontology $O$}},
  • $N$ be the number of permutations,
  • $score^{n}(C,D)$ be the score between $C$ and $D$ in the $n^{th}$ permutation,
  • $NQ(x,C,D)=P(score^n(C,D)\leq x)$, $1\leq n\leq N$, be the cumulative distribution function (CDF) of $score(C, D)$.
  • $nq$ be the quantile of the score distribution of the cumulated score functions,
  • $DQ^{j}(x,C,D)=P(score^n(C,D)-score^n(C_j,D)\leq x)$, $1\leq n \leq N$, be the CDF of the difference between the category $C$ and its $j^{th}$ daughter category,
  • $d^j$ be the score difference $score(C,D)-score(C^j, D)$,
  • $MQ^{j}(x,C,D)=P(score^n(C_j,D)-score^n(C,D)\leq x)$, $1\leq n \leq N$, be the CDF of the score difference between the category $C$ and its $j^{th}$ mother category,
  • $VQ_{NQ}(x)=P(Var(NQ(x,X_1, X_2))\leq x)$, for all $X_1 \in Cat(O_1)$ and $X_2 \in Cat(O_2)$, be the CDF of the variances $Var$ of the distribution $NQ(x,X_1,X_2)$, and $VQ_{DQ}$ and $VQ_{MQ}$ for the distributions $DQ^{j}(x,X_1,X_2)$ and $MQ^{j}(x,X_1,X_2)$ respectively.

Based on these distributions, we designed SixTests. These tests are one-sided, the final tests combine two test-results (see paper).

5.  Implementation

Here we describe the various scripts that we use for the analysis and how they fit together.

5.1  Requirements

We require a JRE 1.5 and Groovy 1.5.

The following Java libraries are required:

5.2  CVS Repository

The CVS repository can be found at Savannah. For CVS access, use

  • cvs -z3 -d:pserver:anonymous@cvs.savannah.nongnu.org:/sources/gfo co cm
  • or this link for WebCVS access.

5.3  OboReader

The OboReader is a parser for a part of the OBO Flatfile Format.

5.4  Collocation extraction

The library uses a the output of the OboReader and a text corpus (contained in a folder, subfolders are not processed) to compute the frequencies of categories in the text and the number of times they collocate. Attach:Collocation.zip

5.5  Generating random distributions

The attached groovy scripts use the TM results we get from the collocations, randomly distribute them throughout the graphs, and re-calculate the scores. They output three values, the scores (normalized or non-normalized), the differences in scores to the predecessors, and the differences to the successors in the graph.

5.6  Generate distributions from randomization

These scripts take four arguments, the results obtained from text-mining, a directory with random distribution files and the name of the output file for pdpl and for ndpl. It calculates the quantile, the variances and the arithmetic mean of the measured values with respect to the random distributions. The first is intended for score-files (two nodes as key), the second for diff-files (three nodes as key)

5.7  Calculate final test results

The tests are calculated using the script

6.  Results

The best of our results can also be found in the BOWiki.

The collocation analysis of the PMC corpus results in two file:

The first value is the total number of occurrences, the second the number of documents of (co-)occurrence.

The final results of the six tests can be found at http://onto.eva.mpg.de/tm-new/output. The file http://onto.eva.mpg.de/tm-new/output2 contains the names of the categories instead of their IDs. Full details can be found at http://onto.eva.mpg.de/tm-new.

In the course of our analysis, we generated several distributions. They are

These files contain the pair or triple of categories, then the measured value, the arithmetic mean of the distribution, the variance of the distribution, and the p-value of the measured value in the distribution.

The permutations are available as nonNormal.tar.gz (warning: 2.4GB in size).