Description
This track is powered by Bing! and Microsoft Research. UCSC collaborators at
Microsoft Research (Bob Davidson, David Heckerman) implemented a DNA sequence
detector and processed thirty days of web crawler updates, which covers
roughly 40 billion webpages. The results were mapped with BLAT to the genome.
Display Convention and Configuration
The track indicates the location of sequences on web pages
mapped to the genome, labelled with the web page URL. If the web page includes
invisible meta data, then the first author and a year of publication
is shown instead of the URL. All
matches of one web page are grouped ("chained") together.
Web page titles are shown when you move the mouse cursor over the features.
Thicker parts of the features (exons) represent matching sequences,
connected by thin lines to matches from the same web page within 30 kbp.
Methods
All file types (PDFs and various Microsoft Office formats) were converted to
text. The results were processed to find groups of words that look like DNA/RNA
sequences. These were then mapped with BLAT to the human genome using the same
software as used in the Publication track.
Credits
DNA sequence detection by Bob Davidson at Microsoft Research.
HTML parsing and sequence mapping by Maximilian Haeussler at UCSC.
References
Aerts S, Haeussler M, van Vooren S, Griffith OL, Hulpiau P, Jones SJ, Montgomery SB, Bergman CM, Open Regulatory Annotation Consortium.
Text-mining assisted regulatory annotation.
Genome Biol. 2008;9(2):R31.
PMID: 18271954; PMC: PMC2374703
Haeussler M, Gerner M, Bergman CM.
Annotating genes and genomes with DNA sequences extracted from biomedical articles.
Bioinformatics. 2011 Apr 1;27(7):980-6.
PMID: 21325301; PMC: PMC3065681
Van Noorden R.
Trouble at the text mine.
Nature. 2012 Mar 7;483(7388):134-5.
|