FW: Tika content detection and crawled "remote" content

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

FW: Tika content detection and crawled "remote" content

Allison, Timothy B.
Dominik,
  Thanks to Sebastian and CommonCrawl, this means that we can now have far better precision and recall in selecting only MSOffice docs for our regression tests!!!


-----Original Message-----
From: Sebastian Nagel [mailto:[hidden email]]
Sent: Tuesday, July 4, 2017 6:18 AM
To: [hidden email]
Subject: Tika content detection and crawled "remote" content

Hi,

recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch) with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage and isn't always correct [1].

For the June 2017 crawl I've prepared a comparison of content types sent by the server in the HTTP header and as detected by Tika 1.15 [2].  It shows that content types by Tika are definitely clean
(1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers).

A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs are plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all:

            Tika-1.15                HTTP-Content-Type
1001968023  application/xhtml+xml    text/html
   2298146  application/rss+xml      text/xml
    617435  application/rss+xml      application/xml
    613525  text/html                unk
    361525  application/xhtml+xml    unk
    297707  application/rdf+xml      application/xml


However, there are a few dubious decisions, esp. the group of web server-side scripting languages (ASP, JSP, PHP, ColdFusion, etc.):

         Tika-1.15         HTTP-Content-Type
2047739  text/x-php        text/html
 681629  text/asp          text/html
 193095  text/x-coldfusion text/html
 172318  text/aspdotnet    text/html
 139033  text/x-jsp        text/html
  38415  text/x-cgi        text/html
  32092  text/x-php        text/xml
  18021  text/x-perl       text/html

Of course, due to misconfigurations some servers may deliver the script files unmodified but in general I wouldn't expect that this happens for millions of pages.  I've checked some of the affected URLs:

- HTML fragment (no declaration of <!DOCTYPE...> or <html> opening tag)

https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
    http://www.privi.com/product-details.asp?cno=C10910011
    http://mental-ray.de/Root_alt/Default.asp
    http://ekyrs.org/support/index.php?action=profile
    http://cwmorse.eu5.org/lineal/mostrar.php?contador=200

- (overlong) comment block at start of HTML which "masks" the HTML declaration
    http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24

http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
    https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
    https://de.e-stories.org/categories.php?&lan=nl&art=p

- HTML with some scripting fragments ("<?php?>") present:
    http://www.eco-ani-yao.org/shien/

- others are clearly HTML (looks more like a bug, at least, there is no simple explanation)
    http://www.proedinc.com/customer/content.aspx?redid=9
    http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
    http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
    http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068f79


Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type sent from the responding server.
Now my question: where's the best place to fix this: in the crawler [3] or in Tika?

If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm happy to help!  The URL index [4] contains now a new field "mime-detected" which makes it easy to search or grep for confusion pairs.


Thanks and best,
Sebastian


[1] https://github.com/commoncrawl/nutch/issues/3
[2] s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz

https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
[3] https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/MimeUtil.java#L152
[4] http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: FW: Tika content detection and crawled "remote" content

Sebastian Nagel
Yes, you'll get few 10,000 more (MS)Office documents thanks to Tika:

           Tika-1.15                       HTTP-Content-Type
  12520    application/x-tika-msoffice     application/octet-stream
   6681    application/x-tika-ooxml        application/octet-stream
   3793    application/x-tika-msoffice     text/plain
   3515    application/x-tika-msoffice     application/force-download
   2259    application/x-tika-ooxml        application/msword
   1911    application/x-tika-msoffice     unk
   1314    application/x-tika-msoffice     application/download
   1259    application/x-tika-ooxml        unk
   1068    application/x-tika-ooxml        application/force-download
    711    application/x-tika-msoffice     file/unknown
    ...

The initial intention is, of course, to help to improve the MIME detection in Tika core.
Among the detected office formats there is one conspicuous pair:

    127    application/msword      text/vnd.graphviz

Looks like *.dot is taken as indicator only for MSWord documents.

Let me know if I can help to extract any data sets!

Thanks,
Sebastian


On 07/05/2017 01:42 PM, Allison, Timothy B. wrote:

> Dominik,
>   Thanks to Sebastian and CommonCrawl, this means that we can now have far better precision and recall in selecting only MSOffice docs for our regression tests!!!
>
>
> -----Original Message-----
> From: Sebastian Nagel [mailto:[hidden email]]
> Sent: Tuesday, July 4, 2017 6:18 AM
> To: [hidden email]
> Subject: Tika content detection and crawled "remote" content
>
> Hi,
>
> recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch) with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage and isn't always correct [1].
>
> For the June 2017 crawl I've prepared a comparison of content types sent by the server in the HTTP header and as detected by Tika 1.15 [2].  It shows that content types by Tika are definitely clean
> (1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers).
>
> A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs are plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all:
>
>             Tika-1.15                HTTP-Content-Type
> 1001968023  application/xhtml+xml    text/html
>    2298146  application/rss+xml      text/xml
>     617435  application/rss+xml      application/xml
>     613525  text/html                unk
>     361525  application/xhtml+xml    unk
>     297707  application/rdf+xml      application/xml
>
>
> However, there are a few dubious decisions, esp. the group of web server-side scripting languages (ASP, JSP, PHP, ColdFusion, etc.):
>
>          Tika-1.15         HTTP-Content-Type
> 2047739  text/x-php        text/html
>  681629  text/asp          text/html
>  193095  text/x-coldfusion text/html
>  172318  text/aspdotnet    text/html
>  139033  text/x-jsp        text/html
>   38415  text/x-cgi        text/html
>   32092  text/x-php        text/xml
>   18021  text/x-perl       text/html
>
> Of course, due to misconfigurations some servers may deliver the script files unmodified but in general I wouldn't expect that this happens for millions of pages.  I've checked some of the affected URLs:
>
> - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening tag)
>
> https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
>     http://www.privi.com/product-details.asp?cno=C10910011
>     http://mental-ray.de/Root_alt/Default.asp
>     http://ekyrs.org/support/index.php?action=profile
>     http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
>
> - (overlong) comment block at start of HTML which "masks" the HTML declaration
>     http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
>
> http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
>     https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
>     https://de.e-stories.org/categories.php?&lan=nl&art=p
>
> - HTML with some scripting fragments ("<?php?>") present:
>     http://www.eco-ani-yao.org/shien/
>
> - others are clearly HTML (looks more like a bug, at least, there is no simple explanation)
>     http://www.proedinc.com/customer/content.aspx?redid=9
>     http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
>     http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
>     http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068f79
>
>
> Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type sent from the responding server.
> Now my question: where's the best place to fix this: in the crawler [3] or in Tika?
>
> If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm happy to help!  The URL index [4] contains now a new field "mime-detected" which makes it easy to search or grep for confusion pairs.
>
>
> Thanks and best,
> Sebastian
>
>
> [1] https://github.com/commoncrawl/nutch/issues/3
> [2] s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
>
> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
> [3] https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/MimeUtil.java#L152
> [4] http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: FW: Tika content detection and crawled "remote" content

Allison, Timothy B.
> The initial intention is, of course, to help to improve the MIME detection in Tika core.
Absolutely agree.

> Yes, you'll get few 10,000 more (MS)Office documents thanks to Tika:

           Tika-1.15                       HTTP-Content-Type
  12520    application/x-tika-msoffice     application/octet-stream
   6681    application/x-tika-ooxml        application/octet-stream
   3793    application/x-tika-msoffice     text/plain

Agreed, as I look at the numbers they aren't huge, but the improvement for our test corpus development is fantastic.  Even a few thousand extra docx, for example, will help.  

My guess is that the x-tika-ooxml and x-tika-msoffice are truncated files.  Common Crawl is truncating at 1MB, right?  

Again, WOW!!!  Thank you!!!

Cheers,

          Tim
-----Original Message-----
From: Sebastian Nagel [mailto:[hidden email]]
Sent: Wednesday, July 5, 2017 8:43 AM
To: Allison, Timothy B. <[hidden email]>
Cc: [hidden email]; POI Developers List ([hidden email]) <[hidden email]>
Subject: Re: FW: Tika content detection and crawled "remote" content

Yes, you'll get few 10,000 more (MS)Office documents thanks to Tika:

           Tika-1.15                       HTTP-Content-Type
  12520    application/x-tika-msoffice     application/octet-stream
   6681    application/x-tika-ooxml        application/octet-stream
   3793    application/x-tika-msoffice     text/plain
   3515    application/x-tika-msoffice     application/force-download
   2259    application/x-tika-ooxml        application/msword
   1911    application/x-tika-msoffice     unk
   1314    application/x-tika-msoffice     application/download
   1259    application/x-tika-ooxml        unk
   1068    application/x-tika-ooxml        application/force-download
    711    application/x-tika-msoffice     file/unknown
    ...

The initial intention is, of course, to help to improve the MIME detection in Tika core.
Among the detected office formats there is one conspicuous pair:

    127    application/msword      text/vnd.graphviz

Looks like *.dot is taken as indicator only for MSWord documents.

Let me know if I can help to extract any data sets!

Thanks,
Sebastian


On 07/05/2017 01:42 PM, Allison, Timothy B. wrote:

> Dominik,
>   Thanks to Sebastian and CommonCrawl, this means that we can now have far better precision and recall in selecting only MSOffice docs for our regression tests!!!
>
>
> -----Original Message-----
> From: Sebastian Nagel [mailto:[hidden email]]
> Sent: Tuesday, July 4, 2017 6:18 AM
> To: [hidden email]
> Subject: Tika content detection and crawled "remote" content
>
> Hi,
>
> recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch) with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage and isn't always correct [1].
>
> For the June 2017 crawl I've prepared a comparison of content types
> sent by the server in the HTTP header and as detected by Tika 1.15
> [2].  It shows that content types by Tika are definitely clean
> (1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers).
>
> A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs are plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all:
>
>             Tika-1.15                HTTP-Content-Type
> 1001968023  application/xhtml+xml    text/html
>    2298146  application/rss+xml      text/xml
>     617435  application/rss+xml      application/xml
>     613525  text/html                unk
>     361525  application/xhtml+xml    unk
>     297707  application/rdf+xml      application/xml
>
>
> However, there are a few dubious decisions, esp. the group of web server-side scripting languages (ASP, JSP, PHP, ColdFusion, etc.):
>
>          Tika-1.15         HTTP-Content-Type
> 2047739  text/x-php        text/html
>  681629  text/asp          text/html
>  193095  text/x-coldfusion text/html
>  172318  text/aspdotnet    text/html
>  139033  text/x-jsp        text/html
>   38415  text/x-cgi        text/html
>   32092  text/x-php        text/xml
>   18021  text/x-perl       text/html
>
> Of course, due to misconfigurations some servers may deliver the script files unmodified but in general I wouldn't expect that this happens for millions of pages.  I've checked some of the affected URLs:
>
> - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening
> tag)
>
> https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
>     http://www.privi.com/product-details.asp?cno=C10910011
>     http://mental-ray.de/Root_alt/Default.asp
>     http://ekyrs.org/support/index.php?action=profile
>     http://cwmorse.eu5.org/lineal/mostrar.php?contador=200
>
> - (overlong) comment block at start of HTML which "masks" the HTML declaration
>     http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24
>
> http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
>     https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
>     https://de.e-stories.org/categories.php?&lan=nl&art=p
>
> - HTML with some scripting fragments ("<?php?>") present:
>     http://www.eco-ani-yao.org/shien/
>
> - others are clearly HTML (looks more like a bug, at least, there is no simple explanation)
>     http://www.proedinc.com/customer/content.aspx?redid=9
>     http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
>     http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
>    
> http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068
> f79
>
>
> Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type sent from the responding server.
> Now my question: where's the best place to fix this: in the crawler [3] or in Tika?
>
> If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm happy to help!  The URL index [4] contains now a new field "mime-detected" which makes it easy to search or grep for confusion pairs.
>
>
> Thanks and best,
> Sebastian
>
>
> [1] https://github.com/commoncrawl/nutch/issues/3
> [2]
> s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tik
> a-1.15-cc-main-2017-26.txt.xz
>
> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/c
> ontent-type-diff-tika-1.15-cc-main-2017-26.txt.xz
> [3]
> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/
> util/MimeUtil.java#L152 [4]
> http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Loading...