EMF corpus

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

EMF corpus

kiwiwings
Hi Tim / Dominik,

please give me a few pointers, how I could access a pool of EMF files, e.g. (not only) within the common crawl corpus. My focus is currently on rendering, but as I extend the supported records, I also like to validate the parsing.
As the EMF parsing is relatively new, you still might have a corpus for it, Tim?

I have a few old mails about the common crawl corpus [2], but I guess there has been some restructuring taken place and there might be an easier option than downloading the whole index.

Of course office files which I parse for embedded EMFs are also ok.

I have to admit, that I haven't yet tested Dominiks tool [1].

Alternatively I can use the govdocs1 corpus [3]

Best wishes,
Andi


[1] https://github.com/centic9/CommonCrawlDocumentDownload

[2] http://apache-poi.1045710.n5.nabble.com/Using-CommonCrawl-for-POI-regression-mass-testing-td5721585.html

[3] http://downloads.digitalcorpora.org/corpora/files/govdocs1/by_type/


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: EMF corpus

Dominik Stadler
Hi Andi

It is easy to change CommonCrawlDocumentDownload to fetch other mime-types,
see https://github.com/centic9/CommonCrawlDocumentDownload/tree/download_emf

However .emf files don't appear in the top-100 mimetypes of the crawls and
thus are likely very rarely included if at all. I started a download-run,
but the first two of the 300 index-files do not contain any matching
extension or mime-type.

See https://commoncrawl.github.io/cc-crawl-statistics/plots/mimetypes for
mimetype-statistics in the crawl.

Dominik.

On Sat, Oct 6, 2018 at 8:14 PM Andreas Beeker <[hidden email]> wrote:

> Hi Tim / Dominik,
>
> please give me a few pointers, how I could access a pool of EMF files,
> e.g. (not only) within the common crawl corpus. My focus is currently on
> rendering, but as I extend the supported records, I also like to validate
> the parsing.
> As the EMF parsing is relatively new, you still might have a corpus for
> it, Tim?
>
> I have a few old mails about the common crawl corpus [2], but I guess
> there has been some restructuring taken place and there might be an easier
> option than downloading the whole index.
>
> Of course office files which I parse for embedded EMFs are also ok.
>
> I have to admit, that I haven't yet tested Dominiks tool [1].
>
> Alternatively I can use the govdocs1 corpus [3]
>
> Best wishes,
> Andi
>
>
> [1] https://github.com/centic9/CommonCrawlDocumentDownload
>
> [2]
> http://apache-poi.1045710.n5.nabble.com/Using-CommonCrawl-for-POI-regression-mass-testing-td5721585.html
>
> [3] http://downloads.digitalcorpora.org/corpora/files/govdocs1/by_type/
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: EMF corpus

Tim Allison
At some point I extracted all emfs from our corpus. I’ll see if that data
is still around and/or re-extract...prob have time tomorrow/ Wednesday

On Sun, Oct 7, 2018 at 5:01 PM Dominik Stadler <[hidden email]>
wrote:

> Hi Andi
>
> It is easy to change CommonCrawlDocumentDownload to fetch other mime-types,
> see
> https://github.com/centic9/CommonCrawlDocumentDownload/tree/download_emf
>
> However .emf files don't appear in the top-100 mimetypes of the crawls and
> thus are likely very rarely included if at all. I started a download-run,
> but the first two of the 300 index-files do not contain any matching
> extension or mime-type.
>
> See https://commoncrawl.github.io/cc-crawl-statistics/plots/mimetypes for
> mimetype-statistics in the crawl.
>
> Dominik.
>
> On Sat, Oct 6, 2018 at 8:14 PM Andreas Beeker <[hidden email]>
> wrote:
>
> > Hi Tim / Dominik,
> >
> > please give me a few pointers, how I could access a pool of EMF files,
> > e.g. (not only) within the common crawl corpus. My focus is currently on
> > rendering, but as I extend the supported records, I also like to validate
> > the parsing.
> > As the EMF parsing is relatively new, you still might have a corpus for
> > it, Tim?
> >
> > I have a few old mails about the common crawl corpus [2], but I guess
> > there has been some restructuring taken place and there might be an
> easier
> > option than downloading the whole index.
> >
> > Of course office files which I parse for embedded EMFs are also ok.
> >
> > I have to admit, that I haven't yet tested Dominiks tool [1].
> >
> > Alternatively I can use the govdocs1 corpus [3]
> >
> > Best wishes,
> > Andi
> >
> >
> > [1] https://github.com/centic9/CommonCrawlDocumentDownload
> >
> > [2]
> >
> http://apache-poi.1045710.n5.nabble.com/Using-CommonCrawl-for-POI-regression-mass-testing-td5721585.html
> >
> > [3] http://downloads.digitalcorpora.org/corpora/files/govdocs1/by_type/
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: EMF corpus

Dominik Stadler
Hi Andi,

I have now executed the CommonCrawlDownload-tool on crawl 2018-30, only 144
files did match by extension, I have collected them at
https://www.dropbox.com/s/w3sxnb5l3er3kdq/downloadEMF.zip?dl=0 however many
are actually some HTML, mostly redirects.

5hwaterwiki2011.wikispaces.com_file_links_parana_river_wordart.emf:
empty
5hwaterwiki2011.wikispaces.com_file_view_parana_river_wordart.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
apache.org_foundation_press_kit_asf_logo_wide.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
apkpure.com_emf-fitness_com.technogym.emf:
HTML document, UTF-8 Unicode text, with very long lines, with CRLF, LF line
terminators
apkpure.com_eye-monster-invasion-free_com.abula.emf:
HTML document, UTF-8 Unicode text, with very long lines, with CRLF, LF line
terminators
appraiser77.ru_adds_nechaev_rakova_2.files_image001.emz:
gzip compressed data, max compression, from NTFS filesystem (NT)
atlantabusinessnetwork.org_newsletter_july_image014.emz:
ASCII text, with no line terminators
atlantabusinessnetwork.org_newsletter_july_image015.emz:
ASCII text, with no line terminators
caicedo.wikispaces.com_file_history_imagen1.emf:
empty
caicedo.wikispaces.com_file_links_imagen1.emf:
empty
chisinau.md_public_files_primaria_info_utila_rezerva_cmc.emf:
HTML document, ASCII text
demaret.se_demaret060725.emf:
HTML document, ASCII text
demaret.se_demaret5.emf:
HTML document, ASCII text
downtowntactical.com_brand.emf:
empty
encyclopedia2.thefreedictionary.com_.emf:
HTML document, UTF-8 Unicode text, with very long lines, with CRLF line
terminators
extension.sophia-it.com_content_.emf:
HTML document, UTF-8 Unicode text, with very long lines, with CRLF, LF line
terminators
faculty.ksu.edu.sa_a-alkathiri_publishingimages_link.emf:
empty
festivales.wikispaces.com_file_links_dia_de_los_madres.emf:
empty
informationforsurvey.com_powerprocessplant_image002.emz:
ASCII text, with no line terminators
iranapps.ir_app_com.superphunlabs.emf:
HTML document, UTF-8 Unicode text, with very long lines, with CRLF, LF line
terminators
irunguns.com_brand.emf:
empty
itec-int.co.jp_isop_users_images_giziroku.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
itec-int.co.jp_isop_users_images_isopzirei.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
itec-int.co.jp_isop_users_images_nipou.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
itec-int.co.jp_isop_users_images_syuuhouzirei.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
itec-int.co.jp_isop_users_images_toukou.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
javalibs.com_artifact_org.eclipse.incquery_org.eclipse.incquery.patternlanguage.emf:
HTML document, UTF-8 Unicode text, with very long lines, with CRLF, LF line
terminators
javalibs.com_artifact_org.eclipse_org.eclipse.wst.common.emf:
HTML document, UTF-8 Unicode text, with very long lines, with CRLF, LF line
terminators
javalibs.com_artifact_org.eclipse.xpand_org.eclipse.xtend.typesystem.emf:
HTML document, UTF-8 Unicode text, with very long lines, with CRLF, LF line
terminators
karayan.net_sourin_furusato1.files_image023.emz:
gzip compressed data, max compression, from NTFS filesystem (NT)
kentuckypawninc.com_brand.emf:
empty
llmotivation.wikispaces.com_file_links_picture1.emf:
empty
media.community.dell.com_zh_images_1000.4.2512.itroom.emf:
HTML document, ASCII text, with CRLF line terminators
media.community.dell.com_zh_images_1000.4.2513.itroom.emf:
HTML document, ASCII text, with CRLF line terminators
media.community.dell.com_zh_images_1000.4.8165.step2.emf:
HTML document, ASCII text, with CRLF line terminators
media.community.dell.com_zh_images_1000.4.8166.step3.emf:
HTML document, ASCII text, with CRLF line terminators
media.community.dell.com_zh_images_1000.4.8168.step2.emf:
HTML document, ASCII text, with CRLF line terminators
media.community.dell.com_zh_images_1000.5.2515.itroom.emf:
HTML document, ASCII text, with CRLF line terminators
mineralesygemas.com_index_archivos_image163.emz:
HTML document, ASCII text
mvnrepository.com_artifact_org.eclipse.emf:
HTML document, UTF-8 Unicode text, with very long lines
nienaltowski.net_drzewo_20r.nienaltowski.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
nsw.hia.com.au_images_last_20chance_20for_20tickets.emf:
HTML document, ASCII text, with CRLF line terminators
play.google.com_store_apps_details_id=switches.emf:
HTML document, ASCII text, with very long lines
prstv.ru_logo_prstv_logo_01.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
rightswebquest.wikispaces.com_file_history_picture1.emf:
empty
rightswebquest.wikispaces.com_file_links_picture1.emf:
empty
saf.bio.caltech.edu_ppt_g_p_i_p2i_rotated_images_ppt.emf:
XML 1.0 document, ASCII text
school22.irkutsk.ru_gimn.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
sipoc.wikispaces.com_file_history_mujer_joven-vieja.emf:
empty
sipoc.wikispaces.com_file_links_mujer_joven-vieja.emf:
empty
sipoc.wikispaces.com_file_view_mujer_joven-vieja.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
stillbeziehungen.tk_whitedrop_ammentriskele.emf:
HTML document, ASCII text
stillbeziehungen.tk_whitedrop_erolaclo2.emf:
HTML document, ASCII text
stillbeziehungen.tk_whitedrop_erolaclo.emf:
HTML document, ASCII text
sureshotguns.com_brand.emf:
empty
thcs-qcong.quangdien.thuathienhue.edu.vn_imgs_thu_muc_he_thong__nam_2013_picture1.emf:
HTML document, ASCII text
webgerman.com_presentations_animationresearchreport_files_slide0009_image033.emz:
HTML document, ASCII text, with no line terminators
wikitext.transvivid.ch_iframes_image001.emz:
HTML document, ASCII text, with CRLF, LF line terminators
working-memory-and-education.wikispaces.com_file_view_ld_and_wm_chart.emf:
HTML document, ASCII text, with very long lines
www.aibi.ph_htm_oldharvest_leaven-like-evangelism_files_image007.emz:
HTML document, ASCII text
www.aibi.ph_htm_oldharvest_leaven-like-evangelism_files_image009.emz:
HTML document, ASCII text
www.appbrain.com_app_emf-sensor_com.codebros.emf:
HTML document, UTF-8 Unicode text, with very long lines
www.chisinau.md_public_files_primaria_info_utila_rezerva_cmc.emf:
HTML document, ASCII text
www.drugfuture.com_chemdata_stremf_iminodisuccinic-acid.emf:
HTML document, ISO-8859 text, with CRLF line terminators
www.eclipse.org_projects_project-plan.php_projectid=modeling.emf:
HTML document, ASCII text, with very long lines
www.extremedemocracy.com_information_20_26_20values.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
www.goldsim.com_downloads_library_images_logos_symbol_fullname_blackgold.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
www.goldsim.com_downloads_library_images_logos_symbol_name_blackgold.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
www.goldsim.com_downloads_library_images_logos_symbol_whitegold.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
www.keywild.com_six_mountains_rullers_ruller_12_inches.emf:
HTML document, ASCII text, with no line terminators
www.keywild.com_six_mountains_rullers_ruller_6_inches.emf:
HTML document, ASCII text, with no line terminators
www.otsucci.or.jp_public_keikyo_keikyo31_image005.emz:
HTML document, ASCII text
www.otsucci.or.jp_public_keikyo_keikyo31_image007.emz:
HTML document, ASCII text
www.otsucci.or.jp_public_keikyo_keikyo32_image011.emz:
HTML document, ASCII text
www.otsucci.or.jp_public_keikyo_keikyo32_image019.emz:
HTML document, ASCII text
www.otsucci.or.jp_public_keikyo_keikyo34_image005.emz:
HTML document, ASCII text
www.otsucci.or.jp_public_keikyo_keikyo34_image009.emz:
HTML document, ASCII text
www.otsucci.or.jp_public_keikyo_keikyo35_image003.emz:
HTML document, ASCII text
www.otsucci.or.jp_public_keikyo_keikyo35_image011.emz:
HTML document, ASCII text
www.otsucci.or.jp_public_keikyo_keikyo35_image015.emz:
HTML document, ASCII text
www.rogerblench.info_language_afroasiatic_aaop_files_image003.emz:
gzip compressed data, max compression, from NTFS filesystem (NT)
www.tulaed-union.ru_111_chislenniy_20sostav.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
www.wiki.ciu20.org_file_history_background4.emf:
empty
www.wiki.ciu20.org_file_links_background4.emf:
empty
zakon4.rada.gov.ua_laws_file_imgs_21_p416467n73-2.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
zakon4.rada.gov.ua_laws_file_imgs_24_p416467n103-17.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
zakon4.rada.gov.ua_laws_file_imgs_24_p416467n110-19.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
zakon4.rada.gov.ua_laws_file_imgs_24_p416467n112-20.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
zakon4.rada.gov.ua_laws_file_imgs_24_p416467n113-21.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
zakon4.rada.gov.ua_laws_file_imgs_24_p416467n122-27.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
zakon4.rada.gov.ua_laws_file_imgs_24_p416467n88-8.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
zakon4.rada.gov.ua_laws_file_imgs_24_p416467n93-13.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
zakon4.rada.gov.ua_laws_file_imgs_24_p416467n95-14.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
zavodsvet.ru_production_files_but__d0_92-28-1-500-_d0_93_d0_a0.emf:
HTML document, ISO-8859 text, with CRLF line terminators
zavodsvet.ru_production_files_but__d0_92-28-2-500-_d0_a4_d0_af.emf:
HTML document, ISO-8859 text, with CRLF line terminators
zavodsvet.ru_production_files_but__d0_92-28-7-250-_d0_9e_d0_9a.emf:
HTML document, ISO-8859 text, with CRLF line terminators
zavodsvet.ru_production_files_but__d0_92-28-7-250-_d0_a4_d0_96.emf:
HTML document, ISO-8859 text, with CRLF line terminators
zavodsvet.ru_production_files_but__d0_92-28-7-375-_d0_a0_d0_94.emf:
HTML document, ISO-8859 text, with CRLF line terminators
zavodsvet.ru_production_files_but__d0_92-31-4-500-_d0_a2_d0_a3.emf:
HTML document, ISO-8859 text, with CRLF line terminators
zavodsvet.ru_production_files_but__d0_92-31-4-700-_d0_a2_d0_a3.emf:
HTML document, ISO-8859 text, with CRLF line terminators
zavodsvet.ru_production_files_but__d0_92-5-500-_d0_9b_d0_92.emf:
HTML document, ISO-8859 text, with CRLF line terminators
zavodsvet.ru_production_files_but__d0_92_d0_9a_d0_9f-330-_d0_a1.emf:
HTML document, ISO-8859 text, with CRLF line terminators
zavodsvet.ru_production_files_but__d0_92_d0_bd-28-8-1000-_d0_a1_d0_a0.emf:
HTML document, ISO-8859 text, with CRLF line terminators
zavodsvet.ru_production_files_but__d0_92_d0_bd-28-8-500-_d0_a1_d0_a0.emf:
HTML document, ISO-8859 text, with CRLF line terminators
zavodsvet.ru_production_files_but__d0_9a_d0_9f_d0_94-1-500-_d0_91_d0_9a.emf:
HTML document, ISO-8859 text, with CRLF line terminators
zavodsvet.ru_production_files_but__d0_9a_d0_9f_d0_9c-30-1000-_d0_91_d0_95.emf:
HTML document, ISO-8859 text, with CRLF line terminators
zavodsvet.ru_production_files_but__d0_9a_d0_9f_d0_9c-30-1000-_d0_9f_d0_92.emf:
HTML document, ISO-8859 text, with CRLF line terminators
zavodsvet.ru_production_files_but__d0_9a_d0_9f_d0_9c-30-500-_d0_91_d0_95.emf:
HTML document, ISO-8859 text, with CRLF line terminators
zavodsvet.ru_production_files_but__d0_9a_d0_9f_d0_9c-30-500-_d0_91_d0_ae.emf:
HTML document, ISO-8859 text, with CRLF line terminators
zavodsvet.ru_production_files_but__d0_9a_d0_9f_d0_9c-30-500-_d0_9f_d0_92.emf:
HTML document, ISO-8859 text, with CRLF line terminators
zavodsvet.ru_production_files_but__d0_9a_d0_9f_d0_9c-30-750-_d0_91_d0_ae.emf:
HTML document, ISO-8859 text, with CRLF line terminators
zavodsvet.ru_production_files_but__d0_9a_d0_9f_d0_9c-30-750-_d0_9f_d0_92.emf:
HTML document, ISO-8859 text, with CRLF line terminators
zavodsvet.ru_production_files_but__d0_9f-8-500-_d0_a4_d0_9d.emf:
HTML document, ISO-8859 text, with CRLF line terminators
zavodsvet.ru_production_files_but_kpd-1-500-bk.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
zavodsvet.ru_production_files_but_kpm-30-1000-be.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
zavodsvet.ru_production_files_but_kpm-30-1000-pv.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
zavodsvet.ru_production_files_but_kpm-30-500-be.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
zavodsvet.ru_production_files_but_kpm-30-500-byu.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
zavodsvet.ru_production_files_but_kpm-30-500-pv.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
zavodsvet.ru_production_files_but_kpm-30-750-byu.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
zavodsvet.ru_production_files_but_kpm-30-750-pv.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
zavodsvet.ru_production_files_but_p-8-500-fn.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
zavodsvet.ru_production_files_but_v-28-1-500-gr.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
zavodsvet.ru_production_files_but_v-28-2-500-fya.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
zavodsvet.ru_production_files_but_v-28-7-250-fzh.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
zavodsvet.ru_production_files_but_v-28-7-250-ok.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
zavodsvet.ru_production_files_but_v-28-7-375-rd.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
zavodsvet.ru_production_files_but_v-31-4-500-tu.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
zavodsvet.ru_production_files_but_v-31-4-700-tu.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
zavodsvet.ru_production_files_but_v-5-500-lv.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
zavodsvet.ru_production_files_but_vkp-330-s.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
zavodsvet.ru_production_files_but_vn-28-8-1000-sr.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
zavodsvet.ru_production_files_but_vn-28-8-500-sr.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
zavodsvet.ru_production_files_but_xxi-_d0_92-28-7-375.emf:
HTML document, ISO-8859 text, with CRLF line terminators
zavodsvet.ru_production_files_but_xxi-_d0_92-28-7-500.emf:
HTML document, ISO-8859 text, with CRLF line terminators
zavodsvet.ru_production_files_but_xxi-v-28-7-375.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000
zavodsvet.ru_production_files_but_xxi-v-28-7-500.emf:
Windows Enhanced Metafile (EMF) image data version 0x10000


Dominik.

On Mon, Oct 8, 2018 at 1:37 PM Tim Allison <[hidden email]> wrote:

> At some point I extracted all emfs from our corpus. I’ll see if that data
> is still around and/or re-extract...prob have time tomorrow/ Wednesday
>
> On Sun, Oct 7, 2018 at 5:01 PM Dominik Stadler <[hidden email]>
> wrote:
>
> > Hi Andi
> >
> > It is easy to change CommonCrawlDocumentDownload to fetch other
> mime-types,
> > see
> > https://github.com/centic9/CommonCrawlDocumentDownload/tree/download_emf
> >
> > However .emf files don't appear in the top-100 mimetypes of the crawls
> and
> > thus are likely very rarely included if at all. I started a download-run,
> > but the first two of the 300 index-files do not contain any matching
> > extension or mime-type.
> >
> > See https://commoncrawl.github.io/cc-crawl-statistics/plots/mimetypes
> for
> > mimetype-statistics in the crawl.
> >
> > Dominik.
> >
> > On Sat, Oct 6, 2018 at 8:14 PM Andreas Beeker <[hidden email]>
> > wrote:
> >
> > > Hi Tim / Dominik,
> > >
> > > please give me a few pointers, how I could access a pool of EMF files,
> > > e.g. (not only) within the common crawl corpus. My focus is currently
> on
> > > rendering, but as I extend the supported records, I also like to
> validate
> > > the parsing.
> > > As the EMF parsing is relatively new, you still might have a corpus for
> > > it, Tim?
> > >
> > > I have a few old mails about the common crawl corpus [2], but I guess
> > > there has been some restructuring taken place and there might be an
> > easier
> > > option than downloading the whole index.
> > >
> > > Of course office files which I parse for embedded EMFs are also ok.
> > >
> > > I have to admit, that I haven't yet tested Dominiks tool [1].
> > >
> > > Alternatively I can use the govdocs1 corpus [3]
> > >
> > > Best wishes,
> > > Andi
> > >
> > >
> > > [1] https://github.com/centic9/CommonCrawlDocumentDownload
> > >
> > > [2]
> > >
> >
> http://apache-poi.1045710.n5.nabble.com/Using-CommonCrawl-for-POI-regression-mass-testing-td5721585.html
> > >
> > > [3]
> http://downloads.digitalcorpora.org/corpora/files/govdocs1/by_type/
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [hidden email]
> > > For additional commands, e-mail: [hidden email]
> > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: EMF corpus

Dave Fisher-5
I’m following this as part of my talk at COSCON which I plan to include common crawler.

Who is in charge of where the crawler is pointed and how would one ask for additional URLs?

Regards,
Dave

Sent from my iPhone

> On Oct 8, 2018, at 10:31 PM, Dominik Stadler <[hidden email]> wrote:
>
> Hi Andi,
>
> I have now executed the CommonCrawlDownload-tool on crawl 2018-30, only 144
> files did match by extension, I have collected them at
> https://www.dropbox.com/s/w3sxnb5l3er3kdq/downloadEMF.zip?dl=0 however many
> are actually some HTML, mostly redirects.
>
> 5hwaterwiki2011.wikispaces.com_file_links_parana_river_wordart.emf:
> empty
> 5hwaterwiki2011.wikispaces.com_file_view_parana_river_wordart.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> apache.org_foundation_press_kit_asf_logo_wide.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> apkpure.com_emf-fitness_com.technogym.emf:
> HTML document, UTF-8 Unicode text, with very long lines, with CRLF, LF line
> terminators
> apkpure.com_eye-monster-invasion-free_com.abula.emf:
> HTML document, UTF-8 Unicode text, with very long lines, with CRLF, LF line
> terminators
> appraiser77.ru_adds_nechaev_rakova_2.files_image001.emz:
> gzip compressed data, max compression, from NTFS filesystem (NT)
> atlantabusinessnetwork.org_newsletter_july_image014.emz:
> ASCII text, with no line terminators
> atlantabusinessnetwork.org_newsletter_july_image015.emz:
> ASCII text, with no line terminators
> caicedo.wikispaces.com_file_history_imagen1.emf:
> empty
> caicedo.wikispaces.com_file_links_imagen1.emf:
> empty
> chisinau.md_public_files_primaria_info_utila_rezerva_cmc.emf:
> HTML document, ASCII text
> demaret.se_demaret060725.emf:
> HTML document, ASCII text
> demaret.se_demaret5.emf:
> HTML document, ASCII text
> downtowntactical.com_brand.emf:
> empty
> encyclopedia2.thefreedictionary.com_.emf:
> HTML document, UTF-8 Unicode text, with very long lines, with CRLF line
> terminators
> extension.sophia-it.com_content_.emf:
> HTML document, UTF-8 Unicode text, with very long lines, with CRLF, LF line
> terminators
> faculty.ksu.edu.sa_a-alkathiri_publishingimages_link.emf:
> empty
> festivales.wikispaces.com_file_links_dia_de_los_madres.emf:
> empty
> informationforsurvey.com_powerprocessplant_image002.emz:
> ASCII text, with no line terminators
> iranapps.ir_app_com.superphunlabs.emf:
> HTML document, UTF-8 Unicode text, with very long lines, with CRLF, LF line
> terminators
> irunguns.com_brand.emf:
> empty
> itec-int.co.jp_isop_users_images_giziroku.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> itec-int.co.jp_isop_users_images_isopzirei.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> itec-int.co.jp_isop_users_images_nipou.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> itec-int.co.jp_isop_users_images_syuuhouzirei.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> itec-int.co.jp_isop_users_images_toukou.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> javalibs.com_artifact_org.eclipse.incquery_org.eclipse.incquery.patternlanguage.emf:
> HTML document, UTF-8 Unicode text, with very long lines, with CRLF, LF line
> terminators
> javalibs.com_artifact_org.eclipse_org.eclipse.wst.common.emf:
> HTML document, UTF-8 Unicode text, with very long lines, with CRLF, LF line
> terminators
> javalibs.com_artifact_org.eclipse.xpand_org.eclipse.xtend.typesystem.emf:
> HTML document, UTF-8 Unicode text, with very long lines, with CRLF, LF line
> terminators
> karayan.net_sourin_furusato1.files_image023.emz:
> gzip compressed data, max compression, from NTFS filesystem (NT)
> kentuckypawninc.com_brand.emf:
> empty
> llmotivation.wikispaces.com_file_links_picture1.emf:
> empty
> media.community.dell.com_zh_images_1000.4.2512.itroom.emf:
> HTML document, ASCII text, with CRLF line terminators
> media.community.dell.com_zh_images_1000.4.2513.itroom.emf:
> HTML document, ASCII text, with CRLF line terminators
> media.community.dell.com_zh_images_1000.4.8165.step2.emf:
> HTML document, ASCII text, with CRLF line terminators
> media.community.dell.com_zh_images_1000.4.8166.step3.emf:
> HTML document, ASCII text, with CRLF line terminators
> media.community.dell.com_zh_images_1000.4.8168.step2.emf:
> HTML document, ASCII text, with CRLF line terminators
> media.community.dell.com_zh_images_1000.5.2515.itroom.emf:
> HTML document, ASCII text, with CRLF line terminators
> mineralesygemas.com_index_archivos_image163.emz:
> HTML document, ASCII text
> mvnrepository.com_artifact_org.eclipse.emf:
> HTML document, UTF-8 Unicode text, with very long lines
> nienaltowski.net_drzewo_20r.nienaltowski.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> nsw.hia.com.au_images_last_20chance_20for_20tickets.emf:
> HTML document, ASCII text, with CRLF line terminators
> play.google.com_store_apps_details_id=switches.emf:
> HTML document, ASCII text, with very long lines
> prstv.ru_logo_prstv_logo_01.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> rightswebquest.wikispaces.com_file_history_picture1.emf:
> empty
> rightswebquest.wikispaces.com_file_links_picture1.emf:
> empty
> saf.bio.caltech.edu_ppt_g_p_i_p2i_rotated_images_ppt.emf:
> XML 1.0 document, ASCII text
> school22.irkutsk.ru_gimn.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> sipoc.wikispaces.com_file_history_mujer_joven-vieja.emf:
> empty
> sipoc.wikispaces.com_file_links_mujer_joven-vieja.emf:
> empty
> sipoc.wikispaces.com_file_view_mujer_joven-vieja.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> stillbeziehungen.tk_whitedrop_ammentriskele.emf:
> HTML document, ASCII text
> stillbeziehungen.tk_whitedrop_erolaclo2.emf:
> HTML document, ASCII text
> stillbeziehungen.tk_whitedrop_erolaclo.emf:
> HTML document, ASCII text
> sureshotguns.com_brand.emf:
> empty
> thcs-qcong.quangdien.thuathienhue.edu.vn_imgs_thu_muc_he_thong__nam_2013_picture1.emf:
> HTML document, ASCII text
> webgerman.com_presentations_animationresearchreport_files_slide0009_image033.emz:
> HTML document, ASCII text, with no line terminators
> wikitext.transvivid.ch_iframes_image001.emz:
> HTML document, ASCII text, with CRLF, LF line terminators
> working-memory-and-education.wikispaces.com_file_view_ld_and_wm_chart.emf:
> HTML document, ASCII text, with very long lines
> www.aibi.ph_htm_oldharvest_leaven-like-evangelism_files_image007.emz:
> HTML document, ASCII text
> www.aibi.ph_htm_oldharvest_leaven-like-evangelism_files_image009.emz:
> HTML document, ASCII text
> www.appbrain.com_app_emf-sensor_com.codebros.emf:
> HTML document, UTF-8 Unicode text, with very long lines
> www.chisinau.md_public_files_primaria_info_utila_rezerva_cmc.emf:
> HTML document, ASCII text
> www.drugfuture.com_chemdata_stremf_iminodisuccinic-acid.emf:
> HTML document, ISO-8859 text, with CRLF line terminators
> www.eclipse.org_projects_project-plan.php_projectid=modeling.emf:
> HTML document, ASCII text, with very long lines
> www.extremedemocracy.com_information_20_26_20values.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> www.goldsim.com_downloads_library_images_logos_symbol_fullname_blackgold.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> www.goldsim.com_downloads_library_images_logos_symbol_name_blackgold.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> www.goldsim.com_downloads_library_images_logos_symbol_whitegold.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> www.keywild.com_six_mountains_rullers_ruller_12_inches.emf:
> HTML document, ASCII text, with no line terminators
> www.keywild.com_six_mountains_rullers_ruller_6_inches.emf:
> HTML document, ASCII text, with no line terminators
> www.otsucci.or.jp_public_keikyo_keikyo31_image005.emz:
> HTML document, ASCII text
> www.otsucci.or.jp_public_keikyo_keikyo31_image007.emz:
> HTML document, ASCII text
> www.otsucci.or.jp_public_keikyo_keikyo32_image011.emz:
> HTML document, ASCII text
> www.otsucci.or.jp_public_keikyo_keikyo32_image019.emz:
> HTML document, ASCII text
> www.otsucci.or.jp_public_keikyo_keikyo34_image005.emz:
> HTML document, ASCII text
> www.otsucci.or.jp_public_keikyo_keikyo34_image009.emz:
> HTML document, ASCII text
> www.otsucci.or.jp_public_keikyo_keikyo35_image003.emz:
> HTML document, ASCII text
> www.otsucci.or.jp_public_keikyo_keikyo35_image011.emz:
> HTML document, ASCII text
> www.otsucci.or.jp_public_keikyo_keikyo35_image015.emz:
> HTML document, ASCII text
> www.rogerblench.info_language_afroasiatic_aaop_files_image003.emz:
> gzip compressed data, max compression, from NTFS filesystem (NT)
> www.tulaed-union.ru_111_chislenniy_20sostav.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> www.wiki.ciu20.org_file_history_background4.emf:
> empty
> www.wiki.ciu20.org_file_links_background4.emf:
> empty
> zakon4.rada.gov.ua_laws_file_imgs_21_p416467n73-2.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> zakon4.rada.gov.ua_laws_file_imgs_24_p416467n103-17.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> zakon4.rada.gov.ua_laws_file_imgs_24_p416467n110-19.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> zakon4.rada.gov.ua_laws_file_imgs_24_p416467n112-20.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> zakon4.rada.gov.ua_laws_file_imgs_24_p416467n113-21.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> zakon4.rada.gov.ua_laws_file_imgs_24_p416467n122-27.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> zakon4.rada.gov.ua_laws_file_imgs_24_p416467n88-8.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> zakon4.rada.gov.ua_laws_file_imgs_24_p416467n93-13.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> zakon4.rada.gov.ua_laws_file_imgs_24_p416467n95-14.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> zavodsvet.ru_production_files_but__d0_92-28-1-500-_d0_93_d0_a0.emf:
> HTML document, ISO-8859 text, with CRLF line terminators
> zavodsvet.ru_production_files_but__d0_92-28-2-500-_d0_a4_d0_af.emf:
> HTML document, ISO-8859 text, with CRLF line terminators
> zavodsvet.ru_production_files_but__d0_92-28-7-250-_d0_9e_d0_9a.emf:
> HTML document, ISO-8859 text, with CRLF line terminators
> zavodsvet.ru_production_files_but__d0_92-28-7-250-_d0_a4_d0_96.emf:
> HTML document, ISO-8859 text, with CRLF line terminators
> zavodsvet.ru_production_files_but__d0_92-28-7-375-_d0_a0_d0_94.emf:
> HTML document, ISO-8859 text, with CRLF line terminators
> zavodsvet.ru_production_files_but__d0_92-31-4-500-_d0_a2_d0_a3.emf:
> HTML document, ISO-8859 text, with CRLF line terminators
> zavodsvet.ru_production_files_but__d0_92-31-4-700-_d0_a2_d0_a3.emf:
> HTML document, ISO-8859 text, with CRLF line terminators
> zavodsvet.ru_production_files_but__d0_92-5-500-_d0_9b_d0_92.emf:
> HTML document, ISO-8859 text, with CRLF line terminators
> zavodsvet.ru_production_files_but__d0_92_d0_9a_d0_9f-330-_d0_a1.emf:
> HTML document, ISO-8859 text, with CRLF line terminators
> zavodsvet.ru_production_files_but__d0_92_d0_bd-28-8-1000-_d0_a1_d0_a0.emf:
> HTML document, ISO-8859 text, with CRLF line terminators
> zavodsvet.ru_production_files_but__d0_92_d0_bd-28-8-500-_d0_a1_d0_a0.emf:
> HTML document, ISO-8859 text, with CRLF line terminators
> zavodsvet.ru_production_files_but__d0_9a_d0_9f_d0_94-1-500-_d0_91_d0_9a.emf:
> HTML document, ISO-8859 text, with CRLF line terminators
> zavodsvet.ru_production_files_but__d0_9a_d0_9f_d0_9c-30-1000-_d0_91_d0_95.emf:
> HTML document, ISO-8859 text, with CRLF line terminators
> zavodsvet.ru_production_files_but__d0_9a_d0_9f_d0_9c-30-1000-_d0_9f_d0_92.emf:
> HTML document, ISO-8859 text, with CRLF line terminators
> zavodsvet.ru_production_files_but__d0_9a_d0_9f_d0_9c-30-500-_d0_91_d0_95.emf:
> HTML document, ISO-8859 text, with CRLF line terminators
> zavodsvet.ru_production_files_but__d0_9a_d0_9f_d0_9c-30-500-_d0_91_d0_ae.emf:
> HTML document, ISO-8859 text, with CRLF line terminators
> zavodsvet.ru_production_files_but__d0_9a_d0_9f_d0_9c-30-500-_d0_9f_d0_92.emf:
> HTML document, ISO-8859 text, with CRLF line terminators
> zavodsvet.ru_production_files_but__d0_9a_d0_9f_d0_9c-30-750-_d0_91_d0_ae.emf:
> HTML document, ISO-8859 text, with CRLF line terminators
> zavodsvet.ru_production_files_but__d0_9a_d0_9f_d0_9c-30-750-_d0_9f_d0_92.emf:
> HTML document, ISO-8859 text, with CRLF line terminators
> zavodsvet.ru_production_files_but__d0_9f-8-500-_d0_a4_d0_9d.emf:
> HTML document, ISO-8859 text, with CRLF line terminators
> zavodsvet.ru_production_files_but_kpd-1-500-bk.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> zavodsvet.ru_production_files_but_kpm-30-1000-be.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> zavodsvet.ru_production_files_but_kpm-30-1000-pv.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> zavodsvet.ru_production_files_but_kpm-30-500-be.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> zavodsvet.ru_production_files_but_kpm-30-500-byu.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> zavodsvet.ru_production_files_but_kpm-30-500-pv.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> zavodsvet.ru_production_files_but_kpm-30-750-byu.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> zavodsvet.ru_production_files_but_kpm-30-750-pv.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> zavodsvet.ru_production_files_but_p-8-500-fn.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> zavodsvet.ru_production_files_but_v-28-1-500-gr.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> zavodsvet.ru_production_files_but_v-28-2-500-fya.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> zavodsvet.ru_production_files_but_v-28-7-250-fzh.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> zavodsvet.ru_production_files_but_v-28-7-250-ok.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> zavodsvet.ru_production_files_but_v-28-7-375-rd.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> zavodsvet.ru_production_files_but_v-31-4-500-tu.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> zavodsvet.ru_production_files_but_v-31-4-700-tu.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> zavodsvet.ru_production_files_but_v-5-500-lv.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> zavodsvet.ru_production_files_but_vkp-330-s.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> zavodsvet.ru_production_files_but_vn-28-8-1000-sr.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> zavodsvet.ru_production_files_but_vn-28-8-500-sr.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> zavodsvet.ru_production_files_but_xxi-_d0_92-28-7-375.emf:
> HTML document, ISO-8859 text, with CRLF line terminators
> zavodsvet.ru_production_files_but_xxi-_d0_92-28-7-500.emf:
> HTML document, ISO-8859 text, with CRLF line terminators
> zavodsvet.ru_production_files_but_xxi-v-28-7-375.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
> zavodsvet.ru_production_files_but_xxi-v-28-7-500.emf:
> Windows Enhanced Metafile (EMF) image data version 0x10000
>
>
> Dominik.
>
>> On Mon, Oct 8, 2018 at 1:37 PM Tim Allison <[hidden email]> wrote:
>>
>> At some point I extracted all emfs from our corpus. I’ll see if that data
>> is still around and/or re-extract...prob have time tomorrow/ Wednesday
>>
>> On Sun, Oct 7, 2018 at 5:01 PM Dominik Stadler <[hidden email]>
>> wrote:
>>
>>> Hi Andi
>>>
>>> It is easy to change CommonCrawlDocumentDownload to fetch other
>> mime-types,
>>> see
>>> https://github.com/centic9/CommonCrawlDocumentDownload/tree/download_emf
>>>
>>> However .emf files don't appear in the top-100 mimetypes of the crawls
>> and
>>> thus are likely very rarely included if at all. I started a download-run,
>>> but the first two of the 300 index-files do not contain any matching
>>> extension or mime-type.
>>>
>>> See https://commoncrawl.github.io/cc-crawl-statistics/plots/mimetypes
>> for
>>> mimetype-statistics in the crawl.
>>>
>>> Dominik.
>>>
>>> On Sat, Oct 6, 2018 at 8:14 PM Andreas Beeker <[hidden email]>
>>> wrote:
>>>
>>>> Hi Tim / Dominik,
>>>>
>>>> please give me a few pointers, how I could access a pool of EMF files,
>>>> e.g. (not only) within the common crawl corpus. My focus is currently
>> on
>>>> rendering, but as I extend the supported records, I also like to
>> validate
>>>> the parsing.
>>>> As the EMF parsing is relatively new, you still might have a corpus for
>>>> it, Tim?
>>>>
>>>> I have a few old mails about the common crawl corpus [2], but I guess
>>>> there has been some restructuring taken place and there might be an
>>> easier
>>>> option than downloading the whole index.
>>>>
>>>> Of course office files which I parse for embedded EMFs are also ok.
>>>>
>>>> I have to admit, that I haven't yet tested Dominiks tool [1].
>>>>
>>>> Alternatively I can use the govdocs1 corpus [3]
>>>>
>>>> Best wishes,
>>>> Andi
>>>>
>>>>
>>>> [1] https://github.com/centic9/CommonCrawlDocumentDownload
>>>>
>>>> [2]
>>>>
>>>
>> http://apache-poi.1045710.n5.nabble.com/Using-CommonCrawl-for-POI-regression-mass-testing-td5721585.html
>>>>
>>>> [3]
>> http://downloads.digitalcorpora.org/corpora/files/govdocs1/by_type/
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [hidden email]
>>>> For additional commands, e-mail: [hidden email]
>>>>
>>>>
>>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: EMF corpus

Tim Allison
In reply to this post by Tim Allison
Y.  Turns out I extracted a bunch a while ago.  See the 'emfs'
directory in this tar.bz2 file:
http://162.242.228.174/embedded_files/xmfs.tar.bz2

Let me know if you have any questions and/or if I can make that any
more useful for you.

Cheers,

           Tim
On Mon, Oct 8, 2018 at 7:37 AM Tim Allison <[hidden email]> wrote:

>
> At some point I extracted all emfs from our corpus. I’ll see if that data is still around and/or re-extract...prob have time tomorrow/ Wednesday
>
> On Sun, Oct 7, 2018 at 5:01 PM Dominik Stadler <[hidden email]> wrote:
>>
>> Hi Andi
>>
>> It is easy to change CommonCrawlDocumentDownload to fetch other mime-types,
>> see https://github.com/centic9/CommonCrawlDocumentDownload/tree/download_emf
>>
>> However .emf files don't appear in the top-100 mimetypes of the crawls and
>> thus are likely very rarely included if at all. I started a download-run,
>> but the first two of the 300 index-files do not contain any matching
>> extension or mime-type.
>>
>> See https://commoncrawl.github.io/cc-crawl-statistics/plots/mimetypes for
>> mimetype-statistics in the crawl.
>>
>> Dominik.
>>
>> On Sat, Oct 6, 2018 at 8:14 PM Andreas Beeker <[hidden email]> wrote:
>>
>> > Hi Tim / Dominik,
>> >
>> > please give me a few pointers, how I could access a pool of EMF files,
>> > e.g. (not only) within the common crawl corpus. My focus is currently on
>> > rendering, but as I extend the supported records, I also like to validate
>> > the parsing.
>> > As the EMF parsing is relatively new, you still might have a corpus for
>> > it, Tim?
>> >
>> > I have a few old mails about the common crawl corpus [2], but I guess
>> > there has been some restructuring taken place and there might be an easier
>> > option than downloading the whole index.
>> >
>> > Of course office files which I parse for embedded EMFs are also ok.
>> >
>> > I have to admit, that I haven't yet tested Dominiks tool [1].
>> >
>> > Alternatively I can use the govdocs1 corpus [3]
>> >
>> > Best wishes,
>> > Andi
>> >
>> >
>> > [1] https://github.com/centic9/CommonCrawlDocumentDownload
>> >
>> > [2]
>> > http://apache-poi.1045710.n5.nabble.com/Using-CommonCrawl-for-POI-regression-mass-testing-td5721585.html
>> >
>> > [3] http://downloads.digitalcorpora.org/corpora/files/govdocs1/by_type/
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [hidden email]
>> > For additional commands, e-mail: [hidden email]
>> >
>> >

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: EMF corpus

Dominik Stadler
In reply to this post by Dave Fisher-5
Hi,

The crawls themselves are run and defined by the organisation at
http://commoncrawl.org/, see also http://commoncrawl.org/connect/contact-us/,
we then consume the resulting freely available data.

Dominik.


On Tue, Oct 9, 2018 at 7:35 AM Dave Fisher <[hidden email]> wrote:

> I’m following this as part of my talk at COSCON which I plan to include
> common crawler.
>
> Who is in charge of where the crawler is pointed and how would one ask for
> additional URLs?
>
> Regards,
> Dave
>
> Sent from my iPhone
>
> > On Oct 8, 2018, at 10:31 PM, Dominik Stadler <[hidden email]>
> wrote:
> >
> > Hi Andi,
> >
> > I have now executed the CommonCrawlDownload-tool on crawl 2018-30, only
> 144
> > files did match by extension, I have collected them at
> > https://www.dropbox.com/s/w3sxnb5l3er3kdq/downloadEMF.zip?dl=0 however
> many
> > are actually some HTML, mostly redirects.
> >
> > 5hwaterwiki2011.wikispaces.com_file_links_parana_river_wordart.emf:
> > empty
> > 5hwaterwiki2011.wikispaces.com_file_view_parana_river_wordart.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > apache.org_foundation_press_kit_asf_logo_wide.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > apkpure.com_emf-fitness_com.technogym.emf:
> > HTML document, UTF-8 Unicode text, with very long lines, with CRLF, LF
> line
> > terminators
> > apkpure.com_eye-monster-invasion-free_com.abula.emf:
> > HTML document, UTF-8 Unicode text, with very long lines, with CRLF, LF
> line
> > terminators
> > appraiser77.ru_adds_nechaev_rakova_2.files_image001.emz:
> > gzip compressed data, max compression, from NTFS filesystem (NT)
> > atlantabusinessnetwork.org_newsletter_july_image014.emz:
> > ASCII text, with no line terminators
> > atlantabusinessnetwork.org_newsletter_july_image015.emz:
> > ASCII text, with no line terminators
> > caicedo.wikispaces.com_file_history_imagen1.emf:
> > empty
> > caicedo.wikispaces.com_file_links_imagen1.emf:
> > empty
> > chisinau.md_public_files_primaria_info_utila_rezerva_cmc.emf:
> > HTML document, ASCII text
> > demaret.se_demaret060725.emf:
> > HTML document, ASCII text
> > demaret.se_demaret5.emf:
> > HTML document, ASCII text
> > downtowntactical.com_brand.emf:
> > empty
> > encyclopedia2.thefreedictionary.com_.emf:
> > HTML document, UTF-8 Unicode text, with very long lines, with CRLF line
> > terminators
> > extension.sophia-it.com_content_.emf:
> > HTML document, UTF-8 Unicode text, with very long lines, with CRLF, LF
> line
> > terminators
> > faculty.ksu.edu.sa_a-alkathiri_publishingimages_link.emf:
> > empty
> > festivales.wikispaces.com_file_links_dia_de_los_madres.emf:
> > empty
> > informationforsurvey.com_powerprocessplant_image002.emz:
> > ASCII text, with no line terminators
> > iranapps.ir_app_com.superphunlabs.emf:
> > HTML document, UTF-8 Unicode text, with very long lines, with CRLF, LF
> line
> > terminators
> > irunguns.com_brand.emf:
> > empty
> > itec-int.co.jp_isop_users_images_giziroku.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > itec-int.co.jp_isop_users_images_isopzirei.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > itec-int.co.jp_isop_users_images_nipou.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > itec-int.co.jp_isop_users_images_syuuhouzirei.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > itec-int.co.jp_isop_users_images_toukou.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> >
> javalibs.com_artifact_org.eclipse.incquery_org.eclipse.incquery.patternlanguage.emf:
> > HTML document, UTF-8 Unicode text, with very long lines, with CRLF, LF
> line
> > terminators
> > javalibs.com_artifact_org.eclipse_org.eclipse.wst.common.emf:
> > HTML document, UTF-8 Unicode text, with very long lines, with CRLF, LF
> line
> > terminators
> > javalibs.com_artifact_org.eclipse.xpand_org.eclipse.xtend.typesystem.emf:
> > HTML document, UTF-8 Unicode text, with very long lines, with CRLF, LF
> line
> > terminators
> > karayan.net_sourin_furusato1.files_image023.emz:
> > gzip compressed data, max compression, from NTFS filesystem (NT)
> > kentuckypawninc.com_brand.emf:
> > empty
> > llmotivation.wikispaces.com_file_links_picture1.emf:
> > empty
> > media.community.dell.com_zh_images_1000.4.2512.itroom.emf:
> > HTML document, ASCII text, with CRLF line terminators
> > media.community.dell.com_zh_images_1000.4.2513.itroom.emf:
> > HTML document, ASCII text, with CRLF line terminators
> > media.community.dell.com_zh_images_1000.4.8165.step2.emf:
> > HTML document, ASCII text, with CRLF line terminators
> > media.community.dell.com_zh_images_1000.4.8166.step3.emf:
> > HTML document, ASCII text, with CRLF line terminators
> > media.community.dell.com_zh_images_1000.4.8168.step2.emf:
> > HTML document, ASCII text, with CRLF line terminators
> > media.community.dell.com_zh_images_1000.5.2515.itroom.emf:
> > HTML document, ASCII text, with CRLF line terminators
> > mineralesygemas.com_index_archivos_image163.emz:
> > HTML document, ASCII text
> > mvnrepository.com_artifact_org.eclipse.emf:
> > HTML document, UTF-8 Unicode text, with very long lines
> > nienaltowski.net_drzewo_20r.nienaltowski.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > nsw.hia.com.au_images_last_20chance_20for_20tickets.emf:
> > HTML document, ASCII text, with CRLF line terminators
> > play.google.com_store_apps_details_id=switches.emf:
> > HTML document, ASCII text, with very long lines
> > prstv.ru_logo_prstv_logo_01.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > rightswebquest.wikispaces.com_file_history_picture1.emf:
> > empty
> > rightswebquest.wikispaces.com_file_links_picture1.emf:
> > empty
> > saf.bio.caltech.edu_ppt_g_p_i_p2i_rotated_images_ppt.emf:
> > XML 1.0 document, ASCII text
> > school22.irkutsk.ru_gimn.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > sipoc.wikispaces.com_file_history_mujer_joven-vieja.emf:
> > empty
> > sipoc.wikispaces.com_file_links_mujer_joven-vieja.emf:
> > empty
> > sipoc.wikispaces.com_file_view_mujer_joven-vieja.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > stillbeziehungen.tk_whitedrop_ammentriskele.emf:
> > HTML document, ASCII text
> > stillbeziehungen.tk_whitedrop_erolaclo2.emf:
> > HTML document, ASCII text
> > stillbeziehungen.tk_whitedrop_erolaclo.emf:
> > HTML document, ASCII text
> > sureshotguns.com_brand.emf:
> > empty
> >
> thcs-qcong.quangdien.thuathienhue.edu.vn_imgs_thu_muc_he_thong__nam_2013_picture1.emf:
> > HTML document, ASCII text
> >
> webgerman.com_presentations_animationresearchreport_files_slide0009_image033.emz:
> > HTML document, ASCII text, with no line terminators
> > wikitext.transvivid.ch_iframes_image001.emz:
> > HTML document, ASCII text, with CRLF, LF line terminators
> >
> working-memory-and-education.wikispaces.com_file_view_ld_and_wm_chart.emf:
> > HTML document, ASCII text, with very long lines
> > www.aibi.ph_htm_oldharvest_leaven-like-evangelism_files_image007.emz:
> > HTML document, ASCII text
> > www.aibi.ph_htm_oldharvest_leaven-like-evangelism_files_image009.emz:
> > HTML document, ASCII text
> > www.appbrain.com_app_emf-sensor_com.codebros.emf:
> > HTML document, UTF-8 Unicode text, with very long lines
> > www.chisinau.md_public_files_primaria_info_utila_rezerva_cmc.emf:
> > HTML document, ASCII text
> > www.drugfuture.com_chemdata_stremf_iminodisuccinic-acid.emf:
> > HTML document, ISO-8859 text, with CRLF line terminators
> > www.eclipse.org_projects_project-plan.php_projectid=modeling.emf:
> > HTML document, ASCII text, with very long lines
> > www.extremedemocracy.com_information_20_26_20values.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> >
> www.goldsim.com_downloads_library_images_logos_symbol_fullname_blackgold.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > www.goldsim.com_downloads_library_images_logos_symbol_name_blackgold.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > www.goldsim.com_downloads_library_images_logos_symbol_whitegold.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > www.keywild.com_six_mountains_rullers_ruller_12_inches.emf:
> > HTML document, ASCII text, with no line terminators
> > www.keywild.com_six_mountains_rullers_ruller_6_inches.emf:
> > HTML document, ASCII text, with no line terminators
> > www.otsucci.or.jp_public_keikyo_keikyo31_image005.emz:
> > HTML document, ASCII text
> > www.otsucci.or.jp_public_keikyo_keikyo31_image007.emz:
> > HTML document, ASCII text
> > www.otsucci.or.jp_public_keikyo_keikyo32_image011.emz:
> > HTML document, ASCII text
> > www.otsucci.or.jp_public_keikyo_keikyo32_image019.emz:
> > HTML document, ASCII text
> > www.otsucci.or.jp_public_keikyo_keikyo34_image005.emz:
> > HTML document, ASCII text
> > www.otsucci.or.jp_public_keikyo_keikyo34_image009.emz:
> > HTML document, ASCII text
> > www.otsucci.or.jp_public_keikyo_keikyo35_image003.emz:
> > HTML document, ASCII text
> > www.otsucci.or.jp_public_keikyo_keikyo35_image011.emz:
> > HTML document, ASCII text
> > www.otsucci.or.jp_public_keikyo_keikyo35_image015.emz:
> > HTML document, ASCII text
> > www.rogerblench.info_language_afroasiatic_aaop_files_image003.emz:
> > gzip compressed data, max compression, from NTFS filesystem (NT)
> > www.tulaed-union.ru_111_chislenniy_20sostav.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > www.wiki.ciu20.org_file_history_background4.emf:
> > empty
> > www.wiki.ciu20.org_file_links_background4.emf:
> > empty
> > zakon4.rada.gov.ua_laws_file_imgs_21_p416467n73-2.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > zakon4.rada.gov.ua_laws_file_imgs_24_p416467n103-17.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > zakon4.rada.gov.ua_laws_file_imgs_24_p416467n110-19.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > zakon4.rada.gov.ua_laws_file_imgs_24_p416467n112-20.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > zakon4.rada.gov.ua_laws_file_imgs_24_p416467n113-21.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > zakon4.rada.gov.ua_laws_file_imgs_24_p416467n122-27.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > zakon4.rada.gov.ua_laws_file_imgs_24_p416467n88-8.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > zakon4.rada.gov.ua_laws_file_imgs_24_p416467n93-13.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > zakon4.rada.gov.ua_laws_file_imgs_24_p416467n95-14.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > zavodsvet.ru_production_files_but__d0_92-28-1-500-_d0_93_d0_a0.emf:
> > HTML document, ISO-8859 text, with CRLF line terminators
> > zavodsvet.ru_production_files_but__d0_92-28-2-500-_d0_a4_d0_af.emf:
> > HTML document, ISO-8859 text, with CRLF line terminators
> > zavodsvet.ru_production_files_but__d0_92-28-7-250-_d0_9e_d0_9a.emf:
> > HTML document, ISO-8859 text, with CRLF line terminators
> > zavodsvet.ru_production_files_but__d0_92-28-7-250-_d0_a4_d0_96.emf:
> > HTML document, ISO-8859 text, with CRLF line terminators
> > zavodsvet.ru_production_files_but__d0_92-28-7-375-_d0_a0_d0_94.emf:
> > HTML document, ISO-8859 text, with CRLF line terminators
> > zavodsvet.ru_production_files_but__d0_92-31-4-500-_d0_a2_d0_a3.emf:
> > HTML document, ISO-8859 text, with CRLF line terminators
> > zavodsvet.ru_production_files_but__d0_92-31-4-700-_d0_a2_d0_a3.emf:
> > HTML document, ISO-8859 text, with CRLF line terminators
> > zavodsvet.ru_production_files_but__d0_92-5-500-_d0_9b_d0_92.emf:
> > HTML document, ISO-8859 text, with CRLF line terminators
> > zavodsvet.ru_production_files_but__d0_92_d0_9a_d0_9f-330-_d0_a1.emf:
> > HTML document, ISO-8859 text, with CRLF line terminators
> >
> zavodsvet.ru_production_files_but__d0_92_d0_bd-28-8-1000-_d0_a1_d0_a0.emf:
> > HTML document, ISO-8859 text, with CRLF line terminators
> > zavodsvet.ru_production_files_but__d0_92_d0_bd-28-8-500-_d0_a1_d0_a0.emf:
> > HTML document, ISO-8859 text, with CRLF line terminators
> >
> zavodsvet.ru_production_files_but__d0_9a_d0_9f_d0_94-1-500-_d0_91_d0_9a.emf:
> > HTML document, ISO-8859 text, with CRLF line terminators
> >
> zavodsvet.ru_production_files_but__d0_9a_d0_9f_d0_9c-30-1000-_d0_91_d0_95.emf:
> > HTML document, ISO-8859 text, with CRLF line terminators
> >
> zavodsvet.ru_production_files_but__d0_9a_d0_9f_d0_9c-30-1000-_d0_9f_d0_92.emf:
> > HTML document, ISO-8859 text, with CRLF line terminators
> >
> zavodsvet.ru_production_files_but__d0_9a_d0_9f_d0_9c-30-500-_d0_91_d0_95.emf:
> > HTML document, ISO-8859 text, with CRLF line terminators
> >
> zavodsvet.ru_production_files_but__d0_9a_d0_9f_d0_9c-30-500-_d0_91_d0_ae.emf:
> > HTML document, ISO-8859 text, with CRLF line terminators
> >
> zavodsvet.ru_production_files_but__d0_9a_d0_9f_d0_9c-30-500-_d0_9f_d0_92.emf:
> > HTML document, ISO-8859 text, with CRLF line terminators
> >
> zavodsvet.ru_production_files_but__d0_9a_d0_9f_d0_9c-30-750-_d0_91_d0_ae.emf:
> > HTML document, ISO-8859 text, with CRLF line terminators
> >
> zavodsvet.ru_production_files_but__d0_9a_d0_9f_d0_9c-30-750-_d0_9f_d0_92.emf:
> > HTML document, ISO-8859 text, with CRLF line terminators
> > zavodsvet.ru_production_files_but__d0_9f-8-500-_d0_a4_d0_9d.emf:
> > HTML document, ISO-8859 text, with CRLF line terminators
> > zavodsvet.ru_production_files_but_kpd-1-500-bk.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > zavodsvet.ru_production_files_but_kpm-30-1000-be.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > zavodsvet.ru_production_files_but_kpm-30-1000-pv.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > zavodsvet.ru_production_files_but_kpm-30-500-be.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > zavodsvet.ru_production_files_but_kpm-30-500-byu.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > zavodsvet.ru_production_files_but_kpm-30-500-pv.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > zavodsvet.ru_production_files_but_kpm-30-750-byu.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > zavodsvet.ru_production_files_but_kpm-30-750-pv.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > zavodsvet.ru_production_files_but_p-8-500-fn.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > zavodsvet.ru_production_files_but_v-28-1-500-gr.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > zavodsvet.ru_production_files_but_v-28-2-500-fya.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > zavodsvet.ru_production_files_but_v-28-7-250-fzh.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > zavodsvet.ru_production_files_but_v-28-7-250-ok.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > zavodsvet.ru_production_files_but_v-28-7-375-rd.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > zavodsvet.ru_production_files_but_v-31-4-500-tu.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > zavodsvet.ru_production_files_but_v-31-4-700-tu.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > zavodsvet.ru_production_files_but_v-5-500-lv.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > zavodsvet.ru_production_files_but_vkp-330-s.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > zavodsvet.ru_production_files_but_vn-28-8-1000-sr.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > zavodsvet.ru_production_files_but_vn-28-8-500-sr.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > zavodsvet.ru_production_files_but_xxi-_d0_92-28-7-375.emf:
> > HTML document, ISO-8859 text, with CRLF line terminators
> > zavodsvet.ru_production_files_but_xxi-_d0_92-28-7-500.emf:
> > HTML document, ISO-8859 text, with CRLF line terminators
> > zavodsvet.ru_production_files_but_xxi-v-28-7-375.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> > zavodsvet.ru_production_files_but_xxi-v-28-7-500.emf:
> > Windows Enhanced Metafile (EMF) image data version 0x10000
> >
> >
> > Dominik.
> >
> >> On Mon, Oct 8, 2018 at 1:37 PM Tim Allison <[hidden email]> wrote:
> >>
> >> At some point I extracted all emfs from our corpus. I’ll see if that
> data
> >> is still around and/or re-extract...prob have time tomorrow/ Wednesday
> >>
> >> On Sun, Oct 7, 2018 at 5:01 PM Dominik Stadler <[hidden email]>
> >> wrote:
> >>
> >>> Hi Andi
> >>>
> >>> It is easy to change CommonCrawlDocumentDownload to fetch other
> >> mime-types,
> >>> see
> >>>
> https://github.com/centic9/CommonCrawlDocumentDownload/tree/download_emf
> >>>
> >>> However .emf files don't appear in the top-100 mimetypes of the crawls
> >> and
> >>> thus are likely very rarely included if at all. I started a
> download-run,
> >>> but the first two of the 300 index-files do not contain any matching
> >>> extension or mime-type.
> >>>
> >>> See https://commoncrawl.github.io/cc-crawl-statistics/plots/mimetypes
> >> for
> >>> mimetype-statistics in the crawl.
> >>>
> >>> Dominik.
> >>>
> >>> On Sat, Oct 6, 2018 at 8:14 PM Andreas Beeker <[hidden email]>
> >>> wrote:
> >>>
> >>>> Hi Tim / Dominik,
> >>>>
> >>>> please give me a few pointers, how I could access a pool of EMF files,
> >>>> e.g. (not only) within the common crawl corpus. My focus is currently
> >> on
> >>>> rendering, but as I extend the supported records, I also like to
> >> validate
> >>>> the parsing.
> >>>> As the EMF parsing is relatively new, you still might have a corpus
> for
> >>>> it, Tim?
> >>>>
> >>>> I have a few old mails about the common crawl corpus [2], but I guess
> >>>> there has been some restructuring taken place and there might be an
> >>> easier
> >>>> option than downloading the whole index.
> >>>>
> >>>> Of course office files which I parse for embedded EMFs are also ok.
> >>>>
> >>>> I have to admit, that I haven't yet tested Dominiks tool [1].
> >>>>
> >>>> Alternatively I can use the govdocs1 corpus [3]
> >>>>
> >>>> Best wishes,
> >>>> Andi
> >>>>
> >>>>
> >>>> [1] https://github.com/centic9/CommonCrawlDocumentDownload
> >>>>
> >>>> [2]
> >>>>
> >>>
> >>
> http://apache-poi.1045710.n5.nabble.com/Using-CommonCrawl-for-POI-regression-mass-testing-td5721585.html
> >>>>
> >>>> [3]
> >> http://downloads.digitalcorpora.org/corpora/files/govdocs1/by_type/
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: [hidden email]
> >>>> For additional commands, e-mail: [hidden email]
> >>>>
> >>>>
> >>>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: EMF corpus

Tim Allison
In reply to this post by Tim Allison
Turns out that's a subset.  It looks like there should be ~200k emfs.
I'll try to dig up the extraction code and re-run.
On Tue, Oct 9, 2018 at 8:55 AM Tim Allison <[hidden email]> wrote:

>
> Y.  Turns out I extracted a bunch a while ago.  See the 'emfs'
> directory in this tar.bz2 file:
> http://162.242.228.174/embedded_files/xmfs.tar.bz2
>
> Let me know if you have any questions and/or if I can make that any
> more useful for you.
>
> Cheers,
>
>            Tim
> On Mon, Oct 8, 2018 at 7:37 AM Tim Allison <[hidden email]> wrote:
> >
> > At some point I extracted all emfs from our corpus. I’ll see if that data is still around and/or re-extract...prob have time tomorrow/ Wednesday
> >
> > On Sun, Oct 7, 2018 at 5:01 PM Dominik Stadler <[hidden email]> wrote:
> >>
> >> Hi Andi
> >>
> >> It is easy to change CommonCrawlDocumentDownload to fetch other mime-types,
> >> see https://github.com/centic9/CommonCrawlDocumentDownload/tree/download_emf
> >>
> >> However .emf files don't appear in the top-100 mimetypes of the crawls and
> >> thus are likely very rarely included if at all. I started a download-run,
> >> but the first two of the 300 index-files do not contain any matching
> >> extension or mime-type.
> >>
> >> See https://commoncrawl.github.io/cc-crawl-statistics/plots/mimetypes for
> >> mimetype-statistics in the crawl.
> >>
> >> Dominik.
> >>
> >> On Sat, Oct 6, 2018 at 8:14 PM Andreas Beeker <[hidden email]> wrote:
> >>
> >> > Hi Tim / Dominik,
> >> >
> >> > please give me a few pointers, how I could access a pool of EMF files,
> >> > e.g. (not only) within the common crawl corpus. My focus is currently on
> >> > rendering, but as I extend the supported records, I also like to validate
> >> > the parsing.
> >> > As the EMF parsing is relatively new, you still might have a corpus for
> >> > it, Tim?
> >> >
> >> > I have a few old mails about the common crawl corpus [2], but I guess
> >> > there has been some restructuring taken place and there might be an easier
> >> > option than downloading the whole index.
> >> >
> >> > Of course office files which I parse for embedded EMFs are also ok.
> >> >
> >> > I have to admit, that I haven't yet tested Dominiks tool [1].
> >> >
> >> > Alternatively I can use the govdocs1 corpus [3]
> >> >
> >> > Best wishes,
> >> > Andi
> >> >
> >> >
> >> > [1] https://github.com/centic9/CommonCrawlDocumentDownload
> >> >
> >> > [2]
> >> > http://apache-poi.1045710.n5.nabble.com/Using-CommonCrawl-for-POI-regression-mass-testing-td5721585.html
> >> >
> >> > [3] http://downloads.digitalcorpora.org/corpora/files/govdocs1/by_type/
> >> >
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: [hidden email]
> >> > For additional commands, e-mail: [hidden email]
> >> >
> >> >

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: EMF corpus

kiwiwings
In reply to this post by Dave Fisher-5
Hi Dominik,

thank you for the files - I have now more than enough files :)

Andi


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]