{"id":18646,"date":"2017-07-21T08:21:08","date_gmt":"2017-07-21T07:21:08","guid":{"rendered":"http:\/\/www.ch.imperial.ac.uk\/rzepa\/blog\/?p=18646"},"modified":"2017-07-21T10:04:41","modified_gmt":"2017-07-21T09:04:41","slug":"accessing-raw-chemical-data-a-peek-into-the-cif-format","status":"publish","type":"post","link":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=18646","title":{"rendered":"Accessing (raw) chemical data: a peek into the CIF format."},"content":{"rendered":"<div class=\"kcite-section\" kcite-section-id=\"18646\">\n<p>There is much focus at the moment on how to ensure experimental replicability in <em>e.g.<\/em> the molecular sciences. An important aspect of that is having access to <strong>FAIR<\/strong> data; data which is findable, accessible, inter-operable and re-usable. One of the &#8220;gold standards&#8221; in chemistry is the data associated with crystal structures. Here I take an inside peek into the standard file-type for carrying crystal structure data, the CIF file (the Crystallographic Information File).<\/p>\n<p>CIF is a tightly managed format, with utility tools such as <a href=\"http:\/\/checkcif.iucr.org\">checkCIF<\/a> to validate the files and check for errors. It is also what is called a processed data format, created from structural analysis of the raw image data that emerges from a diffractometer, and is therefore what might be described as a lossy format. Discussing these aspects with our crystallographer here (thanks Andrew!), I began to realise that there are at least three distinctly different versions of a CIF file, each carrying a different degree of data loss.<\/p>\n<p>I am going to take as my illustration of this structure<span id=\"cite_ITEM-18646-0\" name=\"citation\"><a href=\"#ITEM-18646-0\">[1]<\/a><\/span> known by three different identifiers; <a href=\"https:\/\/www.ccdc.cam.ac.uk\/structures\/Search?Ccdcid=AZUJOW\">AZUJOW<\/a>, <a href=\"https:\/\/www.ccdc.cam.ac.uk\/structures\/search?sid=ConQuest&amp;pid=ccdc:1406199\">CCDC 1406199<\/a>\u00a0or DOI:\u00a0<a href=\"https:\/\/dx.doi.org\/10.5517\/ccdc.csd.cc1j6888\">10.5517\/ccdc.csd.cc1j6888<\/a><\/p>\n<ol>\n<li>The CIF originates with the authors and this version is 449KB in size. I have deposited it and the other two at DOI:\u00a0<a href=\"https:\/\/doi.org\/10.14469\/hpc\/2752\">10.14469\/hpc\/2752<\/a>\u00a0 for you to inspect and compare them. This file is relatively large since it contains the so-called structure factors or <em>hk<\/em>l information, a snippet of which looks like:\n<pre>_shelx_hkl_file \r\n; \r\n   0   0   1 108882. 1066.19   2 \r\n   0   0   2 320.055 130.609   2 \r\n   0   0   3 18538.0 806.608   2 \r\n   0   0   4 173192. 2808.03   2 \r\n<\/pre>\n<\/li>\n<li>This information is removed using a utility known as <a href=\"http:\/\/shelx.uni-ac.gwdg.de\/SHELX\/cif.php\">shredcif<\/a> to produce a second version, known as the name_x.cif version and reducing the size to 27KB. This retains information about properties such as thermal ellipsoids and bond length and angle information but loses the <em>hkl<\/em> information.<\/li>\n<li>After the CIF is submitted to CSD, it emerges as\u00a0AZUJOW.cif, which is now just 7KB in size and is now missing the bond lengths and angles etc.<\/li>\n<\/ol>\n<p>The original raw image data for this structure is not publicly available, but you can see a set of structures for which it IS available at DOI:<a href=\"https:\/\/doi.org\/10.14469\/hpc\/2297\">10.14469\/hpc\/2297<\/a>\u00a0 (published as <span id=\"cite_ITEM-18646-1\" name=\"citation\"><a href=\"#ITEM-18646-1\">[2]<\/a><\/span> and where the file sizes are typically 200-600 MB (they can get much larger).\u00a0<\/p>\n<p>So a CIF can vary in data content between 7- 449KB, and the original &#8220;raw&#8221; data can be ten thousand \u00a0times larger than this! To acquire all the flavours, you have to access both the CSD and contact the original authors (unless of course the latter have deposited their versions in an open data repository, as above).\u00a0<\/p>\n<p>Fortunately for most chemical applications, even the &#8220;lossiest&#8221; of the CIF formats is more than adequate. But for the gold standard in chemical data, you should be aware that you may still be losing access to a lot of original data in the CIF formats and of course to all of the raw diffractometer data. I think it fair to say however that there is now momentum to increasingly retain as much of this data as is possible for posterity.<\/p>\n<h2>References<\/h2>\n    <ol class=\"kcite-bibliography csl-bib-body\"><li id=\"ITEM-18646-0\">A. Toscani, K.A. Jantan, J.B. Hena, J.A. Robson, E.J. Parmenter, V. Fiorini, A.J.P. White, S. Stagni, and J.D.E.T. Wilton-Ely, \"The stepwise generation of multimetallic complexes based on a vinylbipyridine linkage and their photophysical properties\", <i>Dalton Transactions<\/i>, vol. 46, pp. 5558-5570, 2017. <a href=\"https:\/\/doi.org\/10.1039\/c6dt03810g\">https:\/\/doi.org\/10.1039\/c6dt03810g<\/a>\n\n<\/li>\n<li id=\"ITEM-18646-1\">J. Almond-Thynne, A.J.P. White, A. Polyzos, H.S. Rzepa, P.J. Parsons, and A.G.M. Barrett, \"Synthesis and Reactions of Benzannulated Spiroaminals: Tetrahydrospirobiquinolines\", <i>ACS Omega<\/i>, vol. 2, pp. 3241-3249, 2017. <a href=\"https:\/\/doi.org\/10.1021\/acsomega.7b00482\">https:\/\/doi.org\/10.1021\/acsomega.7b00482<\/a>\n\n<\/li>\n<\/ol>\n\n<\/div> <!-- kcite-section 18646 -->","protected":false},"excerpt":{"rendered":"<p>There is much focus at the moment on how to ensure experimental replicability in e.g. the molecular sciences. An important aspect of that is having access to FAIR data; data which is findable, accessible, inter-operable and re-usable. One of the &#8220;gold standards&#8221; in chemistry is the data associated with crystal structures. Here I take an [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"activitypub_content_warning":"","activitypub_content_visibility":"","activitypub_max_image_attachments":5,"activitypub_interaction_policy_quote":"anyone","activitypub_status":"","footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[2],"tags":[],"ppma_author":[2661],"class_list":["post-18646","post","type-post","status-publish","format-standard","hentry","category-chemical-it"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.5 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Accessing (raw) chemical data: a peek into the CIF format. - Henry Rzepa&#039;s Blog<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=18646\" \/>\n<meta property=\"og:locale\" content=\"en_GB\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Accessing (raw) chemical data: a peek into the CIF format. - Henry Rzepa&#039;s Blog\" \/>\n<meta property=\"og:description\" content=\"There is much focus at the moment on how to ensure experimental replicability in e.g. the molecular sciences. An important aspect of that is having access to FAIR data; data which is findable, accessible, inter-operable and re-usable. One of the &#8220;gold standards&#8221; in chemistry is the data associated with crystal structures. Here I take an [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=18646\" \/>\n<meta property=\"og:site_name\" content=\"Henry Rzepa&#039;s Blog\" \/>\n<meta property=\"article:published_time\" content=\"2017-07-21T07:21:08+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2017-07-21T09:04:41+00:00\" \/>\n<meta name=\"author\" content=\"Henry Rzepa\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Henry Rzepa\" \/>\n\t<meta name=\"twitter:label2\" content=\"Estimated reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"3 minutes\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Accessing (raw) chemical data: a peek into the CIF format. - Henry Rzepa&#039;s Blog","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=18646","og_locale":"en_GB","og_type":"article","og_title":"Accessing (raw) chemical data: a peek into the CIF format. - Henry Rzepa&#039;s Blog","og_description":"There is much focus at the moment on how to ensure experimental replicability in e.g. the molecular sciences. An important aspect of that is having access to FAIR data; data which is findable, accessible, inter-operable and re-usable. One of the &#8220;gold standards&#8221; in chemistry is the data associated with crystal structures. Here I take an [&hellip;]","og_url":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=18646","og_site_name":"Henry Rzepa&#039;s Blog","article_published_time":"2017-07-21T07:21:08+00:00","article_modified_time":"2017-07-21T09:04:41+00:00","author":"Henry Rzepa","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Henry Rzepa","Estimated reading time":"3 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=18646#article","isPartOf":{"@id":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=18646"},"author":{"name":"Henry Rzepa","@id":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/#\/schema\/person\/2b40f7b9c872a4dc1547e040a11b6281"},"headline":"Accessing (raw) chemical data: a peek into the CIF format.","datePublished":"2017-07-21T07:21:08+00:00","dateModified":"2017-07-21T09:04:41+00:00","mainEntityOfPage":{"@id":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=18646"},"wordCount":516,"commentCount":7,"articleSection":["Chemical IT"],"inLanguage":"en-GB","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=18646#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=18646","url":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=18646","name":"Accessing (raw) chemical data: a peek into the CIF format. - Henry Rzepa&#039;s Blog","isPartOf":{"@id":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/#website"},"datePublished":"2017-07-21T07:21:08+00:00","dateModified":"2017-07-21T09:04:41+00:00","author":{"@id":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/#\/schema\/person\/2b40f7b9c872a4dc1547e040a11b6281"},"breadcrumb":{"@id":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=18646#breadcrumb"},"inLanguage":"en-GB","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=18646"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=18646#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog"},{"@type":"ListItem","position":2,"name":"Accessing (raw) chemical data: a peek into the CIF format."}]},{"@type":"WebSite","@id":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/#website","url":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/","name":"Henry Rzepa&#039;s Blog","description":"Chemistry with a twist","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-GB"},{"@type":"Person","@id":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/#\/schema\/person\/2b40f7b9c872a4dc1547e040a11b6281","name":"Henry Rzepa","image":{"@type":"ImageObject","inLanguage":"en-GB","@id":"https:\/\/secure.gravatar.com\/avatar\/897b6740f7f599bca7942cdf7d7914af5988937ae0e3869ab09aebb87f26a731?s=96&d=blank&r=g370be3a7397865e4fd161aefeb0a5a85","url":"https:\/\/secure.gravatar.com\/avatar\/897b6740f7f599bca7942cdf7d7914af5988937ae0e3869ab09aebb87f26a731?s=96&d=blank&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/897b6740f7f599bca7942cdf7d7914af5988937ae0e3869ab09aebb87f26a731?s=96&d=blank&r=g","caption":"Henry Rzepa"},"description":"Henry Rzepa is Emeritus Professor of Computational Chemistry at Imperial College London.","sameAs":["https:\/\/orcid.org\/0000-0002-8635-8390"],"url":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?author=1"}]}},"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/pDef7-4QK","jetpack-related-posts":[{"id":24723,"url":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=24723","url_meta":{"origin":18646,"position":0},"title":"Raw data: the evolution of FAIR data and crystallography.","author":"Henry Rzepa","date":"March 1, 2022","format":false,"excerpt":"Scientific data in chemistry has come a long way in the last few decades. Originally entangled into scientific articles in the form of tables of numbers or diagrams, it was (partially) disentangled into supporting information when journals became electronic in the late 1990s. The next phase was the introduction of\u2026","rel":"","context":"In &quot;Chemical IT&quot;","block_context":{"text":"Chemical IT","link":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?cat=2"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":25761,"url":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=25761","url_meta":{"origin":18646,"position":1},"title":"Molecules of the year -2022.  Data issues!","author":"Henry Rzepa","date":"December 13, 2022","format":false,"excerpt":"The list of molecules of the year is out now at C&E News (but you have to have an account to view the list, unlike previous years).\u2663 These three caught my eye: Electron in a cube: Synthesis and characterization of perfluorocubane as an electron acceptor,. I have already written about\u2026","rel":"","context":"Similar post","block_context":{"text":"Similar post","link":""},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":16997,"url":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=16997","url_meta":{"origin":18646,"position":2},"title":"An inorganic double helix: SnIP.","author":"Henry Rzepa","date":"October 16, 2016","format":false,"excerpt":"After sixty years of searching, the first non-templated double helical carbon-free inorganic molecular structure has been reported. That is so neat that I thought to load the 3D coordinates here for you to interact with\u00a0and then to explore the prospect of using these coordinates to add some\u00a0value with\u00a0e.g. some chiroptical\u2026","rel":"","context":"In &quot;Chemical IT&quot;","block_context":{"text":"Chemical IT","link":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?cat=2"},"img":{"alt_text":"snip","src":"https:\/\/i0.wp.com\/www.ch.ic.ac.uk\/rzepa\/blog\/wp-content\/uploads\/2016\/10\/SnIP.jpg?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":15907,"url":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=15907","url_meta":{"origin":18646,"position":3},"title":"Global initiatives in research data management and discovery: searching metadata.","author":"Henry Rzepa","date":"March 7, 2016","format":false,"excerpt":"The upcoming ACS national meeting in San Diego has a CINF\u00a0(chemical information division) session entitled \"Global initiatives in research data management and discovery\". I have highlighted here just one slide from my contribution to this session, which addresses the discovery aspect of the session. Data, if you think about it,\u2026","rel":"","context":"In &quot;Chemical IT&quot;","block_context":{"text":"Chemical IT","link":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?cat=2"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":18257,"url":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=18257","url_meta":{"origin":18646,"position":4},"title":"The challenges in curating research data: one case study.","author":"Henry Rzepa","date":"April 28, 2017","format":false,"excerpt":"Research data (and its management) is rapidly emerging as a focal point for the development of research dissemination practices. An important aspect of ensuring that such data remains fit for purpose is identifying what curation activities need to be associated with it. Here I revisit one particular case study associated\u2026","rel":"","context":"In &quot;Chemical IT&quot;","block_context":{"text":"Chemical IT","link":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?cat=2"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/www.ch.ic.ac.uk\/rzepa\/blog\/wp-content\/uploads\/2017\/04\/077-1.jpg?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":24951,"url":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?p=24951","url_meta":{"origin":18646,"position":5},"title":"Raw data and the evolution of crystallographic FAIR data. Journals, processed and raw structure data.","author":"Henry Rzepa","date":"March 28, 2022","format":false,"excerpt":"In my previous post on the topic,\u00a0I introduced the concept that data can come in several forms, most commonly as \"raw\" or primary data and as a \"processed\" version of this data that has added value. In crystallography, the chemist is interested in this processed version, carried by a CIF\u2026","rel":"","context":"In &quot;Chemical IT&quot;","block_context":{"text":"Chemical IT","link":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/?cat=2"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"jetpack_likes_enabled":false,"authors":[{"term_id":2661,"user_id":1,"is_guest":0,"slug":"admin","display_name":"Henry Rzepa","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/897b6740f7f599bca7942cdf7d7914af5988937ae0e3869ab09aebb87f26a731?s=96&d=blank&r=g","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""}],"_links":{"self":[{"href":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/index.php?rest_route=\/wp\/v2\/posts\/18646","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=18646"}],"version-history":[{"count":5,"href":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/index.php?rest_route=\/wp\/v2\/posts\/18646\/revisions"}],"predecessor-version":[{"id":18651,"href":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/index.php?rest_route=\/wp\/v2\/posts\/18646\/revisions\/18651"}],"wp:attachment":[{"href":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=18646"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=18646"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=18646"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.ch.ic.ac.uk\/rzepa\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fppma_author&post=18646"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}