Compute Ontario Research Day 2015

May 21
Conestoga College Institute of Technology and Advanced Learning

Contributing Speakers

Hazem Ahmed

Postdoctoral Fellow, Structural Genomics Consortium Toronto, University of Toronto

Systematic Prediction of Novel Lysine Methyltransferase Substrates – An Integrative Approach

In the nucleus of each human cell, over 3 billion base pairs of genomic DNA are packaged and highly compacted by histones in a dynamic polymer called chromatin. Histones are DNA-packaging proteins that are subject to a diverse array of Post-Translational Modifications (PTMs), which control DNA accessibility and gene expression. Such non-sequence-changing (epigenetic) modifications constitute a 'histone code' that affects cell fate without directly affecting the 'genetic code', extending the information potential of the human genome beyond the already-massive information content of DNA sequences. Histone methylation is one of the most important, yet perhaps least understood, epigenetic modification that impacts many biological processes and implicates a wide range of human diseases, particularly cancer. In addition to histone methylation, there are over 3000 non-histone substrates that are methylated by over 60 protein methyltransferases (enzymes). However, less than 200 enzyme/substrate relationships are reported in the literature to date, leaving many missing relationships to be discovered to draw a more complete picture of the human methylome. In this study, we propose a novel bioinformatics approach to predict missing enzyme/substrate relationships. Integration of biochemical data, structural data and biological networks is conducted to identify potential enzymes responsible for the methylation of numerous protein substrates, which could eventually inform functional studies for disease diagnosis and treatment. Our integrative, knowledge discovery-based approach makes use of large-scale data experimentally derived from several sources, including protein crystal structures, protein physical interactions, protein functional associations (co-expression and biological pathways), existing substrates from the literature, known methylation sites by Mass Spec analysis, multiple sequence alignment and cellular localization in addition, to a number of machine learning algorithms, such as structural energy minimization, Fisher's-Jenks natural breaks classification, Markov graph clustering and text mining. We applied the proposed proteomic approach to identify novel substrates for G9a, an important lysine methyltransferase (KMT) known to be overexpressed in various human cancers including leukemia, prostate carcinoma and lung cancer. Our computational results identified novel G9a candidate substrates and selected a number of peptide segments that potentially include lysine residues methylated by G9a. Experimental validation confirmed that many selected peptides indeed bind to G9a with as good quality as three known G9a substrates that we used as positive controls. In addition to peptide array support, we also found structural, functional, physical interaction and literature support for our top hits. The proposed bioinformatics approach offers a promising route for the large-scale, rapid and inexpensive identification of G9a (and broadly other KMT) substrates, which could potentially help understand how lysine methylation participates in wider signaling processes in health and disease.

Charbel Azzi

master's degree student, University Of Waterloo

Parallel Processing in Image-based Localization

Image-based localization problem has witnessed major improvements recently presenting powerful approaches as solutions. Nevertheless, the robustness and accuracy levels of these solutions remain unclear. This paper presents a comparative analysis of the main approaches, namely Brute Force Matching, Approximate Nearest Neighbor Matching, Embedded Ferns Classification, ACG Localizer(Using Visual Vocabulary) and Keyframe Matching Approach. The objective here is to first uncover the specifics of each of these techniques and thereby understand the advantages and disadvantages of each of them. We present a comparison methodology to guarantee a fair analysis. The testing is performed on familiar datasets where the localization is determined with respect to a 3D cloud map for each set obtained using a modern Structure-from-Motion approach. The results shows that Visual Words approach is the best localization approach though it suffers processing problem specially when clustering the huge data. Also the Embedded Ferns has major computational time problems where training and testing on city-scale data sets makes it hard to work real-time without processing solutions. We applied preliminary parallel processing techniques on both of them and it showed promising improvements which proves the need of such techniques in the image-based localization.

Adam Domurad

master's degree student, University of Waterloo

Monte Carlo Simulation of Massive Social Networks

In the last decade, popular social networks such as Twitter and Facebook have grown to include large portions of the world population. Due to their centralized nature, these networks spread information in ways unlike previously studied phenomena. For example, unlike services such as e-mail, a significant amount of global content discovery is enabled. #k@ (http://hashkat.org) is a simulation engine written in C++ for studying the growth of abstracted social networks over time. The abstracted social network consists of entities of certain classes (characterizing, for example, popularity) and their asymmetric relationships. The simulation progresses through a kinetic Monte Carlo process where events, such as relationship changes, are chosen randomly according to configurable weights. These configurable weights correspond to a network model. By experimenting on model social networks, more can be learned about the emergent behaviour which occurs in such networks.

Julie Friddell

Associate Director, Canadian Cryospheric Information Network, University of Waterloo

Polar Data at Compute Ontario and SHARCNET

The Polar Data Catalogue is Canada's primary online source for scientific research data and information about the Arctic and Antarctica. The PDC, hosted by the Canadian Cryospheric Information Network at the University of Waterloo, provides access to over 30 TB of data files and RADARSAT satellite images from partner organizations whose researchers work in the polar regions, particularly northern Canada. The PDC collection spans the range from individual files of caribou and polar bear population surveys to large datasets such as climate simulations focused on the polar vortex and high-resolution cloud height/motion observations from Eureka in northern Nunavut. We work with researchers across Canada to provide guidance and assistance on effective stewardship of their valuable data resources, including creation of descriptive metadata and preparation of data files to meet growing national and international requirements for data archiving and sharing. To make better use of PDC's datasets so that more people can understand and benefit from them, we have created simple interactive visualization tools of selected datasets, starting with snow cover and lake ice thickness observations across Canada. To address user requests for more graphical display and access to the PDC data, we are beginning discussions with Compute Ontario for extending our visualizations to more complicated combinations of related datasets, such as sea ice and meteorology over the melting Arctic Ocean and the Northwest Passage. In addition to virtualized and redundant servers and storage at the University of Waterloo, the PDC files and components are replicated for protection against data loss on live storage infrastructure at Compute Ontario/SHARCNET. We are exploring conversion to an off-site, completely cloud-based system as Compute Canada's capacities evolve and our current hardware ages.

Marcial Garbanzo-Salas

doctorate degree student, Western University

Atmospheric Studies with Radar Data and Simulations; HPC Applications in Radar Processing

HPC is widely used in physics. Radar applications provide a wide field of study where HPC techniques can be applied. Interferometry is used in radar processing to gather information about scatterers in the sky and wide beam radars provide a considerable amount of information. HPC techniques greatly improves the analysis of interferometric data and general circulation wind approximation. Finding large numbers of probable scatterers, discriminating them and solving equations for wind is no small task and better approached with HPC techniques. Another area of radar where HPC is greatly used is in atmospheric simulations. Large Eddy Simulations (LES) are used to better comprehend atmospheric motions, generation of turbulence and dissipation scales. In this presentation a simulation within a simulation is used to obtain radar back-scattering information from a virtual atmosphere. The results of the interferometric processing using HPC are also discussed.

Ioannis Haranas

Adjunct Professor, Wilfrid Laurier University

Perturbations Due to Dust in Mars Orbiting Satellites

In this paper we calculate the effect of atmospheric dust on the orbital elements of a satellite. Dust storms that originate in the Martian surface may evolve into global storms in the atmosphere that can last for months can affect low orbiter and lander missions. We model the dust as a velocity-square depended drag force acting on a satellite and we derive an appropriate disturbing function that accounts for the effect of dust on the orbit, using a Lagrangean formulation. A first-order perturbation solution of Lagrange’s planetary equations of motion indicates that for a local dust storm cloud that has a possible density of 8.323*10^(-10) kg/ m^3 at an altitude of 100 km affects the orbital semimajor axis of a 1000 kg satellite up -0.142 m /day. Regional dust storms of the same density may affect the semimajor axis up to of -0.418 m /day. Other orbital elements are also affected but to a lesser extent. Taking dust into account in more detailed effort to model the Martian gravity field high power supercomputing becomes really important.

Ricardo Harripaul

doctorate degree student, University of Toronto

Mapping Loci and Genes using a Hidden Markov Model for Bipolar Affective Disorder in Consanguineous Families

Bipolar Disorder (BD) is a psychiatric disorder characterized by transitions between depression and mania, a high rate of suicide (6% over age 20) and self-harm (30-40%). This debilitating condition has no known cause and both genetic and environmental factors contribute to its complex phenotype. We hypothesize that in rare cases, autosomal recessive mutations contribute to BD. To identify these genetic loci, 34 consanguineous Iranian families were genotyped with Affymetrix 5.0 Single Nucleotide Polymorphism microarray chips. This genotype information was analysed with the FSuite analysis pipeline and dCHIP to identify homozygosity-by-descent (HBD) regions as well as performing a HBD Genome Wide Association study to identify novel recessive risk variants. In addition, we looked for Copy Number Variations (CNVs) and through these approaches 43 large HBD regions were identified. We identified large HBD regions as the 56 Mb region on chromosome 8 (harboring candidate genes such as IMPA1, IMPAD1), a 10 Mb region on chromosome 17 (including SLC6A4) and a 7 Mb region on 5q35-2-qter including genes DRD1 and GRM6. Whole Exome and Sanger sequencing were used to search for homozygous coding mutations within these regions. Large runs of homozygosity have been identified in BD probands, including a 400 Kb loci that traverses the GRIK6 glutamate receptor gene. We have also identified 48 CNVs of interest that may disrupt candidate genes such as SYN3, SLC39A11 and S100A10. Whole exome sequencing has been applied to all families. Rare variants have been identified in more than one family for a number of genes. For instance, for ABCA13, in separate families, one homozygous nonsense and one homozygous non-synonymous variant were identified. Potential implications of these findings for genetics of bipolar disorder will be discussed.

Thomas Hemmy

master's degree student, Wilfrid Laurier University

Investigating the Impact of Horizontal Gene Transfer on Metabolic Conservation within Bacteria

Through the process of horizontal gene transfer bacteria are capable of acquiring new traits and capabilities never posed by their parents, deviating from the widely accepted view that evolution occurs in a strictly vertical fashion. Despite contrasting evidence, the idea that bacteria inherit their metabolic capabilities directly from their parents remains quite popular and problematic. In current studies it is assumed that bacteria posses the same metabolisms as other members of their species, leading to the poplar practice of using marker genes to predict the organisms present in environmental samples and the metabolic capabilities that these bacteria posses. In this work completely sequenced bacterial genomes have been functionally annotated in an effort to quantify horizontal gene transfer amongst distantly related bacteria. By using several functional annotation databases we have searched for the presence of metabolic functions being shared between distant species and the development of divergent functions within members of the same species. Preliminary results suggest that little metabolic difference can be found between members of even distantly related species; however, this occurrence seems to be due lack of functional annotation for novel metabolic functions. It appears that functional annotations exist primarily for metabolisms that are common across bacteria phyla.

Grigoriy Kimaev

master's degree student, University of Waterloo

Nanoscale Structure of Liquid Crystal-Nanoparticle Mixtures Using Molecular Monte Carlo Simulations

Liquid crystals are materials which exhibit a phase-ordering transition from a traditional disordered liquid phase to a phase with attributes of both a liquid and a solid. These attributes include liquid-like flow and solid-like elasticity. Thus, liquid crystal phases are also known as "mesophases" and have driven both novel and disruptive technological advances over the past few decades. The presence of liquid crystals in display technology, high-performance materials, and pervasively in biology is well-known, but it is lesser known that liquid crystals are inherently nanoscale in structure and dynamics. The vast majority of current technology based on LCs involve geometries and surface interactions that employ solely macroscopic liquid crystal phase behaviour. Recent advances, for example the ferroelectric display technology, have slowly introduced applications which utilize the nanoscale liquid crystal response to electric field. Leveraging liquid crystals at the nanoscale introduces significant challenges, mainly due to the possible breakdown of standard continuum theory which has been used for device design over the past decade. Conducting simulations of liquid crystal mesogens at the molecular scale is further complicated by the fact that mesogens are anisotropic. Thus, potential energy between a pair of mesogens depends not only on the separation distance between the mesogens, but also on their relative orientation. The Gay-Berne pair potential has been widely used and remains the most successful to date for capturing the dependence of energy on particles' orientations, but it is limited because it can only capture ellipsoidal mesogen shapes. In this presented research we have conducted coarse-grained molecular Monte Carlo simulations to yield a more accurate understanding of the thermodynamics and structure of LC phases. We have employed an approach to approximate pairwise molecular interactions with a Lennard-Jones type pair potential that utilized expansions of the range and strength parameters in terms of orthogonal functions instead of treating the parameters as constant scalar quantities. These orthogonal functions were dependent on the particles' relative positions and orientations, thereby capturing shape anisotropy and the dependence of potential energy on it. This approach allows for the representation of a broad set of pairwise interactions and mesogen shapes (uniaxial and biaxial) and, eventually, fitting of the potential using more accurate methods (atomistic molecular dynamics, quantum density functional theory, etc). Because of the nature of the approach, we were able to simulate not only the interactions between identical anisotropic mesogens, but also the interactions between a mesogen and a spherical particle, simulating LC phase formation and behaviour in liquid crystal-nanoparticle mixtures. Comparison of the method for liquid crystal/nano-particle mixtures to an existing continuum theory is presented.

Ilias Kotsireas

Professor, Wilfrid Laurier University

Efficient Algorithms for Matching Problems

We will describe a class of matching problems that arise in combinatorial design theory We will outline several different approaches to solve these problems. These approaches give rise to algorithms that are for the most part amenable to parallelization, even though it is not always evident to recognize an optimal parallelization strategy.

Ruipeng Lu

doctorate degree student, Western University

Discovery of Primary, Cofactor, and Novel Transcription Factor Binding Site Motifs by Recursive, Thresholded Entropy Minimization

Transcription factors regulate gene expression by binding to related DNA sequences of target genes. Cooperative interactions between multiple, bound factors can repress or activate expression of these genes. We apply Shannon information theory to discover conserved motifs recognized by these factors in ChIP-Seq data from the Encyclopedia of DNA Elements. The data consist of thousands of sequenced genomic fragments that have been co-immunoprecipated with a particular transcription factor. Motifs are built with Bipad, a C++ program that applies Monte Carlo-based entropy minimization to search multiple alignment space for homogeneous or bipartite models. These models can be used to determine the information contents (Ri) or binding affinity of functional binding sites and identify mutated sites. We built accurate information models for 168 transcription factors from unaligned sequences of ChIP-Seq fragments, biological and technical replicates, and from different cell lines. Resulting models were compared between replicates and immunoprecipated sequences from other cell lines, and with previously determined motifs. This process was then iterated to discover additional conserved sequence patterns in the same data. The original motif was masked, prior to derivation of a second model by entropy minimization. Those models consisting of low complexity, noise patterns were also thresholded to eliminate low read abundance ChIP-Seq peaks, and then reanalyzed with Bipad. Three quality control measures were used to evaluate the accuracy of these models including: 1) determining the Euclidean distance between the current information weight matrix and previously published motifs, 2) evaluating the linearity of Ri vs binding energy to distinguish between correct and noisy, low complexity motifs, and 3) validation of predicted binding sites with experimentally proven sites in known target genes.

Fei Mao

HPC Consultant, SHARCNET

Python-based Large-scale Visual Recognition with Multiple GPUs

The AlexNet convolutional neural net architecture propelled Deep Learning into the spotlight by winning the Imagenet LSVRC competition in 2012. Since that time, it has become a baseline for empirical Deep Learning research. In this presentation, I will describe a Theano-based AlexNet implementation and its naive data parallelism on multiple GPUs. Our performance on 2 GPUs is comparable with the state-of-art Caffe library (Jia et al., 2014) run on 1 GPU. We released our source code on Github, and we believe that is the first and only open-source Python-based AlexNet implementation to-date.

Jonah Miller

doctorate degree student, Perimeter Institute for Theoretical Physics

Discontinuous Galerkin Methods for Numerical Relativity

Discontinuous Galerkin Finite Element (DGFE) methods offer a mathematically beautiful and computationally efficient way to solve hyperbolic PDEs. This approach is well parallelizable and has been very successful in computational fluid dynamics and electrodynamics. Traditionally, however, DGFE methods have only been formulated for manifestly flux-conservative systems. The BSSN formulation of Einstein equations, which is a very successful formulation used in numerical relativity is not of this form. We have therefore generalized DGFE methods to be able to handle this type of equation. In this talk, we describe our generalized formulation of DGFE methods, their relevance and advantages, and preliminary numerical results using our formulation.

Ali Ramadhan

undergraduate degree student, University of Waterloo

Reconstructing Molecular Geometries of Small Molecules using Coulomb Explosion Imaging

Coulomb explosion imaging is used to produce images of simple molecules as they undergo ultrafast changes by producing "molecular movies" with frame rates of one frame per 10^-15 seconds. The pulse length acts like the shutter speed of a camera allowing us take a snapshot of a molecule in motion and the intense laser radiation gives us a means to make the image as the laser light develops a momentary electric field stronger than the one which binds the electrons to the atoms. This could remove up to six electrons from a typical triatomic molecule and causes the molecule to explode because there are not enough electrons left to bind the positively charged ions together. We call this process a coulomb explosion. To use this explosion as a way of imaging the molecule we need to detect all of the fragment ions created by the explosion and measure their momentum then we can run a simulation of the explosion to see what the original geometry of the molecule was. A simplex algorithm was used to reconstruct molecular geometries. However we introduce a faster and more accurate method of reconstructing these geometries that runs on MATLAB on SHARCNET. It can also detect degenerate solutions and scales well performance-wise for larger molecules.

Russell Spencer

Postdoctoral Fellow, University of Waterloo

Boundary Tension Between Coexisting Phases of a Block Copolymer Blend

Self-consistent field theory (SCFT) is used to evaluate the excess free energy per unit area (i.e., tension) of different boundaries (or interfaces) between the coexisting phases of a block copolymer blend. In this first study of its kind, we focus on the boundaries separating the short- and long-period lamellar phases that form in mixtures of small and large symmetric diblock copolymers of polymerizations Ns and Nl, respectively. According to strong-segregation theory (SST), the tension is minimized when both sets of lamellae orient parallel to the boundary, but experiments tend to observe kink boundaries instead, where the domains of one lamellar phase evolve continuously into those of the other lamellar phase. Our more refined SCFT calculations, on the other hand, do predict a lower tension for the kink boundary consistent with the experimental observations. For completeness, we also examine the boundaries that form when the short-period lamellar phase disorders, and again the SCFT results are in agreement with experiment.

Shaobo Wei

master's degree student, Laurentian University

Somewhat Homomorphic Encryption Scheme for Secure Range Query Process in a Cloud Environment

Recently with the development of cloud computing, many service models have appeared that are based on cloud computing, such as "Infrastructure as a Service" (IaaS), "Platform as a Service" (PaaS), and "Software as a Service" (SaaS). There is also one called "Database as a Service". This service model lets the users store, manage, and access their data in a cloud database. However, the cloud database must be fully secured because of restrictions due to security problems. This corresponding research area of cloud computing is called "Cloud Security". One of the problems is that it is difficult to execute queries on encrypted data in a cloud database without any information leakage. This research proposes a secure range query process which is based on a somewhat homomorphic encryption scheme without any sensitive information leakage. The data being stored in the cloud database are integers which are encrypted in their binary forms bit by bit. A homomorphic "greater-than" algorithm is used in the process to compare two integers. Efficiency, security, and the maximum noise that can be controlled in the process are analyzed in terms of security and efficiency. Parameter settings of the process are also analyzed. Some experiments were performed to test the practicability of the secure range query process with some realistic parameter settings. Since there are many very large integers involved in the process, normal personal computers cannot compute them efficiently. The computing capabilities of SHARCNET were used in the experiments for computations involving large integers, and, at the same time, SHARCNET was also regarded as the cloud service provider of the cloud environment in the experiments.