Pf3k pilot release 5 -------------------- Original date 2016-02-08 Last updated 2017-09-13 This directory contains data files comprising the Pf3k data release 5. This constitutes the final sample set for the pilot stage of the Pf3k project. For more information about this data release, including terms of use and guidance on how to cite the data, please see: http://www.malariagen.net/data/pf3k-5 This release contains an updated sample set (5.0) comprising 2640 Plasmodium falciparum samples. In addition to the release 4.0 set of: - 2375 samples from multiple sampling sites in Africa and Asia, contributed by a number of MalariaGen Pf Community Project studies. - 137 Senegal samples contributed by the Broad Institute. - 5 lab clonal samples (7G8, GB4, KH02, KE01, GN01) used for validation. this release contains new samples contributed by MalariaGen Pf Community: - 96 lab clonal samples comprising parents and progeny from the Plasmodium falciparum genetic crosses project - 27 mixed lab strains in varying proportions created by Jason Wendler It also contains a set of genotype calls for the 5.0 sample set. These genotypes are based on a de-novo variant discovery. These genotypes should not be taken as a quality-controlled output of the Pf3K project and are provided for public interest and as a basis for future methods development. Files in the release include: - pf3k_release_5_metadata_20170804.txt.gz|xlsx : sample metadata file in gzipped tab-delimited and excel formats - pf3k_release_5_crosses_metadata.txt|xls : sample metadata file specific to 98 crosses samples in tab-delimited and excel formats - pf3k_release_5_mixtures_metadata.txt|xls : sample metadata file specific to 27 mixture samples in tab-delimited and excel formats - Pfalciparum.genome.fasta.gz : 3D7 v3.1 reference genome sequences (downloaded from ftp://ftp.sanger.ac.uk/pub/project/pathogens/gff3/2015-08/Pfalciparum.genome.fasta.gz) - Pfalciparum.genome.fasta.gz.md5 : checksum for reference genome - BAM/[samplename].bam - analysis BAM files, one-per-sample, aligned to the 3D7_v3 reference - BAM/md5.txt - checksums for BAM files - 5.1/SNP_INDEL_[chromosome].combined.filtered.vcf.gz : vcf files, one-per-chromosome, for release 5.0 genotypes - 5.1/md5.txt : checksums for vcfs - deprecated_metadata/pf3k_release_5_metadata.txt|xls : older sample metadata file in tab-delimited and excel formats Note on sample metadata - this release was originally accompanied by the sample metadata found in deprecated_metadata/pf3k_release_5_metadata.txt|xls. We subsequently received updates on metadata of collection sites and dates from some partners. To guarantee consistency across samples, we have replaced collection date with collection year, which is now populated for all samples, and which is a proxy for season. The correct sample metadata to use with this release is pf3k_release_5_metadata_20170804.txt.gz|xlsx. In addition to the changes to site and collection year, this file includes three additional columns. IsFieldSample is set to True for field samples and False for lab samples. The data set includes some replicate samples (multiple sequencing of samples from the same individual). The PreferredSample column is set to True for the sample from each individual which has the greatest coverage. For most population analysis, the sample set should be restricted to samples where this is True. The AllSamplesThisIndividual column can be used to identify which samples come from the same individual. As an example, samples PF0550-C and PF0550-Cx both come from the same individual. PF0550-C is the preferred sample for analysis in this case. See additional README files for further details.