| Tarun Mohapatra

JAMstack the future of web

About Dataset ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ The Uncleaned Laptop Price dataset is a collection of laptop product listings scraped from an online e-commerce website. The dataset includes information about various laptop models, such as their brand, screen size, processor, memory, storage capacity, operating system, and price. However, the dataset is uncleaned, meaning that it contains missing values, inconsistent formatting, and other errors that need to be addressed before the data can be used for analysis or modeling.

The dataset contains both categorical and numerical variables, with the majority of variables being categorical, including brand, model name, screen resolution, processor type, and operating system. Some numerical variables include screen size, memory, and storage capacity. The target variable in the dataset is the price, which is a continuous variable.

The dataset contains over 1,300 laptop listings, making it a reasonably sized dataset for analysis and modeling. The dataset may be useful for machine learning projects related to predicting the price of a laptop based on its specifications. However, before using the dataset, it would be necessary to clean and preprocess the data to address the inconsistencies and missing values.

Dataset Links

I am going to clean this dataset with mysql queries . Although it is easy to clean dataset with help of python but as most cases in real world data lies within database so it is better to do initial level data level data cleaning through sql . Here I have used lots of string functions ,DDL and DML commands of mysql

steps to do
  1. first download dataset from above link
  2. create a database name project
  3. import dataset to mysql server through import wizard and name table laptop

Initial look of data

    Select * from laptop
indexUnnamed: 0CompanyTypeNameInchesScreenResolutionCpuRamMemoryGpuOpSysWeightPrice
010AppleUltrabook13.3IPS Panel Retina Display 2560x1600Intel Core i5 2.3GHz8GB128GB SSDIntel Iris Plus Graphics 640macOS1.37kg71378.6832
121AppleUltrabook13.31440x900Intel Core i5 1.8GHz8GB128GB Flash StorageIntel HD Graphics 6000macOS1.34kg47895.5232
232HPNotebook15.6Full HD 1920x1080Intel Core i5 7200U 2.5GHz8GB256GB SSDIntel HD Graphics 620No OS1.86kg30636.0000

count of rows in dataset

    select count(*) from laptop
count(*)
01272

Let's create a backup data

    create table laptop_backup like     laptop;
    insert into laptop_backup select * from laptop;

count of rows in backup dataset

    select count(*) from laptop_backup

check for space occupied

                    select DATA_LENGTH/1024 from information_schema.tables 
                    where table_schema = 'project'
                    and table_name = 'laptop';
DATA_LENGTH/1024
0256.0

drop unnecessary column

     alter table laptop drop column `Unnamed: 0`;
     Select * from laptop;

check for null rows

        select `index` from laptop
        where Company is null and TypeName is null and Inches is null 
        and ScreenResolution is null and `Cpu` is null and Ram is null 
        and `Memory` is null and Gpu is null and OpSys is null and
        Weight is null and Price is null;

delete all the null row

        delete from laptop where `index` in (
        select `index` from laptop
        where Company is null and TypeName is null and Inches is null 
        and ScreenResolution is null and `Cpu` is null and Ram is null 
        and `Memory` is null and Gpu is null and OpSys is null and
        Weight is null and Price is null)

check for duplicate row


    select Company ,TypeName ,Inches,ScreenResolution ,`Cpu`,Ram,`Memory`,
    Gpu,OpSys,Weight,price,count(*) from laptop group by Company ,TypeName ,Inches,
    ScreenResolution ,`Cpu`,Ram,`Memory`,Gpu,OpSys,Weight,price having  count(*)>1;

delete duplicate row

    with base_query as(select min(`index`)from laptopdata group by Company ,
                        TypeName ,Inches,ScreenResolution ,
                        `Cpu`,Ram,`Memory`,Gpu,OpSys,Weight,price)
    delete from laptop where `index` not in (select * from base_query);
pd.read_sql_query("Select count(*) from laptop",conn)
count(*)
01243

Inches Column

changing Inches to datatpe to decimal
    alter table laptop modify column  Inches Decimal(10,1);

Ram Column

removing GB
    update laptop set Ram = replace(Ram,'GB','') ;

Changing data type of Ram column to integer

    alter table laptop modify column Ram INTEGER;

Weight Column

removing kg

   update laptop set Weight = replace(Weight,'kg','') ;

changing data type of Weight column to decimal

       alter table laptop modify column Weight Decimal(10,2);

delete record where weight is 0

       delete from laptop where Weight = 0;

Price

rounding of price
    update laptop set Price = round(Price);
changing price column datatype to integer
    alter table laptop modify column Price INTEGER;

OpSYS

changing OpSYS values
there are values like windows 10,windows11,windows 7 ....,we are going to convert all to windows similarly to other
    update laptop set OpSYS = 
case 
    when OpSys like '%mac%' then 'Mac'
    when OpSys like '%windows%' then 'Windows'
    when OpSys like '%Linux%' then 'Linux'
    when OpSys like '%chrome%' then 'Chrome'
    when OpSys like '%android%' then 'Android'
    when OpSys = 'No OS' then 'N/A'
    else 'Other'
end;

Gpu

strategy is to extract Brand name from Gpu column and drop the Gpu column
alter table laptop 
add column gpu_brand varchar(255) after Gpu;


update laptop set gpu_brand = substring_index(Gpu ,' ',1);


alter table laptop drop column Gpu;

Cpu

strategy is to make make three new column name cpu_brand,cpu_speed,cpu_name from cpu and finaly drop cpu column . also in cpu speed column I am going to remove GHz and convert that column to decimal
    alter table laptop
    add column cpu_brand varchar(255) after `Cpu`,
    add column cpu_name varchar(255) after cpu_brand,
    add column cpu_speed decimal(10,1) after cpu_name; 
    
    update laptop set cpu_brand = substring_index(`Cpu` ," ",1);
    
    update laptop 
    set cpu_speed = cast( replace(substring_index(`Cpu`,' ',-1),'GHz',"")  as decimal(10,1) );
    
    update laptop 
    set cpu_name = replace(replace(`Cpu`,cpu_brand,''),substring_index(`Cpu`,' ',-1),'');
    
    update laptop
    set cpu_name = substring_index(trim(cpu_name)," ",2);
    
    alter table laptop drop column `Cpu`;

Screen Resolution

strategy is to add four column name resolution_width , resolution_height ,is_touchscreen,is_ips_panel and drop ScreenResolution

    alter table laptop
    add column resolution_width integer after ScreenResolution,
    add column resolution_height integer after resolution_width,
    add column is_touchscreen integer after resolution_height,
    add column is_ips_panel integer after is_touchscreen;
    
    update laptop
    set resolution_width = substring_index(substring_index(ScreenResolution," ",-1),"x",1),
    resolution_height = substring_index(substring_index(ScreenResolution," ",-1),"x",-1),
    is_touchscreen  = ScreenResolution like '%Touch%',
    is_ips_panel  = ScreenResolution like '%IPS%';

    alter table laptop drop column ScreenResolution;

Memory

strategy is to make three new column name memory_type , primary_storage, secondary storage also all the memoryis going to be in Gb
    alter table laptop 
    add column memory_type varchar(255) after Memory,
    add column primary_storage integer after memory_type,
    add column secondary_storage integer after primary_storage;
    
    
    update laptop
    set memory_type =
    case
        when Memory like '%hybrid%' then 'Hybrid'
        when Memory like '%SSD%' and Memory like '%HDD%' then 'Hybrid'
        when Memory like '%Flash Storage%' and Memory like '%HDD%' then 'Hybrid'
        when Memory like '%SSD%' then 'SSD'
        when Memory like '%HDD%' then 'HDD'
        when Memory like '%Flash Storage%' then 'Flash Storage'
        else null
    end;
    
    update laptop set
    primary_storage = regexp_substr(substring_index(Memory,'+',1),'[0-9]+');
    
    update laptop set
    secondary_storage = 
    case 
        when Memory like '%+%' then regexp_substr(substring_index(Memory,'+',-1),'[0-9]+')
        else 0
    end;
    
    
    
    update laptop set 
     primary_storage = 
     case 
         when primary_storage <=3 then primary_storage*1024
         else primary_storage 
     end,
     
     secondary_storage = 
     case 
          when secondary_storage <=3 then secondary_storage*1024
          else secondary_storage
     end;
     
     
     alter table laptop drop column Memory;
pd.read_sql_query("Select * from laptop",conn).head(3)
indexCompanyTypeNameInchesis_ips_panelis_touchscreenresolution_widthresolution_heightcpu_brandcpu_namecpu_speedRammemory_typeprimary_storagesecondary_storagegpu_brandOpSysWeightPrice
01AppleUltrabook13.31025601600IntelCore i52.38SSD128.00IntelMac1.3771379
12AppleUltrabook13.3001440900IntelCore i51.88Flash Storage128.00IntelMac1.3447896
23HPNotebook15.60019201080IntelCore i52.58SSD256.00IntelN/A1.8630636

Stay up-to-date with my words ✍️